Data pipelines with Alloy, Take 2
In which I write some easy Alloy code for a data model, with change over time.
Alloy is a programming language and analysis tool for software modeling. I learnt of it 3 years ago, when I was starting with TLA+ (a language focused on how systems evolve over time). Alloy has a particularly clean syntax (see for instance this blog post) which I liked way more than TLA+, but I couldn’t find any interesting use cases. Half a year ago I wrote another blog post about modelling with it, particularly using it to find “gotchas” before writing the code. I also promised to explore either Electrum or Alloy 6: extensions that enable changes over time. Here it is.
I recommend you read the previous post first.
The problem is the same as in that post:
- Process some data in an “engine” (EMR, Databricks, Redshift…),
- Dump the resulting data in S3,
- Load this data in our analytics warehouse (a different Redshift).
I have no particular expectations: I want to write a similar model, but changing state over time instead of modelling the pipeline itself.
I’ll start by defining the universe we are going to consider in this model.
sig Datum {}
sig File {
data: set Datum
}
fact always_data {
all f: File | #f.data > 2
}
one sig Storage {
var contents: set File
}
sig Table {
var data: set Datum
}
one sig Database {
var tables: set Table
}
In this universe we have:
- (One or more)
File
s containing data (always at least 2 items of data), - One
Storage
, withcontents
varying over time. Contents areFiles
. - (One or more)
Table
s containing data, varying over time. - One
Database
, withtables
varying over time.
This time I just want to see what Alloy 6 brings and how to make it work nicely, so I won’t bother with data types, columns or anything from the previous model. Left as an exercise for the interested reader.
Next we need to define how the universe changes over time. This is done via predicates. Predicates are logical expressions, in this case they define what changes from one “tick” to the next.
pred Process[output: File] {
historically Storage -> output not in contents and
contents' = contents + Storage -> output and
tables' = tables and
Table.data' = Table.data
}
As you can see, it is formed by several conditions in each line, all tied together. They mean:
- historically (that is, before
Process
is true), the argumentoutput
is not inStorage.contents
(this would be requiring idempotency in the real process), and - in the next step,
Storage.contents
should containoutput
(the process does actually output something), and - there is no change to tables, and
- there is no change to data in tables.
The syntax used in conditions 1 and 2 may look a bit strange, but in Alloy (6 or before, doesn’t matter) everything is a relation. So, contents
(which is a “field” in Storage
) is actually the relationship Storage -> file
for some files, hence how the logic goes in the predicate. Historically
is one of the new temporal operators and means always before now.
In any state transition we need to specify what can or cannot happen to the rest of time-varying signatures, otherwise the analyser will happily add weird stuff to them.
We can do likewise for a Load
step:
pred Load[input: File] {
historically input.data not in Table.data and
Storage -> input in contents and
one t: Table |
{
t.data' = t.data + input.data and
tables' = tables + Database -> t
} and
contents' = contents
}
This one has the additional condition one table but is otherwise the same as before. In this case, we can think of this condition on some table existing as requiring Load
to either create the table or add to an existing table, this is why we make sure the table t
is in the tables in the Database
.
For a model to be interesting (and sometimes even computable) we also need to add a stuttering step (not this). Which is a step that basically does nothing to the state space:
pred Stutter {
tables' = tables and
contents' = contents and
Table.data' = Table.data
}
Finally, we need to actually initialise the state space and impose the phase transitions:
fact init {
no contents and
no tables and
no Table.data and
#File > 2
}
fact transitions {
always (some f: File | Process[f] or
some f: File | Load[f] or
Stutter)
}
This is pretty readable.
- We start with nothing in storage, tables or database, and more than two files.
- Then for some file we either
Process
(outputting it) orLoad
(with it as input), or - We just stutter, the system does something else.
Wait, what, where are those files in init
? Well, from all my experiments so far and previous experience with Alloy, the number of atoms (inhabitants of a signature) in Alloy is whatever exists at the beginning.
No new atoms are created in later steps (or that’s what I have seen, I may have been doing something wrong without realising), so if we want our model to work we will have several File
and Table
instance existing at the beginning, but associated with nothing. This makes the whole process work.
This feels artificial, but once understood is easy to work around and understand. In the model visualization we can customize the theme to prevent these instances from showing, too.
To embellish the presentation, we can use a trick described here to display what transition is going on from one step to the next
enum Event { Stutter, Process, Load }
fun process_happens : Event -> File -> Storage {
{ e: Process, f: File, st: Storage | Process[f] }
}
fun load_happens : Event -> File {
{ e: Load, f: File | Load[f] }
}
fun stutter_happens : set Event {
{ e: Stutter | Stutter }
}
These functions assign the relations with the singletons Process
, Load
or Stutter
to situations where the corresponding predicate is true, so in the visualizer we see what is going on.
One final tip, we need to add some form of weak fairness condition. In our case, we would like to make sure that if at some point there is data to load, we actually run load. Otherwise we would have nothing to really analyse.
pred fairness {
some f: File {
(eventually always Storage -> f in contents) implies
(always eventually Load[f])
}
}
And now everything is ready to run the simulation: under fairness, eventually we have a table with data:
run { fairness implies
eventually (#tables > 0 and #Table.data > 0)
} for 5
After tweaking the theme a bit, we’ll get a trace (a set of transitions) like this one.
First tick
When we start everything is empty, and the first predicate that can trigger is Process
, which creates File2
in Storage
.
Second tick
Storage contains a File
with 3 Datum
, and now we can Load
that file into Database
.
Third tick
The data has been loaded in the database, we still have the data in Storage
(obviously) and the system starts Stutter
ing forever.
Other possible traces
There are many more traces the visualiser can generate given this model, for example:
- Stuttering for a bit between
Process
andLoad
, - Additional
Process
for other files beforeLoad
, - Combinations of the above up to the size of the universe defined.
The model is always minimal, so the first run will always have the smallest possible trace.
You can get the whole model from here and the theme from here.
Recommended information sources for Alloy 6
Currently there is no complete documentation available. The two most informative sources are:
- Post by Hillel Wayne with an example and observations
- Formal Software Design with Alloy 6 (WIP book by the creators)
Conclusion
Sadly, nothing in particular.
- This model, even if more interesting to write, feels weaker than the previous model for this particular problem. The previous model gave interesting areas to focus on the code.
- There is no complex concurrency problem in a pipeline as described: the problems are in making sure we don’t mess up simple stuff, not complex stuff.
- I have the feeling having temporal logic in Alloy can help data engineering, I just don’t know how yet.