Data pipelines with Alloy, Take 2

This time, with time

December 12, 2021 7 minutes read | 1407 words by Ruben Berenguel

In which I write some easy Alloy code for a data model, with change over time.

Alloy is a programming language and analysis tool for software modeling. I learnt of it 3 years ago, when I was starting with TLA+ (a language focused on how systems evolve over time). Alloy has a particularly clean syntax (see for instance this blog post) which I liked way more than TLA+, but I couldn’t find any interesting use cases. Half a year ago I wrote another blog post about modelling with it, particularly using it to find “gotchas” before writing the code. I also promised to explore either Electrum or Alloy 6: extensions that enable changes over time. Here it is.

Note:

I recommend you read the previous post first.

The problem is the same as in that post:

Process some data in an “engine” (EMR, Databricks, Redshift…),
Dump the resulting data in S3,
Load this data in our analytics warehouse (a different Redshift).

I have no particular expectations: I want to write a similar model, but changing state over time instead of modelling the pipeline itself.

I’ll start by defining the universe we are going to consider in this model.

sig Datum {}

sig File {
  data: set Datum
}

fact always_data {
  all f: File | #f.data > 2
}

one sig Storage {
  var contents: set File
}

sig Table {
  var data: set Datum
}

one sig Database {
  var tables: set Table
}

In this universe we have:

(One or more) Files containing data (always at least 2 items of data),
One Storage, with contents varying over time. Contents are Files.
(One or more) Tables containing data, varying over time.
One Database, with tables varying over time.

This time I just want to see what Alloy 6 brings and how to make it work nicely, so I won’t bother with data types, columns or anything from the previous model. Left as an exercise for the interested reader.

Next we need to define how the universe changes over time. This is done via predicates. Predicates are logical expressions, in this case they define what changes from one “tick” to the next.

pred Process[output: File] {
  historically Storage -> output not in contents and
  contents' = contents + Storage -> output and
  tables' = tables and
  Table.data' = Table.data
}

As you can see, it is formed by several conditions in each line, all tied together. They mean:

historically (that is, before Process is true), the argument output is not in Storage.contents (this would be requiring idempotency in the real process), and
in the next step, Storage.contents should contain output (the process does actually output something), and
there is no change to tables, and
there is no change to data in tables.

The syntax used in conditions 1 and 2 may look a bit strange, but in Alloy (6 or before, doesn’t matter) everything is a relation. So, contents (which is a “field” in Storage) is actually the relationship Storage -> file for some files, hence how the logic goes in the predicate. Historically is one of the new temporal operators and means always before now.

In any state transition we need to specify what can or cannot happen to the rest of time-varying signatures, otherwise the analyser will happily add weird stuff to them.

We can do likewise for a Load step:

pred Load[input: File] {
  historically input.data not in Table.data and
  Storage -> input in contents and
  one t: Table |
    {
      t.data' = t.data + input.data and
      tables' = tables + Database -> t
    } and
  contents' = contents
}

This one has the additional condition one table but is otherwise the same as before. In this case, we can think of this condition on some table existing as requiring Load to either create the table or add to an existing table, this is why we make sure the table t is in the tables in the Database.

For a model to be interesting (and sometimes even computable) we also need to add a stuttering step (not this). Which is a step that basically does nothing to the state space:

pred Stutter {
  tables' = tables and
  contents' = contents and
  Table.data' = Table.data
}

Finally, we need to actually initialise the state space and impose the phase transitions:

fact init {
  no contents and
  no tables and
  no Table.data and
  #File > 2
}

fact transitions {
  always (some f: File | Process[f] or
      some f: File | Load[f] or
      Stutter)
}

This is pretty readable.

We start with nothing in storage, tables or database, and more than two files.
Then for some file we either Process (outputting it) or Load (with it as input), or
We just stutter, the system does something else.

Atoms, the universe and everything else

Wait, what, where are those files in init? Well, from all my experiments so far and previous experience with Alloy, the number of atoms (inhabitants of a signature) in Alloy is whatever exists at the beginning.

No new atoms are created in later steps (or that’s what I have seen, I may have been doing something wrong without realising), so if we want our model to work we will have several File and Table instance existing at the beginning, but associated with nothing. This makes the whole process work.

This feels artificial, but once understood is easy to work around and understand. In the model visualization we can customize the theme to prevent these instances from showing, too.

To embellish the presentation, we can use a trick described here to display what transition is going on from one step to the next

enum Event { Stutter, Process, Load }

fun process_happens : Event -> File -> Storage {
  { e: Process, f: File, st: Storage | Process[f] }
}

fun load_happens : Event -> File {
  { e: Load, f: File | Load[f] }
}

fun stutter_happens : set Event {
  { e: Stutter | Stutter }
}

These functions assign the relations with the singletons Process, Load or Stutter to situations where the corresponding predicate is true, so in the visualizer we see what is going on.

One final tip, we need to add some form of weak fairness condition. In our case, we would like to make sure that if at some point there is data to load, we actually run load. Otherwise we would have nothing to really analyse.

pred fairness {
  some f: File {
    (eventually always Storage -> f in contents) implies
    (always eventually Load[f])
  }
}

And now everything is ready to run the simulation: under fairness, eventually we have a table with data:

run { fairness implies
    eventually (#tables > 0 and #Table.data > 0)
  } for 5

After tweaking the theme a bit, we’ll get a trace (a set of transitions) like this one.

First tick

When we start everything is empty, and the first predicate that can trigger is Process, which creates File2 in Storage.

Second tick

Storage contains a File with 3 Datum, and now we can Load that file into Database.

Third tick

The data has been loaded in the database, we still have the data in Storage (obviously) and the system starts Stuttering forever.

Other possible traces

There are many more traces the visualiser can generate given this model, for example:

Stuttering for a bit between Process and Load,
Additional Process for other files before Load,
Combinations of the above up to the size of the universe defined.

The model is always minimal, so the first run will always have the smallest possible trace.

You can get the whole model from here and the theme from here.

Recommended information sources for Alloy 6

Currently there is no complete documentation available. The two most informative sources are:

Post by Hillel Wayne with an example and observations
Formal Software Design with Alloy 6 (WIP book by the creators)

Conclusion

Sadly, nothing in particular.

This model, even if more interesting to write, feels weaker than the previous model for this particular problem. The previous model gave interesting areas to focus on the code.
There is no complex concurrency problem in a pipeline as described: the problems are in making sure we don’t mess up simple stuff, not complex stuff.
I have the feeling having temporal logic in Alloy can help data engineering, I just don’t know how yet.