Model checking safety of Ben-Or’s Byzantine consensus with Apalache

Author: Igor Konnov

This blogpost summarizes my experiments when specifying Ben-Or’s Byzantine consensus protocol in TLA⁺ and model checking it with Apalache. I presented these results at the TLA Community Meeting 2024 and VSTTE 2024. Both events were great, and I really enjoyed giving the talks there! The first part of this blogpost was written on a train from Prague to Vienna. I have already hinted at some of these results in Episode 1 of Why I use TLA+ and not(TLA+). This is one of the examples of using TLA⁺ to get good guarantees about a protocol in reasonable time.

Why Ben-Or’s consensus? The paper on “Another advantage of free choice: Completely asynchronous agreement protocols” was published in 1983, which makes it almost as old as I am. However, the algorithms in the paper are using the primitives that we still find in modern consensus protocols: sending messages to all processes, counting the number of received messages, using monotonically increasing rounds. On top of that, the two algorithms Ben-Or use a probabilistic coin . We cannot deal with probabilities in Apalache. Hence, we only check its safety. What I find exciting about this protocol is that it showcases the full spectrum of the techniques supported by Apalache:

Randomized symbolic execution to reason about some executions up to length $k$,
bounded model checking to reason about all executions up to length $k$, and
inductiveness checking to reason about all executions of all lengths.

Prepare for a long read… If you find the text too hard to read, look at the funny cat images that ChatGPT happily generated for me. The plots are real and produced from the experimental data with GNU plot. The text is of my own, no AI was involved in my typos.

1. Ben-Or’s consensus for Byzantine faults

I am not going to set up the scene for the protocol. Ben-Or does it with amazing clarity. Just read the paper. It is three pages long, excluding the references. If you don’t need an introduction to consensus, here is the pseudo-code of the Protocol B, just retyped in markdown:

Process P: Initial value x_P
step 0: set r := 1.

step 1: Send the message (1, r, x_P) to all the processes

step 2: Wait till messages of type (1, r, *) are received from N - t processes.
If more than (N + t) / 2 messages have the same value v,
then send the message (2, r, v, D) to all processes.
Else send the message (2, r, ?) to all processes.

step 3: Wait till messages of type (2, r, *) arrive from N - t processes.
(a) If there are at least t + 1 D-messages (2, r, v, D), then set x_P := v.
(b) If there are more than (N + t) / 2 D-messages then decide v.
(c) Else set x_P = 1 or 0 each with probability 1/2.

step 4: Set r := r + 1 and go to step 1.

It does not look very complicated, right? The processes send messages around, receive messages, count the number of received messages and change their states. There are two challenges to understanding why this algorithm works:

The processes are completely asynchronous. Say, process 1 may be in round 3, whereas process 2 may be in round 100.
At most $t$ of $N$ processes may behave Byzantine. They may send conflicting messages to the correct processes, or not send any messages at all.

This looks like a good fit for TLA⁺ and Apalache.

2. Specifying Ben-Or’s consensus in TLA⁺

You can find the full specification in ben-or83. I wrote the specification in one go over a weekend, as it was not my first consensus specification in TLA⁺. However, to explain the choice of good specification abstractions, I am presenting it in several steps. The code snippets are rendered in the Unicode version of TLA⁺. If you would like to see the ASCII version (which you would normally type), check the sources on Github.

In the following, the paragraphs starting with indicate a non-obvious specification trick.

2.1. Declaring specification parameters

Many specifications are parameterized, e.g., in the number of communicating processes. What kind of parameters do we have in Ben-Or’s consensus? Obviously, we have the total number of processes, which we call N. Also, we need the upper bound on the number of faulty processes, which we call T (for threshold). Since the actual number of faulty processes may differ from T, we will also introduce the actual number of faulty processes, which we call F. Hence, we declare the first three constants as follows:

CONSTANTS
    \* The total number of processes.
    \* @type: Int;
    N,
    \* An upper bound on the number of faulty processes.
    \* @type: Int;
    T,
    \* The actual number of faulty processes (unknown to the processes).
    \* @type: Int;
    F

Introducing the parameter $F$ may look like cheating. Indeed, if the correct processes could read $F$, solving consensus would be much easier. However, the parameter $F$ is only used to reason about the executions of the protocol, and it should not be used in the steps executed by the correct processes.

In order to send messages, we also need process identifiers. To this end, we introduce two more parameters:

CONSTANTS
    \* The set of the correct (honest) processes.
    \* @type: Set(REPLICA);
    CORRECT,
    \* The set of the Byzantine (faulty) processes.
    \* @type: Set(REPLICA);
    FAULTY

Importantly, when we will specify the protocol behavior, we should not refer to the constants CORRECT and FAULTY, as the correct processes do not know the identities of the faulty processes in a real distributed system. Alternatively, instead of declaring these two parameters, we could just introduce two sets of integers:

CorrectAlt == 1..(N - F)
FaultyAlt == (N - F + 1)..N

I generally try to avoid integers, when they are not required. Integers implicitly suggest a total order on process identities, which is not required in this protocol. Moreover, uninterpreted types such as REPLICA are translated to uninterpreted sorts in SMT, which are more efficient to reason about in SMT in the general case. In our specification, it would not make a big difference though.

Since the parameters N, T, and F are in relation with the parameters CORRECT and FAULTY, we add the following assumptions about them:

ASSUME N > 5 * T ∧ Cardinality(CORRECT) = N - F ∧ Cardinality(FAULTY) = F

We could add one more assumption about the relation between T and F such as $T \ge F$. However, it is quite useful to check the cases, where $F > T$. The protocol should be unsafe in those cases! Thus, we omit such an assumption.

Finally, in the above protocol description, round numbers may grow unboundedly large. For model checking purposes, it is convenient to restrict round numbers to small sets. To this end, we introduce the set of all rounds as a parameter:

CONSTANT
    \* The set of rounds, which we bound for model checking.
    \* @type: Set(Int);
    ROUNDS

ASSUME 1 ∈ ROUNDS

For proving a theorem, we can still instantiate ROUNDS with the set of all natural numbers Nat, but we will instantiate it with small sets such as 0..2 or 0..5 for model checking.

The protocol needs a few values that could be declared as parameters. However, we can just use the definitions and assumptions in this case:

\* The set of values to choose from
VALUES ≜ { 0, 1 }

NO_DECISION ≜ -1
ASSUME NO_DECISION ∉ VALUES

ALL ≜ CORRECT ∪ FAULTY

2.2. Declaring state variables

After declaring specification parameters, I would usually think about state variables.

The first portion of state variables is easy to come up with. We can see some control structure in the pseudo-code, which refers to steps, rounds, values, and decisions. Hence, we declare the following control state variables:

VARIABLES
  \* The current value by a process, called $x_P$ in the paper.
  \* @type: REPLICA -> Int;
  value,
  \* The decision by a process, where -1 means no decision.
  \* @type: REPLICA -> Int;
  decision,
  \* The round number of a process, called $r$ in the paper.
  \* @type: REPLICA -> Int;
  round,
  \* The process step: S1, S2, S3.
  \* @type: REPLICA -> STEP;
  step

The second portion of state variables is less obvious. We have to somehow model message-passing in the specification. For this kind of algorithms, the standard pattern is to store the set of all messages that were sent by the processes. Hence, we simply declare two state variables to store the sent messages:

VARIABLES
  \* Type-1 messages sent by the correct and faulty processes, mapped by rounds.
  \* @type: Int -> Set($msgOne);
  msgs1,
  \* Type-2 messages sent by the correct and faulty processes, mapped by rounds.
  \* @type: Int -> Set($msgTwo);
  msgs2

Since we had to refer to message types such as $msgOne and $msgTwo, it is good time to think about type definitions.

2.3. Type definitions

It is common to isolate type definitions is a separate file such as typedefs.tla. Usually, once we have figured out the type definitions, we are not going to revisit them much.

To start with, since I have decided to use the uninterpreted type STEP to encode steps, we have to refer to values with the syntax like "S1_OF_STEP". It is too verbose. Hence, we introduce auxiliary definitions to keep things tidy:

\* predefined constants for the steps
S1 ≜ "S1_OF_STEP"
S2 ≜ "S2_OF_STEP"
S3 ≜ "S3_OF_STEP"

Now we have to decide on the types of messages. We have messages of type 1, which look like (1, r, v), and type 2, which look like (2, r, v, D) or (2, r, ?). Hence, for the messages of type 1, it’s easy to see that the record type should be sufficient:

{ src: REPLICA, r: Int, v: Int }

For the messages of type 2, we need to figure out how to encode (2, r, v, D) and (2, r, ?). This looks like a good application for the variant type:

msgTwo = D({ src: REPLICA, r: Int, v: Int })
       | Q({ src: REPLICA, r: Int });

As type annotations have to be written inside comments in front of definitions in TLA⁺, we declare a dummy definition type_aliases and add the above types as type aliases:

(*
 * Type definitions:
 *
 * Type-1 messages.
 * @typeAlias: msgOne = { src: REPLICA, r: Int, v: Int };
 *
 * Type-2 messages.
 * @typeAlias: msgTwo = Q({ src: REPLICA, r: Int }) | D({ src: REPLICA, r: Int, v: Int });
 *)
typedefs_aliases ≜ TRUE

I prefer to hide the difference between records and variants right in the type definitions. To this end, we declare the following constructors (not really a TLA⁺ term) and predicates:

\* @type: (REPLICA, Int, Int) => $msgOne;
M1(src, round, value) ≜ [ src ↦ src, r ↦ round, v ↦ value ]

\* @type: (REPLICA, Int) => $msgTwo;
Q2(src, round) ≜ Variant("Q", [ src ↦ src, r ↦ round ])

\* @type: $msgTwo => Bool;
IsQ2(msg) ≜ VariantTag(msg) = "Q"

\* @type: $msgTwo => { src: REPLICA, r: Int };
AsQ2(msg) ≜ VariantGetUnsafe("Q", msg)

\* @type: (REPLICA, Int, Int) => $msgTwo;
D2(src, round, value) ≜ Variant("D", [ src ↦ src, r ↦ round, v ↦ value ])

\* @type: $msgTwo => Bool;
IsD2(msg) ≜ VariantTag(msg) = "D"

\* @type: $msgTwo => { src: REPLICA, r: Int, v: Int };
AsD2(msg) ≜ VariantGetUnsafe("D", msg)

2.4. Specifying initial states

Looking at the protocol pseudo-code, initializing the state machine can be done as trivial as follows:

Init ≜
  \* non-deterministically choose the initial values
  ∧ value ∈ [ CORRECT → VALUES ]
  ∧ decision = [ r ∈ CORRECT ↦ NO_DECISION ]
  ∧ round = [ r ∈ CORRECT ↦ 1 ]
  ∧ step = [ r ∈ CORRECT ↦ S1 ]
  ∧ msgs1 = [ r ∈ ROUNDS ↦ {}]
  ∧ msgs2 = [ r ∈ ROUNDS ↦ {}]

We will revisit this decision when modeling faults.

2.5. Specifying protocol steps

Step1. Step 1 is relatively straightforward:

check that a process is at Step 1,
send the message,
go to Step 2.

This is how we write it in TLA⁺:

\* @type: REPLICA => Bool;
Step1(id) ≜
  LET r ≜ round[id] IN
  ∧ step[id] = S1
  \* "send the message (1, r, x_P) to all the processes"
  ∧ msgs1' = [msgs1 EXCEPT ![r] = @ ∪ { M1(id, r, value[id]) }]
  ∧ step' = [step EXCEPT ![id] = S2]
  ∧ UNCHANGED ⟨ value, decision, round, msgs2 ⟩

Step2. Step 2 has more logic than Step 1. In addition to checking that a process is at Step 2, we have to receive messages and count them.

Again, there is a simple pattern of receiving messages in a consensus protocol:

  ∃ received ∈ SUBSET msgs1[r]:
    ...

This looks too simple! Don’t we have to receive message one-by-one and delete the received messages? It so happens that in consensus algorithms, neither of these two things are important. A process may receive messages in any order, so, technically, it may receive an arbitrary subset of the sent messages. When a process receives too few messages, it does nothing. So we just have to pick a large enough subset of messages. To this end, we can just use Cardinality(...). Moreover, once a process processed this subset, it is not going to look at messages of the same shape ever again, as proceeds to the next step.

Now we have the necessary ingredients to write Step 2:

Step2(id) ≜
  LET r ≜ round[id] IN
  ∧ step[id] = S2
  ∧ ∃ received ∈ SUBSET msgs1[r]:
     \* "wait till messages of type (1, r, *) are received from N - T processes"
     ∧ Cardinality(Senders1(received)) ≥ N - T
     ∧ LET Weights ≜ [ v ∈ VALUES ↦
          Cardinality(Senders1({ m ∈ received: m.v = v })) ]
        IN
        ∨ ∃ v ∈ VALUES: 
          \* "if more than (N + T)/2 messages have the same value v..."
          ∧ 2 * Weights[v] > N + T
          \* "...then send the message (2, r, v, D) to all processes"
          ∧ msgs2' = [msgs2 EXCEPT ![r] = @ ∪ { D2(id, r, v) }]
        ∨∧ ∀ v ∈ VALUES: 2 * Weights[v] ≤ N + T
          \* "Else send the message (2, r, ?) to all processes"
         ∧ msgs2' = [msgs2 EXCEPT ![r] = @ ∪ { Q2(id, r) }]
  ∧ step' = [ step EXCEPT ![id] = S3 ]
  ∧ UNCHANGED ⟨ value, decision, round, msgs1 ⟩

If you look at Step2 carefully, you will see that we are missing the definition of Senders1. Here is how we define Senders1 and Senders2:

\* @type: Set($msgOne) => Set(REPLICA);
Senders1(m1s) ≜
  \* Filter the set of ALL, as it has fixed size, in contrast to m1s.
  \* This is especially important, when we call Cardinality on the result.
  { id ∈ ALL: ∃ m ∈ m1s: m.src = id }

\* @type: Set($msgTwo) => Set(REPLICA);
Senders2(m2s) ≜
  \* Filter the set of ALL, as it has fixed size, in contrast to m2s.
  \* This is especially important, when we call Cardinality on the result.
  { id ∈ ALL:
    ∃ m ∈ m2s: (IsD2(m) ∧ AsD2(m).src = id) ∨ (IsQ2(m) ∧ AsQ2(m).src = id) }

As mentioned in the comments, these definitions are tuned towards the current implementation of Apalache: The definitions are restricting the small set ALL instead of simply mapping all messages to the sender field. Otherwise, the constraints for senders will grow as fast as the set of messages, even though they do not have to. After all, there are not so many process identities.

The definition of Step2 still contains one line that may look surprising:

                  ∧ 2 * Weights[v] > N + T

Recall that the pseudo-code contained the statement:

If more than (N + t) / 2 messages have the same value v, then...

It is easy to overlook that $(N + t) / 2$ is not an integer division, especially, if you do programming as a daily activity. What is meant here is the mathematical division, which is done over reals. Obviously, rationals are sufficient in this case.

Luckily, we do not have to use reals or rationals, as we can simply use the school math rule:

An equation $x > \frac{a}{b}$, for $b \ne 0$, has a solution if and only if the equation $b \cdot x > a$ has a solution. This is quite a standard transformation that drives us from rationals back to integers.

Similarly, the following line encodes the “Else” arm of the statement. Surprisingly, I introduced a typo there, which I found later!

        ∨∧ ∀ v ∈ VALUES: 2 * Weights[v] ≤ N + T

Step 3. Finally, Step 3 is the most complex one, but having understood steps 1-2, it should be easy to see what is going on:

Step3(id) ≜
  LET r ≜ round[id] IN
  ∧ step[id] = S3
  ∧ ∃ received ∈ SUBSET msgs2[r]:
    \* "Wait till messages of type (2, r, *) arrive from N - T processes"
    ∧ Cardinality(Senders2(received)) = N - T
    ∧ LET Weights ≜ [ v ∈ VALUES ↦
             Cardinality(Senders2({ m ∈ received: IsD2(m) ∧ AsD2(m).v = v })) ]
       IN
       ∨ ∃ v ∈ VALUES: 
          \* "(a) If there are at least T+1 D-messages (2, r, v, D),
          \* then set x_P to v"
          ∧ Weights[v] ≥ T + 1
          ∧ value' = [value EXCEPT ![id] = v]
          \* "(b) If there are more than (N + T)/2 D-messages..."
          ∧ IF 2 * Weights[v] > N + T
             \* "...then decide v"
             THEN decision' = [decision EXCEPT ![id] = v]
             ELSE decision' = decision
       ∨ ∧ ∀ v ∈ VALUES: Weights[v] < T + 1
         ∧ ∃ next_v ∈ VALUES:
             \* "(c) Else set x_P = 1 or 0 each with probability 1/2."
             \* We replace probabilites with non-determinism.
             ∧ value' = [value EXCEPT ![id] = next_v]
             ∧ decision' = decision
    \* the condition below is to bound the number of rounds for model checking
    ∧ r + 1 ∈ ROUNDS
    \* "Set r := r + 1 and go to step 1"
    ∧ round' = [ round EXCEPT ![id] = r + 1 ]
    ∧ step' = [ step EXCEPT ![id] = S1 ]
    ∧ UNCHANGED ⟨ msgs1, msgs2 ⟩

If you look at the definition carefully, you will see that we have merged steps 3 and 4, as step 4 does nothing more than increasing the round number.

In comparison to the pseudo-code, we have introduced one additional constraint:

    \* the condition below is to bound the number of rounds for model checking
    ∧ r + 1 ∈ ROUNDS

The purpose of the above condition is to restrict the execution space of the algorithm to the values in the set ROUNDS. When we set ROUNDS to Nat, The above condition is trivially true. When we set ROUNDS to a small interval 1..m, we restrict the search space. Actually, in this algorithm, setting ROUNDS to a finite set makes the state-space finite. In theory, we could use the model checker TLC to exhaustively enumerate all states. In practice, the state space of this algorithm is too large even for as small sets of rounds as 0..2.

2.6. Model checking the behavior of the correct processes

Before we dive into potential faults, it is always a good idea to check the properties of the fault-free behavior. To this end, we have all the ingredients ready. It only remains to put together steps 1-3:

CorrectStep ≜
  ∃ id ∈ CORRECT:
    ∨ Step1(id)
    ∨ Step2(id)
    ∨ Step3(id)

Having Init and CorrectStep, we use Apalache to produce interesting examples. For instance, let’s define two invariants:

\* An example of a process that has made a decision
DecisionEx ≜
    ¬(∃ id ∈ CORRECT: decision[id] ≠ NO_DECISION)

\* An example of all correct processes having made a decision
AllDecisionEx ≜
    ¬(∀ id ∈ CORRECT: decision[id] ≠ NO_DECISION)

These invariants are intended to be violated. This is why I’ve added the suffix Ex to their names, for an “example”. Indeed, these “falsy” invariants are good for generating examples.

$ apalache-mc check --length=30 --inv=DecisionEx --init=Init --next=CorrectStep MC_n6t1f1.tla
...
State 11: Checking 1 state invariants
Check the trace in: ...
State 11: state invariant 0 violated.
Total time: 8.502 sec

Not bad. We have found an execution that leads to a state, in which one of the correct processes made a decision. The example has 11 steps, the last of them containing the following value for decision:

State11 ==
  decision
      = SetAsFun({ <<"1_OF_REPLICA", -1>>,
        <<"4_OF_REPLICA", -1>>,
        <<"3_OF_REPLICA", -1>>,
        <<"2_OF_REPLICA", -1>>,
        <<"0_OF_REPLICA", 1>> })
  ...

Apalache outputs a function F as SetAsFun(S), where S is the set of pairs <<x, F[x]>>, for x ∈ DOMAIN F. We find this notation a bit more accessible than the one produced by TLC:

  ("1_OF_REPLICA" :> -1) @@ ("4_OF_REPLICA" :> -1)
    @@ ("3_OF_REPLICA" :> -1) @@ ("2_OF_REPLICA" :> -1) @@ ("0_OF_REPLICA" :> 1)

Further, we look for an example of an execution that ends up in a state, where all correct processes made a decision. Apalache finds such an example in about one minute:

$ apalache-mc check --length=30 --inv=AllDecisionEx --init=Init --next=CorrectStep MC_n6t1f1.tla
State 15: Checking 1 state invariants
Check the trace in: ...
State 15: state invariant 0 violated.
Total time: 67.629 sec

2.7. Specifying faults

2.7.1. Injecting faults as an action

Let’s first check how Ben-Or defines Byzantine faults in the paper:

Here faulty processes might go completely haywire, perhaps even sending messages according to some malevolent plan.

Well, the first part of the sentence is not very helpful. How do we specify going “haywire”? The second part is actually giving us a hint: The faulty processes may send messages that deviate from the algorithm. So this is exactly what we do in the next definition of FaultyStep. The faulty processes may simply send an arbitrary subset of messages in a single step:

FaultyStep ≜
    \* the faulty replicas collectively inject messages for a single round
    ∧ ∃ r ∈ ROUNDS:
        ∧ ∃ F1 ∈ SUBSET [ src: FAULTY, r: { r }, v: VALUES ]:
            msgs1' = [ msgs1 EXCEPT ![r] = @ ∪ F1 ]
        ∧ ∃ F2D ∈ SUBSET { D2(src, r, v): src ∈ FAULTY, v ∈ VALUES }:
             ∃ F2Q ∈ SUBSET { Q2(src, r): src ∈ FAULTY }:
                msgs2' = [ msgs2 EXCEPT ![r] = @ ∪ F2D ∪ F2Q ]
    ∧ UNCHANGED ⟨ value, decision, round, step ⟩

If you have been doing a lot of explicit-state model checking, e.g., with TLC, you surely frowned upon the above definition. Would not it be more desirable to inject exactly one faulty message? Well, in case of TLC, it’s probably more efficient to inject messages one-by-one. However, in case of bounded model checking, it is not. Since bounded model checking almost always slows down with longer symbolic executions, it is more beneficial to do more in one step.

Now we can define a single step of the algorithm as either a step by a correct process, or a step by a faulty process:

Next ≜
  ∨ CorrectStep
  ∨ FaultyStep

Let’s check how long it takes to find our two examples in the presence of faults:

$ apalache-mc check --length=30 --inv=DecisionEx --init=Init --next=Next MC_n6t1f1.tla
...
State 10: Checking 1 state invariants
It took me 0 days  1 hours 30 min 11 sec

$ apalache-mc check --length=30 --inv=AllDecisionEx --init=Init --next=Next MC_n6t1f1.tla
...
State 15: state invariant 0 violated.
It took me 0 days  6 hours 20 min 59 sec

Wow. That was quite a slowdown! Why is it happening? The answer is that we have added a possibility to add really a lot of messages at every step, and the model checker has to take that into account.

Can we do better? It turns out that, usually, yes. See the next section.

2.7.2. Injecting faults in the initial states

Let’s think again about how we send and receive messages. To send a message to all processes, a process simply adds its message to the set of all sent messages. To receive messages, a process simply looks into the set of all sent messages. As a result, what matters is the ability of Byzantine processes to send a message whenever they like, and the ability of the correct processes to receive messages that were sent earlier. Moreover, the correct processes do not have to receive all the sent messages.

Hence, Byzantine processes can simply inject their messages in the initial states. The correct processes do not have to receive them:

InitWithFaults ≜
  \* non-deterministically choose the initial values
  ∧ value ∈ [ CORRECT → VALUES ]
  ∧ decision = [ r ∈ CORRECT ↦ NO_DECISION ]
  ∧ round = [ r ∈ CORRECT ↦ 1 ]
  ∧ step = [ r ∈ CORRECT ↦ S1 ]
  \* non-deterministically initialize the messages with faults
  ∧ ∃ F1 ∈ SUBSET [ src: FAULTY, r: ROUNDS, v: VALUES ]:
        msgs1 = [ r ∈ ROUNDS ↦ { m ∈ F1: m.r = r } ]
  ∧ ∃ F1D ∈ SUBSET [ src: FAULTY, r: ROUNDS, v: VALUES ],
        F1Q ∈ SUBSET [ src: FAULTY, r: ROUNDS ]:
        msgs2 = [ r ∈ ROUNDS ↦
            { D2(mm.src, r, mm.v): mm ∈ { m ∈ F1D: m.r = r } }
                ∪ { Q2(mm.src, r): mm ∈ { m ∈ F1Q: m.r = r } }
        ]

Strictly speaking, the paper requires all messages to be eventually delivered to all correct processes. This would require a carefully-written fairness constraint. We do not give it here. We could require that all messages sent by the correct processes are eventually delivered to all correct processes.

Now, we can ask Apalache to find examples again. This time, we are using InitWithFaults to inject faults in the initial states and CorrectStep to perform the steps by the correct processes.

$ apalache-mc check --length=30 --inv=DecisionEx \
  --init=InitWithFaults --next=CorrectStep MC_n6t1f1.tla
...
State 9: state invariant 0 violated.
...
Total time: 12.738 sec

$ apalache-mc check --length=30 --inv=AllDecisionEx \
  --init=InitWithFaults --next=CorrectStep MC_n6t1f1.tla
...
State 15: state invariant 0 violated.
...
Total time: 993.812 sec

Nice! This was a small change of our perspective, but a significant speed up of the model checker.

If you compare the above results with the results of checking the behavior of the correct processes, you will notice that the experiments in this section required fewer steps to produce the examples. This is an interesting effect of having Byzantine processes: They may help the correct processes to make decisions faster, if they want to.

We have to be careful about defining the behavior of faulty processes. It’s often easy to restrict their behavior way too much. If you have a friend from distributed computing, ask them, whether your optimizations make sense. If you don’t have such a friend, ask the model checker. See the next section.

3. Model-checking agreement

Producing examples of the expected behavior is nice. In fact, we should always start with that. However, we also want to check algorithm’s safety. As in many consensus algorithms, safety means that no two correct processes may decide on different values. This property is usually called “Agreement”. We can easily define it as a state invariant in TLA⁺:

AgreementInv ≜                                                                                                                
    ∀ id1, id2 ∈ CORRECT:
        ∨ decision[id1] = NO_DECISION
        ∨ decision[id2] = NO_DECISION
        ∨ decision[id1] = decision[id2]

From the paper, we know that AgreementInv should hold true on the specification instances when $N > 5T \land T \ge F$. This naturally gives us two minimal instances to check:

A good instance: $N = 6$, $T = 1$, $F = 1$. This is the minimal instance, where agreement should hold true.
A bad instance: $N = 6$, $T = 1$, $F = 2$. This is the minimal instance, where agreement is expected to break.

Which instance shall we check? The answer is: We have to check the both instances. Checking agreement on the good instance positively confirms our expectations. Checking agreement on the bad instance is our attempt to falsify our result. If it succeeds, something is wrong, either with our specification, or with the algorithm. Of course, there is a third outcome, namely, that the model checker has a bug. However, it would not be that dramatic: We can always check the produced counterexample.

3.1. Checking agreement on the bad instance

Let’s start with the minimal bad instance. As expected, Apalache finds a counterexample to agreement:

$ apalache-mc check --length=10 --init=InitWithFaults \
  --next=CorrectStep --inv=AgreementInv MC_n6t1f2.tla
State 10: state invariant 0 violated.
It took me 0 days  0 hours 2 min 59 sec

This time we were lucky. Apalache found a countrexample in three minutes. When we check the counterexample, we see that two correct replicas indeed decide on two different values:

State10 ==
  decision
      = SetAsFun({ <<"0_OF_REPLICA", -1>>,
        <<"1_OF_REPLICA", 0>>,                                                                                                
        <<"2_OF_REPLICA", 1>>,
        <<"3_OF_REPLICA", -1>> })
  ...

3.2. Checking agreement on the good instance

When we check agreement for the same bound on the execution length of 10, Apalache does not produce a counterexample:

$ apalache-mc check --length=10 --init=InitWithFaults \
  --next=CorrectStep --inv=AgreementInv MC_n6t1f1.tla
...
Checker reports no error up to computation length 10
It took me 0 days  0 hours  0 min 34 sec

However, how convincing is this result? We can increase the bound, say, to 15:

$ apalache-mc check --length=15 --init=InitWithFaults \
  --next=CorrectStep --inv=AgreementInv MC_n6t1f1.tla
...
Checker reports no error up to computation length 15
It took me 1 days 20 hours 47 min  8 sec

Oh-oh. It took the model checker really long to tell us that it could not find a counterexample. The good news is that this is a hard guarantee: If check writes that there is no error up to computation length $k$, then the invariant holds true for all executions that contain up to $k$ steps ($k$-bounded). This guarantee is completeness of bounded model checking for $k$-bounded executions.

Recall that we had an invariant violation just in 10 steps for the case of $N = 6$, $T = 1$, $F = 2$. Hence, checking $15$-bounded executions seems to be enough. What if we want to check longer executions? In the rest of this blogpost, we will discuss two techniques:

Randomized symbolic execution. This technique requires zero time investment, but it loses completeness.
Inductive invariants. This technique requires plenty of additional work, but it gives us completeness for unbounded executions!

4. Boosting the search with randomized symbolic execution

4.1. Understanding slowdown

Why is bounded model checking slowing down so dramatically? Apalache gives us a hint in the execution log:

Step 0: picking a transition out of 1 transition(s)
Step 1: picking a transition out of 1 transition(s)
...
Step 4: picking a transition out of 1 transition(s)
Step 5: picking a transition out of 3 transition(s)
...
Step 8: picking a transition out of 3 transition(s)
Step 9: picking a transition out of 6 transition(s)
...
Step 15: picking a transition out of 6 transition(s)

In step 0, the model checker adds the constraints from InitWithFaults. In steps 1-4, it adds the constraints from Step1. In steps 5-8, it adds the constraints from Step1 ∨ Step2. Note that it chooses from three symbolic transitions, not two. This is because Apalache decomposes Step2 into two independent transitions. If you look carefully at Step2, then you will notice that it contains two disjunctive cases. Finally, in steps 9-15, the model checker adds the constraints from Step1 ∨ Step2 ∨ Step3.

We can also break it down in the table below:

Step No.	Action 1	Action 2	Action 3
0	`Init`
1	`Step1`
2	`Step1`
3	`Step1`
4	`Step1`
5	`Step1`	`Step2`
6	`Step1`	`Step2`
7	`Step1`	`Step2`
8	`Step1`	`Step2`
9	`Step1`	`Step2`	`Step3`
10	`Step1`	`Step2`	`Step3`
11	`Step1`	`Step2`	`Step3`
12	`Step1`	`Step2`	`Step3`
13	`Step1`	`Step2`	`Step3`
14	`Step1`	`Step2`	`Step3`
15	`Step1`	`Step2`	`Step3`

Essentially, we were asking Apalache to check all executions up to 15 steps, but it had to produce constraints for $15 + 11 + 7 = 33$ actions. Moreover, starting with step 5, the solver had to choose between Step1, Step2, or Step3. If you were solving systems of inequalities at school, you remember that disjunctions were causing you a lot of trouble.

What if we save the solver from the trouble of choosing the actions? For instance, we could fix the following schedule, that is the sequence of actions (shown in bold):

Step No.	Action 1	Action 2	Action 3
0	`Init`
1	`Step1`
2	`Step1`
3	`Step1`
4	`Step1`
5	`Step1`	`Step2`
6	`Step1`	`Step2`
7	`Step1`	`Step2`
8	`Step1`	`Step2`
9	`Step1`	`Step2`	`Step3`
10	`Step1`	`Step2`	`Step3`
11	`Step1`	`Step2`	`Step3`
12	`Step1`	`Step2`	`Step3`
13	`Step1`	`Step2`	`Step3`
14	`Step1`	`Step2`	`Step3`
15	`Step1`	`Step2`	`Step3`

The above schedule should produce significantly fewer constraints than the general one. One issue though is that it may be infeasible, that is, there is no actual execution that follows this schedule.

In our example, we have $2^4 + 3^7 = 2207$ schedules, if we count Step2 and Step3 as indivisible symbolic transitions. This is not a lot. We could just enumerate all of them. Recall that Step2 and Step3 are producing 2 symbolic transitions each. This gives us $3^4 + 5^7 = 78206$ schedules. It’s getting harder. Now imagine we had executions of length 30, not 15.

In general, enumerating schedules is not an option. There are way too many of them. However, many of the schedules are “similar”. Say, if it’s possible to execute Step3 and then Step2, most likely, it’s possible to first execute Step2 and then Step3. Similar effects in various algorithms were noticed a long time ago. There are well-studied techniques of partial-order reduction (check the chapter in the Handbook of Model Checking) and Lipton’s reduction. Unfortunately, nobody transferred these techniques to TLA⁺ in its full generality. Perhaps, the ideas by Ernie Cohen and Leslie Lamport in their Reduction in TLA could help us a bit.

4.2. Randomized symbolic execution

It would be great to have a complete reduction technique, but we don’t know how to get one. If we forget about completeness, there is a very simple way of exploiting similarities between schedules. Simply generate feasible schedules uniformly at random! If many schedules are similar, then there should be a good chance of hitting one of them. This is exactly what the command simulate in Apalache does. I am using intuitive and imprecise wording on purpose, as I don’t have precise theory to capture these properties, yet.

This kind of randomized search is implemented with the command simulate. Let’s simply run simulate instead of check and see what happens:

$ apalache-mc simulate --max-run=100 --length=15 \
  --init=InitWithFaults --next=CorrectStep --inv=AgreementInv MC_n6t1f1.tla
...
State 0: Checking 1 state invariants
State 0: state invariant 0 holds.
Step 0: randomly picked transition 0
Step 1: Transition #0 is disabled
Step 1: randomly picked transition 3
Step 2: randomly picked transition 3
Step 3: Transition #5 is disabled
Step 3: randomly picked transition 3
Step 4: Transition #4 is disabled
Step 4: Transition #5 is disabled
...
Step 15: Transition #3 is disabled
Step 15: randomly picked transition 2
----------------------------
Symbolic runs left: 99
State 0: Checking 1 state invariants
...
----------------------------
Symbolic runs left: 1
Checker reports no error up to computation length 15
It took me 0 days  0 hours  6 min 27 sec

Six minutes instead of 1.5 days. The above execution constructs 100 symbolic runs (as set with the option --max-run) and checks the invariant against all of them.

Can we push it beyond 15 steps? Let’s try 100 symbolic runs of 30 steps:

$ apalache-mc simulate --max-run=100 --length=30 \
  --init=InitWithFaults --next=CorrectStep --inv=AgreementInv MC_n6t1f1.tla
...
Checker reports no error up to computation length 30
It took me 0 days  4 hours 26 min 57 sec

This time, it takes longer, but still much faster than bounded model checking for 15 steps, which has to exhaustively analyze all available actions.

Now I am feeling lucky. Let’s try symbolic runs of 100 steps:

$ apalache-mc simulate --length=100 --init=InitWithFaults --next=CorrectStep \
  --inv=AgreementInv MC_n6t1f1.tla
...
Check the trace in: ...
Found a deadlock.
It took me 0 days  0 hours  4 min 38 sec

A deadlock is a bit unexpected. Let’s look at the produced example:

State40 ≜
  ...
  ∧ round 
      = SetAsFun({ ⟨"1_OF_REPLICA", 3⟩,
        ⟨"4_OF_REPLICA", 3⟩,
        ⟨"3_OF_REPLICA", 3⟩,
        ⟨"2_OF_REPLICA", 3⟩, 
        ⟨"0_OF_REPLICA", 3⟩ })
  ...

Recall that we have defined ROUNDS as 0..3. This is it. All correct processes have reached round 3 and cannot progress any further. Hence, this is not a real deadlock but a by-product of our modeling. The solution is to either ignore deadlocks with --no-deadlock, or introduce an explicit stuttering step:

CorrectStepOrStutter ≜
  ∨ CorrectStep
  ∨ UNCHANGED ⟨decision, msgs1, msgs2, round, step, value⟩

TLA⁺ even has a special operator for producing CorrectStepOrStutter, namely:

  [CorrectStep]_⟨decision, msgs1, msgs2, round, step, value⟩

How it works. Every run is constructed incrementally, by randomly picking a symbolic transition (an action or its part):

i. As expected, $A_{n(0)}$ is the initial-state predicate, e.g., InitWithFaults.

ii. In a step $i > 0$, the model checker randomly selects a symbolic transition $A_{n(i)}$. Then, it checks whether there is at least one concrete execution that follows the actions $A_{n(0)}, A_{n(1)}, \dots, A_{n(i)}$.

If there is such a concrete execution, that is, there is a model for the sequence of actions, the model checker fixes $A_{n(i)}$, checks the invariant and proceeds to step $i + 1$.
If there is no such concrete execution, the model checker declares $A_{n(i)}$ to be disabled in the context of the sequence $n(0), n(1), \dots, n(i-1)$. Then, it randomly selects another value for $n(i)$ among the values that have not been tried yet.
If the model checker found all actions at step $i$ to be disabled, it reports a deadlock (unless the option --no-deadlock is set).

4.3. Unexpected perk – parallel search

An interesting feature of randomized search is that it is infinitely parallelizable, essentially, for free!

My laptop has 12 CPU cores, and Apalache uses only one of them. I can easily run 10 symbolic runs on each of the cores, which gives me 100 symbolic runs approximately 10 times faster! Even better, if there is a counterexample, the parallel search often finds it faster. All we need is GNU Parallel:

$ seq 0 9 | parallel --delay 1 --halt now,fail=1 \
  apalache-mc simulate --max-run=10 --length=20 --out-dir={1} \
  --init=InitWithFaults --next=CorrectStep --inv=AgreementInv MC_n6t1f2.tla
...
State 13: state invariant 0 violated.
Total time: 48.301 sec

When you run this command, your running times may differ a lot, due to randomization. The above line contains plenty of parameters to parallel. If you have not seen GNU Parallel before, check its manual. It’s totally worth it!

Remember the model checking runs that took over six hours? Let’s run ten randomized searches in parallel:

$ seq 0 9 | parallel --delay 1 --halt now,fail=1 \
  apalache-mc simulate --max-run=10 --length=30 --inv=AllDecisionEx \
  --init=Init --next=Next --out-dir={1} MC_n6t1f2.tla
...
State 20: state invariant 0 violated.
Total time: 55.31 sec

As you can see, randomized search seriously boosts our bug finding capabilities .

5. Checking unbounded executions via an inductive invariant

If you have reached this part of the blogpost, you are going to learn about the most advanced technique supported by Apalache.

Bounded model checking and randomized symbolic execution are low-effort techniques in the sense that they do not require a lot of boilerplate to check the invariants of a TLA⁺ specification. However, these techniques do not give us a guarantee of safety for executions that are longer than the bound provided by the user (using the option --length).

Interestingly, bounded model checking can be used to reason about unbounded executions, as we noticed in the paper at OOPSLA’19. This is the technique of inductive invariants. In this section, we will discuss how to apply this technique to show that AgreementInv holds true for all executions of our specification.

5.1. Basics of inductive invariants

When we want to reason about unbounded executions, intuitively, we want to reason as follows:

Start in an arbitrary “good” state of the algorithm.
Let the algorithm make a single transition.
Show that the resulting state is still “good”.

The technique of inductive invariants formalizes this intuition. For example, it is used in Specifying Systems (Chapter 5). Assume that we want to prove safety of our TLA⁺ specification with respect to a state invariant $Safety$. We have to find a state predicate $IndInv$ – called an inductive invariant that satisfies the three properties:

All initial states satisfy the invariant: $\models Init \Rightarrow IndInv$.
For every state $s$ that satisfies $IndInv$, every successor $s’$ of $s$ via $Next$ satisfies the invariant: $\models IndInv \land Next \Rightarrow IndInv’$.
Every state $s$ that satisfies $IndInv$ also satisfies the invariant: $\models IndInv \Rightarrow Safety$.

Why is $IndInv$ called an inductive invariant? It allows us to use the inductive principle to prove that $Safety$ holds true for every state reachable from $Init$. The requirements 1 and 3 give us the base case of induction. The requirements 2 and 3 give us the inductive step.

5.2. Shape invariant

Since the state variables in TLA⁺ are completely unrestricted, the very first step in finding an inductive invariant is to give a shape of the state variables. This shape is usually give with a predicate called $TypeOK$, see Specifying Systems:

TypeOK ≜
  ∧ value ∈ [ CORRECT → VALUES ]
  ∧ decision ∈ [ CORRECT → VALUES ∪ { NO_DECISION } ]
  ∧ round ∈ [ CORRECT → ROUNDS ]
  ∧ step ∈ [ CORRECT → { S1, S2, S3 } ]
  ∧ ∃ A1 ∈ SUBSET [ src: ALL, r: ROUNDS, v: VALUES ]:
        msgs1 = [ r ∈ ROUNDS ↦ { m ∈ A1: m.r = r } ]
  ∧ ∃ A1D ∈ SUBSET [ src: ALL, r: ROUNDS, v: VALUES ],
          A1Q ∈ SUBSET [ src: ALL, r: ROUNDS ]:
        msgs2 = [ r ∈ ROUNDS ↦
            { D2(mm.src, r, mm.v): mm ∈ { m ∈ A1D: m.r = r } }
                ∪ { Q2(mm.src, r): mm ∈ { m ∈ A1Q: m.r = r } }
        ]

In theory, we should be able to show that $TypeOK$ is also inductive (with $Safety$ being just $\textsf{TRUE}$). In practice, to disprove inductiveness of the above $TypeOK$, Apalache would have to reason about all powersets of relatively large sets. We can still try it:

$ apalache-mc check --init=TypeOK --inv=TypeOK --length=1 MC_n6t1f1_inductive.tla
...
The set $C$1100 is expanded, producing O(2^36) constraints. This may slow down the solver.
...
EXITCODE: ERROR (255)

To work around this issue, with use induction with a caveat. Instead of using just $IndInv$, we are using $IndInit$ and $IndInv$:

IndInv ≜
  ...

IndInit ≜
  ∧ TypeOK
  ∧ IndInv

5.3. Discovering IndInv

This is the hardest part of the technique. It requires some understanding of the algorithm. If we do not want to write a paper proof, we can follow a refinement loop:

Start with $TypeOK$ as the first approximation of $IndInv$. Check, whether $IndInv$ implies $AgreementInv$:
```
$ apalache-mc check --init=IndInit --inv=AgreementInv --length=0 MC_n6t1f1_inductive.tla
```
If not, inspect the counterexample, add a conjunct to $IndInv$. Go to step 1.

Otherwise, check, whether the inductive step holds true:

$ apalache-mc check --init=IndInit --inv=IndInv --length=1 MC_n6t1f1_inductive.tla

If not, inspect the counterexample, add a conjunct to $IndInv$. Go to step 3.

Check, whether the initial states satisfy the inductive invariant:

$ apalache-mc check --init=IndInit --inv=InitWithFaults --length=0 MC_n6t1f1_inductive.tla

I have followed the above steps to find a good inductive invariant. You can check the code of IndInv. It contains 13 lemmas:

IndInv ≜
  ∧ Lemma2_NoEquivocation1ByCorrect
  ∧ Lemma3_NoEquivocation2ByCorrect
  ∧ Lemma4_MessagesNotFromFuture
  ∧ Lemma5_RoundNeedsSentMessages
  ∧ Lemma6_DecisionDefinesValue
  ∧ Lemma7_D2RequiresQuorum
  ∧ Lemma8_Q2RequiresNoQuorumFaster
  ∧ Lemma9_RoundsConnection
  ∧ Lemma10_M1RequiresQuorum
  ∧ Lemma11_ValueOnQuorumLessRam
  ∧ Lemma12_CannotJumpRoundsWithoutQuorum
  ∧ Lemma13_ValueLock
  \* this lemma is rather slow
  ∧ Lemma1_DecisionRequiresLastQuorumLessRam

For instance, the crucial lemma 1 looks as follows:

ExistsQuorum2LessRam(r, v) ≜
  LET nv ≜ Cardinality({ m ∈ msgs2[r]: IsD2(m) ∧ AsD2(m).v = v })
      n ≜ Cardinality(msgs2[r])
  IN
  ∧ n ≥ N - T
  ∧ nv ≥ T + 1
  ∧ 2 * nv > N + T

Lemma1_DecisionRequiresLastQuorumLessRam ≜
  Lemma1c ∷
  ∀ id ∈ CORRECT:
    ∨ decision[id] = NO_DECISION
    ∨ round[id] > 1 ∧ ExistsQuorum2LessRam(round[id] - 1, decision[id])

5.4. Using the inductive invariant

Let’s see how much time it takes to check inductiveness of $IndInv$ and use it to prove $AgreementInv$, as we outlined in the basics.

Checking the initial states:

$ apalache-mc check --init=InitWithFaults --inv=IndInv --length=0 MC_n6t1f1_inductive.tla
...
Checker reports no error up to computation length 0
Total time: 6.270 sec

Checking agreement:

$ apalache-mc check --init=IndInit --inv=AgreementInv --length=0 MC_n6t1f1_inductive.tla
...
Total time: 452.58 sec

Checking the inductive step:
```
$ apalache-mc check --init=IndInit --inv=IndInv --length=1 MC_n6t1f1_inductive.tla
# see the text below
```
Since this takes most of the time, we ran the above query for all 13 lemmas in parallel. Below, we demonstrate running times of checking the lemmas:

The slowest lemma required over 8 days. The second slowest lemma required 2.5 days. All the remaining lemmas required less than 30 minutes.

5.5. Bonus: checking decision finality

It was a lot of work to check AgreementInv. Can we use the inductive invariant to check another property? It turns out that IndInv is general enough to check decision finalization:

\* Once a correct replica decides, it never changes its decision
FinalityInv ≜
    ∀ id ∈ CORRECT:
        ∨ decision[id] = NO_DECISION
        ∨ decision'[id] = decision[id]

The invariant FinalityInv says that whenever a correct process has a non-empty value for decision, it cannot change it. Notice that FinalityInv is an action invariant, that is, it must be checked against a pair of states $(s, s’)$, in contrast to a state invariant, which is checked against a single state $s$.

By looking at the pseudo-code and the specification, it is hard to tell, whether this invariant holds true. When a correct process makes a decision, it continues sending and receiving messages. Hence, potentially, it can change its decision value. Using the inductive invariant, we check FinalityInv in five minutes:

$ apalache-mc check --init=IndInit --inv=FinalityInv --length=1 MC_n6t1f0_inductive.tla
...
Checker reports no error up to computation length 1
Total time: 304.89 sec

6. Conclusions

It took me two days to write the first working version of the TLA⁺ and check its basic properties. This was unusually fast. I believe that this is due to several factors:

known patterns for specifying consensus in TLA⁺ and the plasticity of TLA⁺,
my experience in specifying consensus algorithms in TLA⁺, and
advanced analysis techniques implemented in Apalache.

You can check the timeline of finding the inductive invariant. In total, it took 15 hours of my (human) time to discover a good collection of lemmas. The first version required over 40G of RAM. It took me 4 hours more to write lemmas that required about 7G of RAM. It took significantly more hours of compute time running Apalache on a dedicated workstation. I do not have concrete figures for the whole timeline. The first version required 10 days to check, and the final version took 9 days. Approximately, the model checker was running for three weeks, burning from 1 to 13 CPU cores.

In my opinion, it is a very healthy balance between my time and the compute time:

I had to invest significantly less time than the model checker.

To be fair, I probably needed so little time to come up with lemmas, as Ben-Or’s consensus was a running example in the paper by Nathalie Bertrand, Marijana Lazic, Josef Widder, and myself (2021). In the paper, we developed a technique for the parameterized case, that is, showing safety for all $N$, $T$, and $F$ satisfying $N > 5 \cdot T$, $T \ge F \ge 0$, and for unboundedly growing round numbers! In addition, we had a (manually written) proof of algorithm’s termination. However, I have to warn you that the paper is much heavier on theory than this blogpost.

Our restriction to three rounds seems to be a serious limitation. Of course, we could increase the number of rounds, say, to five or ten, and conduct further experiments. The experiments will probably take longer than reasonable. Intuitively, three rounds is the minimally interesting configuration: some processes may be in “the past” (round 1), some processes may be in “the current round” (round 2), some processes may be in “the future” (round 3).

As we can see, randomized symbolic execution is unexpectedly efficient at discovering invariant violations. Combined with GNU parallel, it opens space for fast experimentation with TLA⁺. Currently, the symbolic runs in simulate are completely independent. Similar to code fuzzers, we could explore prioritization of runs based on the previous results. This is the next direction of my research.

I would really love to learn, how much effort it takes to discover an inductive invariant (or prove inductiveness of IndInv) in TLAPS, Lean, Coq, or Dafny. Having such experimental evidence would help all of us to understand, where to put the efforts first. In my experience, we should start with lightweight tools that bring the value to the customer fast. Once the customer senses the value of specifications and the lightweight tools, we can continue with heavier tools.

Finally, when I was finishing these conclusions, Google Scholar brought me the Master thesis by Marijus Jasinskas. I only browsed through the thesis. Marijus Jasinskas proposes optimization techniques that make TLC applicable to model checking of Ben-Or’s algorithm.