The value of model checking in distributed protocols design

Tags: specification model-checking tlaplus quint apalache SMT

Reviewers: ChatGPT was the first reviewer of this blog post. I let it improve some of my wording.

Recently, we have published two technical papers on arXiv that are both using model checkers as the main vehicle for verifying properties of fault-tolerant distributed algorithms. I do not want to repeat these reports in the blogpost. If you like this blog, you will probably enjoy reading the papers as well:

ChonkyBFT: Consensus Protocol of ZKsync. This is the follow-up work of the earlier blog post on ChonkyBFT.
Exploring Automatic Model-Checking of the Ethereum Specification. This is a technical report after the work done in an academic grant received from the Ethereum Foundation.

As an independent researcher, I have to deliver real value to customers. That means I find myself discussing the merits of different tools and approaches to protocol correctness more often than ever. In my view, we should embrace as many tools as possible to improve correctness and security across all critical dimensions.

These tools span a wide range of effort and cost, roughly in this order: testing, property-based testing, simulation, fuzzing, model checking, paper-and-pencil proofs, writing research papers, and proving with a proof assistant.

There’s simply no room for dogmatic thinking anymore. In this post, I want to focus specifically on model checking — and the value it brings.

What is value after all? I must admit that I must still have some form of PTSD when people start talking “value” and “product-market fit” around me. As a person with technical education, by “value” I normally understand something measurable, like a number or a function of some other values. Or, perhaps, monetary gains and losses. Even though business people would like to put a dollar value on everything they see, we can only subjectively evaluate deep technical work in dollars. For the rest of this blog post, by value, I mean a technical improvement of a protocol, preferably, in a measurable way. Obviously, finding a protocol bug that has economic impact always brings value to the customer. But what if there is no such bug? Showing that the protocol has a given property under specific scope also brings value to the customer. If we can do it fast and spend less in personnel costs, that’s kind of even more measurable value?

1. What is the value of testing?

Before jumping into model checking, let’s think about the value of testing. I believe every other PhD thesis on formal methods contains this famous quote by Edsger W. Dijkstra:

Testing can only prove the presence of bugs, not their absence.

My thesis certainly had this quote. While technically true and clever, we tend to use this phrase to show that formal verification is ultimately superior to testing. In practice, it is not. You definitely should not even think about formal verification, if you have not tested your system. Let me make it in a bigger font:

By all means, write tests. Please, do. Write many.

What is the value of tests? There are several “values”:

A test makes sure that your system or protocol has at least one execution that does something useful, that is, it reaches at least one interesting state. In a pure mathematical definition of a program, this does make little sense. Of course, a program computes something! Well, it may happen that all it computes is a bunch of errors. As our little nightly CI builders say: “For the night is dark and full of terrors”.
If you break previous assumptions about the system behavior by changing the code, a test will quickly show that you did this. Since the software is constantly rewritten, tests is the quick and cheap way to keep us away from regressions. This is why we all do continuous integration.
Less obviously, a test suite may demonstrate a gap in the designer’s understanding. A large class of untested inputs may open up space for an attacker.

Now, as many people notice these days, single-input tests can only check the system’s behavior for one data point. That is not a lot, especially, if the system has a huge input space. Intuitively, we expect this single data point to be a representative of a large equivalence class of other data points, which all make our system behave “more or less” the same.

Of course, it is hard to manually reason about these equivalence classes. Fortunately, we have property-based testing tools and fuzzers. They can try multiple inputs, and they can run for longer times. Unfortunately, naive application of these tools would give us false confidence in our system’s correctness. To use these tools efficiently, we have to understand how they work. This is a topic for another blog post.

Similar to property-based testing, we can also write a protocol simulator. For example, Quint has a built-in randomized simulator. It also gives us a quick start in disproving correctness of our protocol. Sometimes, the simulator finds bugs very quickly, much faster than a model checker would do. This especially applies to shallow bugs. Actually, I was using the Quint simulator in parallel with the model checker, when looking for the inductive invariant of ChonkyBFT.

2. What is the value of model checking?

On the surface, there are two “value” points:

Model checking, especially SMT-backed model checking, is efficient at finding bugs.
Model checking proves that the protocol is correct.

Well. While the above points are sometimes true, they are more often not — especially when we do not use the tools right.

Bug finding. It is true that model checkers that use SMT solvers can find certain bugs relatively fast, when randomized exploration cannot. Intuitively, symbolic model checkers are the most effective when a bug is manifested by a “very narrow” equivalence class of inputs. When the bug’s equivalence class is “large”, randomized exploration is also effective. For example, the Quint’s randomized simulator was good at finding some invariant violations in our work on specification of the ZKsync Governance Protocol, but only Apalache found other violations. So both tools were good, but for different kinds of invariants. If you do not want to read, you can also watch my talk at DSS’24, where I explain the approaches we tried.

Proofs. Again, it is true that model checking can prove correctness under certain conditions. For instance, the famous traffic light example has 3–4 states, so we can exhaustively enumerate all of them — just search for it, there are too many links. However, when we deal with realistic protocols, we have to restrict the search scope, for example, by fixing the protocol parameters such as the number of processes, faults, or time bounds. We often have to restrict the scope, even when using symbolic model checkers. After all, if it was easy, why would we write a whole book on Decidability of Parameterized Verification. Some people call it the “undecidability” book.

The real value of model checking. If model checking isn’t primarily about finding bugs or generating proofs, then why do we need it at all? It took me some time, but I think I’ve come to see where its real value lies:

Model checking is complete under bounded scopes, that is, if it terminates, there is no bug in the given scope. If it does not terminate, we can reduce the search scope.
Under small scopes, model checking terminates in reasonable time. Everybody has their own definition of reasonable. If we compare it with fuzzing, running experiments for days and weeks is still fine.
Model checking is mostly automated. We still need a human expert to write a spec and to define the scope, but most of the search is done automatically. Moreover, there are ways to bump the compute resources. For instance, we were utilizing a machine with 48 cores and 256G in the ChonkyBFT experiments. I could even make use of a beefier machine, to get the results faster.

Let’s think about it a bit. On one hand, we can run a PBT tool or a fuzzer for quite a while. Yet, we never know, where it looked at and where it did not. To be fair, code fuzzers usually produce line coverage, so we have an idea of the lines they missed, but we don’t know much about the inputs they explored. While line coverage is useful for sequential programs, it is almost useless for distributed algorithms, where several instances have to execute the same line, in order to progress. On the other hand, we could also specify the protocol in Lean 4 or Rocq and then prove its correctness right there. The question is always how long it would take and how many qualified people it would require. There is a lot of hope for LLMs doing the job. I am still looking for someone to prove safety of Ben-Or’s consensus, say, in Lean. You even have the inductive invariant found with Apalache for you, see the blogpost on Model checking safety of Ben-Or’s Byzantine consensus with Apalache. A professor friend of mine gave me a six-month time estimate for their student to do that in Lean. Well, we talked about probabilities, too, so it may be faster in the case of safety. Another question that people in industry are asking: If it takes a year to prove correctness of an industrial-strength protocol with a prover, what do we do after the protocol has been improved by the engineers.

We definitely encountered the same story about the search scopes unfolding again, both in the ChonkyBFT paper and the 3SF model checking report. Interestingly, in the case of ChonkyBFT, the state-space defined by the inductive invariant was so immense that we had to even further restrict the set cardinalities to relatively small values. This required a bit of tool customization on the Quint side. See the paper for the details.

Bounded scopes come in different shapes:

Fixed or bounded protocol parameters, such as $n$, $t$, and $f$, say, in Ben-Or’s algorithm.
Execution length, such as the bound on the depth in bounded model checking.
Set bounds, such as the bounds on set cardinalities, e.g., as introduced in the ChonkyBFT paper.

3. Do we need better model checkers?

Over the course of 2024, I’ve become even more convinced that we still need better model checkers for distributed protocols. In Apalache, we introduced bounded scopes almost by accident—initially aiming for a quantifier-free encoding in SMT. Later, we added value generators, which offered a more intentional way to restrict the search space, similar to property-based testing. By contrast, Alloy had scopes from the very beginning, and bounded model checking has always centered around limiting the search to short executions.

After working with complex industrial protocols, I’ve grown a bit frustrated with SMT solvers. It’s no secret in the field that their performance can be unpredictable and highly sensitive to how the problem is encoded. In theory, Apalache uses a relatively simple, quantifier-free SMT encoding — which should make things more predictable. In practice, though, we still have very little control over execution times.

Maybe it’s time to write another model checker