<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://protocols-made-fun.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://protocols-made-fun.com/" rel="alternate" type="text/html" /><updated>2026-03-09T15:49:06+00:00</updated><id>https://protocols-made-fun.com/feed.xml</id><title type="html">Protocols Made Fun</title><subtitle>All things about protocol specification, testing, and verification. Creative Commons Attribution 4.0 International License.</subtitle><author><name>{&quot;igor&quot;=&gt;{&quot;name&quot;=&gt;&quot;Igor Konnov&quot;, &quot;url&quot;=&gt;&quot;https://konnov.phd&quot;, &quot;email&quot;=&gt;&quot;igor@konnov.phd&quot;}}</name></author><entry><title type="html">All you need is a simulator? Nope</title><link href="https://protocols-made-fun.com/testing/model-checking/2026/03/09/random-walks.html" rel="alternate" type="text/html" title="All you need is a simulator? Nope" /><published>2026-03-09T00:00:00+00:00</published><updated>2026-03-09T00:00:00+00:00</updated><id>https://protocols-made-fun.com/testing/model-checking/2026/03/09/random-walks</id><content type="html" xml:base="https://protocols-made-fun.com/testing/model-checking/2026/03/09/random-walks.html"><![CDATA[<p><strong>Author:</strong> <a href="https://konnov.phd">Igor Konnov</a></p>

<p><strong>Date:</strong> March 09, 2026</p>

<p><strong>Punchline: Testing distributed protocols with random simulation and stateful
property-based testing (PBT) is not enough!</strong> Yes, running a simulator for days
is better than doing manual testing or just running unit tests. But <strong>you will
miss states, which may expose bugs</strong>. <strong>Even on very small systems.</strong> I have
been saying exactly this to many software engineers. Many times. However,
whiteboard arguments do not help. As humans, we have a great deal of trust in
probabilities, and our intuitive understanding of randomness is often wrong.
Hence, I am giving you concrete figures and plots in this blog post. I must
admit that my own intuition was also wrong: I expected fewer random walks to be
needed to achieve good coverage. For a quick glance, see <a href="#quick-summary">Quick
summary</a>.</p>

<p>Achieving <strong>complete coverage with random walks is hard</strong>. This is especially
important to know, <strong>if you are using them to produce test cases for your
implementation</strong>. It is also crucial to know, in case you generate an
implementation of a distributed protocol with AI tools and <strong>hope for random
walks/PBT to work as an ultimate guardrail</strong>.</p>

<p>Don’t get me wrong. I like PBT and simulators (having written the <a href="https://github.com/informalsystems/quint">Quint</a>
simulator). I believe that these tools are must-have tools for testing.  See my
recent blog post on <a href="/pbt/2025/12/22/pbt-adversarial-llms.html">Property-based testing, adversarial developers, and
LLMs</a>. However, they are not the only tools that we need to make sure
that our systems work as expected. This is especially true now, when we do not
have time to properly design and review the AI-generated code.</p>

<p><strong>Why now?</strong> It has always been difficult to compare search procedures that were
developed by different branches of computer science. Everyone wanted to promote
their technique as the ultimate winner. Want to compare property-based testing
and model checking? Bad luck. Different tools require different inputs. Some are
libraries for programming languages (like <a href="https://en.wikipedia.org/wiki/QuickCheck">QuickCheck</a>), some are tools for
specification languages (like <a href="https://github.com/tlaplus/tlaplus">TLC</a> and <a href="https://apalache-mc.org">Apalache</a>).  Now it is much faster
to design frameworks, to experiment with multiple search procedures. It is also
easier to do reproducible experiments with LLMs. Good times, if you know how to
conduct experimental research.</p>

<div class="tip-box">
    <div class="tip-header"><strong>💡 Tip:</strong></div>
    <div class="tip"><p>In contrast to the previous blog posts, I do not
provide the artifacts for download. AI slop forks are real. It still takes me
several days to design and conduct the experiments on a beefy machine, as well
as to find the right format to interpret and plot the data. Even with the help
of the frontier models, though they are of great help. It only takes 10-15
minutes to repackage the benchmarks and results with an AI tool, having the
experimental data. Hence, I am sharing my lab book with the customers and
researchers, upon request.</p>
</div>
</div>

<p><a id="quick-summary"></a></p>

<h2 id="1-quick-summary-for-the-impatient-readers">1. Quick summary for the impatient readers</h2>

<p>Look at two groups of figures below. They summarize the results of running
random walks on specifications of three prominent distributed protocols:
two-phase commit, readers-writers, and FPaxos (see <a href="#benchmarks">Benchmarks</a>).
The figures show the coverage achieved by random walks, with 100% being the
numbers of distinct states (reported by the model checker TLC). In addition, we
plot the running times of the random walks, with the values plotted against the
right y-axis. All running times are on a AMD Ryzen 9 5950X processor (16
physical, 32 logical cores), 128 GB memory.</p>

<h3 id="11-coverage-for-minimal-instances">1.1 Coverage for minimal instances</h3>

<p>In this set of experiments, we do random walks for the minimal instances of the
benchmarks. We start with the <strong>meaningful default</strong> of 100,000 random walks,
with at most 100 steps per walk. As you can see from Figure 1, only in the case
of two-phase commit and two resource managers, we achieve complete state
coverage. This is not surprising, since this instance has only 56 states. It’s
tiny! For two-phase commit with three resource managers and readers-writers with
three actors, we achieve 85-90% coverage. This is also in the reasonable range.
On <strong>FPaxos with two acceptors, we achieve the 77.5% coverage with 100k random
walks</strong>. This is a bit worrying, since the state space is about 37k states.</p>

<p>The good news is that we can push all of the above benchmarks to achieve over
99% coverage. As you can see in the figures, it takes <strong>10 million random walks
to achieve 99% coverage</strong>. In addition to that, <strong>these runs require 1-2
hours</strong>.</p>

<p><strong>All of the above benchmarks are quite small by the model checking
standards. They have tens of thousands of states. It takes TLC only 1-3 seconds
to explore the state space and check the invariants for each of these
benchmarks.</strong></p>

<div class="figure-grid">
  <figure>
    <a href="/img/random-walks/twophase-n2-coverage.png" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="/img/random-walks/twophase-n2-coverage.png" alt="Coverage of random walks for the two-phase commit benchmark with 2 resource managers" />
    </picture></a>
    <figcaption>Figure 1.a: Two-phase commit, 2 RMs.</figcaption>
  </figure>
  <figure>
    <a href="/img/random-walks/twophase-n3-coverage.png" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="/img/random-walks/twophase-n3-coverage.png" alt="Coverage of random walks for the two-phase commit benchmark with 3 resource managers" />
    </picture></a>
    <figcaption>Figure 1.b: Two-phase commit, 3 RMs.</figcaption>
  </figure>
  <figure>
    <a href="/img/random-walks/rw-inst3-coverage.png" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="/img/random-walks/rw-inst3-coverage.png" alt="Coverage of random walks for the readers-writers benchmark with 3 actors" />
    </picture></a>
    <figcaption>Figure 1.c: Readers-writers, 3 actors.</figcaption>
  </figure>
  <figure>
    <a href="/img/random-walks/fpaxos-inst2-coverage.png" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="/img/random-walks/fpaxos-inst2-coverage.png" alt="Coverage of random walks for the FPaxos benchmark with 2 acceptors" />
    </picture></a>
    <figcaption>Figure 1.d: FPaxos, 2 acceptors.</figcaption>
  </figure>
</div>

<h3 id="12-slightly-larger-instances">1.2 Slightly larger instances</h3>

<p>What happens if we take the instances that are still small, but have 1-2
participants more? Figure 2 shows the results of doing random walks on these
instances.</p>

<p>As you can see, with the meaningful default of 100,000 random walks, we achieve
extremely poor coverage, about 25-30% on the benchmarks up to 2 million states.
<strong>On FPaxos with 4 acceptors, we achieve only 3% coverage after 100,000
random walks</strong>. Really bad!</p>

<p>To see how far we could push the coverage, we did the experiments with 10-100
million random walks. It is clear that <strong>in 1-2 hours of simulation we get to
60-80% coverage</strong>. It is good, but not great. When we push FPaxos with 3
acceptors to 100 million random walks, we get to 94.5% coverage. Nice, though it
took us 7.5 hours to get there. However, <strong>on FPaxos with 4 acceptors, we get a
poor coverage of 60.4% even with 100 million random walks, which took us 8.5
hours to run</strong>. This benchmark has about 11 million states. So it is reasonably
large, but, again, <strong>not that large by the model checking standards</strong>.</p>

<p>Again, <strong>it takes the model checker TLC up to 10 minutes to enumerate all the
states and check the invariants for these instances</strong>, whereas we have been
<strong>running the simulations for hours!</strong> This is especially
striking, given that we are <strong>running optimized simulators in Rust</strong>.</p>

<div class="figure-grid">
  <figure>
    <a href="/img/random-walks/twophase-n5-coverage.png" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="/img/random-walks/twophase-n5-coverage.png" alt="Coverage of random walks for the two-phase commit benchmark with 5 resource managers" />
    </picture></a>
    <figcaption>Figure 2.a: Two-phase commit, 5 RMs.</figcaption>
  </figure>
  <figure>
    <a href="/img/random-walks/rw-inst4-coverage.png" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="/img/random-walks/rw-inst4-coverage.png" alt="Coverage of random walks for the readers-writers benchmark with 4 actors" />
    </picture></a>
    <figcaption>Figure 2.b: Readers-writers, 4 actors.</figcaption>
  </figure>
  <figure>
    <a href="/img/random-walks/fpaxos-inst3-coverage.png" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="/img/random-walks/fpaxos-inst3-coverage.png" alt="Coverage of random walks for the FPaxos benchmark with 3 acceptors" />
    </picture></a>
    <figcaption>Figure 2.c: FPaxos, 3 acceptors.</figcaption>
  </figure>
  <figure>
    <a href="/img/random-walks/fpaxos-inst4-coverage.png" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="/img/random-walks/fpaxos-inst4-coverage.png" alt="Coverage of random walks for the FPaxos benchmark with 4 acceptors" />
    </picture></a>
    <figcaption>Figure 2.d: FPaxos, 4 acceptors.</figcaption>
  </figure>
</div>

<p><a id="benchmarks"></a></p>

<h2 id="2-the-benchmarks">2. The benchmarks</h2>

<p>As benchmarks, we use three specifications of distributed protocols. These are
prominent examples from the repository of <a href="https://github.com/tlaplus/Examples/">TLA+ Examples</a>:</p>

<ul>
  <li>
    <p><strong>Two-phase commit</strong>. This is the famous two-phase commit. The specification
 is explained in <a href="https://www.microsoft.com/en-us/research/publication/consensus-on-transaction-commit/">Consensus on Transaction Commit</a> by Jim Gray
 and Leslie Lamport. You can check the TLA<sup>+</sup> specification in
 <a href="https://github.com/tlaplus/Examples/blob/master/specifications/transaction_commit/TwoPhase.tla">TwoPhase.tla</a>.</p>
  </li>
  <li>
    <p><strong>Readers-writers</strong>. This is a solution to the <a href="https://en.wikipedia.org/wiki/Readers%E2%80%93writers_problem">Readers-Writers Problem</a>.
 The TLA<sup>+</sup> specification by Stephan Merz can be found in
 <a href="https://github.com/tlaplus/Examples/blob/master/specifications/ReadersWriters/ReadersWriters.tla">ReadersWriters.tla</a>.</p>
  </li>
  <li>
    <p><strong>FPaxos</strong>. This is <a href="https://fpaxos.github.io/">Flexible Paxos</a> by Heidi Howard, Dahlia Malkhi, and
 Alexander Spiegelman. The TLA<sup>+</sup> specification can be found in
 <a href="https://github.com/fpaxos/fpaxos-tlaplus/blob/main/FPaxos.tla">FPaxos.tla</a>.</p>
  </li>
</ul>

<p>All of the above specifications are parameterized in the number of participating
processes. We consider several instances of each benchmark. To give you an idea
of their state space size (the number of reachable states), we compute the
figures with <a href="https://github.com/tlaplus/tlaplus">TLC</a>. The reachable states are called <em>distinct states</em> in TLC,
whereas <em>produced states</em> are the number of states that TLC generates during the
search. Another important metric is the <em>diameter</em> of the state space, which is
the length of the longest shortest path between any two reachable states (read
it again!).</p>

<p>As you can see from Table 1, these transition systems are not tiny, but they are
actually small by the model checking standards. Surprisingly, they are
sophisticated enough to challenge random walks! <strong>Distributed protocols are
hard.</strong></p>

<figure>

  <table>
    <thead>
      <tr>
        <th>Benchmark</th>
        <th>Instance</th>
        <th>Distinct states</th>
        <th>Produced states</th>
        <th>Diameter</th>
        <th>TLC times</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Two-phase commit</td>
        <td>2 resource managers</td>
        <td>56</td>
        <td>154</td>
        <td>8</td>
        <td>1 sec</td>
      </tr>
      <tr>
        <td> </td>
        <td>3 resource managers</td>
        <td>288</td>
        <td>1,146</td>
        <td>11</td>
        <td>2 sec</td>
      </tr>
      <tr>
        <td> </td>
        <td>5 resource managers</td>
        <td>8,832</td>
        <td>58,146</td>
        <td>17</td>
        <td>2 sec</td>
      </tr>
      <tr>
        <td>Readers-writers</td>
        <td>2 readers/writers</td>
        <td>390</td>
        <td>935</td>
        <td>9</td>
        <td>2 sec</td>
      </tr>
      <tr>
        <td> </td>
        <td>3 readers/writers</td>
        <td>21,527</td>
        <td>59,674</td>
        <td>13</td>
        <td>2 sec</td>
      </tr>
      <tr>
        <td> </td>
        <td>4 readers/writers</td>
        <td>2,192,020</td>
        <td>7,069,237</td>
        <td>17</td>
        <td>1 min</td>
      </tr>
      <tr>
        <td>FPaxos</td>
        <td>2 acceptors</td>
        <td>36,953</td>
        <td>245,288</td>
        <td>19</td>
        <td>4 sec</td>
      </tr>
      <tr>
        <td> </td>
        <td>3 acceptors</td>
        <td>362,361</td>
        <td>2,697,682</td>
        <td>25</td>
        <td>21 sec</td>
      </tr>
      <tr>
        <td> </td>
        <td>4 acceptors</td>
        <td>11,279,393</td>
        <td>96,056,172</td>
        <td>31</td>
        <td>9 min</td>
      </tr>
    </tbody>
  </table>

  <figcaption>Table 1: The state space size of the benchmarks</figcaption>
</figure>

<p>In the experiments, I am using a custom framework to represent the above
<strong>specifications-as-code</strong> that makes it easy to experiment with
different search procedures. To make sure that these specifications faithfully
represent the original TLA<sup>+</sup> specifications, I do the following:</p>

<ol>
  <li>
    <p>do a code review (obviously),</p>
  </li>
  <li>
    <p>automatically translate the specifications to TLA<sup>+</sup> and check them
 with TLC,</p>
  </li>
  <li>
    <p>run a custom-tailored model checker to compute the number of distinct states
   and check the invariants.</p>
  </li>
</ol>

<p><a id="experimental-results"></a></p>

<p><a id="what-are-random-walks"></a></p>

<h2 id="3-what-are-random-walks-and-state-enumeration">3. What are random walks and state enumeration?</h2>

<p>I have mentioned random walks and state enumeration multiple times so far.
Let’s clarify what these terms mean. The concept of a random walk is intuitively
simple, though the details matter. Instead of looking at a large specification,
let’s look at a simple example of a system that models adding and removing
workers from a pool. This example is inspired by the example in <a href="https://learntla.com/topics/tips.html#parameterize-your-actions">Parameterize
Your Actions</a> by Hillel Wayne. We add the variable
<code>count</code> to have a meaningful invariant. The specification is shown below. Even
if you do not know TLA<sup>+</sup>, it should be easy to understand.  If you
still have trouble understanding it, just ask an LLM, they are good at
explaining TLA<sup>+</sup> specifications.</p>

<figure>

  <pre><code class="language-tla">EXTENDS Integers, FiniteSets

CONSTANTS
    (* The set of workers to choose from. *)
    (* @type: Set(Int);                   *)
    Worker

VARIABLES
    (* The set of active workers.         *)
    (* @type: Set(Int);                   *)
    active,
    (* The number of active workers.      *)
    (* @type: Int;                        *)
    count

(* Add a worker w to the set of active workers, if it is not already active. *)
(* @type: (Int) =&gt; Bool;                                                     *)
Add(w) ≜ w ∉ active ∧ active' = active ∪ {w} ∧ count' = count + 1

(* Remove a worker w from the set of active workers, if it is active.        *)
(* @type: (Int) =&gt; Bool;                                                     *)
Remove(w) ≜ w ∈ active ∧ active' = active \ {w} ∧ count' = count - 1

(* Initialize the system with no active workers and a count of zero.         *)
Init ≜ active = {} ∧ count = 0

(* In a next state, either add a worker or remove a worker.                  *)
Next ≜ ∃ w ∈ Worker:
          Add(w) ∨ Remove(w)

(* An invariant: `count` matches the cardinality of the active set.          *)
Inv ≜ (count = Cardinality(active))
</code></pre>

  <figcaption>Figure 3: TLA<sup>+</sup> specification for the Workers example.</figcaption>
</figure>

<p>If we fix the set of workers to be <code>Worker = {1, 2}</code>, we get a nice labelled
transition system (LTS) of 4 states. The graphical representation of this LTS is
shown below.</p>

<figure>

  <div><a href="/img/random-walks-lts.svg" target="_blank" title="Click to open full-size">
<picture>
  <img class="responsive-img full-width-img" src="/img/random-walks-lts.svg" alt="LTS for the Workers specification of two workers" />
</picture>
</a></div>

  <figcaption>Figure 4: The labelled transition system for two workers.</figcaption>
</figure>

<p>TLA<sup>+</sup> does not have any built-in notion of randomness or
probabilities.  It is what is usually called a <em>qualitative</em> specification.
When evaluating <code>Next</code> in a state, we can only evaluate whether a specific
transition is possible under a specific choice of <code>w</code> and the action scheduling
decision (whether to execute <code>Add(w)</code> or <code>Remove(w)</code>). This is the standard
semantics under the definition of behaviors. We can enumerate all reachable
states for the above system by breadth-first search or depth-first search. This
is what the model checker TLC does (it uses breadth-first search). This is what
I will call <em>state enumeration</em> in this blog post.</p>

<p>We could also interpret the choice of <code>w</code> and the action scheduling decision as
a random choice. Since the above specification is small, we can visualize it as
a <a href="https://en.wikipedia.org/wiki/Markov_decision_process">Markov decision process
(MDP)</a>. The states are
the same as in the LTS, but we also attach probabilities to the transitions.</p>

<figure>

  <div><a href="/img/random-walks-mdp.svg" target="_blank" title="Click to open full-size">
<picture>
  <img class="responsive-img full-width-img" src="/img/random-walks-mdp.svg" alt="MDP for the Workers specification of two workers" />
</picture>
</a></div>

  <figcaption>Figure 5: The MDP for two workers.</figcaption>
</figure>

<p>Notice that we assign probabilities for choosing the value of <code>w</code> and for
choosing the action to execute: <code>Add(w)</code> or <code>Remove(w)</code>. For example, in the
initial state, we choose <code>w=1</code> with probability 0.5, then the action <code>Add(1)</code>
with probability 0.5, which gives us a transition to the state where <code>active =
{1}</code> and <code>count = 1</code> (with probability 0.25). However, if we choose <code>w=1</code> and
the action <code>Remove(1)</code>, we have to backtrack to the initial state, since the
precondition of <code>Remove(1)</code> is not satisfied.</p>

<p>A <em>random walk</em> is a path through the MDP. It is a sequence of states that we
get by making random choices at each step. In the above figure, you can see one
walk in blue and one walk in red. To avoid too many backward edges, we have a
retry budget, typically, 3-10 retries per step. We take this simple approach in
our custom framework. It is similar to what the randomized simulator in
<a href="https://github.com/informalsystems/quint">Quint</a> is doing, though the Quint simulator is trying a bit more locally
before backtracking. Probabilities are basically used to produce various random
walks. There is no inherent statistical meaning to these probabilities in random
walks. This is very much how stateful property-based testing works, too, though
PBT frameworks usually use biased coins, instead of uniform ones.</p>

<div class="tip-box">
    <div class="tip-header"><strong>💡 Tip:</strong></div>
    <div class="tip"><p>TLC also supports random simulation, but it assigns
probabilities differently. Given a state, TLC first computes all successors of
the state and then chooses one of the successors uniformly at random. This would
give us a different MDP that filters out disabled transitions. Both approaches
have their merits and drawbacks. The approach of TLC requires us to enumerate
successors, unless we use reservoir sampling. It would actually work better on
the examples in this blog post, since they have many disabled transitions.
However, in systems that inject faults, this approach has an issue, as the
faulty transitions often dominate the search.</p>
</div>
</div>

<h2 id="4-which-states-are-missing">4. Which states are missing?</h2>

<p>Since we can measure state coverage now, the next question is: What are these
states that we are missing? Maybe these states are not important at all. To
check that, I ran the random walks for the two-phase commit benchmark with 2
resource managers for 10,000 instead of 100,000 runs. Conveniently, exactly one
state was missing from the coverage. As our specifications are code, I just
asked Claude to instrument the search to experimentally evaluate the visit
frequencies per run for each reachable state. Figure 6 is quite detailed. Click
on it to see the full-size version.</p>

<figure>

  <div><a href="/img/two_phase_graph.svg" target="_blank" title="Click to open full-size">
    <picture>
      <img class="responsive-img full-width-img" src="/img/two_phase_graph.svg" alt="Experimental evaluation of the visit frequencies for each state in the two-phase commit benchmark with 2 resource managers" />
    </picture>
  </a></div>

  <figcaption>
    Figure 6: Reachability frequencies for the two-phase commit benchmark with
    2 resource managers.
  </figcaption>
</figure>

<p>As we can see, the missing state (with the frequency of 0) is the state where
the transasction manager aborts the transaction, one resource manager also
aborts the transaction, and the other resource manager is in the “prepared”
state. This is an interesting state in this protocol, as the other resource
manager still has the potential to commit the transaction, though it should not
do that.</p>

<p><strong>Bottom line:</strong> We may miss important states with random walks.</p>

<h2 id="5-more-coverage-plots">5. More coverage plots</h2>

<p>Figure 7 shows the coverage evolution for the large instances of the benchmarks.
With this, we can see how increasing the number of random walks helps to
increase the coverage.  It also demonstrates the growing volume of covered and
missing states.</p>

<p>I wanted to share these flame plots with you. I find them cool.</p>

<div class="figure-grid">
  <figure>
    <a href="/img/random-walks/twophase-n5-overlay.png" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="/img/random-walks/twophase-n5-overlay.png" alt="Overlaid coverage of random walks for the two-phase commit benchmark with 5 resource managers" />
    </picture></a>
    <figcaption>Figure 7.a: Overlaid coverage for two-phase commit, 5 RMs.</figcaption>
  </figure>
  <figure>
    <a href="/img/random-walks/rw-inst4-overlay.png" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="/img/random-walks/rw-inst4-overlay.png" alt="Overlaid coverage of random walks for the readers-writers benchmark with 4 actors" />
    </picture></a>
    <figcaption>Figure 7.b: Overlaid coverage for readers-writers, 4 actors.</figcaption>
  </figure>
  <figure>
    <a href="/img/random-walks/fpaxos-inst3-overlay.png" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="/img/random-walks/fpaxos-inst3-overlay.png" alt="Overlaid coverage of random walks for the FPaxos benchmark with 3 acceptors" />
    </picture></a>
    <figcaption>Figure 7.c: Overlaid coverage for FPaxos, 3 acceptors.</figcaption>
  </figure>
  <figure>
    <a href="/img/random-walks/fpaxos-inst4-overlay.png" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="/img/random-walks/fpaxos-inst4-overlay.png" alt="Overlaid coverage of random walks for the FPaxos benchmark with 4 acceptors" />
    </picture></a>
    <figcaption>Figure 7.d: Overlaid coverage for FPaxos, 4 acceptors.</figcaption>
  </figure>
</div>

<h2 id="6-conclusions">6. Conclusions</h2>

<p><strong>Random walks are not sufficient to achieve complete coverage
except for very small state spaces</strong>. Moreover, <strong>random walks take
significantly longer than the model checker</strong>. This is especially striking,
given that we are <strong>running optimized simulators in Rust</strong>. Another issue with
state coverage by random walks is that <strong>you would not even know that you
achieved complete coverage</strong>. You can measure the speed of discovering new
states, but understanding that the simulator has converged basically requires a
model checker.</p>

<p>Interestingly, random walks behaved badly on FPaxos with four acceptors. This is
a relatively benign benchmark, not having a state explosion like specifications
of Byzantine consensus protocols (PBFT). In PBFT, the minimal configurations
contain 4-6 replicas, depending on the protocol. Hence, <strong>we should expect a
significantly worse coverage by random walks on PBFT</strong>.</p>

<p>Why do engineers keep running randomized experiments? Well, it is relatively
easy to write a simulator. (It is not that easy to write one that actually
works!) I have seen people playing with action distributions in the simulator,
just to drive the search towards “interesting” states. Whenever I was asking,
where the distributions were coming from, they could not explain this.
Simulators are deceptive. You have to understand what you are doing, or,
better, incorporate feedback. The most basic feedback is state coverage, though
we can implement more sophisticated feedback mechanisms.</p>

<p>From our experiments it may look like <strong>state enumeration is all we need</strong>. I
would argue that it is true <strong>as long as the set of reachable states fits into
memory</strong>. We do not have to store the states directly in memory, practical model
checkers store hashes of states. We can go as far as to store 2-3 bits per
state, assuming that collisions are acceptable (still better than random
walks!). Having a machine with 128 GB of memory, we can store roughly 50
billions of states. This is way more than the number of states in our
benchmarks – dozens of billions vs. thousands and millions.</p>

<p>There are cases where randomness may find bugs, where state enumeration gets
stuck:</p>

<ol>
  <li>
    <p><strong>Value domains are quite large.</strong> For example, if we choose values from the
 set of all 64-bit integers, it is not feasible to enumerate all successors even
 for a single state. A random walk can still do some progress without getting
 stuck. One can argue that choosing a value from the set $[0, 2^{64})$
 uniformly at random is shooting in the dark, but sometimes it helps us to find
 bugs, especially if the large set has just a few large equivalence classes.
 Arguably, one should be able to apply data abstraction in this case. Also,
 this is usually the moment when you should consider using a model checker that
 supports symbolic representation of states, like <a href="https://apalache-mc.org">Apalache</a>.</p>
  </li>
  <li>
    <p><strong>Guided search.</strong> If we have an heuristic that guides the search towards
 interesting states, we can achieve better coverage with random walks faster.
 Maybe we use reinforcement learning to learn such a heuristic. Maybe we use an
 LLM to predict which actions are more likely to lead to interesting states.
 The main issue is that it is quite hard to find a direction for the search in
 the state space of distributed protocols.</p>
  </li>
</ol>

<h2 id="want-to-talk">Want to talk?</h2>

<!-- References -->]]></content><author><name>Igor Konnov</name></author><category term="testing" /><category term="model-checking" /><summary type="html"><![CDATA[Author: Igor Konnov]]></summary></entry><entry><title type="html">AI-generated shovels or second-order slop?</title><link href="https://protocols-made-fun.com/llms/testing/2026/02/12/second-order-slop.html" rel="alternate" type="text/html" title="AI-generated shovels or second-order slop?" /><published>2026-02-12T00:00:00+00:00</published><updated>2026-02-12T00:00:00+00:00</updated><id>https://protocols-made-fun.com/llms/testing/2026/02/12/second-order-slop</id><content type="html" xml:base="https://protocols-made-fun.com/llms/testing/2026/02/12/second-order-slop.html"><![CDATA[<p><strong>Author:</strong> <a href="https://konnov.phd">Igor Konnov</a></p>

<p><strong>Date:</strong> February 12, 2026</p>

<p>tl;dr:</p>

<ul>
  <li>
    <p><em>AI coding tools now reduce development costs, but they also accelerate the
 creation of software that appears high-quality while hiding serious correctness
 and reliability risks.</em></p>
  </li>
  <li>
    <p><em>When both code and tests are autogenerated, traditional quality checks lose
 their signaling value, increasing the likelihood of costly failures, outages,
 and liability exposure in production systems.</em></p>
  </li>
  <li>
    <p><em>To responsibly capture productivity gains without undermining trust,
 organizations must pair AI-generated code with specification-driven development
 and automated validation techniques that verify real system behavior rather
 than surface-level compliance.</em></p>
  </li>
</ul>

<p>In 2025, we saw plenty of enthusiastic announcements about LLMs generating (or,
more correctly, <em>replicating</em>) relatively complex projects, like web
applications or video games. Apparently, many people got so tired of all this
that you could hear the words “AI slop” quite often. So often that some very
important people asked all of us not to call the output of their amazing tools
“slop”.</p>

<p>Anyhow, by the end of 2025, the amazing AI tools became <em>visibly more amazing</em>.
In the early 2025, I was only using ChatGPT and Copilot to produce small code
snippets and scripts, as well as to search for design solutions. In the summer
of 2025, I used Copilot &amp; Sonnet to produce boilerplate code. Now, I am using
Claude Code and Copilot (both with Opus and Sonnet) to generate code and tests
as well as to fix linting errors (still the hardest task!). I still have to
define the core data structures, write non-standard code and explain in detail
what I want to achieve. It is still hit and miss (see the most notable examples
below). However, it becomes economically feasible for me to use these tools,
unless they get 10x more expensive. By the way, after finishing my experiment
with <a href="/tlaplus/2025/12/15/tftp-symbolic-testing.html">Symbolic testing of TFTP</a>, I was still not sure, whether I
wanted to use agentic tools every day. The feedback loop was energy draining. It
looks like the tools became better, and I’ve learnt how to give them more
focused and smaller-scoped tasks.</p>

<p>It still remains to see an AI-generated product that generates revenue. Are
there any examples, except the AI coding assistants themselves? In any case,
this is not what I wanted to write about. I wanted to write about something that
looks like a new phenomenon to me. We all have heard the saying: <em>When everyone
is digging for gold, sell shovels</em>. Just over a couple of weeks, there was an
unusual number of announcements about development tools that were generated with
AI. This is what I call <em>AI-generated shovels</em>.  These announcements bring so
much joy to AI influencers that it’s hard to find anything else. Do these tools
actually work though? At a closer look, some of the shovels break on a first
try, some happen to work only under very specific conditions. Most likely, you
have seen some announcements, and you know what I am talking about. It is also
very likely that you have not seen all of the announcements that I have in mind.
Since we are talking about development tools, libraries, or even languages that
do not actually work, not web apps, it is not just slop, it is a <em>second-order
slop</em>!</p>

<p>I am not going to call any names, or do any fingerpointing. This is not the
point. What makes me seriously concerned about the second-order slop is that the
software development industry was cutting corners everywhere even before the AI
boom. “Move fast and break things!”, <a href="https://en.wikipedia.org/wiki/Minimum_viable_product">minimum-viable products</a> (or are they
solutions?), <a href="https://en.wikipedia.org/wiki/Product-market_fit">product-market-fit</a>, etc. A couple of years ago, I was joking
that I would rather not use an MVP compiler, operating system, or database.
Well, AI tools generate compilers. Here we are.</p>

<p><strong>Shovel ad!</strong> Since I have been working on pre-LLM shovels like <a href="https://apalache-mc.org">Apalache</a>
and <a href="https://github.com/informalsystems/quint">Quint</a> myself, I am in the shovel business, too! (Do you know that SMT
solvers were also considered AI?) Of course, I am developing new shovels, and
they are also AI-generated and AI-compatible, and they are the best in town, by
the way. So if you want to talk, <a href="#want-to-talk">drop me a message</a>. To be fair,
my time tracker shows that I’ve burnt six weeks of my time on the latest shovel,
in addition to burning through and over my Copilot and Claude budgets, so it’s
not entirely AI-generated. Perhaps, a bit artisanal.</p>

<p><strong>Good shovel or slop?</strong> How do we distinguish a robust AI-generated shovel from
a second-order slop? In the pre-LLM years, I could just look at the test suite
and say, whether the team was serious or not. Those amazing days when blockchain
engineers would nod their heads to the question: <em>Do you have integration
tests?</em> They were proudly demonstrating a <em>single</em> integration test that was 3-5
KLOC long. Also, by looking at the code, you could sense whether it was written
just yesterday, or someone had time to think about it.</p>

<p>In 2026, the code may look professionally written and follow all the best
practices and still be completely broken. On top of that, LLMs generate
well-looking tests, if you ask them. A lot of tests! The more tests you have,
the more tokens you have to pay for. Win-win. Moreover, the generated
tests may check that the code works, but this does not mean that the code does
what you expect. This happened to me (see below).</p>

<p>So when we evaluate an AI-generated shovel, we want to answer two questions:</p>

<ol>
  <li>
    <p>Does this shovel do what the authors claim it should do?</p>
  </li>
  <li>
    <p>Does this shovel work beyond a few simple tests?</p>
  </li>
</ol>

<p>These are not new questions. The testing and verification communities have been
trying to automate validation and verification for long time. Interestingly,
these questions did not get much attention over the last two decades. It was
expected that open source projects and products by respectable companies were
“more or less” correct and complete. In my understanding, two factors
contributed to that:</p>

<ol>
  <li>
    <p>The code was written and reviewed by highly-skilled engineers, for fun or
 profit.</p>
  </li>
  <li>
    <p>The projects were extensively tested with continuous integration tools.</p>
  </li>
</ol>

<p>Now, if an LLM generated the code just yesterday, and all tests pass, are we
good? It is hard to tell. If we follow the brand new <em>spec-driven development</em>,
we have a bunch of markdown files. Apparently, we should ask a few other LLMs to
check whether the implemented code matches the markdown specs. Something like
that.</p>

<p><strong>Can we do better?</strong> I believe we can. For example, if you are developing
a distributed system, do not generate it directly. First, write or AI-generate a
sequential reference implementation (e.g., in Python) or, even better, a formal
specification (e.g., in TLA<sup>+</sup>). Second, use this artifact to produce
the code for the actual distributed system.</p>

<p>Why does this help? For two reasons:</p>

<ol>
  <li>
    <p>It is easier to compare the reference implementation or specification
 against the markdown requirements than to compare the entire codebase.</p>
  </li>
  <li>
    <p>The reference implementation/specification is an actionable artifact.  Use
 it to produce tests for the distributed system. Instead of generating 10 KLOC
 tests once (and paying for loading them into the LLM context), automatically
 produce as many tests as you can. This is where property-based testing and
 model checking start to shine. See <a href="/tlaplus/2025/12/15/tftp-symbolic-testing.html">Symbolic testing of TFTP</a>
 for an example.</p>
  </li>
</ol>

<p><strong>Examples of LLMs hit and miss.</strong> If you <a href="https://www.linkedin.com/in/igor-konnov-at/">follow me on LinkedIn</a>, you could
have seen some of the examples. Below are the most curious instances that I would
regret missing in a code review (by Sonnet 4.5 and Opus 4.5):</p>

<ul>
  <li>
    <p><strong>The set minimum.</strong> When I asked an LLM to implement the search for the
 minimal element of a set by using its string representation (called <code>repr</code> in
 Python), it collected all set elements in a list, sorted them by <code>repr</code> and
 picked the first one. It looks like my requirement was slightly non-standard.</p>
  </li>
  <li>
    <p><strong>Sets with duplicates.</strong> An LLM has produced a unit test that constructed
 the data structure called “Set” from the list <code>[ V(1), V(2), V(3), V(1) ]</code> and
 asserted that the set cardinality was 4. The test passed, since <code>V</code> did not
 have equality defined, and two different instances of <code>V(1)</code> had different
 references. So it was doing the things right, but it was not doing the right
 things!</p>
  </li>
  <li>
    <p><strong>Performance bottleneck.</strong> An LLM translated my Python function into a Rust
 function. Perfectly looking code. However, instead of adding a big integer <code>x</code>
 to the big integer <code>y</code>, it used an iterator that made <code>y</code> increments of <code>x</code>.
 Almost like a theorem prover! A logically correct solution, but my Rust code
 was slower than the Python code. I only spotted it after running the profiler.
 Again, a bit non-standard setup threw it off.</p>
  </li>
</ul>

<h2 id="want-to-talk">Want to talk?</h2>

<!-- References -->]]></content><author><name>Igor Konnov</name></author><category term="llms" /><category term="testing" /><summary type="html"><![CDATA[Author: Igor Konnov]]></summary></entry><entry><title type="html">Property-based testing, adversarial developers, and LLMs</title><link href="https://protocols-made-fun.com/pbt/2025/12/22/pbt-adversarial-llms.html" rel="alternate" type="text/html" title="Property-based testing, adversarial developers, and LLMs" /><published>2025-12-22T00:00:00+00:00</published><updated>2025-12-22T00:00:00+00:00</updated><id>https://protocols-made-fun.com/pbt/2025/12/22/pbt-adversarial-llms</id><content type="html" xml:base="https://protocols-made-fun.com/pbt/2025/12/22/pbt-adversarial-llms.html"><![CDATA[<p><strong>Author:</strong> <a href="https://konnov.phd">Igor Konnov</a></p>

<p><strong>Date:</strong> December 22, 2025</p>

<p>I present a simple example that illustrates how property-based testing (PBT) and
model checking can help us catch unexpected behaviors of LLMs when they are used
to generate code. The example is inspired by the <a href="https://youtu.be/IYzDFHx6QPY">talk on property-based
testing</a> by <a href="https://scottwlaschin.com/">Scott Wlaschin</a>. If you are looking for a light
example that stresses the importance of writing good properties and having them
checked, this post is for you.</p>

<h2 id="1-adversarial-developer">1. Adversarial Developer</h2>

<p>A few days ago, I watched the <a href="https://youtu.be/IYzDFHx6QPY">talk on property-based testing</a> by
<a href="https://scottwlaschin.com/">Scott Wlaschin</a>. He started the talk by introducing a persona that he called
the <strong>Enterprise Developer from Hell</strong>. This is basically someone who
implements a feature to satisfy the given requirements, but they do it in
creatively evil (or just stupid) and unexpected ways. I will call such a persona
an <strong>adversarial developer</strong> in the rest of this post.</p>

<p>Then, Scott<sup id="fnref:scott-talk"><a href="#fn:scott-talk" class="footnote" rel="footnote" role="doc-noteref">1</a></sup> showed how an adversarial developer could ruin as simple task as
adding up two numbers. For example, if we give them two tests $2+2=4$ and $10+33
= 43$, they will implement exactly those cases by case distinction. I am not
going to repeat Scott’s talk. <a href="https://youtu.be/IYzDFHx6QPY">Watch it</a>! It’s instructive and
entertaining.</p>

<p>Back in 2020, of course, Scott added that we are often aversarial developers
ourselves, and our peers are rarely that evil. It could be an enthusiastic
junior developer, who has just started and now wants to rewrite the whole code
base.  Now, we are a few weeks away from 2026, and <strong>we definitely have such a
peer</strong>!  It is called an LLM, or just AI, as the corporate marketers prefer.
LLMs are not necessarily evil, but they are definitely less predictable than
experienced human software engineers. I am not talking about <a href="https://genai.owasp.org/llmrisk/llm01-prompt-injection/">prompt
injection</a> here, which is another real issue with LLMs.</p>

<p>To be clear, in the rest of this text, I am talking about how an <strong>LLM could
behave like an adversarial developer when it generates code</strong>. It does not mean
that I ran one of the commercial LLMs and got those results.</p>

<h2 id="2-property-based-testing">2. Property-Based Testing</h2>

<p>The point of Scott’s talk was to show that a few data points (typical unit
tests) are insufficient to demonstrate correctness of the implementation.
A totally valid point!</p>

<p>In addition to the standard unit tests, we should also write the expected
properties of our implementation. The PBT frameworks test the code by producing
input values at random. For example, have a look at <a href="https://hypothesis.readthedocs.io/en/latest/">Hypothesis</a>. While this
may seem to be a silly idea at first, property-based tests uncovers tricky
bugs. Moreover the input value distribution does not have to be uniform. Keep
reading to see how this helps us catch the adversarial developer.</p>

<p>Here are the three properties of addition that Scott used to defeat the
adversarial developer:</p>

<ul>
  <li><strong>identity</strong>: for every number $x$, we have $x + 0 = x$,</li>
  <li><strong>commutativity</strong>: for every numbers $x$ and $y$, we have $x + y = y + x$, and</li>
  <li><strong>associativity</strong>: for every numbers $x$, $y$, and $z$, we have
$(x + y) + z = x + (y + z)$.</li>
</ul>

<p>At this point of the talk, I was like: Wait a minute! <strong>I could continue this
game of the adversarial developer</strong>. Before doing this, let’s look at where we
are with respect to the code and the properties. Here is the obvious
implementation of integer addition in Python, since the language has built-in
support for unbounded integers:</p>

<pre><code class="language-python">def add(a: int, b: int) -&gt; int:
    return a + b

</code></pre>

<p>Here are the property-based tests in <a href="https://hypothesis.readthedocs.io/en/latest/">Hypothesis</a>, generated by Claude Sonnet 4.5:</p>

<pre><code class="language-python">@given(st.integers())
def test_identity(a):
    """Test identity property: a + 0 = a."""
    assert add(a, 0) == a
    assert add(0, a) == a


@given(st.integers(), st.integers())
def test_commutativity(a, b):
    """Test commutativity property: a + b = b + a."""
    assert add(a, b) == add(b, a)


@given(st.integers(), st.integers(), st.integers())
def test_associativity(a, b, c):
    """Test associativity property: (a + b) + c = a + (b + c)."""
    assert add(add(a, b), c) == add(a, add(b, c))

</code></pre>

<p>You can find these and further examples in the <a href="https://github.com/konnov/pbt-example-summation">example repository</a>.</p>

<p>We run these tests with <code>pytest</code> to make sure that they all pass:</p>

<pre><code class="language-sh">$ git clone https://github.com/konnov/pbt-example-summation.git
$ cd pbt-example-summation/python
$ poetry run pytest tests/test_add.py \
  -k "test_identity or test_commutativity or test_associativity" --verbose
...
tests/test_add.py::test_identity PASSED                                        [ 33%]
tests/test_add.py::test_commutativity PASSED                                   [ 66%]
tests/test_add.py::test_associativity PASSED                                   [100%]
</code></pre>

<h2 id="3-symbolic-model-checking-with-apalache">3. Symbolic model checking with Apalache</h2>

<p>I decided to go even further and write a TLA<sup>+</sup> specification, to check
the three properties with <a href="https://apalache-mc.org">Apalache</a> and <a href="https://github.com/Z3Prover/z3">Z3</a> (only showing the relevant parts):</p>

<pre><code class="language-tlaplus">----------------------------------- MODULE Add --------------------------------
(*
 * A simple TLA+ specification of different kinds of addition.
 *
 * Igor Konnov, 2025
 *)
EXTENDS Integers

VARIABLE
    \* @type: Int;
    x,
    \* @type: Int;
    y,
    \* @type: Int;
    z

AddMath(a, b) == a + b

InitMath ==
    /\ x \in Int
    /\ y \in Int
    /\ z \in Int

Next == UNCHANGED &lt;&lt;x, y, z&gt;&gt;

Identity(F(_, _)) ==
    F(x, 0) = x

Commutativity(F(_, _)) ==
    F(x, y) = F(y, x)

Associativity(F(_, _)) ==
    F(F(x, y), z) = F(x, F(y, z))

InvMath ==
    /\ Identity(AddMath)
    /\ Commutativity(AddMath)
    /\ Associativity(AddMath)

</code></pre>

<p>With the above specification, we define a very simple state machine that
non-deterministically picks three integers <code>x</code>, <code>y</code>, and <code>z</code> with <code>InitMath</code>.
These variables do not change their values in the state machine, as you can see
from the definition of <code>Next</code>. We use <code>x</code>, <code>y</code>, and <code>z</code> to define three
properties of addition: <code>Identity</code>, <code>Commutativity</code>, and <code>Associativity</code>. As you
can see, these definitions are parameterized by the operator <code>F</code>, which is <code>Add</code>
for now.  Our invariant <code>InvMath</code> is simply the conjunction of the three
properties.</p>

<p>This is how we run Apalache to check the invariant:</p>

<pre><code class="language-sh">$ cd pbt-example-summation/tla-spec
$ docker pull ghcr.io/apalache-mc/apalache
$ docker run --rm -v `pwd`:/var/apalache ghcr.io/apalache-mc/apalache \
  check --init=InitMath --inv=InvMath --length=0 Add.tla
</code></pre>

<p>With the above command, we tell Apalache to check the invariant <code>InvMath</code>
starting from the initial state <code>InitMath</code>. The <code>--length=0</code> option tells
Apalache to unroll <code>Next</code> zero times, which is sufficient in our case, since the
state machine does not change the values of the state variables.</p>

<h2 id="4-playing-adversarial">4. Playing adversarial</h2>

<p>Ok, the code and the specification above seem to be correct. But what if our
friendly AI produced something unexpected?</p>

<h3 id="41-hallucinating-addition-over-32-bit-unsigned-integers">4.1. Hallucinating addition over 32-bit unsigned integers</h3>

<p>Since we are dealing with an adversarial developer, they could simply use a
different definition of addition. So far, we have been talking about unbounded
mathematical integers, which Python conveniently implements for us.</p>

<p>Now, the adversarial developer gives us this implementation:</p>

<pre><code class="language-python">def add32(a: int, b: int) -&gt; int:
    return (a + b) % (2**32)

</code></pre>

<p>This implementation is actually not wrong. An LLM could copy it from a code base
that emulates a 32-bit CPU architecture in Python. This is a bit of a stretch,
but possible.</p>

<p>Let’s add property-based tests for this implementation as well:</p>

<pre><code class="language-python"># Tests for add32 (32-bit natural numbers with wrapping)

@given(st.integers(min_value=0, max_value=2**32 - 1))
def test_add32_identity(a):
    """Test identity property for add32: a + 0 = a."""
    assert add32(a, 0) == a
    assert add32(0, a) == a


@given(st.integers(min_value=0, max_value=2**32 - 1), st.integers(min_value=0, max_value=2**32 - 1))
def test_add32_commutativity(a, b):
    """Test commutativity property for add32: a + b = b + a."""
    assert add32(a, b) == add32(b, a)


@given(
    st.integers(min_value=0, max_value=2**32 - 1),
    st.integers(min_value=0, max_value=2**32 - 1),
    st.integers(min_value=0, max_value=2**32 - 1)
)
def test_add32_associativity(a, b, c):
    """Test associativity property for add32: (a + b) + c = a + (b + c)."""
    assert add32(add32(a, b), c) == add32(a, add32(b, c))

</code></pre>

<p>These tests also pass:</p>

<pre><code class="language-sh">$ poetry run pytest tests/test_add.py \
  -k "test_add32_identity or test_add32_commutativity or test_add32_associativity" \
  --verbose
...
tests/test_add.py::test_add32_identity PASSED                                [ 33%]
tests/test_add.py::test_add32_commutativity PASSED                           [ 66%]
tests/test_add.py::test_add32_associativity PASSED                           [100%]
</code></pre>

<p>What is going on? Well, identity, commutativity, and associativity also hold for
32-bit integers with overflow semantics. <strong>If we let AI generate not only the
implementation but also the properties, we may end up with a correct
implementation, but not the one we wanted!</strong> In this case, imagine an LLM has
added the <code>@given</code> decorators for the inputs to be in the range in $[0,
2^{32})$, whereas we wanted unbounded integers! This is an example of the
classical question in requirements engineering:
<em>do we get things right</em> vs. <em>do we get the right things</em>.</p>

<p>Just to double check that it is not the random chance, I ran Apalache on the
TLA<sup>+</sup> specification above with <code>Add32</code> instead of <code>Add</code>:</p>

<pre><code class="language-shell">$ docker run --rm -v `pwd`:/var/apalache ghcr.io/apalache-mc/apalache \
  check --length=0 --init=Init32 --inv=Inv32 Add.tla
...
Checker reports no error up to computation length 0
Total time: 2.205 sec
</code></pre>

<p>Further, the SMT solver Z3 confirms that identity, commutativity, and
associativity hold for 32-bit integers with overflow semantics. This is provided
that we pick the integers from the range $[0, 2^{32})$, which we do with
<code>Init32</code>.</p>

<p>However, <strong>this is not what we wanted initially</strong>. Let’s catch the adversarial
developer with the PBT tests that pick unbounded non-negative integers:</p>

<pre><code class="language-python">@given(st.integers(0))
def test_add32_unbounded_inputs_identity(a):
    """Test identity property for add32: a + 0 = a."""
    assert add32(a, 0) == a
    assert add32(0, a) == a

@given(st.integers(0), st.integers(0))
def test_add32_unbounded_inputs_commutativity(a, b):
    """Test commutativity property for add32: a + b = b + a."""
    assert add32(a, b) == add32(b, a)


@given(
    st.integers(0),
    st.integers(0),
    st.integers(0)
)
def test_add32_unbounded_inputs_associativity(a, b, c):
    """Test associativity property for add32: (a + b) + c = a + (b + c)."""
    assert add32(add32(a, b), c) == add32(a, add32(b, c))

</code></pre>

<p>This time, Hypothesis catches the issue with identity:</p>

<pre><code class="language-sh">$ poetry run pytest tests/test_add.py  --verbose \
  -k "test_add32_unbounded_inputs_identity or test_add32_unbounded_inputs_commutativity or test_add32_unbounded_inputs_associativity"
...
a = 4294967296

    @given(st.integers(0))
    def test_add32_unbounded_inputs_identity(a):
        """Test identity property for add32: a + 0 = a."""
&gt;       assert add32(a, 0) == a
E       assert 0 == 4294967296
E        +  where 0 = add32(4294967296, 0)
E       Falsifying example: test_add32_unbounded_inputs_identity(
E           a=4_294_967_296,
E       )
</code></pre>

<div class="tip-box">
    <div class="tip-header"><strong>💡 Tip:</strong></div>
    <div class="tip"><p>We can also implement <code>add64</code> that wraps
integers modulo $2^{64}$ and the PBT tests will catch the issue with
identity almost immediately.</p>
</div>
</div>

<h3 id="42-is-property-based-testing-a-magic-tool">4.2. Is property-based testing a magic tool?</h3>

<p>Let’s stop and think about our example. How did Hypothesis catch the issue over
an unbounded integer domain? Even it was picking the integers from the interval
$[0, 2^{64})$, the chance of picking <code>4294967296</code> by uniform random sampling is
pretty slim. Yet, Hypothesis keeps picking this number.</p>

<p>Well, the trick is that its input generator tries
the well-known “magic numbers” such as <code>0</code>, <code>1</code>, <code>-1</code>, <code>2**32</code>, <code>2**64</code>, etc.
In this sense, Hypothesis does not use uniform random sampling. See the
discussion on <a href="https://hypothesis.readthedocs.io/en/latest/explanation/domain.html">domain and distribution</a> in the Hypothesis documentation for
more details.</p>

<p>What if our adversarial developer hallucinated an implementation that stays
undetected by Hypothesis? This is what our next example is about.</p>

<h3 id="43-hallucinating-addition-over-256-bit-unsigned-integers">4.3. Hallucinating addition over 256-bit unsigned integers</h3>

<p>This time, the adversarial developer uses 256-bit unsigned integers with overflow
semantics:</p>

<pre><code class="language-python">def add256(a: int, b: int) -&gt; int:
    return (a + b) % (2**256)

</code></pre>

<p>If you think that using 256-bit integers is absurd, well, the Ethereum Virtual
Machine (EVM) does exactly that. So an LLM could have adapted the above code
from an EVM-related code base.</p>

<pre><code class="language-python">@given(st.integers(0))
@settings(max_examples=100000)
def test_add256_unbounded_inputs_identity(a):
    """Test identity property for add256: a + 0 = a."""
    assert add256(a, 0) == a
    assert add256(0, a) == a

@given(st.integers(0), st.integers(0))
@settings(max_examples=100000)
def test_add256_unbounded_inputs_commutativity(a, b):
    """Test commutativity property for add256: a + b = b + a."""
    assert add256(a, b) == add256(b, a)


@given(
    st.integers(0),
    st.integers(0),
    st.integers(0)
)
@settings(max_examples=100000)
def test_add256_unbounded_inputs_associativity(a, b, c):
    """Test associativity property for add256: (a + b) + c = a + (b + c)."""
    assert add256(add256(a, b), c) == add256(a, add256(b, c))
</code></pre>

<p>This time, the adversarial developer gets away, all tests pass:</p>

<pre><code class="language-shell">$ poetry run pytest tests/test_add.py --verbose -k \
  "test_add256_unbounded_inputs_identity or test_add256_unbounded_inputs_commutativity or test_add256_unbounded_inputs_associativity" 
...
tests/test_add.py::test_add256_unbounded_inputs_identity PASSED                [ 33%]
tests/test_add.py::test_add256_unbounded_inputs_commutativity PASSED           [ 66%]
tests/test_add.py::test_add256_unbounded_inputs_associativity PASSED           [100%]
</code></pre>

<p>Why? Hypothesis does not try $2^{256}$ as a magic number. I gave it the budget of
100,000 examples, so it had a chance to try multiple powers of two, but it did
not try anything above $2^{256} - 1$.</p>

<p>We can make sure that the identity test indeed fails when we pass $2^{256}$
as an example:</p>

<pre><code class="language-python">@given(st.integers(0))
@settings(max_examples=100000)
@example(2**256)  # This should fail!
def test_add256_unbounded_inputs_identity(a):
    """Test identity property for add256: a + 0 = a."""
    assert add256(a, 0) == a
    assert add256(0, a) == a
</code></pre>

<h3 id="44-catching-the-adversarial-developer-with-apalache">4.4. Catching the adversarial developer with Apalache</h3>

<p>Here is how we modify the TLA<sup>+</sup> specification to use <code>Add256</code>:</p>

<pre><code class="language-tlaplus">Add256(a, b) == (a + b) % (2^256)

InitNat ==
    /\ x \in Nat
    /\ y \in Nat
    /\ z \in Nat

Inv256 ==
    /\ Identity(Add256)
    /\ Commutativity(Add256)
    /\ Associativity(Add256)

</code></pre>

<p>Apalache immediately finds the issue with identity when we run it with <code>Add256</code>:</p>

<pre><code class="language-shell">$ docker run --rm -v `pwd`:/var/apalache ghcr.io/apalache-mc/apalache \
  check --length=0 --init=InitNat --inv=Inv256 Add.tla
...
State 0: state invariant 0 violated.
Total time: 2.272 sec
</code></pre>

<p>If we check the counterexample, we see that the solver picks the value $2^{256}$
for <code>x</code>:</p>

<pre><code class="language-sh">$ head -n 14 _apalache-out/Add.tla/2025-12-22T15-20-31_8166638658721555415/violation.tla
---------------------------- MODULE counterexample ----------------------------
EXTENDS Add

(* Constant initialization state *)
ConstInit == TRUE

(* Initial state [_transition(0)] *)
State0 ==
  x
      = 115792089237316195423570985008687907853269984665640564039457584007913129639936
    /\ y = 0
    /\ z = 0

</code></pre>

<p><strong>This is not just luck and not an heuristic!</strong> Apalache delegates solving to
the SMT solver <a href="https://github.com/Z3Prover/z3">Z3</a>, which solves integer constraints. If you want to make
sure that it’s not using magic numbers, go and change the modulo operator in
<code>Add256</code> to a large prime number, e.g., $2^{256} + 297$. Rerun the model
checker, and it will still find the issue with identity.</p>

<h3 id="45-for-the-curious-how-apalache-and-z3-work-together">4.5. For the curious: how Apalache and Z3 work together</h3>

<p>Our example is so simple that we can even go over the actual SMT constraints
that Apalache generates. Let’s run Apalache with the option <code>--debug</code>:</p>

<pre><code class="language-sh">$ docker run --rm -v `pwd`:/var/apalache ghcr.io/apalache-mc/apalache \
  check --debug \
  --length=0 --init=InitNat --inv=Inv256 Add.tla
</code></pre>

<p>Open the file
<code>_apalache-out/Add.tla/2025-12-23T09-31-46_16436938462355564409/log0.smt</code> (the
timestamp will be different on your machine). The log is pretty verbose. Here
are the crucial parts, which I’ve accompanied with explanations:</p>

<pre><code class="language-lisp">-- the initial value of `x`
(declare-const $C$6 Int)
-- `x` is a natural number
(assert (&gt;= $C$6 0))
-- introduce a boolean variable for the identity property
(declare-const $C$9 Bool)
-- Encode the identity property for `Add256` and `x`.
-- The huge number is 2^256.
(assert (= $C$9
   (= (mod $C$6
           115792089237316195423570985008687907853269984665640564039457584007913129639936)
      $C$6)))
-- assert that the identity property is violated
(declare-const $C$10 Bool)
(assert (= (not $C$10) $C$9))
(assert $C$10)
-- check, whether the above constraints have a solution
(check-sat)
</code></pre>

<p>If you want to understand what is going on, read the comments above. At first, I
was actually surprised that the SMT constraints did not contain addition at all.
Then I recalled that Apalache has a bunch of rewriting rules that simplify the
constraints. In this case, the symbolic model checker has applied the property
<code>a + 0 = a</code> internally to get rid of the addition (yeah, it is the identity
property!).  It was an equivalent transformation, so we are still left with the
modulo operator.</p>

<p>Essentially, we are asking Z3 to solve these inequalities over integers:</p>

\[\left\{
\begin{aligned}
x &amp;\ge 0\\
x &amp;\pmod{2^{256}} \neq 0
\end{aligned}
\right.\]

<p>What is crucial here is that the SMT solver <strong>Z3 has a strict contract with the
user</strong>. When we give it a set of constraints and ask it to check their
satisfiability, it will apply sound algorithms to arrive at one of the three
answers:</p>

<ul>
  <li>
    <p><strong>sat</strong>: there is a solution to the above constraints, i.e., an assignment of
values to the variables that makes the constraints true,</p>
  </li>
  <li>
    <p><strong>unsat</strong>: there is no solution, and</p>
  </li>
  <li>
    <p><strong>unknown</strong>: the constraints are too hard, or it took the solver too long to
solve them.</p>
  </li>
</ul>

<p>In contrast to PBT, it is not just like “I tried a few random inputs and did not
find a bug”. If Z3 answers <code>sat</code>, there is indeed a solution to the constraints,
and the solver gives it to us as a model. If it returns <code>unsat</code>, there is no
solution. Whenever you see <code>unknown</code>, it’s a bad day. Sometimes, it also
indicates a bug in the solver itself, as I’ve <a href="https://github.com/Z3Prover/z3/issues?q=is%3Aissue%20state%3Aclosed%20author%3Akonnov">reported to the Z3 developers a
few times</a>. However, Z3 is pretty reliable in my experience, and
producing an <code>unknown</code> is an achievement, unless you set very tight timeouts,
or use tricky non-linear arithmetic.</p>

<p>If you are further interested in how Z3 actually solved the above constraints, a
simple answer is that it used something like the <a href="https://en.wikipedia.org/wiki/Simplex_algorithm">Simplex algorithm</a> for
integer linear programming. The constraints are linear in our case, so Z3 could
apply this algorithm to find a solution. Most likely, Z3 used a more recent
algorithm, but the idea is similar.</p>

<p>In case you really want to know how SMT solvers work under the hood, I recommend
starting with the book on <a href="https://www.decision-procedures.org/">Decision Procedures</a> by Kroening and Strichman.</p>

<h2 id="5-conclusions">5. Conclusions</h2>

<p>We all have to <strong>learn how to write high-quality properties and understand the
boundaries of the “magic” tools</strong>. <strong>Even if you don’t use LLMs, your peers
will.</strong></p>

<p><strong>How to learn writing good properties</strong>? We can play with property-based
testing. However, <strong>PBT tools are not reliable teachers</strong>. By their random
nature, a PBT tool may miss a bug on one run and find it on another run. Don’t
get me wrong. <strong>Property-based testing has its value</strong>, as many other testing
and verification techniques. However, <strong>PBT is not a silver bullet</strong>. It may
miss bugs, especially if the input generator does not cover the right input
space well enough.</p>

<p><strong>Shall we use interactive provers like <a href="https://lean-lang.org/">Lean</a> and <a href="https://rocq-prover.org/">Rocq</a></strong>? Learning how to
prove code correct definitely helps! However, <strong>these tools only tell us that the
proof does not go through</strong>. It would not give us a counterexample.
State-of-the-art provers also recommend using PBT for bug finding.</p>

<p>In my opinion, <strong>model checking is the best way to learn how to write good
properties</strong>. You can write as many properties as you like, and the model
checker will produce you counterexamples, or not. Importantly, model checkers
come with a guarantee of not having a bug in their <em>search scope</em>, if they
terminate.  See my blog post on the <a href="/modelchecking/2025/04/08/value.html">value of model
checking</a> on that.</p>

<p>Usually, I recommend people to start with <a href="https://github.com/tlaplus/tlaplus">TLC</a>. It works by state enumeration
and easy to understand. If your search scope is small, TLC is a good learning
tool. In our example, the search scope is astronomical. In this case,
<a href="https://apalache-mc.org">Apalache</a> is there to help.</p>

<p>By the way, our example was so simple, that we could encode it in <a href="https://github.com/Z3Prover/z3">Z3</a>
directly via its python bindings. We could use other model checkers. If you do
that, let me know!</p>

<h2 id="6-bonus-hypothesis--crosshair">6. Bonus: Hypothesis + Crosshair</h2>

<p>Hypothesis offers an integration with <a href="https://crosshair.readthedocs.io/en/latest/">Crosshair</a>, which is a symbolic
execution engine for Python using Z3. I did not explore this integration in
depth. Claude told me that it is sufficient to just add this import to the test:</p>

<pre><code class="language-python">import hypothesis_crosshair_provider
</code></pre>

<p>Well, this did not help me to find the violation of identity for <code>add256</code>.  If
you know how to make Crosshair work with Hypothesis, please let me know!</p>

<p>When we run Crosshair directly on the <code>add256</code> implementation, it finds the
issue with identity right away:</p>

<pre><code class="language-sh">$ poetry run crosshair check tests.test_add_crosshair.check_add256_identity
.../python/tests/test_add_crosshair.py:12: error: false when calling check_add256_identity(115792089237316195423570985008687907853269984665640564039457584007913129639936) (which returns False)
</code></pre>

<p>The Crosshair test looks like follows:</p>

<pre><code class="language-python">from crosshair.core_and_libs import standalone_statespace
from pbt_add import add256


def check_add256_identity(a: int) -&gt; bool:
    """
    Check identity property for add256: a + 0 = a.
    
    pre: a &gt;= 0
    post: _
    """
    return add256(a, 0) == a and add256(0, a) == a

</code></pre>

<!-- References -->

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:scott-talk">
      <p>I’ve never met Scott Wlaschin in real life, online or offline. I hope he would not mind me referring to him by his first name. <a href="#fnref:scott-talk" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Igor Konnov</name></author><category term="pbt" /><summary type="html"><![CDATA[Author: Igor Konnov]]></summary></entry><entry><title type="html">Interactive Symbolic Testing of TFTP with TLA+ and Apalache</title><link href="https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing.html" rel="alternate" type="text/html" title="Interactive Symbolic Testing of TFTP with TLA+ and Apalache" /><published>2025-12-15T00:00:00+00:00</published><updated>2025-12-15T00:00:00+00:00</updated><id>https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing</id><content type="html" xml:base="https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing.html"><![CDATA[<p><strong>Author:</strong> <a href="https://konnov.phd">Igor Konnov</a></p>

<p><strong>Date:</strong> December 15, 2025</p>

<p><em>Note: I mostly stopped using LLMs for proof-reading my texts, so you know
it is not all generated. Enjoy my typos and weird grammar!</em></p>

<p><strong>Abstract.</strong> As promised in the <a href="/tlaplus/2025/12/02/small-scope.html">blog post on small-scope
hypothesis</a>, I am continuing with the main body of the talk that I
presented at the internal Nvidia FM Week 2025. This blog post is rather long. If
you do not want to read the whole post, here are the most exciting new
developments:</p>

<ul>
  <li>
    <p>A <strong>new JSON-RPC server API</strong> for <a href="https://apalache-mc.org">Apalache</a>, which allows external tools
 and scripts to drive the symbolic execution of TLA<sup>+</sup> specifications
 and interact with the solver.  Read the section on <a href="#4-the-new-json-rpc-api-of-apalache">The new JSON-RPC API of
 Apalache</a>.</p>
  </li>
  <li>
    <p>A new approach to <strong>conformance testing of TLA<sup>+</sup> specifications and
 real implementations</strong>, called <strong>interactive symbolic testing</strong>. This approach
 is inspired by the work of <a href="https://www.mcmil.net/pubs/SIGCOMM19.pdf">McMillan and Zuck (2019)</a> on testing of the
 QUIC protocol with IVy and SMT. Read the section on <a href="#3-interactive-symbolic-testing-with-smt">Interactive symbolic
 testing with SMT</a>.</p>
  </li>
  <li>
    <p>A case study on <strong>testing multiple open-source implementations of TFTP</strong>,
 including unexpected (but not harmful) deviations from the protocol. This case
 study includes the experience report on using Claude to bootstrap the harness
 for testing TFTP implementations against the TLA<sup>+</sup> specification.
 Read the section on <a href="#7-bootstrapping-the-testing-harness-with-claude">Bootstrapping the testing harness with
 Claude</a> and <a href="#9-testing-against-adversarial-behavior">Testing against
 adversarial behavior</a>.  My point is
 not <strong>not to brainwash you into LLMs</strong>, but to <strong>show what works for me and
 what does not</strong>.</p>
  </li>
  <li>
    <p>The specification and the test harness are <strong>openly available</strong>. Check the
 <a href="https://github.com/konnov/tftp-symbolic-testing">Github repository</a>.</p>
  </li>
</ul>

<p>In this blog post, I am using TLA<sup>+</sup>. The same tooling and results
equally apply to <a href="https://github.com/informalsystems/quint">Quint</a>.</p>

<p><strong>Contents:</strong></p>

<ol>
  <li><a href="#1-introduction">Introduction</a></li>
  <li><a href="#2-model-based-testing-and-trace-validation">Model-based testing and trace validation</a></li>
  <li><a href="#3-interactive-symbolic-testing-with-smt">Interactive symbolic testing with SMT</a></li>
  <li><a href="#4-the-new-json-rpc-api-of-apalache">The new JSON-RPC API of Apalache</a></li>
  <li><a href="#5-case-study-tftp-protocol">Case study: TFTP protocol</a></li>
  <li><a href="#6-initial-tla-specification-of-tftp">Initial TLA<sup>+</sup> specification of TFTP</a></li>
  <li><a href="#7-bootstrapping-the-testing-harness-with-claude">Bootstrapping the testing harness with Claude</a></li>
  <li><a href="#8-debugging-the-tla-specification-with-the-implementation">Debugging the TLA<sup>+</sup> specification with the implementation</a></li>
  <li><a href="#9-testing-against-adversarial-behavior">Testing against adversarial behavior</a></li>
  <li><a href="#10-the-specification-as-a-differential-testing-oracle">The specification as a differential testing oracle</a></li>
  <li><a href="#11-prior-work">Prior Work</a></li>
  <li><a href="#12-conclusions">Conclusions</a></li>
</ol>

<h2 id="1-introduction">1. Introduction</h2>

<p>This work aims at demonstrating how to answer the following two questions with
<a href="https://apalache-mc.org">Apalache</a>:</p>

<p class="highlight-question"><strong><em>
  1. How to test the actual implementation against its TLA<sup>+</sup> specification?
</em></strong></p>

<p class="highlight-question"><strong><em>
  2. How to test the TLA<sup>+</sup> specification against the actual implementation?
</em></strong></p>

<p>For long time, these questions have been mostly ignored by the TLA<sup>+</sup>
community. Over the last 4-5 years, researchers started to look into these two
questions and found out that having a connection between the specification and
the implementation is much more useful than it was initially thought. (The
engineers were telling this to me all the time!) Check <a href="#10-prior-work">the prior work
section</a> for the papers and talks on this topic.  Roughly
speaking, the approaches follow the two ideas:</p>

<ul>
  <li>
    <p><strong>Model-based testing (MBT)</strong>. The TLA<sup>+</sup> specification is used to
 generate test cases that are then executed against the implementation. This is
 an answer to question 1 above. The state exploration is driven by the
 specification. Hence, we are testing, whether the implementation matches the
 inputs and outputs, as produced by the specification.</p>
  </li>
  <li>
    <p><strong>Trace validation (TV)</strong>. The traces are collected from the implementation
 and checked against the TLA<sup>+</sup> specification. This is an answer to
 question 2 above. State exploration is driven by the implementation, e.g., by
 executing the existing test suites, or just by running the system for some
 time. Hence, we are testing whether the specification matches the inputs and
 outputs of the implementation. Alternatively, we may check whether the
 implementation states may be lifted to the specification states, in order to
 produce a feasible trace in the specification.</p>
  </li>
</ul>

<p>If you re-read the description of MBT and TV above, you may notice that there
are two more dimensions of how to do testing:</p>

<ul>
  <li>
    <p><strong>State-based</strong>. In this case, we have to establish a relation between the
 implementation states and the specification states in each step of the trace.
 This usually done by defining mapping functions, either from the implementation
 states to the specification states, or vice versa. Notice that mapping an
 implementation state to a specification state is usually much easier, as it
 involves <em>state abstraction</em> (e.g., dropping some variables). Mapping a
 specification state to an implementation state is more difficult, as it
 involves <em>state concretization</em>, e.g., choosing a representative concrete value
 for each abstract value in the specification state. For example, if the
 specification says $x \in [10, 20]$, then we have to choose a concrete value
 for $x$ in this range, e.g., at random.</p>
  </li>
  <li>
    <p><strong>Action-based</strong>. In this case, we have to establish a relation between the
 implementation actions and the specification actions. Again, we would need to
 define mappings. Interestingly, in my experience, defining action mappings is
 way easier than defining state mappings.</p>
  </li>
</ul>

<h2 id="2-model-based-testing-and-trace-validation">2. Model-based testing and trace validation</h2>

<h3 id="21-model-based-testing-in-one-picture">2.1. Model-based testing in one picture</h3>

<p>Without going into too many details, the following picture illustrates the main
idea of model-based testing. We generate an “interesting” trace with a model
checker, e.g., with <a href="https://apalache-mc.org">Apalache</a>. This trace is fed to the test harness that:
(1) does action concretization, (2) executes the actions against the
implementation. The moment the implementation refuses to replay an action, we
know that there is a divergence. Notice that we often do not even need to query
the system for its current state, as we only care about the actions.</p>

<picture>
  <img class="responsive-img" src="/img/mbt.svg" alt="Model-based testing" />
</picture>

<p>One downside of this approach is that the model checker can be quickly overwhelmed
by the many possible action interleavings unless the search scope is further
restricted. In my experience, the SMT solver Z3 slows down dramatically when it
must solve two problems simultaneously:</p>

<ol>
  <li>
    <p>Choose a sequence of actions (a schedule) to explore, and</p>
  </li>
  <li>
    <p>Find variable assignments (states) that produce a feasible trace for the
chosen schedule.</p>
  </li>
</ol>

<p>When a schedule is fixed, the SMT solver must solve far fewer constraints
because it mainly propagates values through the actions. If the solver must also
pick a schedule, it must backtrack along two axes: (1) schedules and (2) states.
This increases solving times in practice.</p>

<p>To mitigate this, Apalache lets you randomly sample schedules and execute them
symbolically. To enumerate different “interesting” schedules, the user can
define a view operator, which usually projects state variables to more abstract
values. The model checker will then produce traces projected onto those views.
This works significantly better for test generation in practice. However, this
exploration strategy is fixed and cannot be changed without modifying Apalache
itself.</p>

<h3 id="22-trace-validation-in-one-picture">2.2. Trace validation in one picture</h3>

<p>Trace validation is conceptually simpler than model-based testing. We simply
execute the system under test (SUT) and collect traces. These traces are then
mapped to the abstract states, if necessary, and checked against the
specification.</p>

<picture>
  <img class="responsive-img" src="/img/tv.svg" alt="Trace validation" />
</picture>

<p>This approach has been tried in multiple projects that use the exhaustive-state
model checker <a href="https://github.com/tlaplus/tlaplus">TLC</a> as the back-end solver. See <a href="#10-prior-work">the prior work
section</a>.</p>

<p>Trace validation also has its challenges:</p>

<ol>
  <li>
    <p>We need a good test suite, in order to produce “interesting” traces.
 However, test cases are usually written for the happy-path scenarios. Hence,
 it is easy to miss handling of error cases and faults. <a href="https://www.youtube.com/watch?v=DO8MvouV29M">Srinidhi Nagendra et
 al. (2025)</a> address this issue by fuzzing the tests.</p>
  </li>
  <li>
    <p>Someone has to instrument the SUT to trace the relevant events. In some
 cases, it easy to do, e.g., by tracing message exchanges, as presented by
 <a href="https://www.youtube.com/watch?v=NZmON-XmrkI">Markus Kuppe et. al. (2024)</a>. In other cases, it may be quite difficult
 to do, e.g., when we want to dump the internal states of the SUT. In a
 concurrent system this may require a global lock and traversing large data
 structures. In a distributed system, this may further require a distributed
 snapshot or using vector clocks.</p>
  </li>
  <li>
    <p>We have to run the whole system to collect traces. It is hard to isolate one
 component, e.g., one network node.</p>
  </li>
</ol>

<h2 id="3-interactive-symbolic-testing-with-smt">3. Interactive symbolic testing with SMT</h2>

<p>As we can see, both model-based testing and trace validation in their above
formulation are non-interactive. They both require a complete trace to be
produced first, and <strong>there is no feedback loop between the specification and
the implementation</strong>.</p>

<p>There is a third way to do conformance testing that leverages SMT solvers, yet
receives feedback from the implementation during the testing. I will call it
<strong>interactive symbolic testing</strong>. I think the first time I heard about this
approach was from <a href="https://www.losa.fr/">Giuliano Losa</a>, when he explained the paper by <a href="https://www.mcmil.net/pubs/SIGCOMM19.pdf">Ken
McMillan and Leonore Zuck (2019)</a> to me. If you have not read this paper
yet, I highly recommend doing so. On the naming side, McMillan and Zuck call
their approach “specification-based testing”. I find this name to be a bit
non-descriptive, as MBT is also specification-based.</p>

<p>The idea is to generate an action with the SMT solver by following the
specification, execute it against the implementation, and then feed the results
back to the SMT solver to generate the next action. This way, we can
systematically explore the protocol specification while getting feedback from
the implementation.</p>

<p>The picture below illustrates this approach, by approximately following the
internal transition executor of Apalache.</p>

<picture>
  <img class="responsive-img" src="/img/symbolic-testing.svg" alt="Symbolic testing" />
</picture>

<p>To implement this approach to testing with Apalache, we would have to find a way
for Apalache and the test harness to communicate. My experience with development
of Apalache shows that <strong>fixing exploration strategies inside the model checker
is not a good idea</strong>. People always want to tweak them a bit for their purposes.
Given this observation, <a href="https://blltprf.xyz">Thomas Pani</a> and I have decided to implement a simple
server API for Apalache that would allow external tools to drive the symbolic
execution of TLA<sup>+</sup> specifications.</p>

<h2 id="4-the-new-json-rpc-api-of-apalache">4. The new JSON-RPC API of Apalache</h2>

<p><a href="https://blltprf.xyz">Thomas</a> and I wanted to have a lightweight API that we could use
from any programming language without writing too much boilerplate code. At this
point, every engineer would whisper: hey, you need gRPC, I’ve got some. Well, we
tried gRPC in the integration of <a href="https://apalache-mc.org">Apalache</a> with <a href="https://github.com/informalsystems/quint">Quint</a>. It is hard to call
gRPC lightweight.</p>

<p>So we have decided to go with <a href="https://www.jsonrpc.org/">JSON-RPC</a> this time, which is a very simple
protocol that works over HTTP/HTTPS. Implementing a JSON-RPC server is quite
straightforward.  Since Apalache is written in Scala, which is JVM-compatible,
we can use the well-known and battle-tested libraries. Perhaps, a bit
unexpectedly for a Scala project, I’ve decided to implement this server with
<a href="https://jetty.org/">Jetty</a> for serving the HTTP requests and <a href="https://github.com/FasterXML/jackson">Jackson</a> for JSON serialization.
(The reason is that we have already burnt ourselves with fancy but poorly
supported libraries in Scala.) The resulting server is lightweight and fast.
Moreover, it can be tested with command-line tools like <a href="https://curl.se/">curl</a>.</p>

<p>The state-chart diagram of the Apalache JSON-RPC server for a single session is
shown below.</p>

<picture>
  <img class="responsive-img" src="/img/apalache-api.svg" alt="Apalache JSON-RPC API" />
</picture>

<p>To see a detailed description of this API, check <a href="https://github.com/apalache-mc/apalache/tree/main/json-rpc">Apalache JSON-RPC</a>.  Just to
give you the taste of it, here is how you start the server without having
anything installed but Docker:</p>

<pre><code class="language-shell">$ docker pull ghcr.io/apalache-mc/apalache
$ docker run --rm /tmp:/var/apalache -p 8822:8822 \
    ghcr.io/apalache-mc/apalache:latest \
    server --server-type=explorer
</code></pre>

<p>Now, we create a new Apalache session with a TLA<sup>+</sup> specification (in a
separate tab):</p>

<pre><code class="language-shell">$ SPEC=`cat &lt;&lt;EOF | base64
---- MODULE Inc ----
EXTENDS Integers
VARIABLE
  \* @type: Int;
  x
Init == I:: x = 0
Next == (A:: (x &lt; 3 /\\ x' = x + 1)) \\/ (B:: (x &gt; -3 /\\ x' = x - 1))
Inv3 == Inv:: x /= 0
\* @type: () =&gt; &lt;&lt;Bool, Bool, Bool&gt;&gt;;
View == &lt;&lt;x &lt; 0, x = 0, x &gt; 0&gt;&gt;
=====================
EOF`
$ curl -X POST http://localhost:8822/rpc \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"loadSpec","params":{"sources": [ "'${SPEC}'" ],
       "invariants": ["Inv3"], "exports": ["View"]},"id":1}'
</code></pre>

<p>Is not that amazing? No protobuf, no code generation, just pure shell and
readable JSON.</p>

<p>Having the specification loaded, we load the predicate <code>Init</code> into the solver
context, which is encoded as transition 0:</p>

<pre><code class="language-shell">$ curl -X POST http://localhost:8822/rpc \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"assumeTransition","params":{"sessionId":"1",
       "transitionId":0,"checkEnabled":true},"id":2}'
</code></pre>

<p>Assuming that the previous call returned <code>ENABLED</code>, we switch to the next
step, which applies the effect of <code>Init</code> to the current symbolic state:</p>

<pre><code class="language-shell">$ curl -X POST http://localhost:8822/rpc \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"nextStep","params":{"sessionId":"1"},"id":3}'
</code></pre>

<p>Now, we can check the invariant <code>Inv3</code> against all states that satisfy <code>Init</code>:</p>

<pre><code class="language-shell">$ curl -X POST http://localhost:8822/rpc \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"checkInvariant",
       "params":{"sessionId":"1","invariantId":0},"id":3}'
</code></pre>

<p>Since invariant <code>Inv3</code> is violated by the initial state, the server returns
<code>VIOLATED</code>, along with a counter-example trace:</p>

<pre><code class="language-json">{
  "jsonrpc": "2.0",
  "id": 3,
  "result": {
    "sessionId": "1",
    "invariantStatus": "VIOLATED",
    "trace": {
      "#meta": {
        "format": "ITF",
        "varTypes": { "x": "Int" },
        "format-description": "https://apalache-mc.org/docs/adr/015adr-trace.html",
        "description": "Created by Apalache on Thu Dec 11 16:56:47 CET 2025"
      },
      "vars": [ "x" ],
      "states": [ {
          "#meta": { "index": 0 },
          "x": { "#bigint": "0" }
      } ]
    }
  }
}
</code></pre>

<p>The trace is encoded in the <a href="https://apalache-mc.org/docs/adr/015adr-trace.html">ITF format</a>, which is a simple JSON-based
format for TLA<sup>+</sup> and Quint traces.</p>

<p>Had the invariant been violated on a deeper trace, we would have to assume more
transitions by calling <code>assumeTransition</code> and <code>nextStep</code> multiple times.</p>

<p>If you want to access this API from Python right away, use two helper libraries:</p>

<ul>
  <li>
    <p><a href="https://github.com/konnov/apalache-rpc-client/">apalache-rpc-client</a> for interacting with the JSON-RPC server of
  Apalache, and</p>
  </li>
  <li>
    <p><a href="https://github.com/konnov/itf-py/">itf-py</a> for serializing and deserializing ITF traces.</p>
  </li>
</ul>

<h2 id="5-case-study-tftp-protocol">5. Case study: TFTP protocol</h2>

<p>To experiment with interactive symbolic testing and the new JSON-RPC API, I
wanted to choose a relatively simple network protocol that had multiple
implementations. After several sessions with ChatGPT, I ended up with the
Trivial File Transfer Protocol (TFTP) as a reasonable target for this small
project.</p>

<p>The Wikipedia page on <a href="https://en.wikipedia.org/wiki/Trivial_File_Transfer_Protocol">TFTP</a> gives us a good overview of the protocol. In
short, TFTP is a simple protocol to transfer files over UDP. It supports reading
and writing files from a remote server. It is mostly used for booting from the
network. The protocol is simple enough to be specified in TLA<sup>+</sup>
without too much effort, yet it has enough complexity to make the testing effort
interesting. Actually, I’ve only specified reading requests (RRQ) and no writing
requests (WRQ) to keep the scope manageable.</p>

<p>You can find more detailed specifications in the original <a href="https://www.rfc-editor.org/rfc/rfc1350">RFC 1350</a>, as well
as in its extensions <a href="https://www.rfc-editor.org/rfc/rfc2347">RFC 2347</a>, <a href="https://www.rfc-editor.org/rfc/rfc2348">RFC 2348</a>, and <a href="https://www.rfc-editor.org/rfc/rfc2349">RFC 2349</a>. RFC 1350
defines a simple non-negotated version of the protocol. Below is an example of
such an interaction between the client and the server. Notice that the client
first sends a read request (RRQ) to the server on the control port 69, which
responds with the first data block (DATA) on a newly allocated ephemeral port.
The client acknowledges (ACK) the received data block on the same ephemeral
port.  This continues until the server sends the last data block, which is
smaller than the maximum block size (512 bytes by default).</p>

<picture>
  <img class="responsive-img" src="/img/rrq1350.svg" alt="Read request and transfer as per RFC 1350" />
</picture>

<p>Further, <a href="https://www.rfc-editor.org/rfc/rfc2347">RFC 2347</a> defines an option negotiation phase that happens right after the
read request. The client and the server may negotiate options like block size,
timeout, and transfer size. <a href="https://www.rfc-editor.org/rfc/rfc2348">RFC 2348</a> defines the block size option, while
<a href="https://www.rfc-editor.org/rfc/rfc2349">RFC 2349</a> defines the transfer size option. Below is an example interaction with
option negotiation:</p>

<picture>
  <img class="responsive-img" src="/img/rrq2347.svg" alt="Read request and transfer as per RFC 2347" />
</picture>

<p>The cool thing about TFTP is that it has multiple open-source implementations of
TFTP clients and servers in different programming languages. Here are some of
them:</p>

<ul>
  <li>
    <p><a href="https://kernel.googlesource.com/pub/scm/network/tftp/tftp-hpa/">tftp-hpa</a> is the canonical implementation of TFTP for Linux (and UNIX?) in C.</p>
  </li>
  <li>
    <p><a href="https://github.com/madmartin/atftp">atftpd</a> is advanced TFTP, which is intended for fast boot in large
 clusters, also in C.</p>
  </li>
  <li>
    <p><a href="http://www.thekelleys.org.uk/dnsmasq/doc.html">dnsmasq</a> is a lightweight DNS and DHCP server that also includes a TFTP
 server, in C.</p>
  </li>
  <li>
    <p><a href="https://github.com/altugbakan/rs-tftpd">rs-tftpd</a> (Rust) is an implementation of a TFTP server in Rust.</p>
  </li>
  <li>
    <p><a href="https://github.com/pin/tftp">gotfpd</a> (Go) is an implementation of a TFTP server in Go.</p>
  </li>
  <li>
    <p>busybox also has its minimalistic implementation for file reads.</p>
  </li>
</ul>

<h2 id="6-initial-tla-specification-of-tftp">6. Initial TLA<sup>+</sup> specification of TFTP</h2>

<p>In the first stage of this experiment, I read the RFCs and wrote a
TLA<sup>+</sup> specification of the TFTP protocol. At that stage, I did not
introduce packet loss, duplication, or reordering. I just wanted to have a
simple working specification that I could use for testing the implementations.
<strong>This stage took me just two days.</strong> Well, I have been writing plenty of
TLA<sup>+</sup> specifications in the past.</p>

<p>You can check this initial specification in the <a href="https://github.com/konnov/tftp-symbolic-testing/tree/6fb00d1878b7e37a629868ac25b853d95b16cbdc">initial commit</a> of the
<a href="https://github.com/konnov/tftp-symbolic-testing">testing repo</a>. The main body of the specification lives in <code>tftp.tla</code>,
which imports several several auxiliary modules:</p>

<ul>
  <li>
    <p><code>typedefs.tla</code> defines the types of the data structures and the basic
 constructors for these data structures. Since I am using Apalache, the
 specification needs type definitions. Luckily, these days, I just write the
 type definitions in comments and let Claude generate the auxilliary operators
 such as constructors and accessors. If you already have an untyped
 specification, Claude is good at figuring out the types in the agent mode. Just
 use <a href="https://github.com/apalache-mc/apalache/blob/main/prompts/type-annotation-assistant.md">this prompt</a>.</p>
  </li>
  <li>
    <p><code>util.tla</code> defines common utility definitions such as <code>Min</code>, <code>Max</code>, and
 option conversions.</p>
  </li>
</ul>

<p>Finally, <code>MC2_tftp.tla</code> defines a protocol instance of two clients and one
server. If you stumble upon the definitions that end with <code>View</code> there, ignore
them. They are not essential for this blog post. I used them to experiment with
more advanced symbolic exploration scripts.</p>

<p>If you are not familiar with TLA<sup>+</sup>, or your TLA<sup>+</sup> skills are
rusty, I recommend giving one of the definitions and this prompt to ChatGPT. It
actually explains TLA<sup>+</sup> quite well:</p>

<pre><code>Assume that I am a software engineer. I don't know TLA+ but know Golang or Rust.
Explain me this TLA+ snippet using my knowledge: ...
</code></pre>

<p>To see the kinds of actions this initial specification had, have a look at the
definition of <code>Next</code> in <code>tftp.tla</code>:</p>

<pre><code class="language-tlaplus">Next ==
    \* the actions by the clients
    \/  \E srcIp \in CLIENT_IPS, srcPort \in PORTS:
            \E filename \in DOMAIN FILES, timeout \in 1..255:
                \* "man tftpd": 65464 is the theoretical maximum for block size
                \* https://linux.die.net/man/8/tftpd
                \E tsize \in 0..FILES[filename], blksize \in 0..65464:
                    \* choose a subset of the options to request
                    \E optionKeys \in SUBSET OPTIONS_RFC2349:
                        LET options ==
                            mk_options(optionKeys, blksize, tsize, timeout)
                        IN
                        ClientSendRRQ(srcIp, srcPort, filename, options)
    \/  \E udp \in packets:
            \/ ClientRecvDATA(udp)
            \/ ClientRecvOACK(udp)
            \/ ClientRecvErrorAndCloseConn(udp)
    \/  \E ipPort \in DOMAIN clientTransfers:
            ClientTimeout(ipPort)
    \* the server
    \/  \E udp \in packets:
            \/ ServerRecvRRQ(udp)
            \/ ServerSendDATA(udp)
            \/ ServerRecvAckAndCloseConn(udp)
            \/ ServerRecvErrorAndCloseConn(udp)
    \/  \E ipPort \in DOMAIN serverTransfers:
            ServerTimeout(ipPort)
    \* handle the clock and timeouts
    \/  \E delta \in 1..255:
            AdvanceClock(delta)

</code></pre>

<div class="tip-box">
    <div class="tip-header"><strong>💡 Tip:</strong></div>
    <div class="tip"><p>Do not spend too much time on reading this
initial specification. I misunderstood several thigs about TFTP from the
RFCs, which I fixed later. Especially, the timeouts are completely wrong
in this initial version. Good that the actual implementations helped me to
find these mistakes!</p>
</div>
</div>

<p><strong>Falsy invariants</strong>. As I always do, I also specified “falsy invariants” to
produce interesting examples. For example, using the invariant
<code>RecvThreeDataBlocksEx</code> below, I can easily produce a trace where a client
receives three data blocks from the server.</p>

<pre><code class="language-tla">\* Check this falsy invariant to see an example of a client receiving 3 blocks.
RecvThreeDataBlocksEx ==
    ~(\E p \in DOMAIN clientTransfers:
        Len(clientTransfers[p].blocks) &gt;= 3)
</code></pre>

<p>If you want to try it right way without installing anything, just do this with
docker:</p>

<pre><code class="language-shell">$ git clone git@github.com:konnov/tftp-symbolic-testing.git
$ git checkout 6fb00d1878b7e37a629868ac25b853d95b16cbdc
$ docker pull ghcr.io/apalache-mc/apalache
$ docker run --rm -v `pwd`:/var/apalache ghcr.io/apalache-mc/apalache \
  check --inv=RecvThreeDataBlocksEx MC2_tftp.tla
</code></pre>

<p><strong>Trace visualization.</strong>
Since Apalache emits traces in the <a href="https://apalache-mc.org/docs/adr/015adr-trace.html">ITF format</a>, which has a very simple
schema in JSON, it was easy for me to convince Claude to produce a Python script
that would convert ITF traces to human-readable state charts in Mermaid. Here is
just an example of such a trace produced by Apalache when checking the invariant
<code>RecvThreeDataBlocksEx</code> in Mermaid:</p>

<pre><code>sequenceDiagram
    participant ip10_0_0_3_port65000 as 10.0.0.3:65000
    participant ip10_0_0_1_port10000 as 10.0.0.1:10000
    participant ip10_0_0_1_port69 as 10.0.0.1:69

    ip10_0_0_3_port65000-&gt;&gt;ip10_0_0_1_port69: RRQ(file1, blksize=0, timeout=4)
    ip10_0_0_1_port10000-&gt;&gt;ip10_0_0_3_port65000: DATA(blk=1, 512B)
    ip10_0_0_3_port65000-&gt;&gt;ip10_0_0_1_port10000: ACK(blk=1)
    ip10_0_0_1_port10000-&gt;&gt;ip10_0_0_3_port65000: DATA(blk=2, 512B)
    ip10_0_0_3_port65000-&gt;&gt;ip10_0_0_1_port10000: ACK(blk=2)
    ip10_0_0_1_port10000-&gt;&gt;ip10_0_0_3_port65000: DATA(blk=3, 0B)
    ip10_0_0_3_port65000-&gt;&gt;ip10_0_0_1_port10000: ACK(blk=3)
</code></pre>

<p>This is how it looks like when rendered by <a href="https://www.mermaidchart.com/">Mermaid</a>:</p>

<picture>
  <img class="responsive-img" src="/img/tftp3.svg" alt="Visualized trace of TFTP client receiving three data blocks" />
</picture>

<p><strong>Note on abstractions.</strong> Similar to the <a href="https://www.mcmil.net/pubs/SIGCOMM19.pdf">McMillan and Zuck</a>, I tried
to avoid unnecessary abstractions and approximations in the specification.  If
you look at the type definition of a TFTP packet in
<a href="https://github.com/konnov/tftp-symbolic-testing/blob/6fb00d1878b7e37a629868ac25b853d95b16cbdc/spec/typedefs.tla"><code>typedefs.tla</code></a>,
you will see that all fields except <code>data</code> are modeled as strings and integers:</p>

<pre><code class="language-tlaplus">  // TFTP Packet Types
  @typeAlias: tftpPacket =
    // Read Request (RFC 1350, Figure 5-1, RFC 2347).
    // See RFCs 2348-2349 for the options.
      RRQ({ opcode: Int, filename: Str, mode: Str, options: Str -&gt; Int })
    // Write Request (RFC 1350, Figure 5-1, RFC 2347).
    // See RFCs 2348-2349 for the options.
    | WRQ({ opcode: Int, filename: Str, mode: Str, options: Str -&gt; Int })
    // Acknowledgment (RFC 1350, Figure 5-3)
    | ACK({ opcode: Int, blockNum: Int })
    // Option Acknowledgment (RFC 2347)
    | OACK({ opcode: Int, options: Str -&gt; Int })
    // Data packet (RFC 1350, Figure 5-2)
    // In our specification, we simply pass the length of data instead of the
    // data itself. The test harness should pass the actual data.
    | DATA({ opcode: Int, blockNum: Int, data: Int })
    // Error packet (RFC 1350, Figure 5-4)
    | ERROR({ opcode: Int, errorCode: Int, msg: Str })
  ;

</code></pre>

<p>Thinking about it now, I could even model <code>data</code> as a sequence of bytes, but it
was obvious to me that only the length of <code>data</code> matters for the protocol logic.</p>

<h2 id="7-bootstrapping-the-testing-harness-with-claude">7. Bootstrapping the testing harness with Claude</h2>

<p>Now, we have the initial TLA<sup>+</sup> specification of TFTP and the standard
implementation <a href="https://kernel.googlesource.com/pub/scm/network/tftp/tftp-hpa/">tftp-hpa</a>, which is the default <code>tftpd</code> server on Linux.</p>

<p>I wanted to avoid running the TFTP server on my laptop. What if I accidentally
find a bug that corrupts my file system? So I have decided to run the server and
the client harnesses in Docker containers. This way, I could easily reset the
SUT and have an isolated network for the TFTP server and clients.</p>

<p>Below is the architecture of the test harness that I had in mind. It’s quite a
bit overengineered for testing TFTP. I also wanted to experiment with Docker
networking and managing multiple containers for potential future projects.</p>

<pre><code>┌──────────────────────────────────────────────────────────────────┐
│                        Host Machine                              │
│                                                                  │
│  ┌────────────────────────────────────────────────────────────┐  │
│  │ harness.py                                                 │  │
│  │  - Coordinates symbolic execution                          │  │
│  │  - Manages Apalache server                                 │  │
│  │  - Controls Docker containers                              │  │
│  │  - Generates and saves test runs                           │  │
│  └────────┬────────────────────────┬──────────────────────────┘  │
│           │                        │                             │
│           ▼                        ▼                             │
│  ┌─────────────────┐     ┌──────────────────────────┐            │
│  │ Apalache Server │     │  Docker Manager          │            │
│  │  (port 8822)    │     │  - Network: 172.20.0.0/24│            │
│  └─────────────────┘     └──────────┬───────────────┘            │
│                                     │                            │
└─────────────────────────────────────┼────────────────────────────┘
                                      │
                         ┌────────────┴──────────────┐
                         │   Docker Network          │
                         │   (172.20.0.0/24)         │
                         │                           │
         ┌───────────────┼───────────────────────────┼─────────────┐
         │               │                           │             │
         ▼               ▼                           ▼             │
  ┌─────────────┐ ┌─────────────┐          ┌─────────────┐         │
  │ TFTP Server │ │  Client 1   │          │  Client 2   │         │
  │ 172.20.0.10 │ │ 172.20.0.11 │          │ 172.20.0.12 │         │
  │             │ │             │          │             │         │
  │ tftp-hpa    │ │ Python      │          │ Python      │         │
  │ Port: 69    │ │ TCP: 15001  │          │ TCP: 15002  │         │
  │ Data:1024-27│ │ (control)   │          │ (control)   │         │
  └─────────────┘ └─────────────┘          └─────────────┘         │
         ▲               │                           │             │
         │               │    UDP TFTP packets       │             │
         └───────────────┴───────────────────────────┘             │
                                                                   │
                         Docker Containers                         │
                                                                   │
                         tftp-test-harness:latest                  │
                                                                   │
└──────────────────────────────────────────────────────────────────┘
</code></pre>

<p><strong>LLMs will do the work?</strong> As you could have guessed, I had no interest in
writing the Docker files and the test harness from scratch. Having heard from so
many people that LLMs are so amazing, I have decided to give Claude a try at
generating the test harness.</p>

<p>Hence, I spent about four hours writing a very detailed prompt for Claude that
explained how I want the test harness to look like (the above architecture
diagram is actually generated by Claude from my prompt).</p>

<p><strong>Pushing the button!</strong> So I’ve run Claude in the agent mode with my prompt and
went for a coffee break. You can see the first generated version in <a href="https://github.com/konnov/tftp-symbolic-testing/commit/063da7d2b79c07dfb64225da852440c98b76c41e3">this
commit</a>.
The result looked so exciting and amazing until I looked at <code>CHECKLIST.md</code>:</p>

<pre><code>## Notes

- The framework is complete and production-ready
- Remaining work is mostly about connecting components
- Each task is independent and can be tackled separately
- Estimated effort: 4-8 hours for core integration (tasks 1-4)
- Additional 2-4 hours for polish and testing (tasks 5-8)
</code></pre>

<p>What is going on? Claude left me homework? I was also baffled by the hourly
estimates: Are these Claude hours or my hours? In the hindsight, the estimate
was surprisingly accurate. It took me about 1.5 days to make this code do the
first test run that made the harness exchange UDP packets with the TFTP server.</p>

<p>Then I looked at <code>harness.py</code>, which was supposed to be “complete and
production-ready”. Guess what? The main loop was left as a TODO!</p>

<pre><code class="language-python">        # TODO: Implement actual TFTP operation execution
        # This would involve:
        # - Querying Apalache for the transition details
        # - Sending commands to the TFTP client in Docker
        # - Collecting UDP packet responses
        # - Parsing the responses
</code></pre>

<p>The overall structure was there, but the most important pieces were left as
TODOs. Fine. It did the tedious part at least. So I started to chat with Claude
again to implement the missing pieces. If you look at the commit history, you
will see plenty of spaghetti code generated by Claude. In the end, it became a
bit better after my guidance, but I had to rewrite it at some point.</p>

<p>Even though I am making jokes about LLMs here, I must say that Claude really
helped me to debug the Docker setup and produce the python code for
communicating over UDP in modern Python. I could easily lose a couple of days
there.</p>

<p>Of course, the exploration logic was totally broken. After all, there is not
much for LLMs to learn from. We are doing something new here!</p>

<p><strong>1.5 days later.</strong> Something was working, but even the happy path was not
there. So I had to do the baby steps with Claude. Here are just a few examples
from my Copilot chat:</p>

<blockquote>
  <p>Let’s implement sending the RRQ packet over the wire.</p>
</blockquote>

<blockquote>
  <p>…</p>
</blockquote>

<blockquote>
  <p>Now I am receiving a response like below. This is good! What I want you to
do next. Decode the response and construct the expected packet for the TLA+
specification. Save this as the expectation that we will use in the next step.</p>
</blockquote>

<blockquote>
  <p>…</p>
</blockquote>

<blockquote>
  <p>You should not construct a TLA+ expression. Rather, convert the packet to an
ITF value using itf-py.</p>
</blockquote>

<blockquote>
  <p>…</p>
</blockquote>

<blockquote>
  <p>Can you implement this case for receiving OACK from the server and sending ACK
by the client to the server.</p>
</blockquote>

<p><strong>2 more days later.</strong> I had the happy path working. At this point, I was tired
of reading the harness logs. So I needed some form of visualization for each
run. Obviously, I wanted to have the same kind of Mermaid diagrams as before.</p>

<p>So I asked Claude to generate a script that would reconstruct the sequence
diagram from the harness logs. Well, it took me longer than expected. At some
point, Claude was producing quite convoluted log parsers with regular
expressions and python loops. Of course, it needs a human to define a simple log
format instead.</p>

<p>Below is an example of such a test run, visualized from the log by the generated
script in Mermaid:</p>

<picture>
  <img class="responsive-img" src="/img/tftp-happy.svg" alt="Trace visualization of the testing run" />
</picture>

<p>If you look at the above diagram carefully, you will notice that server responses
come in two flavors:</p>

<ol>
  <li>
    <p>The dashed arrows indicate that the client has received the UDP packet from
the UDP socket.</p>
  </li>
  <li>
    <p>The solid arrows indicate that the UDP packet was successfully replayed
with the TLA<sup>+</sup> specification.</p>
  </li>
</ol>

<h2 id="8-debugging-the-tla-specification-with-the-implementation">8. Debugging the TLA<sup>+</sup> specification with the implementation</h2>

<p>At that point, the tests started to produce actual interactions between the
TLA<sup>+</sup> specification (as solved by Apalache and Z3) and the real TFTP
server. This brought a lot of surprises! I am going to present some of them
below.</p>

<p>In this debugging session, I am keeping the scorecard of how many times the
TLA<sup>+</sup> specification was wrong versus how many times the implementation
(tftp-hpa) was wrong. The scorecard at this point looks like this:</p>

<table>
  <thead>
    <tr>
      <th>Spec bugs</th>
      <th>Implementation bugs</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>0</td>
      <td>0</td>
    </tr>
  </tbody>
</table>

<p>Actually, <a href="https://kernel.googlesource.com/pub/scm/network/tftp/tftp-hpa/">tftp-hpa</a> is a quite mature implementation, so I was not expecting
any bugs there. Keep reading to see what I found.</p>

<h3 id="81-sending-errors-on-read-request">8.1. Sending errors on read request</h3>

<p>The first surprise came from my misunderstanding of how exactly TFTP is supposed
to reply to a malformed read request (<code>RRQ</code>). Since a client sends <code>RRQ</code> to the
control port 69 of the server, I thought that the server would reply with an
error packet (<code>ERROR</code>) from the port 69, instead of introducing a new ephemeral
port.</p>

<p>This is what <a href="https://www.rfc-editor.org/rfc/rfc2347">RFC 2347</a> says about option negotiation:</p>

<blockquote>
  <p>…the server should simply omit the option from the <code>OACK</code>, respond with an
alternate value, or send an <code>ERROR</code> packet, with error code 8, to terminate the
transfer.</p>
</blockquote>

<p>No explanation about the port from which the <code>ERROR</code> packet is sent. Well, my
understanding was wrong. The server always allocates a new ephemeral port for
sending the <code>ERROR</code> packet. This kind of makes sense, as the implementation simply
forks on a new request. One score to the implementation:</p>

<table>
  <thead>
    <tr>
      <th>Spec bugs</th>
      <th>Implementation bugs</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>0</td>
    </tr>
  </tbody>
</table>

<div class="tip-box">
    <div class="tip-header"><strong>💡 Tip:</strong></div>
    <div class="tip"><p>Actually, as I found later, the spec was not
always wrong, as the busybox implementation always uses port 69!</p>
</div>
</div>

<h3 id="82-the-server-may-send-duplicate-packets">8.2. The server may send duplicate packets</h3>

<p>Well, I knew that, but was lazy to write an action in the specification that
would handle duplicate packets. This is a typical shortcut when writing a
specification, since duplicate packets do not change the specification state and
considered “stuttering” steps. The server implementation retransmitted a <code>DATA</code>
packet, which produced a deviation in the TLA<sup>+</sup> specification. Another
score to the implementation:</p>

<table>
  <thead>
    <tr>
      <th>Spec bugs</th>
      <th>Implementation bugs</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>2</td>
      <td>0</td>
    </tr>
  </tbody>
</table>

<p>Formally speaking, this action does not affect protocol safety, so it is
tempting to simply skip duplicates. However, in conformance testing, we have to
handle all possible actions of the implementation, even if they produce stuttering
steps in the theory of TLA<sup>+</sup>.</p>

<h3 id="83-input-output-conformance-does-not-work-with-udp">8.3. Input-output conformance does not work with UDP!</h3>

<p>The next issue was quite interesting. When I read the papers on input-output
conformance testing from the 1990s, there was always an assumption that the
system under test (SUT) is input-enabled. This means that the SUT can always
accept any input at any time and respond to it, possibly, with an error message.
This assumption makes sense for synchronous systems (such as vending machines?),
where the tester can wait for the SUT to be ready to accept the input.</p>

<p>However, TFTP is not like that at all. The client may send an <code>ERROR</code> packet at
any point in time, and the server does not have to reply to it! This is exactly
a deviating test run I saw produced by the harness.</p>

<p>So instead of waiting for a reply from the server on each client action, the
test harness has to optimistically send the next UDP packet and then retrieve
the UDP packets from the server (remember that they live in Docker!).</p>

<p>This is where Claude was useful again. It helped me to collect the UDP packets
on the Docker client. Before taking the next step, the harness would retrieve
the buffered UDP packets from the Docker clients and replay these packets in the
TLA<sup>+</sup> specification, in arbitrary order.</p>

<p>This makes our testing approach a bit more sensitive to the timing of extracting
the buffered UDP packets, but it worked for TFTP.</p>

<table>
  <thead>
    <tr>
      <th>Spec bugs</th>
      <th>Implementation bugs</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>3</td>
      <td>0</td>
    </tr>
  </tbody>
</table>

<h3 id="84-the-server-recycles-an-ephemeral-port-on-error">8.4. The server recycles an ephemeral port on ERROR</h3>

<p>Another interesting deviation happened when the server recycled an ephemeral
port. <a href="https://www.rfc-editor.org/rfc/rfc1350">RFC 1350</a> explains how the server allocates ephemeral ports:</p>

<blockquote>
  <p>In order to create a connection, each end of the connection chooses a TID for
itself, to be used for the duration of that connection. The TID’s chosen for a
connection should be randomly chosen, so that the probability that the same
number is chosen twice in immediate succession is very low.</p>
</blockquote>

<p>Well, in our test run, the event of low probability happened (actually, I gave
the TFTP server a small range of ports to use):</p>

<picture>
  <img class="responsive-img" src="/img/tftp-fix5.svg" alt="Recycling ephemeral ports on error" />
</picture>

<p>Actually, this theme of reusing the same ephemeral port happened multiple times
in the following debugging iterations. It is probably the most problematic
aspect of the protocol, as there is no notion of a session in TFTP. Another
score to the implementation:</p>

<table>
  <thead>
    <tr>
      <th>Spec bugs</th>
      <th>Implementation bugs</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>4</td>
      <td>0</td>
    </tr>
  </tbody>
</table>

<h3 id="85-the-server-recycles-an-ephemeral-port-on-success">8.5. The server recycles an ephemeral port on success</h3>

<p>Guess what? A very similar thing happened on a successful file transfer as well.
Here is a pruned version of the trace that shows this behavior (the initial
sequence of <code>RRQ</code>-<code>OACK</code>-<code>DATA</code>-<code>ACK</code> is omitted for brevity):</p>

<picture>
  <img class="responsive-img" src="/img/tftp-fix6.svg" alt="Recycling ephemeral ports on success" />
</picture>

<p>This behavior seems to be consistent with <a href="https://www.rfc-editor.org/rfc/rfc1350#section-6">Section 6 of RFC 1350</a>, though it
seems to be ambiguous to me. Anyway, another score to the implementation:</p>

<table>
  <thead>
    <tr>
      <th>Spec bugs</th>
      <th>Implementation bugs</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>5</td>
      <td>0</td>
    </tr>
  </tbody>
</table>

<h3 id="86-mixing-the-protocol-versions">8.6. Mixing the protocol versions</h3>

<p>TFTP essentially has two versions: the original version defined in RFC 1350 and
the extended version with option negotiation defined in RFC 2347. In combination
with packet duplication, this produced a very interesting deviation. I’ve not
saved the full trace, but here is what happened. The server processes an RRQ
with options and sends an OACK, as per RFC 2347. After that, the TLA<sup>+</sup>
specification of the server receives an earlier RRQ without options and sends a
DATA packet in response, as per RFC 1350. This corrupts the internal state of
the server in the specification.</p>

<p>Obviously, this is caused by non-determinism in the TLA<sup>+</sup>
specification, which allows the protocol to behave according to both protocol
versions at the same time. I had to fix the specification by disallowing the
server to behave according to RFC 1350, when it receives an RRQ with options.
One score to the implementation:</p>

<table>
  <thead>
    <tr>
      <th>Spec bugs</th>
      <th>Implementation bugs</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>6</td>
      <td>0</td>
    </tr>
  </tbody>
</table>

<h3 id="87-more-deviations-on-the-specification-side">8.7. More deviations on the specification side</h3>

<p>At some point, I got tired of collecting the precise deviations. They still can
be recovered from the commit log though. Here are some of the further deviations
on the specification side that I fixed:</p>

<ul>
  <li>
    <p>The client must send <code>tsize = 0</code> in RRQ.</p>
  </li>
  <li>
    <p>The server should send default timeout if it’s not specified in the options.</p>
  </li>
  <li>
    <p>The server may send invalid (e.g., outdated) packets.</p>
  </li>
  <li>
    <p>My understanding of TFTP timeouts was wrong. I thought that a timeout was
 meant to close a transfer session. Instead, timeouts in TFTP are just
 triggering packet retransmissions. The number of retries is not specified in
 the RFCs. In practice, tftp-hpa seems to retry 5 times before giving up.</p>
  </li>
  <li>
    <p>The server specification should store transfers for  triplets <code>(clientIP,
 clientPort, serverPort)</code> instead of pairs <code>(clientIP, clientPort)</code>.</p>
  </li>
</ul>

<p>In the end, the implementation scored another 7 points, before tests started to
work.</p>

<table>
  <thead>
    <tr>
      <th>Spec bugs</th>
      <th>Implementation bugs</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>13</td>
      <td>0</td>
    </tr>
  </tbody>
</table>

<p>It looks like my TLA<sup>+</sup> specification was a bit sloppy, in comparison
to the mature implementation of <code>tftp-hpa</code>. I have not designed this protocol
and did not give much thought to it. Obviously, the engineers have spent much
more time thinking about its behavior. You can check the specification in
<a href="https://github.com/konnov/tftp-symbolic-testing">the repository</a>.</p>

<h2 id="9-testing-against-adversarial-behavior">9. Testing against adversarial behavior</h2>

<p>At some point I thought: My clients are too well-behaved! They never lose,
duplicate, or reorder packets. What if they start to misbehave within the
protocol boundaries? Would I be able to find bugs in the implementation? Yes, I
did. Keep reading.</p>

<p>Hence, I have added one more action that simply lets a client retransmit a
previously sent packet in <code>Next</code>:</p>

<pre><code class="language-tlaplus">    \/ \E udp \in packets:
        ClientSendDup(udp)

</code></pre>

<p>Below is the action <code>ClientSendDup</code>. It does not change the specification state
at all. However, it produces an action that retransmits a packet in the harness:</p>

<pre><code class="language-tlaplus">\* A client resends a duplicate packet that it sent in the past.
\* This is to test for the Sourcerer's Apprentice syndrome.
\* @type: $udpPacket =&gt; Bool;
ClientSendDup(_udp) ==
    ClientSendDup::
    /\ _udp.destIp = SERVER_IP
    /\ lastAction' = ActionRecvSend(_udp)
    /\ UNCHANGED &lt;&lt;packets, serverTransfers, clientTransfers, clock&gt;&gt;

</code></pre>

<p>You can find the complete specification <a href="https://github.com/konnov/tftp-symbolic-testing/tree/main/spec">here</a>.</p>

<p><strong>Protocol deviation.</strong> It mostly worked as expected. However, a few traces were
reporting deviations. Here is one of them. It’s pretty long. Look for an
explanation below.</p>

<picture>
  <img class="responsive-img" src="/img/tftp-malformed-ack.svg" alt="The implementation diverging from the specification" />
</picture>

<p>The last UDP packet is an acknowledgment for block 1 from the server. If
you think about the protocol, the server should never send an ACK in the
sessions associated with read requests (RRQ). ACK packets are only sent by the
clients.  Yet, this is what was happening. To double check this, I’ve asked
Claude to capture the traffic in pcap files in the Docker containers. Indeed,
Wireshark was showing the ACK packet from the server. Moreover, the packet was
malformed. It looked like the option acknowledgment (OACK) packet, but had the
first bytes of an ACK packet. Sounds like memory corruption!</p>

<p>Here is the core sequence of events that produced this behavior (a few details
removed):</p>

<ol>
  <li>
    <p>The client sends <code>RRQ("file1", blksize=NN)</code> to the server (172.20.0.10:69).</p>
  </li>
  <li>
    <p>The server sends a few OACK packets to the client.</p>
  </li>
  <li>
    <p>The client erroneously sends <code>ACK(1)</code> to the server, which is a duplicate
 packet from an earlier transfer. It could be simply a delayed packet though.</p>
  </li>
  <li>
    <p>The server responds with <code>ACK(1)</code> of length 64, which is basically the
 <code>OACK</code> packet with the first 4 bytes coming from <code>ACK(1)</code>.</p>
  </li>
</ol>

<p><strong>Investigation.</strong> Luckily, the source code is readily available. I’ve looked
into the function <code>tftp_sendfile</code> of <code>tftp-hpa</code> that handles read requests.
Indeed, the option negotiation loop receives the option acknowledgment packet
<code>OACK</code> and waits for an <code>ACK</code> from the client. There are two cases:</p>

<ul>
  <li>
    <p>When it receives an <code>ACK</code> for block 0, it breaks out of the loop and continues with sending data blocks. <strong>This is the happy path.</strong></p>
  </li>
  <li>
    <p>When it receives an acknowledgment for a block other than 0, block, it simply
 continues the loop, retransmitting <code>OACK</code>. The issue is that <strong>the code uses
 the same buffer</strong> for sending <code>OACK</code> and receiving <code>ACK</code> packets via different
 pointers! Hence, it later sends an <code>OACK</code> packet that is corrupted with the
 contents of the <code>ACK</code> packet. <strong>I don’t think I would have found this by code
 review!</strong></p>
  </li>
</ul>

<p>Just for fun, I checked it with Claude. It could not identify this issue. The
trick is that the same buffer is pointed to by two different pointers, so Claude
is not clever enough to track this aliasing. When I explained the issue to
Claude, it was ecstatic: You have found a critical!</p>

<p>I’ve continued looking for the blast radius of this bug. Even though it somewhat
of memory corruption, it cannot crash the server, as the code is still writing
to the same buffer, allocated by the server itself. All it can do is to produce
malformed packets. Hence, it could probably crash a sloppy client, but would not
do much harm to a well-behaved client and itself. Moreover, if a client crashes
in such a case, anybody else on the network could have sent the malformed ACK as
well.</p>

<p>So this is a bug (from the specification p.o.v.), but it does not result in a
vulnerability. In any case, it was a deviation from the protocol specification.
Finally, one point to the specification!</p>

<table>
  <thead>
    <tr>
      <th>Spec bugs</th>
      <th>Implementation bugs</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>13</td>
      <td>1</td>
    </tr>
  </tbody>
</table>

<p><strong>Contacting the author.</strong> To be on the safe side, before writing this blog
post, I’ve contacted the author of tftp-hpa. As I expected, he also replied that
TFTP is an unencrypted unauthenticated protocol, so we should not expect much
security there.</p>

<h2 id="10-the-specification-as-a-differential-testing-oracle">10. The specification as a differential testing oracle</h2>

<p>After finding the above implementation bug, I have decided to test other TFTP
implementations as well. This is where Claude was super useful again. I just
asked it to generate Dockerfiles for other implementations, which it did
quickly. It happened that a similar issue existed in another implementation. I
could not figure out the root cause in the source code of that other
implementation, as it is a bit harder to read than <code>tftp-hpa</code>. Hence, not giving
the details here.</p>

<p>Except this second deviation, the other implementations worked fine. Overall,
the specification scored another point:</p>

<table>
  <thead>
    <tr>
      <th>Spec bugs</th>
      <th>Implementation bugs</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>13</td>
      <td>2</td>
    </tr>
  </tbody>
</table>

<p>What I find really interesting here. Whenever I talk to engineers about formal
specifications, they tell me that they would like to do <strong>differential testing</strong>
instead of writing specifications. Meaning that they would like to compare the
behavior of one implementation against another implementation. However,
differential testing is not magic. It requires test inputs to compare the
implementations. Hence, <strong>if the test suite is missing adversarial test cases,
both implementations may pass the tests</strong>, even though they are both wrong.</p>

<p>What we did here with the TLA<sup>+</sup> specification is something more than
just differential testing. First, we have debugged the specification against
<code>tftp-hpa</code>, so we have extracted its expected behavior into a relatively small
and precise formal specification. Second, we have used this specification to
produce the tests for another implementation!</p>

<h2 id="11-prior-work">11. Prior Work</h2>

<p>In this section, I’ve collected the previous work on model-based testing and
trace validation with TLA<sup>+</sup>:</p>

<ul>
  <li>
    <p>Nagendra et. al. Model guided fuzzing of distributed systems (2025).
Check <a href="https://www.youtube.com/watch?v=DO8MvouV29M">the talk</a>.</p>
  </li>
  <li>
    <p>Cirstea, Kuppe, Merz, Loillier. Validating Traces of Distributed Systems
Against TLA+ Specifications (2024). Check the
<a href="https://arxiv.org/abs/2404.16075">arxiv paper</a>.</p>
  </li>
  <li>
    <p>Chamayou et. al. Validating System Executions with the TLA+ Tools (2024).
See <a href="https://www.youtube.com/watch?v=NZmON-XmrkI">the talk</a>.</p>
  </li>
  <li>
    <p>Halterman. Verifiability Gap: Why We Need More From Our Specs and
How We Can Get It (2020).
See <a href="https://www.youtube.com/watch?v=itcj9j2yWQo">the talk</a>.</p>
  </li>
  <li>
    <p>Davis et al. eXtreme Modelling in Practice (2020).
See <a href="https://www.youtube.com/watch?v=IIGzXX72weQ">the talk</a>.</p>
  </li>
  <li>
    <p>Kupriyanov, Konnov. Model-based testing with TLA+ and Apalache (2020).
See <a href="https://www.youtube.com/watch?v=aveoIMphzW8">the talk</a>.</p>
  </li>
  <li>
    <p>Pressler. Verifying Software Traces Against a Formal Specification with
TLA<sup>+</sup> and TLC (2018).
Check <a href="https://pron.github.io/files/Trace.pdf">the paper</a>.</p>
  </li>
</ul>

<p>I am pretty sure that this list is incomplete, so please let me know if you are
aware of any other relevant work.</p>

<h2 id="12-conclusions">12. Conclusions</h2>

<p>This was a lot of text! Thank you for reading it till the end. It may look like
this project took me eternity to complete. In reality, <strong>it took me about two
weeks of part-time work</strong> to do it from the start to the end. On one hand, I
could probably do some parts of it faster, if I did not rely too much on Claude
for generating the test harness. On the other hand, <strong>Claude quickly generates
the code to start and stop services, parse their logs, etc.</strong> All the things
Docker were done by Claude, and I did not have to touch them. This is the work
that I find annoying and LLMs just do. In this experiment, I’ve burned all of my
monthly premium requests included in the Copilot plan. To be fair, I also had to
add a few features to the new Apalache API, as I was still experimenting with
it.</p>

<p>What I find interesting in the approach outlined here is that it presents a
(relatively) <strong>lightweight way to testing real-world protocols</strong>. Thinking of
fuzzing in this context, <strong>I don’t think a standard fuzzer would have found the
above deviations in TFTP</strong>. Indeed, the implementation was not crashing. Nor it
was accessing memory out of bounds. It was just producing malformed packets
occasionally. To detect this, <strong>we needed a test oracle</strong> that would tell us,
whether a deviation happened. Writing such an oracle manually would be tedious
and error-prone. Instead, we have used <strong>a formal specification as a precise and
unambiguous oracle</strong>. Unambiguous does not mean deterministic though. Our oracle
is non-deterministic, but it precisely defines the allowed behaviors of the
protocol.</p>

<p>In addition to that, <a href="https://kernel.googlesource.com/pub/scm/network/tftp/tftp-hpa/">tftp-hpa</a> is not just a piece of code that was written
by a startup over a weekend, or generated by an LLM. It is <strong>a very mature
project that has been written by professionals in the times when people had time
to think</strong>. They took care of the <a href="https://en.wikipedia.org/wiki/Sorcerer%27s_Apprentice_syndrome">Sorcerer’s Apprentice Syndrome</a>. This is
why I was quite surprised to see an unexpected packet from the server.</p>

<p>On the Apalache side, we finally have a symbolic approach that <strong>scales much
better than bounded model checking</strong>! In my experiments with TFTP, the new JSON
RPC API was showing the signs of <strong>slowing down only after about 200 steps</strong> of
symbolic execution. This is a huge improvement over the previous approach, where
Apalache was slowing down after about 10-20 steps. It is easy to see why. We
feed the concrete responses from the implementation into the SMT context, which
immediately produce a lot of simplifications.</p>

<p>We can <strong>improve this even further to essentially unlimited number of steps</strong>.
All what is needed is to keep the concrete trace on the harness side and
initialize the SMT context with the last state of the trace. We can do it every
step, or every <code>N</code> steps. The cool thing is that it all can be done outside of
Apalache, on the harness side! This opens the door to <strong>quick experimentation</strong>
with various strategies of mixing <strong>symbolic and concrete execution</strong>.</p>]]></content><author><name>Igor Konnov</name></author><category term="tlaplus" /><summary type="html"><![CDATA[Author: Igor Konnov]]></summary></entry><entry><title type="html">Formal Verification of the Aztec Governance Protocol</title><link href="https://protocols-made-fun.com/quint/2025/12/09/aztec-governance.html" rel="alternate" type="text/html" title="Formal Verification of the Aztec Governance Protocol" /><published>2025-12-09T00:00:00+00:00</published><updated>2025-12-09T00:00:00+00:00</updated><id>https://protocols-made-fun.com/quint/2025/12/09/aztec-governance</id><content type="html" xml:base="https://protocols-made-fun.com/quint/2025/12/09/aztec-governance.html"><![CDATA[<p><strong>Authors:</strong> <a href="https://blltprf.xyz/">Thomas Pani</a>, <a href="https://konnov.phd">Igor Konnov</a></p>

<p><strong>Date:</strong> December 9, 2025</p>

<h2 id="1-introduction">1. Introduction</h2>

<p>In August 2025, <a href="https://aztec-labs.com">Aztec Labs</a> engaged <a href="https://blltprf.xyz/">Thomas Pani</a> and <a href="https://konnov.phd">Igor Konnov</a> to formally specify and verify the new <strong>Aztec Governance Protocol</strong> – the core on-chain system that governs <a href="https://aztec.network/">Aztec Network</a>.</p>

<p>Over the course of five weeks, we reviewed every line of code in scope and developed a <strong>precise formal specification, verified automatically</strong> with <a href="https://github.com/apalache-mc/apalache">Apalache</a>. The result: scalable, massively parallel automated verification that explored the entire protocol state space to <strong>formally confirm correctness and uncover subtle, cross-contract issues</strong> that conventional audits or fuzzing can easily miss.</p>

<p>The team at Aztec Labs reviewed our findings and addressed all of them.</p>

<h3 id="at-a-glance-metrics">At-a-Glance Metrics</h3>

<p>For the impatient reader, here are some key figures:</p>

<table>
  <thead>
    <tr>
      <th> </th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>125 invariants</strong> specified across <strong>10 contracts</strong>, <strong>8 libraries</strong>, and <strong>8 interfaces</strong></td>
    </tr>
    <tr>
      <td><strong>992 verification conditions</strong> checked in total</td>
    </tr>
    <tr>
      <td><strong>72 physical cores / 368 GiB RAM</strong>, running for <strong>321 CPU-days</strong> (≈ 2 weeks)</td>
    </tr>
    <tr>
      <td><strong>Findings:</strong> <span style="background-color:#C48F00; color:#fff; padding:2px 6px; border-radius:6px;">5 Medium</span> • <span style="background-color:#2F7D32; color:#fff; padding:2px 6px; border-radius:6px;">3 Low</span> • <span style="background-color:#005A9E; color:#fff; padding:2px 6px; border-radius:6px;">6 Info</span>  <a href="#9-findings">(jump ahead)</a></td>
    </tr>
    <tr>
      <td><strong>Final complete verification run</strong> in <strong>576 CPU-hours</strong> (≈ 1 calendar day)</td>
    </tr>
    <tr>
      <td><strong>Contract size:</strong> <strong>~2 kLOC</strong> Solidity</td>
    </tr>
    <tr>
      <td><strong>Specification size:</strong> <strong>~4 kLOC</strong> Quint (incl. traceability comments)</td>
    </tr>
  </tbody>
</table>

<p>These runtimes are comparable to large-scale fuzzing campaigns – but with a crucial difference: <strong>formal verification explores every possible transaction symbolically</strong>, offering <em>exhaustive</em> reasoning rather than probabilistic coverage.</p>

<h3 id="highlights">Highlights</h3>

<p>Some of the key highlights from this article include:</p>

<ul>
  <li><strong>Representative issue: Governance Insolvency</strong> with root-cause analysis and fix (<a href="#91-governance-insolvency">§9.1</a>)</li>
  <li>Choosing the <strong>right tools</strong> (<a href="#4-choosing-the-right-tools">§4</a>)</li>
  <li><strong>Bootstrapping the formal specification with AI</strong> (<a href="#52-bootstrapping-the-formal-specification-with-ai">§5.2</a>)</li>
  <li><strong>Making verification scale</strong> with compositional reasoning and inductive invariants (<a href="#53-compositional-reasoning">§5.3</a>, <a href="#73-inductive-invariants-making-verification-scale">§7.3</a>)</li>
  <li><strong>Showing that the protocol can progress</strong>: witnesses of liveness (<a href="#74-witnesses-of-liveness-proving-the-protocol-can-progress">§7.4</a>)</li>
</ul>

<p>Our formal report can be accessed via <a href="https://github.com/konnov/aztec-governance-formal-verification-2025q3/blob/e313681ade9f9e96d0e83a5120a670a1e1e07188/reports/Aztec-Governance-Protocol.pdf">this link</a> and the specifications can be found via <a href="https://github.com/konnov/aztec-governance-formal-verification-2025q3">this link</a>.</p>

<h2 id="2-overview-of-aztec-governance">2. Overview of Aztec Governance</h2>

<p>Aztec Network’s governance is implemented as a suite of on-chain Solidity contracts. We summarize its multi-contract architecture, which required <strong>compositional analysis to verify</strong>, in the diagram below. The current implementation extends and formalizes the concepts from the <a href="https://forum.aztec.network/t/request-for-comments-aztec-governance/7413">Aztec Governance RFC</a> – see <a href="https://docs.aztec.network/the_aztec_network/concepts/governance">Aztec Governance</a> for the canonical documentation. This post reflects the protocol as of the <a href="https://github.com/AztecProtocol/aztec-packages/blob/8b10b2b220de38c9e2e2e2b7d05d7383701ba070">commit used in our engagement</a>. Parts of the codebase have evolved since our engagement.</p>

<p><img src="/assets/images/aztec-governance.webp" alt="Aztec diagram: contract architecture of Governance, GSE, Registry, Proposer, Slasher, and flows" /></p>

<p>(We highlight the key contracts, with no special meaning attached to the colors.)</p>

<p><strong>Rollups and Registry.</strong> Aztec Network manages a system of rollups recorded in the <a href="https://github.com/AztecProtocol/aztec-packages/blob/8b10b2b220de38c9e2e2e2b7d05d7383701ba070/l1-contracts/src/governance/Registry.sol">Registry</a>, which directs inflationary rewards to a single, designated <em>canonical rollup</em>.</p>

<p><strong>GovernanceProposer.</strong> <a href="https://github.com/AztecProtocol/aztec-packages/blob/8b10b2b220de38c9e2e2e2b7d05d7383701ba070/l1-contracts/src/governance/proposer/GovernanceProposer.sol">GovernanceProposer</a> (derived from <a href="https://github.com/AztecProtocol/aztec-packages/blob/8b10b2b220de38c9e2e2e2b7d05d7383701ba070/l1-contracts/src/governance/proposer/EmpireBase.sol">EmpireBase</a>) forms the foundational layer of the voting system, implementing a round-based signaling mechanism to determine which proposals advance to <code>Governance</code>.</p>

<p><strong>Governance.</strong> The <a href="https://github.com/AztecProtocol/aztec-packages/blob/8b10b2b220de38c9e2e2e2b7d05d7383701ba070/l1-contracts/src/governance/Governance.sol">Governance</a> contract manages the full proposal lifecycle, including submission, voting, and execution. Given their critical role in managing the Aztec Network, <code>Governance</code> incorporates access control, such as whitelisting beneficiaries that participate in voting. On top of that, it implements an emergency proposal mechanism which requires a substantial token lock.</p>

<p><strong>Governance Staking Escrow (GSE).</strong> Governance stakes and corresponding voting rights are managed by the <em>Governance Staking Escrow</em> (<a href="https://github.com/AztecProtocol/aztec-packages/blob/8b10b2b220de38c9e2e2e2b7d05d7383701ba070/l1-contracts/src/governance/GSE.sol">GSE</a>) contract. <code>GSE</code> enables seamless migration of staked assets to new canonical chains, addressing the “cold-start” problem by ensuring immediate operational support during network upgrades. Proposals made through <code>GovernanceProposer</code> tie back to <code>GSE</code> during execution, verifying that at least two-thirds of the total stake is allocated to the latest rollup.</p>

<p><strong>SlashingProposer and Slasher.</strong> The <a href="https://github.com/AztecProtocol/aztec-packages/blob/8b10b2b220de38c9e2e2e2b7d05d7383701ba070/l1-contracts/src/core/slashing/SlashingProposer.sol">SlashingProposer</a> contract, also derived from <a href="https://github.com/AztecProtocol/aztec-packages/blob/8b10b2b220de38c9e2e2e2b7d05d7383701ba070/l1-contracts/src/governance/proposer/EmpireBase.sol">EmpireBase</a>, uses the same round-based signaling mechanism to determine which slashing proposals are forwarded to the <a href="https://github.com/AztecProtocol/aztec-packages/blob/8b10b2b220de38c9e2e2e2b7d05d7383701ba070/l1-contracts/src/core/slashing/Slasher.sol">Slasher</a> contract.</p>

<p><strong>Libraries.</strong> The main contracts are supported by a set of custom libraries that use storage-layout compression for gas optimization. These libraries enable the system to retrieve historical, checkpointed state for computing voting power, implement a custom checkpointed set data structure built on OpenZeppelin’s <code>Checkpoints.Trace224</code> library, implement the vote-tallying algorithm, and provide helper functions that encode the proposal lifecycle state machine.</p>

<h2 id="3-attack-surface">3. Attack Surface</h2>

<p>The attack surface of the Governance Protocol is significant – with <strong>over 40 external state-mutating functions across multiple contracts</strong> in scope. Moreover, problematic scenarios typically:</p>

<ul>
  <li>involve multiple contracts,</li>
  <li>exercise them over several transactions, and</li>
  <li>can even involve multiple instances of the same contract (e.g., several Rollups, several Governance contracts, etc.).</li>
</ul>

<p><strong>Reasoning about time.</strong> <code>Governance</code> and <code>GovernanceProposer</code> use the block timestamp to organize signaling and voting phases. <code>GovernanceProposer</code> slots are short (<strong>fractions of a minute</strong>), while <code>Governance</code> voting periods are much longer (<strong>minutes to days</strong>). We must therefore reason about long time horizons interrupted by short-lived events.</p>

<p><strong>Malicious external inputs.</strong> To make things harder, we also considered scenarios in which a canonical rollup produces <strong>erroneous readings</strong> from time to time, e.g., due to a fault. For example, what if a canonical rollup starts to return slot numbers from the past (or far in the future)?</p>

<p>This poses both a <strong>challenge and an opportunity</strong>: standard techniques such as fuzzing, random simulation, or bounded model checking would not get us far – the state and action spaces are prohibitively large.</p>

<h2 id="4-choosing-the-right-tools">4. Choosing the Right Tools</h2>

<p>With the attack surface in view, the next question was tooling: how to verify the protocol logic without drowning in bytecode.</p>

<h3 id="41-protocol-level-specification">4.1. Protocol-Level Specification</h3>

<p>From an engineer’s perspective, the ideal solution would be to verify correctness directly at the implementation level – that is, to automatically reason about the Solidity code itself. Tools such as <a href="https://www.certora.com/">Certora Prover</a>, <a href="https://kontrol.runtimeverification.com/">Kontrol</a>, <a href="https://github.com/a16z/halmos">Halmos</a>, and <a href="https://hevm.dev/">HEVM</a> aim to do exactly this by automating formal reasoning over smart contracts. These tools are remarkable engineering achievements, but their task is inherently complex: among other things, they must reason precisely about stack behavior, memory, storage, and external calls – all the way down to the EVM bytecode.</p>

<p>Before diving into such low-level reasoning, however, we believe it is essential to <strong>ensure that the protocol’s logic is sound</strong>. If high-level properties of the protocol are violated, then verifying bit-level correctness provides limited value. Once the protocol logic is verified, attention can shift to the implementation.</p>

<p>In this project, we focused on <strong>specifying the Aztec Governance Protocol at the logic level</strong>. Our <strong>main objectives</strong> were to:</p>

<ul>
  <li>specify the high-level behavior of the protocol,</li>
  <li>identify its core invariants, and</li>
  <li>prove these invariants correct (or demonstrate violations through counterexamples).</li>
</ul>

<h3 id="42-languages-and-tools">4.2. Languages and Tools</h3>

<p>Several specification languages could serve this purpose. For instance, we could have expressed the protocol directly in an interactive theorem prover like <a href="https://lean-lang.org/">Lean</a> or <a href="https://rocq-prover.org/">Rocq</a>. However, both would offer little automation, which would make limited progress feasible within our one-month timeframe. (Recently, Lean has seen exciting developments such as the <a href="https://lean-lang.org/doc/reference/latest/The--grind--tactic/"><code>grind</code> tactic</a> and research by the <a href="https://verse-lab.github.io/">VERSE group</a>. We may explore these in a future engagement!)</p>

<p><strong>TLA<sup>+</sup> and its tooling.</strong> <a href="https://tlapl.us/">TLA<sup>+</sup></a> is perhaps the most well-known practical specification language. It is supported by two model checkers (<a href="https://github.com/tlaplus/tlaplus">TLC</a> and <a href="https://github.com/apalache-mc/apalache">Apalache</a>) and an interactive theorem prover <a href="https://proofs.tlapl.us/">TLAPS</a>. We use the methodology of TLA<sup>+</sup> to reason about the Governance Protocol as a collection of interacting state machines over large state spaces. Since many engineers find the syntax of TLA<sup>+</sup> confusing, we use the surface syntax <a href="https://github.com/informalsystems/quint">Quint</a> to write the specifications. As co-authors of both <strong>Quint</strong> and the <strong>Apalache</strong> model checker – together with Gabriela Moreira, Shon Feder, Jure Kukovec, and others – we have a deep understanding of their internals and how to apply them to large-scale protocol verification. This expertise was essential for scaling our analysis to a system as complex as Aztec Governance.</p>

<h2 id="5-specification-decisions-and-challenges">5. Specification Decisions and Challenges</h2>

<p>With the goals and tools defined, the next step was to translate the Governance Protocol into a precise, analyzable specification.</p>

<h3 id="51-writing-the-specification">5.1. Writing the Specification</h3>

<p><strong>Modeling the states.</strong> Before specifying the contract behavior, we must decide how to model the contract states. We first define the shape of individual contract states. For example, below is the state of a <code>GovernanceProposer</code>.</p>

<pre><code class="language-ts">// GovernanceProposer contract state
type GovernanceProposerState = {
  // the state of the parent EmpireBase contract
  empireBase: EmpireBaseState,
  // mapping(uint256 proposalId =&gt; address proposer)
  proposalProposer: Uint256 -&gt; Address,
  // immutable config (set in constructor)
  REGISTRY: Address,
  GSE: Address
}
</code></pre>

<p>You can see that many concepts from Solidity (like mappings) are seamlessly expressed in Quint. The full protocol state – including all relevant contracts – is captured by <code>EvmState</code>. In our case, an EVM state is structured as follows:</p>

<pre><code class="language-ts">type EvmState = {
  block_timestamp: Uint256,
  // all possible instances of ERC20 used as assets
  assets: Address -&gt; ERC20State,
  // all possible instances of Governance
  governances: Address -&gt; GovernanceState,
  // all possible instances of GovernanceProposer
  governanceProposers: Address -&gt; GovernanceProposerState,
  // all possible instances of GSE
  gses: Address -&gt; GSEState,
  // all possible instances of Registry
  registries: Address -&gt; RegistryState,
  // all possible instances of RewardDistributor
  rewardDistributors: Address -&gt; RewardDistributorState,
  // all instances of Slasher
  slashers: Address -&gt; SlasherState,
  // all instances of SlashingProposer
  slashingProposers: Address -&gt; SlashingProposerState,
  // IEmperor(...).getCurrentSlot() for each rollup
  rollupSlot: IHaveVersion -&gt; int,
  // IEmperor(...).getCurrentProposer() for each rollup
  rollupProposer: IHaveVersion -&gt; Address,
  // mapping rollup addresses to their versions
  // Corresponds to _rollup.getVersion() call in Registry.sol:53
  ROLLUP_VERSIONS: IHaveVersion -&gt; Uint256,
  // mapping rollup address to the reward distributors that they create
  REGISTRY_REWARD_DISTRIBUTORS: IHaveVersion -&gt; IRewardDistributor
}
</code></pre>

<p>As you can see, we do not have to focus on nitty-gritty low-level details – like how storage is laid out in EVM. <strong><em>This frees us to focus on protocol logic and high-level correctness, rather than low-level implementation concerns. It also makes the reasoning problem more tractable for automated verification.</em></strong></p>

<p><strong>Modeling the contract functions.</strong> The contract functions are simply pure functions over the EVM state. For instance, we define the function <code>initiateWithdraw</code> in Quint as:</p>

<pre><code class="language-ts">// Governance.sol#L341
pure def Governance::initiateWithdraw(__evm_state: EvmState,
      __self: IGovernance, __msg_sender: Address,
      _to: Address, _amount: Uint256): Result[EvmState] = {
  val __state = __evm_state.governances.get(__self)
  val config = __state.configuration

  // ConfigurationLib.sol#L36:
  //   Timestamp.wrap(Timestamp.unwrap(_self.votingDelay) / 5) +
  //     _self.votingDuration + _self.executionDelay;
  val withdrawDelay = config.votingDelay / 5
      + config.votingDuration + config.executionDelay

  // L342: _initiateWithdraw(msg.sender, _to, _amount,
  //                         configuration.withdrawalDelay());
  Governance::_initiateWithdraw(__evm_state, __self, __msg_sender,
                                _to, _amount, withdrawDelay)
}
</code></pre>

<p>In Quint, we explicitly model all side-effects of the Solidity code, including exceptions and reverts. While it makes our specification more verbose, all branches and assignments become immediately visible at the code level – auditors do this in their heads all the time. For example, <code>_initiateWithdraw</code> computes and returns an updated <code>EvmState</code>, unless it reverts:</p>

<pre><code class="language-ts">// Governance.sol#L694
pure def Governance::_initiateWithdraw(__evm_state: EvmState,
      __self: IGovernance, _from: Address, _to: Address,
      _amount: Uint256, _delay: Timestamp): Result[EvmState] = {
  val __state = __evm_state.governances.get(__self)
  // L695: users[_from].sub(_amount);
  val fromAmount = __state.users.getOrElse(_from, checkpoints::constructor)
  val userTraceOrError = checkpoints::sub(__evm_state, fromAmount, _amount)
  if (isErr(userTraceOrError)) {
    err(__evm_state, userTraceOrError.err)
  } else {
    // L696: total.sub(_amount);
    val totalTraceOrError = checkpoints::sub(__evm_state, __state.total, _amount)
    if (isErr(totalTraceOrError)) {
      err(__evm_state, totalTraceOrError.err)
    } else {
      // L698: uint256 withdrawalId = withdrawalCount++;
      // L700: withdrawals[withdrawalId] = Withdrawal({...});
      val withdrawal = {
          amount: _amount,
          unlocksAt: __evm_state.block_timestamp + _delay,
          recipient: _to, claimed: false
      }

      val __state1 = {
        ...__state,
        users: __state.users.put(_from, userTraceOrError.v),  // L695
        total: totalTraceOrError.v,                           // L696
        withdrawals: __state.withdrawals.append(withdrawal),  // L700
      }
      ok({...__evm_state,
        governances: __evm_state.governances.put(__self, __state1)
      })
    }
  }
}
</code></pre>

<p><strong>Modeling transactions.</strong> We model transactions, e.g., initiated by externally-owned accounts (EOAs), via Quint <em>actions</em>:</p>

<pre><code class="language-ts">action governance_initiate_withdraw = {
  nondet _g = evm.governances.keys().oneOf()
  nondet _sender = ALL_SENDERS.oneOf()
  nondet _to = ALL_ADDRESSES.oneOf()
  nondet _amount = 0.to(MAX_UINT256).oneOf()
  val result = Governance::initiateWithdraw(evm, _g, _sender, _to, _amount)
  all {
    is_valid_sender(_sender) and isOk(result),
    evm' = result.v,
    // ...
  }
}
</code></pre>

<p>This directly controls the domains from which input parameters are drawn. When we run the Quint randomized simulator, non-deterministic values are sampled uniformly at random. When we run the Apalache model checker, it uses logic constraints in the <a href="https://github.com/Z3Prover/z3">Z3 SMT solver</a> to reason about all possible non-deterministic values at once.</p>

<h3 id="52-bootstrapping-the-formal-specification-with-ai">5.2. Bootstrapping the Formal Specification with AI</h3>

<p>The above specification looks a bit machine-generated. This is not far from the truth. We used an LLM to produce the initial specifications, given the source code in Solidity and the Quint data types.</p>

<p>Obviously, an LLM cannot make high-level modeling decisions, like how to structure the EVM state, or how best to turn Solidity into functional definitions – this <strong>requires years of practical experience</strong>. We developed a custom system prompt that gives the LLM clear instructions and examples for translating Solidity into Quint. (It’s an internal tool we refine and apply with clients when we bootstrap their specifications.)</p>

<p>Of course, as with all AI assistants, we had to carefully proofread the translation results. Also, we were fortunate to have the model checker <a href="https://github.com/apalache-mc/apalache">Apalache</a> on our side – it automatically pointed us to some inconsistencies in the translation. Compared to writing the specification by hand, this approach allowed us to bootstrap the project very quickly and to start evaluating the protocol early on.</p>

<h3 id="53-compositional-reasoning">5.3. Compositional Reasoning</h3>

<p>Some security researchers believe that formal verification does not scale to more than 1–2 smart contracts, or to exploit scenarios longer than 1–2 external calls deep. We have organized our specification in such a way that the verification tools can deal with the behavior of 10–20 smart contracts, and arbitrarily long transaction sequences. <strong>This level of scalability requires not just formal verification expertise, but a deep understanding of how model checkers and provers work internally</strong> and interact with protocol architecture. It builds directly on our prior formal verification work – including <a href="https://protocols-made-fun.com/zksync/matterlabs/quint/specification/modelchecking/2024/09/12/zksync-governance.html">zkSync Governance</a>, <a href="https://protocols-made-fun.com/consensus/matterlabs/quint/specification/modelchecking/2024/07/29/chonkybft.html">ChonkyBFT</a>, and <a href="https://arxiv.org/abs/2501.07958">Ethereum 3-slot-finality</a> – <strong>where we pushed verification tools to reason compositionally across complex systems</strong>. More on this in Section <a href="#73-inductive-invariants-making-verification-scale">7.3. Inductive Invariants: Making Verification Scale</a>.</p>

<h2 id="6-protocol-invariants">6. Protocol Invariants</h2>

<p>From Aztec’s documentation and source code, we extracted and formalized <strong>125 key invariants</strong> of the Governance Protocol. To get a taste of the invariants, here are a few examples in English (more of them are in the <a href="https://github.com/konnov/aztec-governance-formal-verification-2025q3/blob/e313681ade9f9e96d0e83a5120a670a1e1e07188/reports/Aztec-Governance-Protocol.pdf">report</a>):</p>

<ul>
  <li><strong>GOV-16</strong>: If the proposal has not been active yet, then no votes have been cast.</li>
  <li><strong>GOV-20:</strong> The timestamps in the <code>users</code> traces are ordered.</li>
  <li><strong>GOV-26</strong>: For each timestamp <code>t</code>, <code>total[t]</code> equals the sum of the users’ voting
power at <code>t</code>.</li>
  <li><strong>GP-02-01</strong>: For each submitted proposal in <code>proposalProposer</code>, there is round accounting for a corresponding executed proposal (i.e., submitted to Governance).</li>
  <li><strong>GP-08</strong>: A proposal cannot be executed without a quorum.</li>
  <li><strong>GSE-17:</strong> for each proposal that the <code>delegatee</code> has <code>powerUsed</code> on, Governance contains that proposal.</li>
  <li><strong>GSE-19:</strong> <code>powerUsed</code> cannot exceed the attester’s voting power at the time of the proposal’s <code>pendingThrough</code>.</li>
  <li><strong>GSE-23:</strong> <code>delegation.supply</code> at each checkpoint is the sum of all <code>delegation.ledgers[instance].supply</code> at that time.</li>
  <li><strong>SP-10</strong>: lastSignalSlot is in the valid range.
This range is <code>[round * ROUND_SIZE, (round + 1) * ROUND_SIZE)</code>.</li>
  <li><strong>SP-11</strong>: The number of signals is correct. It does not exceed <code>lastSignalSlot % ROUND_SIZE + 1</code>.</li>
</ul>

<p><strong>Formalized invariants in Quint.</strong> We formalized all 125 invariants in Quint as
well. For example, the <code>Governance</code> contract should uphold the <strong>Solvency Invariant</strong>
(<a href="https://www.certora.com/blog/the-holy-grail">‘The Holy Grail’</a>, as coined by FV researchers at
<a href="https://www.certora.com/">Certora</a>):</p>

<pre><code class="language-ts">// GOV-28: Solvency: Governance holds enough balance to cover all future
// withdrawals.
pure def governance_solvency_inv(_evm: EvmState, ga: IGovernance): bool = {
  pure val g = _evm.governances.get(ga)
  and {
    // the withdrawals that happen in the future
    pure val payable = g.withdrawals.indices().fold(0, (sum, i) =&gt; {
      pure val withdrawal = g.withdrawals[i]
      sum + if (withdrawal.claimed) 0 else withdrawal.amount
    })
    // the total user's balance, add payable, is below the contract's balance
    pure val asset = _evm.assets.get(g.ASSET)
    g.total.latest() + payable &lt;= asset.balances.getOrElse(ga, 0)
  }
}
</code></pre>

<p>Turns out, the solvency invariant is actually violated under certain conditions. We will get back to it in Section <a href="#91-governance-insolvency">9.1. Governance Insolvency</a>.</p>

<p>With the key invariants defined, we started verifying them using Quint and Apalache.</p>

<h2 id="7-formal-verification-workflow">7. Formal Verification Workflow</h2>

<p>As soon as parts of the specification stabilized, we began verification – moving from randomized simulation to full symbolic and inductive reasoning.</p>

<h3 id="71-randomized-simulator-stuck-at-unproductive-inputs">7.1. Randomized Simulator: Stuck at Unproductive Inputs</h3>

<p>The <strong>Quint randomized simulator</strong> operates similarly to property-based testing for implementation languages: it assigns concrete values to <code>nondet</code> declarations and resolves non-deterministic control choices by selecting one branch at random.</p>

<p><strong>Limitations.</strong> We briefly experimented with this approach, but it proved ineffective for our purposes. The simulator’s uniform random sampling consistently failed to produce valid configurations that would even satisfy the protocol’s initial state:</p>

<pre><code class="language-shell">$ quint run --max-samples=100000 --max-steps=10  --invariant=past_signals \
    spec/slashing_proposer_machine.qnt
An example execution:

[ok] No violation found (768ms at 130208 traces/second).
Trace length statistics: max=0, min=0, average=0.00
</code></pre>

<p>We believe the randomized simulator could be improved in future versions. If you’d like to explore in more detail why this happens – and how it could be mitigated – check out our workshop <a href="https://blltprf.xyz/blog/25-min-solidity-fuzzer/"><em>25-Minute Solidity Fuzzer: Fuzzing Smarter, Not Harder</em></a>.</p>

<p>In its current form, however, it did not help us uncover issues. This led us to use the <strong>symbolic analysis tools in Apalache</strong>, which can reason over all possible inputs symbolically rather than sampling concrete ones.</p>

<h3 id="72-symbolic-random-walks-scaling-up">7.2. Symbolic Random Walks: Scaling Up</h3>

<p>With symbolic random walks (part of Apalache), we quickly checked several invariants. The following run revealed an issue: the system could receive outdated (“past”) signals when the canonical rollup was faulty:</p>

<pre><code class="language-shell">$ quint verify --random-transitions=true --max-steps=10 \
  --invariant=past_signals spec/slashing_proposer_machine.qnt
...
[violation] Found an issue (22181ms)
</code></pre>

<p>When an invariant is violated, Apalache produces a counterexample with all details needed to understand the issue. We omit it here because it is quite verbose.</p>

<p><strong>Limitations.</strong> Even though this approach proved to be quite useful in bootstrapping and debugging our specification, it reached its limits when we began dealing with multiple contracts. This limitation stems from the protocol’s scale – with over 40 external functions, many of which can be invoked at nearly any point in time, the number of possible symbolic paths grows combinatorially with path length. We then moved to proving <em>inductive invariants</em> automatically.</p>

<h3 id="73-inductive-invariants-making-verification-scale">7.3. Inductive Invariants: Making Verification Scale</h3>

<p>To scale our formal verification efforts further, we specified 125 invariants that together capture any arbitrary state of the Governance Protocol. For example, below is the invariant <code>gse_rollups_inv</code> that groups the invariants <code>GSE-28</code> to <code>GSE-32</code>:</p>

<pre><code class="language-ts">pure def gse_rollups_inv(evm: EvmState, gsea: IGSE): bool = {
  val gse = evm.gses.get(gsea)
  val chkpts = gse.rollups._checkpoints
  and {
    // GSE-28: `rollups` is an ordered checkpointed trace with ascending timestamps
    _trace_is_ordered(gse.rollups),
    chkpts.indices().forall(i =&gt; and {
      // GSE-29: `rollups` values are rollup addresses
      chkpts[i]._value.in(ROLLUP_ADDRESSES),
      // GSE-30: the bonus instance does not appear in the `rollups` history
      chkpts[i]._value != BONUS_INSTANCE_ADDRESS,
      // GSE-31: `rollups` values are registered in `instances`
      chkpts[i]._value.in(gse.instances.keys()),
    })
  }
}
</code></pre>

<p>The following command checks that all protocol invariants (<code>all_inv</code>) – including <code>gse_rollups_inv</code> – hold whenever the protocol is in a state that satisfies <code>all_inv</code> and one of the contracts makes a single step:</p>

<pre><code class="language-shell">$ ./scripts/quint-inductive.sh spec/invariant_model.qnt 31 32 5 100 all_inv
</code></pre>

<p>Beware that the above command runs over 900 verification runs in parallel (in the example above using at most 5 CPUs at once). This can easily overwhelm your laptop. If you want to reproduce our experiments, read the next section on our experimental setup.</p>

<p><strong>Scalable verification.</strong> The <strong>core technique</strong> that enables this level of scalability is the use of <strong>inductive invariants</strong>. Instead of exploring all possible symbolic paths of the specification from an initial state (this approach, used by most code-level symbolic tools, is called <em>symbolic execution</em>), we start with a much richer set of states (captured by the inductive invariant <code>all_inv</code>) and simply enumerate all possible external functions and make them execute exactly once from any state in the inductive invariant. By assuming that <code>all_inv</code> holds in an arbitrary state and showing that it still holds after symbolically executing any single transaction, our check <strong>extends inductively to all possible executions</strong>.</p>

<p><strong>Note for sticklers.</strong> We still have to show that the initial states satisfy the inductive invariant. In our case, this is easy. Essentially, the initial state of the protocol is an “empty” EVM state where none of the governance contracts are deployed yet.</p>

<h3 id="74-witnesses-of-liveness-proving-the-protocol-can-progress">7.4. Witnesses of Liveness: Proving the Protocol Can Progress</h3>

<p>When verifying safety, there is always a risk that we introduce a bug in the specification that restricts the protocol behavior too much. This would still keep the protocol “safe” from the verification point of view, but, obviously, the protocol would not do as many useful things as it is meant to do. To avoid this pitfall, we introduce “falsy invariants” that instruct Apalache to generate a witness of the protocol reaching an “interesting” state. Below is an example to produce an execution to a state in which at least one governance proposal has been executed:</p>

<pre><code class="language-ts">// Check this invariant to find an example of having at least one executed proposal:
// quint verify --max-steps=0 --invariant=gov_proposals_executed_ex \
//   spec/invariant_model.qnt
val gov_proposals_executed_ex = {
  not(evm.governances.keys().forall(ga =&gt; {
    val g = evm.governances.get(ga)
    g.proposals.indices().exists(proposalId =&gt; {
      val proposal = g.proposals[proposalId]
      proposal.cachedState == ProposalState_Executed
    })
  }))
}
</code></pre>

<p>This ability to automatically generate an execution trace to an ‘interesting’ state is a <strong>superpower of symbolic model checkers like Apalache</strong> – a functionality that would be <strong>far more difficult</strong> to automate with an interactive theorem prover such as Lean or Rocq. (Provers have property-based testing tools, but they are not tuned to bug finding in distributed protocols like <a href="https://github.com/apalache-mc/apalache">Apalache</a>.)</p>

<h2 id="8-experimental-setup-and-verification-runs">8. Experimental Setup and Verification Runs</h2>

<p><strong>Experimental setup.</strong> As mentioned above, checking the inductive invariant of our specification produces 992 verification tasks in total (for the combinations of a specific invariant and an external function call). Apalache decomposes invariant checking into smaller tasks, so we employ <a href="https://www.gnu.org/software/parallel/">GNU parallel</a> to <strong>massively parallelize the verification</strong>. We use two servers to run the experiments:</p>

<ol>
  <li>AMD Ryzen 9 5950X processor (16 physical, 32 logical cores), 128 GB memory</li>
  <li>2× Intel Xeon Platinum 8280 processor (56 physical, 112 logical cores total), pinned at 3.1 GHz, 240 GB memory</li>
</ol>

<p><strong>Verification Runs.</strong> Some of the verification tasks take a few minutes to check, and some of them take a few hours. This is caused by the nature of the SMT constraints. It is well-known that SMT solvers, including Z3, are challenged by non-linear integer arithmetic – in this project, they naturally appear, e.g., as part of Aztec’s vote tallying logic.</p>

<p>Instead of writing many words, we simply show you the plot below. It visualizes the running times of <a href="https://github.com/apalache-mc/apalache">Apalache</a> when checking the 992 verification tasks. The X-axis shows the number of verification conditions solved (roughly, individual constraints in the inductive invariant), sorted from fastest to slowest. Each point corresponds to one verification condition. The Y-axis represents the running time per verification condition, formatted in human-readable units (milliseconds to hours). Notice the logarithmic scale!</p>

<p><img src="/assets/images/aztec-governance-verification-times.svg" alt="aztec-gov-plot-all" /></p>

<p>As we can see from the plot, over 85% of the verification conditions are checked in less than 10 minutes each, about 7% are checked in several hours, and about 8% of the verification conditions require plenty of running time.</p>

<p><strong>Timeouts.</strong> As it happens with SMT solvers, 3% of our verification conditions time out. These are the runs at the end of the “hockey stick”. We capped the running time of Z3 at 12 hours. Since we are decomposing the inductive invariant into smaller pieces, these problematic conditions are well-localized. We have investigated these conditions. They all have to do with non-linear arithmetic.</p>

<p>Below is an example of such an invariant. Notice that the very last expression involves modulo over a non-constant value, since <code>ROUND_SIZE</code> is initialized in the contract constructor.</p>

<pre><code class="language-ts">// GovernanceProposer invariant on last signals and total signals
pure def governance_proposer_signal_inv(evm: EvmState,
                                        ga: IGovernanceProposer): bool = {
  val gp = evm.governanceProposers.get(ga)
  gp.empireBase.rounds.keys().forall(rollup =&gt; {
    gp.empireBase.rounds.get(rollup).keys().forall(round =&gt; {
      val rollupRounds = gp.empireBase.rounds.get(rollup)
      val accounting = rollupRounds.get(round)
      and {
        // GP-12: ...
        // ...
        // GP-13: The number of signals is in the right range
        // It does not exceed `lastSignalSlot % ROUND_SIZE + 1`.
        // This property is very hard for Z3. It is not falsified.
        and {
          gp.empireBase.ROUND_SIZE &lt;= MAX_ROUND_SIZE,
          totalSignalCount &gt;= 0,
          totalSignalCount &lt;= gp.empireBase.ROUND_SIZE,
          totalSignalCount &lt;=
            (accounting.lastSignalSlot % gp.empireBase.ROUND_SIZE) + 1,
        }
      }
    })
  })
}
</code></pre>

<p>We classify the small number of the verification conditions that time out as <em>not falsified</em> rather than verified. Usually, we recommend verifying such conditions with a theorem prover such as Lean or Rocq. Another solution is to fix these non-constant values to known production configuration values to gain further confidence.</p>

<h2 id="9-findings">9. Findings</h2>

<p>Our verification of the Aztec Governance Protocol uncovered <strong>five Medium</strong>, <strong>three Low</strong>, and <strong>six Informational findings</strong>. Most arose from subtle cross-contract interactions that are difficult to identify through conventional testing, fuzzing, or simulation alone. We reported all issues to Aztec Labs, who acknowledged and/or fixed them in subsequent pull requests.</p>

<p>Below we explain one representative issue: a <strong>violation of the solvency invariant</strong>.</p>

<h3 id="91-governance-insolvency">9.1. Governance Insolvency</h3>

<p>Recall the solvency invariant from <a href="#6-protocol-invariants">Protocol Invariants</a>. When we check it, Apalache produces a counterexample. Below is the root cause of this issue:</p>

<pre><code class="language-ts">function deposit(address _beneficiary, uint256 _amount) external
        override(IGovernance) isDepositAllowed(_beneficiary) {
  ASSET.safeTransferFrom(msg.sender, address(this), _amount);
    // &lt;--- if msg.sender == address(this), then the balances do not change
  users[_beneficiary].add(_amount);
    // &lt;--- ...but the liabilities always get increased
  total.add(_amount);
  emit Deposit(msg.sender, _beneficiary, _amount);
}
</code></pre>

<p>In short, <strong>executing an approved governance proposal</strong> can invoke <code>Governance.deposit(...)</code>. Inside <code>deposit</code>, this performs an ERC-20 self-transfer – leaving token balances unchanged – while <strong>crediting <code>_beneficiary</code> and increasing <code>total</code></strong>. <code>Governance</code>’s liabilities go up, but its assets don’t – the contract becomes <strong>insolvent</strong>. The diagram below illustrates the problematic scenario.</p>

<p><img src="/assets/images/gov-insolvency.svg" alt="aztec-gov-plot-all" /></p>

<p><strong>On ERC-20 approvals.</strong> Most ERC20 token implementations would require Governance to execute an explicit token approval for the self-transfer before the call to <code>deposit()</code> (while executing the governance proposal). Calling <code>ASSET</code> is forbidden by the current <code>Governance</code> implementation. However, certain tokens like <a href="https://etherscan.io/address/0xc02aaa39b223fe8d0a0e5c4f27ead9083c756cc2#code">WETH</a> do <strong>not</strong> require approvals for <code>transferFrom()</code> if <code>from == msg.sender</code>.</p>

<p><strong>Resolution.</strong> We raised this finding with Aztec Labs who addressed it in PR <a href="https://github.com/AztecProtocol/aztec-packages/pull/16917">#16917</a> by forbidding Governance from calling <code>deposit()</code> itself. In addition, the <a href="https://etherscan.io/address/0xA27EC0006e59f245217Ff08CD52A7E8b169E62D2#code">current AZTEC token implementation</a> uses OpenZeppelin’s ERC-20 implementation, which does require explicit approval of self-transfers.</p>

<h3 id="92-other-findings">9.2. Other Findings</h3>

<p>For details on all our findings, refer to our <a href="https://github.com/konnov/aztec-governance-formal-verification-2025q3/blob/e313681ade9f9e96d0e83a5120a670a1e1e07188/reports/Aztec-Governance-Protocol.pdf">formal report</a>.</p>

<h2 id="10-conclusion-scalable-formal-verification-in-practice">10. Conclusion: Scalable Formal Verification in Practice</h2>

<p>Our formal verification of the Aztec Governance Protocol went far beyond a traditional audit. It was a <strong>compositional, protocol-level analysis</strong> using state-of-the-art tools and techniques that we helped create. We formally proved <strong>125 high-level invariants</strong> across a multi-contract system – reasoning over a search space beyond the reach of traditional testing and most formal verification tools. These invariants were automatically decomposed into 992 verification conditions, which let us further parallelize the verification task.</p>

<p>By combining <strong>inductive invariants, symbolic reasoning, and massive parallelization</strong> (321 CPU-days of compute), we showed that formal verification can scale to the complexity of modern, mission-critical smart contract systems. Our methodology enables <strong>exhaustive, automated reasoning</strong> about real-world governance mechanisms and other smart contract protocols.</p>

<p>For systems like Aztec Governance, where bugs are subtle but potentially catastrophic, <strong>deep understanding of the tools and underlying logic</strong> is essential. This project demonstrates that scalable, unbounded formal verification is not just theoretically possible – it’s practical today for mature, production-grade protocols.</p>

<h3 id="differential-testing-spec--implementation-conformance">Differential Testing: Spec / Implementation Conformance</h3>

<p>A natural next step would be to <strong>connect the formal protocol specification</strong> with the actual Solidity implementation to <strong>close the verification loop</strong> (known as <em>differential</em> or <em>conformance testing</em>). With our methodology, it suffices to check that each external function call in Solidity conforms to its formal specification in Quint. Traditionally, this is done by writing and proving pre- and post-conditions in Hoare logic – e.g., using <a href="https://www.certora.com/">Certora Prover</a>. We suggest that a <strong>more pragmatic approach</strong> is to <strong>fuzz external Solidity functions directly against the formal specification</strong>.</p>

<p><strong>Enabling diff testing in Apalache:</strong> We have just implemented a new <a href="https://github.com/apalache-mc/apalache/tree/main/json-rpc">Apalache JSON-RPC API</a>, which enables interactive differential testing between implementation and specification. This delivers <strong>fast, actionable, and reproducible results</strong> while still providing a <strong>high level of assurance</strong> grounded in rigorous formal modelling.</p>]]></content><author><name>{&quot;igor&quot;=&gt;{&quot;name&quot;=&gt;&quot;Igor Konnov&quot;, &quot;url&quot;=&gt;&quot;https://konnov.phd&quot;, &quot;email&quot;=&gt;&quot;igor@konnov.phd&quot;}}</name></author><category term="quint" /><summary type="html"><![CDATA[Authors: Thomas Pani, Igor Konnov]]></summary></entry><entry><title type="html">Small scope hypothesis revisited</title><link href="https://protocols-made-fun.com/tlaplus/2025/12/02/small-scope.html" rel="alternate" type="text/html" title="Small scope hypothesis revisited" /><published>2025-12-02T00:00:00+00:00</published><updated>2025-12-02T00:00:00+00:00</updated><id>https://protocols-made-fun.com/tlaplus/2025/12/02/small-scope</id><content type="html" xml:base="https://protocols-made-fun.com/tlaplus/2025/12/02/small-scope.html"><![CDATA[<p><strong>Author:</strong> <a href="https://konnov.phd">Igor Konnov</a></p>

<p><strong>Tags:</strong> specification tlaplus tlc</p>

<p>A couple of weeks ago, I gave a talk at the internal Nvidia FM Week 2025. Many
thanks to <a href="https://github.com/lemmy">Markus Kuppe</a> for the organization and invitation! I am going to
write a longer blog post about interactive spec conformance testing with
Apalache later. Today, I want to talk a bit about the question posed by Markus
(to find the question, continue reading).</p>

<p>Let’s talk about the small scope hypothesis. As formulated by Jackson in the
<a href="https://dspace.mit.edu/bitstream/handle/1721.1/149864/MIT-LCS-TR-735.pdf">technical report</a> (1997), this hypothesis reads as follows:</p>

<p class="highlight-question"><strong><em>
    "...most errors can be demonstrated by counterexamples within a small scope."
</em></strong></p>

<p>As you will see below, my example fits into this hypothesis quite well. However,
having spoken to many engineers over the years, I believe that there is a
mismatch between what engineers understand by “small scope” and what
verification engineers understand by “small scope”.</p>

<p>In this blog post, I’ve decided to try a <strong>new format</strong>. Since everyone is using
LLMs nowadays, I will follow the protocol. I will present the example and the
problem of finding a small scope. Then, it is your turn to decide how this blog
post should continue. If someone gives me an interesting example or insight in a
<a href="#end">comment</a>, I will update this blog post accordingly.</p>

<h2 id="1-example-1-buggy-circular-buffer">1. Example 1: Buggy circular buffer</h2>

<h3 id="11-the-specification">1.1. The specification</h3>

<p>I started the talk with a TLA<sup>+</sup> specification of a <strong>buggy</strong> circular
buffer. You can find the full specification, the model checking models, and the
TLC configuration files <a href="https://github.com/konnov/cyclic-buffer-challenge/tree/main/tla">here</a>. The specification looks as follows:</p>

<pre><code class="language-tlaplus">----------------------------- MODULE BuggyCircularBuffer -----------------------------
(**
 * A very simple specification of a circular buffer with a bug.
 * Generated with ChatGPT and beautified by Igor Konnov, 2025.
 * ChatGPT learned abstraction so well that it omitted the actual buffer storage!
 *)
EXTENDS Integers

CONSTANTS
    \* Size of the circular buffer.
    \* @type: Int;
    BUFFER_SIZE,
    \* The set of possible buffer elements.
    \* @type: Set(Int);
    BUFFER_ELEMS

ASSUME BUFFER_SIZE &gt; 0

VARIABLES
    \* The integer buffer of size BUFFER_SIZE.
    \* @type: Int -&gt; Int;
    buffer,
    \* Index of the next element to POP.
    \* @type: Int;
    head,
    \* Index of the next free slot for PUSH.
    \* @type: Int;
    tail,
    \* Number of elements currently stored.
    \* @type: Int;
    count

\* Initial state
Init ==
  /\ buffer = [ i \in 0..(BUFFER_SIZE - 1) |-&gt; 0 ]
  /\ head = 0
  /\ tail = 0
  /\ count = 0

\* Buggy PUT: Advance tail, increment count, but no fullness check!
Put(x) ==
  Put::
  LET nextTail == (tail + 1) % BUFFER_SIZE IN
  /\ buffer' = [buffer EXCEPT ![tail] = x]
  /\ head' = head
  /\ tail' = nextTail
  /\ count' = count + 1

\* GET: Only allowed when count &gt; 0.
Get ==
  Get::
  LET nextHead == (head + 1) % BUFFER_SIZE IN
  /\ count &gt; 0
  /\ UNCHANGED buffer
  /\ head' = nextHead
  /\ tail' = tail
  /\ count' = count - 1

\* Either Put or Get may happen in any step.
Next ==
    \/ \E x \in BUFFER_ELEMS:
        Put(x)
    \/ Get

vars == &lt;&lt;buffer, head, tail, count&gt;&gt;

\* Complete specification
Spec == Init /\ [][Next]_vars

\* Safety property we *intend* to hold, but it is violated:
\* count must never exceed the buffer capacity.
SafeInv == count &lt;= BUFFER_SIZE

</code></pre>

<p>Since I wanted to experiment with different buffer sizes and potential buffer
elements, I have introduced two parameters in the specification:</p>

<ul>
  <li><code>BUFFER_SIZE</code> is the size of the cyclic buffer, and</li>
  <li><code>BUFFER_ELEMS</code> is the set of possible buffer elements.</li>
</ul>

<p>Now, my previous experience with introducing TLA<sup>+</sup> to engineers
suggests that there are two ways to set these parameters:</p>

<ol>
  <li>
    <p><strong>The Engineer’s way:</strong> Set the parameters to relatively small yet
 reasonable values. For example, <code>BUFFER_SIZE = 10</code> and <code>BUFFER_ELEMS = 0..255</code>.
 These are not the minimal possible values, but they kind of make sense: The
 buffer should hold up to 10 bytes. Obviously, <code>BUFFER_ELEMS</code> are
 set to the minimal possible type in their programming language of choice, e.g.,
 <code>char</code> in C, or <code>u8</code> in Rust.</p>
  </li>
  <li>
    <p><strong>The Verification Engineer’s way:</strong> Start with the smallest possible values
 of the parameters, e.g., <code>BUFFER_SIZE = 2</code> and <code>BUFFER_ELEMS = {0, 1}</code>. The
 idea is to check the specification in the smallest possible scope first. If there
 are no bugs found, increase the parameters gradually until you reach the
 reasonable values.</p>
  </li>
</ol>

<h3 id="12-checking-the-specification-engineers-way">1.2. Checking the specification Engineer’s way</h3>

<p>To check the specification the Engineer’s way, I have created the
TLA<sup>+</sup> model <a href="https://github.com/konnov/cyclic-buffer-challenge/blob/main/tla/MC10u8_BuggyCircularBuffer.tla">MC10u8_BuggyCircularBuffer.tla</a> with <code>BUFFER_SIZE = 10</code>
and <code>BUFFER_ELEMS = 0..255</code>. For technical reasons, we also need the TLC config
file <a href="https://github.com/konnov/cyclic-buffer-challenge/blob/main/tla/MC.cfg">MC.cfg</a>. Follow the links to see the details.  Further, I’ve run TLC on
this model to check the invariant <code>SafeInv</code>:</p>

<pre><code class="language-shell">$ java -cp tla2tools.jar "-XX:+UseParallelGC" tlc2.TLC \
  -config MC.cfg MC10u8_BuggyCircularBuffer.tla
</code></pre>

<p>I wanted to see how far TLC could go, so I gave it a machine with 128 GB of RAM
and 32 cores. TLC has explored around 3 billion states in about 40 minutes.
After consuming 400 GB of disk space, it has run out of disk space and
terminated. No bug was found. Is this surprising? Not really. In this
configuration, TLC has to enumerate $(2^8)^{10} * 10 * 10 * 10 \approx 2^{90}$
states. (Thanks to <a href="https://blltprf.xyz/">Thomas Pani</a> for correcting the initially wrong estimate.)</p>

<p>Obviously, anyone who used TLC for some time would have asked the same question
as Markus did:</p>

<p class="highlight-question"><strong><em>
  What about the small scope hypothesis? Can we use smaller parameters?
</em></strong></p>

<p>The answer to this question is basically the second approach, which I called the
Verification Engineer’s way.</p>

<div class="tip-box">
    <div class="tip-header"><strong>💡 Tip:</strong></div>
    <div class="tip"><p>Apalache finds an invariant violation in 3
seconds, when running bounded model checking with the command <code>check</code>.
However, I do not want to distract us from the main point of this blog post.</p>
</div>
</div>

<h3 id="13-checking-the-specification-verification-engineers-way">1.3. Checking the specification Verification Engineer’s way</h3>

<p>This time, we use the instance <a href="https://github.com/konnov/cyclic-buffer-challenge/blob/main/tla/MC2u1_BuggyCircularBuffer.tla">MC2u1_BuggyCircularBuffer.tla</a>
that has <code>BUFFER_SIZE = 2</code> and <code>BUFFER_ELEMS = {0, 1}</code>.
Let’s run TLC on this instance to check the invariant <code>SafeInv</code>:</p>

<pre><code class="language-shell">$ java -cp tla2tools.jar "-XX:+UseParallelGC" tlc2.TLC \
  -config MC.cfg MC2u1_BuggyCircularBuffer.tla
...
Error: Invariant SafeInv is violated.
...
10 states generated, 10 distinct states found, 5 states left on queue.
</code></pre>

<p>Yay! Just after visiting 10 states, TLC has found a violation of the invariant!</p>

<p>So if we pick the right small scope, exhaustive model checking with TLC finds
the bug quite fast. In this example, it is hard to find a small scope that would
not reveal the bug. Of course, when we know that the bug exists, it is easy to
experiment with different values of the parameters and find the bug.</p>

<h3 id="14-checking-the-specification-randomly">1.4. Checking the specification randomly</h3>

<p>Surprisingly, if we forget about exhaustive enumeration, TLC finds an
invariant violation for <code>BUFFER_SIZE = 10</code> and <code>BUFFER_ELEMS = 0..255</code> in less
than a second. To do this, we run TLC with the option <code>-simulate</code>, which simply
picks successor states at random:</p>

<pre><code class="language-shell">$ java -cp tla2tools.jar "-XX:+UseParallelGC" tlc2.TLC \
  -simulate -config MC.cfg MC10u8_BuggyCircularBuffer.tla
...
Error: Invariant SafeInv is violated.
...
</code></pre>

<p>This effectiveness of randomized search is actually not a one-off thing.
The Quint simulator <a href="https://github.com/konnov/cyclic-buffer-challenge/tree/main/quint#randomized-simulation">finds the bug</a> in less than a second.
Similarly, the Rust property-based testing with <a href="https://github.com/konnov/cyclic-buffer-challenge/tree/main/rust/proptest">proptest</a> finds the bug
almost immediately.</p>

<p>Interestingly, <strong>we did not have to tune the scope to be as tiny as possible</strong>,
as we did for exhaustive model checking. Maybe this is why some engineers want
to use property-based testing for every problem?</p>

<h2 id="2-thinking-about-the-small-scope-hypothesis">2. Thinking about the small scope hypothesis</h2>

<p>In Example 1, we indeed found several assignments to <code>BUFFER_SIZE</code> and
<code>BUFFER_ELEMS</code> that revealed an invariant violation. Actually, this bug is so
simple that almost any assignment to the parameters would reveal it. We could
even set <code>BUFFER_SIZE = 1</code> and <code>BUFFER_ELEMS = {0}</code> to find the bug! If you want
to push it further, think, whether <code>BUFFER_ELEMS = {}</code> would allow us to find an
invariant violation.</p>

<p>In fact, if we go back to <a href="https://alloytools.org/">Alloy</a>, the way Alloy restricts the scope is quite
different from what we did in Example 1. Alloy limits the number of elements of
each type in the specification. For example, if we had specified the circular
buffer in Alloy, we could restrict the search scope as follows:</p>

<ul>
  <li>
    <p>All integers, including <code>BUFFER_SIZE</code> and buffer indices, have the bit width
 of 4.</p>
  </li>
  <li>
    <p>The number of unique buffer elements is $2^8$.</p>
  </li>
</ul>

<p>As a result, Alloy would consider all possible values of <code>BUFFER_SIZE</code> from 0 to
15, all possible values of buffer elements from 0 to 255, as well as all
possible combinations of buffers of size up to 15. This is a much more flexible
way to restrict the search space.  In case of TLC, we did not have this
flexibility: We had to give concrete values to <code>BUFFER_SIZE</code> and <code>BUFFER_ELEMS</code>.</p>

<div class="tip-box">
    <div class="tip-header"><strong>💡 Tip:</strong></div>
    <div class="tip"><p>Apalache has data generators, which are closer to
the Alloy scopes in spirit, though they work slightly different from Alloy.</p>
</div>
</div>

<p>Hence, we have to distinguish between small scopes and small parameter
assignments in TLC. After thinking about this question a bit more, I’ve asked
myself:</p>

<p class="highlight-question"><strong><em>
  Are there examples of specifications that have a small scope for a specific
  invariant violation, but it is hard to find concrete parameter assignments
  within this scope?
</em></strong></p>

<p>Even though my intuition says “yes”, there must be plenty of such examples, I
could not come up with with non-artificial examples immediately. On top of my
head, I can think of the following directions to look for such examples:</p>

<ul>
  <li>
    <p>Examples from <strong>abstract interpretation</strong>. If we have non-trivial math with
  overflows and underflows, it might be hard to find concrete assignments that
  would trigger these overflows and underflows.</p>
  </li>
  <li>
    <p>Examples from <strong>graph theory</strong>. For instance, <a href="https://en.wikipedia.org/wiki/Planar_graph">non-planar graphs</a>
  must contain subgraphs that are subdivisions of $K_5$ or $K_{3,3}$ (see
  Kuratowski’s theorem on <a href="https://en.wikipedia.org/wiki/Planar_graph">planar graphs</a>). So if a bug shows up only in
  non-planar graphs, there must be a small scope that reveals the bug.  However,
  our concrete graph would have to contain a subdivision of $K_5$ or $K_{3,3}$,
  which is far from an arbitrary graph. Unfortunately, I do not know any
  concurrent or distributed algorithm that would have something to do with
  planar or non-planar graphs.</p>
  </li>
</ul>

<h2 id="3-your-turn">3. Your turn</h2>

<p>It is your turn to decide how this blog post should continue. If someone gives
me an interesting example or insight in a <a href="#end">comment</a>, I will
update this blog post accordingly.</p>

<!-- references -->]]></content><author><name>{&quot;igor&quot;=&gt;{&quot;name&quot;=&gt;&quot;Igor Konnov&quot;, &quot;url&quot;=&gt;&quot;https://konnov.phd&quot;, &quot;email&quot;=&gt;&quot;igor@konnov.phd&quot;}}</name></author><category term="tlaplus" /><summary type="html"><![CDATA[Author: Igor Konnov]]></summary></entry><entry><title type="html">Running TLC with non-standard modules</title><link href="https://protocols-made-fun.com/tlaplus/2025/10/09/tlc-with-modules.html" rel="alternate" type="text/html" title="Running TLC with non-standard modules" /><published>2025-10-09T00:00:00+00:00</published><updated>2025-10-09T00:00:00+00:00</updated><id>https://protocols-made-fun.com/tlaplus/2025/10/09/tlc-with-modules</id><content type="html" xml:base="https://protocols-made-fun.com/tlaplus/2025/10/09/tlc-with-modules.html"><![CDATA[<p><strong>Author:</strong> <a href="https://konnov.phd">Igor Konnov</a></p>

<p><strong>Tags:</strong> specification tlaplus apalache tlc</p>

<p>This must be my shortest blog post ever. I just wanted to run <a href="https://github.com/tlaplus/tlaplus">TLC</a> to check
a specification that uses <a href="https://apalache-mc.org/">Apalache</a> modules. For example, the typed version
of two-phase commit <a href="https://github.com/apalache-mc/apalache/blob/main/test/tla/MC3_TwoPhaseTyped.tla">MC3_TwoPhaseTyped.tla</a> uses the module <a href="https://apalache-mc.org/docs/lang/variants.html">Variants</a>.
Sure, TLC can do that, but it requires a small trick.</p>

<p>Let’s do it step by step for <a href="https://github.com/apalache-mc/apalache/blob/main/test/tla/MC3_TwoPhaseTyped.tla">MC3_TwoPhaseTyped.tla</a>. Say, we want to see
an example of all participants committing:</p>

<pre><code class="language-tlaplus">RMAllCommittedEx ==
    ~(\A rm \in RM: rmState[rm] = "committed")
</code></pre>

<p><strong>Step 1.</strong> Checkout the Apalache repository, if you don’t have it already:</p>

<pre><code class="language-shell">$ git clone git@github.com:apalache-mc/apalache.git
$ export APALACHE_HOME=$(pwd)/apalache
</code></pre>

<p><strong>Step 2.</strong> Download TLA<sup>+</sup> Tools:</p>

<pre><code class="language-shell">$ wget https://github.com/tlaplus/tlaplus/releases/download/v1.7.4/tla2tools.jar
$ export TLC_HOME=$(pwd)
</code></pre>

<p><strong>Step 3.</strong> Introduce a configuration file <code>MC3_TwoPhaseTyped.cfg</code> with the
following content:</p>

<pre><code>$ cd $APALACHE_HOME/test/tla
$ cat &gt;MC3_TwoPhaseTyped.cfg &lt;&lt;EOF
INIT Init
NEXT Next
INVARIANT RMAllCommittedEx
EOF
</code></pre>

<p><strong>Step 4.</strong> Run TLC with the option <code>-cp</code>, which extends the Java
classpath. TLC will look for non-standard modules in the specified directory,
that is, in the directory <code>${APALACHE_HOME}/src/tla</code>. <em>This is the trick!</em></p>

<pre><code class="language-shell">$ java -cp ${TLC_HOME}/tla2tools.jar:${APALACHE_HOME}/src/tla \
  "-XX:+UseParallelGC" tlc2.TLC \
  -config MC3_TwoPhaseTyped.cfg MC3_TwoPhaseTyped.tla
</code></pre>

<p>As expected, TLC finds an example of <code>RMAllCommittedEx</code>:</p>

<pre><code>Running breadth-first search Model-Checking...
...
State 11: &lt;Next line 16, col 1 to line 16, col 22 of module MC3_TwoPhaseTyped&gt;
/\ msgs = { [tag |-&gt; "Commit", value |-&gt; "0_OF_NIL"],
  [tag |-&gt; "Prepared", value |-&gt; "0_OF_RM"],
  [tag |-&gt; "Prepared", value |-&gt; "1_OF_RM"],
  [tag |-&gt; "Prepared", value |-&gt; "2_OF_RM"] }
/\ rmState = [0_OF_RM |-&gt; "committed", 1_OF_RM |-&gt; "committed", 2_OF_RM |-&gt; "committed"]
/\ tmState = "committed"
/\ tmPrepared = {"0_OF_RM", "1_OF_RM", "2_OF_RM"}
...
1119 states generated, 287 distinct states found, 7 states left on queue.
</code></pre>

<p><a name="end"></a></p>
<h2 id="bottom-line">Bottom line</h2>

<p>This is it! If you have any questions, please feel free to reach out. I’m
<a href="/contact/">happy to help</a>.</p>]]></content><author><name>{&quot;igor&quot;=&gt;{&quot;name&quot;=&gt;&quot;Igor Konnov&quot;, &quot;url&quot;=&gt;&quot;https://konnov.phd&quot;, &quot;email&quot;=&gt;&quot;igor@konnov.phd&quot;}}</name></author><category term="tlaplus" /><summary type="html"><![CDATA[Author: Igor Konnov]]></summary></entry><entry><title type="html">Proving completeness of an eventually perfect failure detector in Lean4</title><link href="https://protocols-made-fun.com/lean/2025/06/10/lean-epfd-completeness.html" rel="alternate" type="text/html" title="Proving completeness of an eventually perfect failure detector in Lean4" /><published>2025-06-10T00:00:00+00:00</published><updated>2025-06-10T00:00:00+00:00</updated><id>https://protocols-made-fun.com/lean/2025/06/10/lean-epfd-completeness</id><content type="html" xml:base="https://protocols-made-fun.com/lean/2025/06/10/lean-epfd-completeness.html"><![CDATA[<p><strong>Author:</strong> <a href="https://konnov.phd">Igor Konnov</a></p>

<p><strong>Tags:</strong> specification lean distributed proofs tlaplus</p>

<p>In the previous <a href="/lean/2025/05/10/lean-two-phase-proofs.html">blog post</a>, we looked at proving
consistency of Two-phase commit in <a href="https://github.com/leanprover/lean4">Lean 4</a>. This proof followed the
well-trodden path: We found an inductive invariant, quickly model-checked it
with <a href="https://apalache-mc.org/">Apalache</a> and proved its inductiveness in Lean. One of the immediate
questions that I got on X/Twitter <a href="x-liveness">was</a>: What about liveness?</p>

<p>Well, liveness of the two-phase commit after the TLA<sup>+</sup> specification
does not seem to be particularly interesting, as it mostly depends on fairness
of the resource managers and the transaction manager. (A real implementation may
be much trickier though.) I was looking for something a bit more challenging
but, at the same time, something that would not take months to reason about.
Since many Byzantine-fault tolerance algorithms work under partial synchrony, a
natural thing to do was to find a protocol that required partial synchrony.
<a href="https://en.wikipedia.org/wiki/Failure_detector">Failure detectors</a> seemed to be a good fit to me. These
algorithms are relatively small, but require plenty of reasoning about time.</p>

<p>Hence, I opened the book <a href="https://www.distributedprogramming.net/">DP2011</a> on Reliable and Secure Distributed Programming
by Christian Cachin, Rachid Guerraoui, and Luís Rodrigues and found the
pseudo-code of the eventually-perfect failure detector (EPFD). If you have never
heard of failure detectors, there are a few introductory lectures on YouTube,
e.g., <a href="https://www.youtube.com/watch?v=k_mlWOcWOSA">this one</a>. Writing a decent TLA<sup>+</sup>-like
specification of EPFD and its temporal properties in Lean took me about eight
hours. Since temporal properties require us to reason about infinite executions,
this required a bit of experimentation with Lean.  Figuring out how to capture
<a href="#22-partial-synchrony">partial synchrony</a> and <a href="#25-specifying-fairness-and-fair-runs">fairness</a> was
the most interesting part of the exercise. I believe that this approach can be
reused in many other protocol specifications.  You can find the protocol
specification and the properties in <a href="https://github.com/konnov/leanda/blob/main/epfd/Epfd/Propositional.lean">Propositional.lean</a>. See <a href="#2-eventually-perfect-failure-detector-in-lean">Section
2</a> for detailed explanations.</p>

<p>To prove correctness of EPFD, we have to show that it satisfies strong
completeness and strong accuracy (see <a href="#24-specifying-the-temporal-properties">temporal properties</a>). I chose
to start with strong completeness. The proof in the book is just four (!) lines
long. In contrast to that, my proof of strong completeness in Lean is about 1
KLOC. It consists of 13 lemmas and 2 theorems. As one professor once asked me:
<em>Do you want to convince a machine that distributed algorithms are correct?</em>
Apparently, it takes more effort to convince a machine than to convince a human.
By a machine, I mean a proof assistant such as Lean, not an LLM, which would be
easy to convince in pretty much anything. The real reason for that is that we
take a lot of things about computations for granted, whereas Lean requires them
to be explained. For instance, if a process $q$ crashes when the global clock
value is $t$, it seems obvious that no process $p$ can receive a message from
$q$ with the timestamp above $t$. Not so fast, this has to be proven! For the
impatient, the complete proofs can be found in <a href="https://github.com/konnov/leanda/blob/main/epfd/Epfd/PropositionalProofs.lean">PropositionalProofs.lean</a>.
See <a href="#3-proving-strong-completeness-in-lean">Section 3</a> for detailed explanations.</p>

<p>It took me about 35 hours to finish the proof of strong completeness. I remember
to have a working proof for a <em>pair</em> of processes $p$ and $q$ after the 25 hour
mark already. However, the property in the book is formulated over all
processes, not just a pair. Proving the property over all processes took me
about additional 10 hours. This actually required more advanced Lean proof
mechanics and solving a few curious proof challenges with crashing processes,
e.g., <a href="#341-defining-the-crashed-processes">how we define them</a>. Also, bear in mind that this was
literally my first proof of temporal properties in Lean. I believe that the next
protocol would require less time.</p>

<p>Below is the nice diagram that illustrates the dependencies between the theorems
(green) and lemmas (yellow) that I had to prove, culminating in the theorem
<code>strong_completeness_on_states</code>. Notice <code>forall_FG_implies_FG_forall</code> is not a
lemma. Actually, it is a general theorem about swapping universal quantifiers
and eventually-always, which could be reused in other proofs. Once I realized
that I had to apply this lemma twice in the proof of
<code>eventually_always_suspected_meet</code>, I finished the proof quickly. This is one
more instance of that temporal logic helps us with high-level reasoning.</p>

<picture>
  <source srcset="/img/epfd-completeness-deps.png" type="image/webp" />
  <img class="responsive-img" src="/img/epfd-completeness-deps.png" alt="Proof schema" />
</picture>

<p>Surprisingly, even though strong completeness is usually thought of as a safety
counterpart of strong accuracy, the proof required quite a bit of reasoning
about temporal properties, not just state invariants. It also helped me a lot to
structure the proofs in terms of temporal formulas, rather than in terms of
arbitrary properties of computations. Of course, it would be interesting to see
how this proof compares to a proof in TLAPS, which is specifically designed to
reason about temporal properties.</p>

<p>If you look at how the lemma statements are written in <a href="#3-proving-strong-completeness-in-lean">Section
3</a>, you will see that they are all
<em>temporal formulas</em>, just written directly using quantifiers $\forall$ and
$\exists$ instead of modal operators like $\square$ and $\Diamond$. To emphasize
this similarity, I wrote alternative statements in TLA<sup>+</sup>. If you know
temporal logic or TLA<sup>+</sup>, just have a look at all these statements:</p>

\[\begin{align}
   \square (\forall m \in sent: m.ts \le clock) &amp; \\
 \square (p \notin crashed)
   \Rightarrow&amp; \square \Diamond (alive[p] = \emptyset) \\
 \square (p \notin crashed) \land \Diamond (q \in crashed)
   \Rightarrow&amp; \Diamond \square (q \notin alive[p]) \\
 \square (p \notin crashed) \land \Diamond (q \in crashed)
   \Rightarrow&amp; \Diamond \square (q \in suspected[p]) \\
 \square (\forall c \in \mathbb{N}: (q \in crashed \land clock = c)
   \Rightarrow&amp; \square (\forall m \in sent: (m.src = q) \Rightarrow m.ts \le c))
\end{align}\]

<p>Although a time investment of about a week to prove strong completeness of EPFD
may seem like a lot, this approach has certain benefits in comparison to using
tools like the explicit-state model checker <a href="https://github.com/tlaplus/tlaplus">TLC</a> or the symbolic model
checker <a href="https://apalache-mc.org/">Apalache</a>:</p>

<ol>
  <li>
    <p>Re-checking the proofs takes seconds. It’s also trivial to integrate
 proof-checking in the GitHub continuous integration.</p>
  </li>
  <li>
    <p>All tools require a spec to be massaged a bit. I always felt bad about not
 being able to formally show that these transformations are sound with a model
 checker. With Lean, it is usually easy.</p>
  </li>
  <li>
    <p>If you manage to decompose your proof goals into smaller lemmas, there is a
 sense of progress. Even though I had to prove 4-5 unexpected lemmas in this
 experiment, I could definitely say whether I was making progress or not. In the
 end, I only proved one lemma that happened to be redundant. With model
 checkers, both explicit and symbolic, it is often frustrating to wait for hours
 or days without clear progress.</p>
  </li>
</ol>

<p>Obviously, the downside of using an interactive theorem prover is that someone
has to write the proofs. For a customer, it may make a difference whether they
pay for 2–4 weeks of contract work, or for 1 week of contract work and then wait
3 weeks for a model checker. However, if time is critical, it makes sense to
invest in both approaches.</p>

<h2 id="table-of-contents">Table of contents</h2>

<ul>
  <li><a href="#1-eventually-perfect-failure-detector-in-pseudo-code">1. Eventually perfect failure detector in pseudo-code</a></li>
  <li><a href="#2-eventually-perfect-failure-detector-in-lean">2. Eventually perfect failure detector in Lean</a>
    <ul>
      <li><a href="#21-basic-type-definitions">2.1. Basic type definitions</a></li>
      <li><a href="#22-partial-synchrony">2.2. Partial synchrony</a></li>
      <li><a href="#23-specifying-the-actions">2.3. Specifying the actions</a></li>
      <li><a href="#24-specifying-the-temporal-properties">2.4. Specifying the temporal properties</a></li>
      <li><a href="#25-specifying-fairness-and-fair-runs">2.5. Specifying fairness and fair runs</a></li>
    </ul>
  </li>
  <li><a href="#3-proving-strong-completeness-in-lean">3. Proving strong completeness in Lean</a>
    <ul>
      <li><a href="#31-shorthand-temporal-definitions">3.1. Shorthand temporal definitions</a></li>
      <li><a href="#32-warming-up-with-simple-temporal-lemmas">3.2. Warming up with simple temporal lemmas</a></li>
      <li><a href="#33-proving-completeness-for-two-processes">3.3. Proving completeness for two processes</a>
        <ul>
          <li><a href="#331-main-lemma-eventually-q-is-always-suspected-by-p">3.3.1. Main lemma: Eventually q is always suspected by p</a></li>
          <li><a href="#332-eventually-q-is-never-alive-for-p">3.3.2. Eventually q is never alive for p</a></li>
          <li><a href="#333-non-crashing-p-resets-alive-infinitely-often">3.3.3. Non-crashing p resets alive infinitely often</a></li>
          <li><a href="#334-a-crashed-process-q-stops-sending-messages">3.3.4. A crashed process q stops sending messages</a></li>
          <li><a href="#335-no-message-sent-from-the-future">3.3.5. No message sent from the future</a></li>
          <li><a href="#336-other-lemmas">3.3.6. Other lemmas</a></li>
        </ul>
      </li>
      <li><a href="#34-from-2-to-n-processes">3.4. From 2 to N processes</a>
        <ul>
          <li><a href="#341-defining-the-crashed-processes">3.4.1. Defining the crashed processes</a></li>
          <li><a href="#342-where-do-the-suspected-sets-meet">3.4.2. Where do the suspected sets meet?</a></li>
        </ul>
      </li>
    </ul>
  </li>
</ul>

<h2 id="1-eventually-perfect-failure-detector-in-pseudo-code">1. Eventually perfect failure detector in pseudo-code</h2>

<p>To avoid any potential copyright issues, I am not copying the pseudo-code from
the book. If you want to see the original version, go check <a href="https://www.distributedprogramming.net/">DP2011</a><sup id="fnref:1"><a href="#fn:1" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>,
Algorithm 2.7, p. 55. Below is an adapted version, which simplifies the events,
as we do not have to reason about the interactions between different protocol
layers in the proof. Every process $p \in \mathit{Procs}$ works as follows:</p>

<pre>
<code class="nohighlight">
<strong>upon</strong> Init <strong>do</strong>
  alive := Procs
  suspected := ∅
  delay := InitDelay
  set_timeout(delay)

<strong>upon</strong> Timeout <strong>do</strong>
  <strong>if</strong> alive ∩ suspected ≠ ∅ <strong>then</strong>
    delay := delay + InitDelay
  suspected := Procs \ alive
  <strong>send</strong> HeartbeatRequest <strong>to</strong> all p ∈ Procs
  alive := ∅
  set_timeout(delay)

<strong>upon receive</strong> HeartbeatRequest <strong>from</strong> q <strong>do</strong>
  <strong>send</strong> HeartbeatReply <strong>to</strong> q

<strong>upon receive</strong> HeartbeatReply <strong>from</strong> p <strong>do</strong>
  alive := alive ∪ {p}
</code>
</pre>

<p>Intuitively, the operation of a failure detector is very simple.  Initially, a
process $p$ considers all the processes alive and suspects no other process of
being crashed. Also, it sets a timer to $\mathit{InitDelay}$ time units.
Basically, nothing interesting happens in the time interval $[0,
\mathit{InitDelay})$, except that some processes may crash.</p>

<p>Once a timeout is triggered on a process $p$, it updates the set of the
suspected processes to the set of processes that have not sent a heartbeat to it
in the previous time interval (not alive), resets the set of the alive processes
and sends heartbeat requests to every process, including itself. Additionally,
if $p$ finds out that it prematurely suspected an alive process, it increases
its timeout window by $\mathit{InitDelay}$. Importantly, $p$ also sets a new
timeout for $delay$ time units.</p>

<p>Finally, whenever a process receives a heartbeat request, it sends a reply.
Whenever, a process receives a heartbeat reply from a process $q$, it adds $q$
to the set of alive processes.</p>

<p>The algorithm looks deceivingly simple. However, the pseudo-code is missing
another piece of information, namely, how the distributed system behaves as a
whole. What does it mean for processes to crash? When messages are received, if
at all? It’s not even clear how to properly write this in pseudo-code.
Normally, academic papers leave this part to math definitions. Since we want to
prove correctness, we cannot avoid reasoning about the whole system. Instead of
appealing to intuition, we capture both the process behavior and the system
behavior in Lean.</p>

<h2 id="2-eventually-perfect-failure-detector-in-lean">2. Eventually perfect failure detector in Lean</h2>

<p>In contrast to <a href="/lean/2025/04/25/lean-two-phase.html">two-phase commit</a>, where we started with a
functional specification, I decided to start with the propositional
specification immediately. A functional specification would be closer to the
implementation details. Perhaps, we will write one in another blog post.</p>

<h3 id="21-basic-type-definitions">2.1. Basic type definitions</h3>

<p>Before specifying the behavior of the processes, we have to figure out the basic
types. You can find them in <a href="https://github.com/konnov/leanda/blob/main/epfd/Epfd/Basic.lean">Basic.lean</a>. First, we declare the type <code>Proc</code>
that we use for process identities:</p>

<pre><code class="language-lean">variable (Proc : Type) [Fintype Proc] [DecidableEq Proc] [Hashable Proc] [Repr Proc]

</code></pre>

<p>If you compare it with the type <code>RM</code> in <a href="/lean/2025/04/25/lean-two-phase.html">two-phase commit</a>, this
time, we require <code>Proc</code> to be of <code>Fintype</code>. By doing so, we avoid carrying
around the set of all processes. With <code>Fintype</code>, we can simply use
<code>Fintype.univ</code> for the set of all processes!</p>

<p>Next, we define the types of message tags and messages:</p>

<pre><code class="language-lean">/-- A message tag: `HeartbeatRequest` or `HeartbeatReply`. -/
inductive MsgTag where
  | HeartbeatRequest
  | HeartbeatReply
  deriving DecidableEq, Repr

/-- A message that is sent by one process (src) to another process (dst).
    Every message is equipped with a timestamp, which is equal to the
    clock value at the time of sending the message.
 -/
@[ext]
structure Msg where
  kind: MsgTag
  src: Proc
  dst: Proc
  timestamp: ℕ
  deriving DecidableEq, Repr

</code></pre>

<p>Now, most of it should be obvious, except, perhaps, for the field <code>timestamp</code>.
What is it? If we look at the original paper on failure detectors by <a href="https://dl.acm.org/doi/abs/10.1145/226643.226647">Chandra
and Toueg</a> (CT96), we’ll see that they assume the existence of a global
clock. The processes can’t read this clock, but the system definitions refer to
it. Hence, a message’s timestamp refers to the value of the global clock at the
moment the message is sent.</p>

<p>Finally, we define a protocol state with the following structure:</p>

<pre><code class="language-lean">structure ProtocolState where
  alive: Std.HashMap Proc (Finset Proc)
  suspected: Std.HashMap Proc (Finset Proc)
  delay: Std.HashMap Proc Nat
  nextTimeout: Std.HashMap Proc Nat
  sent: Finset (Msg Proc)
  rcvd: Finset (Msg Proc)
  clock: Nat
  crashed: Finset Proc

</code></pre>

<p>The first group of fields should be clear. They map process identities to the
corresponding values of the variables in the pseudo-code:</p>

<ul>
  <li><code>alive[p]!</code> stores the set of alive processes as observed by a process <code>p</code>,</li>
  <li><code>suspected[p]!</code> stores the set of suspected processes as observed by a process <code>p</code>,</li>
  <li><code>delay[p]!</code> stores the current value of delay by a process <code>p</code>,</li>
  <li><code>nextTimeout[p]!</code> stores the timestamp of the next timeout by a process <code>p</code>.
The timestamp refers to the global clock.</li>
</ul>

<p>The second group of fields is less obvious. They do not represent the process
states, but rather the rest of the global state of the distributed system:</p>

<ul>
  <li><code>sent</code> is the set of all messages sent by the processes,</li>
  <li><code>rcvd</code> is the set of all messages received by the processes,</li>
  <li><code>clock</code> is the value of the fictitious global clock,</li>
  <li><code>crashed</code> is the set of all processes that have crashed.</li>
</ul>

<p>While the second group of fields is needed to formally capture a state of the
distributed system, we notice that the processes cannot have access to those
fields. Otherwise, detecting failures would be trivial, we would just access the
field <code>crashed</code>.</p>

<p>If you find this representation of the global state surprising, it’s actually
quite common to reason about such a global snapshot of a distributed system in
TLA<sup>+</sup>. Here, we’re simply following the TLA<sup>+</sup> methodology,
albeit reproduced in Lean.</p>

<h3 id="22-partial-synchrony">2.2. Partial synchrony</h3>

<p>The algorithm is designed to work under partial synchrony. Unfortunately,
<a href="https://www.distributedprogramming.net/">DP2011</a> does not give us a precise definition of what this means. So we go
back to the paper by <a href="https://dl.acm.org/doi/abs/10.1145/42282.42283">Dwork, Lynch, and Stockmeyer</a> (DLS88) who
introduced partial synchrony. There are several kinds of partial synchrony in
the paper. We choose the one that is probably the most commonly used nowadays:
There is a period of time called global stabilization time (GST), after which
every correct process $p$ receives a message from a correct process $q$ no later
than $\mathit{MsgDelay}$ time units after it was sent by $q$. Both
$\mathit{GST}$ and $\mathit{MsgDelay}$ are unknown to the processes, and may
change from run to run. It is also important to fix the guarantees about the
messages that were sent before $\mathit{GST}$. We assume that they are received
by $\mathit{GST} + \mathit{MsgDelay}$ the latest.</p>

<p>Now we can write a formal definition of what it means for a message to be
received on time under partial synchrony:</p>

<pre><code class="language-lean">/--
  Given a message that was sent at `timestamp`, can a process receive it at time `clock`.
  -/
def isMsgTimely (GST: Nat) (MsgDelay: Nat) (timestamp: Nat) (clock: Nat): Bool :=
  clock ≥ timestamp &amp;&amp; clock ≤ (max GST timestamp) + MsgDelay

</code></pre>

<p>Both <a href="https://dl.acm.org/doi/abs/10.1145/42282.42283">DLS88</a> and <a href="https://www.distributedprogramming.net/">DP2011</a> mention that in practice partial synchrony means
that the periods of asynchrony and synchrony alternate. <a href="https://dl.acm.org/doi/abs/10.1145/42282.42283">DLS88</a> mention that
for their consensus algorithms one should be able to compute the time of
convergence after GST. I am not actually sure how it would work in case of
failure detectors, as it is impossible to predict how long it takes a process to
crash. Hence, in our model, there is no alternation of asynchrony and synchrony.
After GST, communication becomes synchronous, in the sense that every message is
delivered not later than $\mathit{MsgDelay}$ time units after it was sent.</p>

<h3 id="23-specifying-the-actions">2.3. Specifying the actions</h3>

<p>Now we are ready to specify the actions of a distributed system that follows the
algorithm. You can find all definitions in <a href="https://github.com/konnov/leanda/blob/main/epfd/Epfd/Propositional.lean">Propositional.lean</a>.</p>

<p>We start with the protocol parameters and the variables <code>s</code> and <code>s'</code> that
we use throughout the definitions:</p>

<pre><code class="language-lean">-- The abstract type of processes
variable (Proc : Type) [Fintype Proc] [DecidableEq Proc] [Hashable Proc] [Repr Proc]

-- The initial delay Δ used by the processes
variable (InitDelay: ℕ)

-- The global stabilization time GST, unknown to the processes
variable (GST: ℕ)

-- The message delay after GST, unknown to the processes
variable (MsgDelay: ℕ)

-- The state `s` is a state of the protocol, explicitly added to all the functions.
variable (s: ProtocolState Proc)

-- The state `s'` is the "next" state of the protocol.
variable (s': ProtocolState Proc)

</code></pre>

<p>Below is the definition of receiving a heartbeat request:</p>

<pre><code class="language-lean">/--
  A process `dst` receives a heartbeat request from `src`.
  -/
def rcv_heartbeat_request (src: Proc) (dst: Proc) (timestamp: ℕ) :=
  let req := { kind := MsgTag.HeartbeatRequest, src, dst, timestamp }
  dst ∉ s.crashed
  ∧ req ∈ s.sent
  ∧ isMsgTimely GST MsgDelay timestamp s.clock
  ∧ s'.rcvd = s.rcvd ∪ { req }
  ∧ let reply :=
      { kind := MsgTag.HeartbeatReply, src := dst, dst := src, timestamp := s.clock }
    s'.sent = s.sent ∪ { reply }
  ∧ s'.crashed = s.crashed
  ∧ s'.clock = s.clock
  ∧ s'.alive = s.alive
  ∧ s'.suspected = s.suspected
  ∧ s'.delay = s.delay
  ∧ s'.nextTimeout = s.nextTimeout

</code></pre>

<p>As you can see, the definition of <code>rcv_heartbeat_request</code> captures the behavior
of the whole system, when <code>dst</code> handles a heartbeat request. In particular,
<code>dst</code> cannot be in the crashed state when it is receiving the message, the
message has to be timely, etc. Similar to TLA<sup>+</sup>, we specify that
certain fields preserve their values. Actually, we could update the structure
<code>s'</code> instead of writing down multiple equalities over the fields. However, it
would make the proofs more cumbersome. I could not find a simple way to express
something like TLA<sup>+</sup>’s <code>UNCHANGED</code> over multiple variables.</p>

<p>Interestingly, I accidentally swapped <code>src</code> and <code>dst</code> in the initial version of
<code>reply</code> in <code>rcv_heartbeat_request</code>. I only found that when trying to prove one
of the lemmas towards strong completeness.</p>

<p>Similar to <code>rcv_heartbeat_request</code>, we specify <code>rcv_heartbeat_reply</code>:</p>

<pre><code class="language-lean">/--
  A process `dst` receives a heartbeat reply from `src`.
  -/
def rcv_heartbeat_reply (src: Proc) (dst: Proc) (timestamp: ℕ) :=
  let reply := { kind := MsgTag.HeartbeatReply, src, dst, timestamp }
  dst ∉ s.crashed
  ∧ reply ∈ s.sent
  ∧ isMsgTimely GST MsgDelay timestamp s.clock
  ∧ s'.rcvd = s.rcvd ∪ { reply }
  ∧ let nextAlive := s.alive[dst]! ∪ { src }
    s'.alive = s.alive.insert dst nextAlive
  ∧ s'.sent = s.sent
  ∧ s'.crashed = s.crashed
  ∧ s'.clock = s.clock
  ∧ s'.suspected = s.suspected
  ∧ s'.delay = s.delay
  ∧ s'.nextTimeout = s.nextTimeout

</code></pre>

<p>The definition of <code>timeout</code> is the longest one, as a lot of things happen on
timeout:</p>

<pre><code class="language-lean">/--
  A process `p` timeouts.
  -/
def timeout (p: Proc) :=
    p ∉ s.crashed
  ∧ s.clock = s.nextTimeout[p]!
  -- if `p` suspects an alive process, increase the delay
  ∧ let nextDelay :=
      if s.alive[p]! ∩ s.suspected[p]! ≠ ∅
      then s.delay[p]! + InitDelay
      else s.delay[p]!
    s'.delay = s.delay.insert p nextDelay
  -- recompute the set of suspected processes
  ∧ let nextSuspected := Finset.univ \ s.alive[p]!
      /- q ∉ s.alive[p]! is equivalent to the original code:
        on q ∉ s.alive[p]! ∧ q ∉ s.suspected[p]! trigger Suspect q
        on q ∈ s.alive[p]! ∧ q ∈ s.suspected[p]! trigger Restore q
        else keep q ∈ s.suspected[p]!
       -/
    s'.suspected = s.suspected.insert p nextSuspected
  -- send heartbeat requests to all processes, including `p` itself
  ∧ s'.sent = s.sent ∪ Finset.univ.image (fun q =&gt; {
      kind := MsgTag.HeartbeatRequest, src := p, dst := q, timestamp := s.clock
    })
  -- set alive to empty and reset the timer
  ∧ s'.alive = s.alive.insert p ∅
  ∧ s'.nextTimeout = s.nextTimeout.insert p (s.clock + s.delay[p]!)
  -- everything else remains unchanged
  ∧ s'.rcvd = s.rcvd
  ∧ s'.crashed = s.crashed
  ∧ s'.clock = s.clock

</code></pre>

<p>As you can see, the sequential logic from the pseudo-code is compressed into
multiple equalities, very much in the spirit of TLA<sup>+</sup>. Our proofs are
complex enough, so it’s good that we do not have to deal with sequential
execution inside actions. If this is not convincing enough, we could write
sequential code and prove that it refines the corresponding propositional
definition.</p>

<p>We have defined the three actions, as in the pseudo-code (the definition of
<code>init</code> comes later). Are we done? Not quite. Since we’re specifying the behavior
of the entire distributed system, not just individual processes, we need two
more actions.</p>

<p>The first additional action is <code>crash</code>:</p>

<pre><code class="language-lean">/--
  A process `p` crashes. This action is not part of the protocol itself, but
  rather a part of the environment (or the adversary).
  -/
def crash (p: Proc) :=
    p ∉ s.crashed
  ∧ s'.crashed = s.crashed ∪ { p }
  ∧ s'.sent = s.sent
  ∧ s'.rcvd = s.rcvd
  ∧ s'.clock = s.clock
  ∧ s'.alive = s.alive
  ∧ s'.suspected = s.suspected
  ∧ s'.delay = s.delay
  ∧ s'.nextTimeout = s.nextTimeout

</code></pre>

<p>Yes, we have to specify what it means for a process to crash, as there is no
built-in semantics of crashing in Lean.</p>

<p>What is left? Remember that we had the fictitious global clock? We have to
advance it from time to time:</p>

<pre><code class="language-lean">/--
  The global system clock advances. We advance the clock by exactly one unit.
  If we had a rational clock, we would have to advance it by `delta` units.
  -/
def advance_clock :=
    s'.clock = s.clock + 1
  ∧ s'.crashed = s.crashed
  ∧ s'.sent = s.sent
  ∧ s'.rcvd = s.rcvd
  ∧ s'.alive = s.alive
  ∧ s'.suspected = s.suspected
  ∧ s'.delay = s.delay
  ∧ s'.nextTimeout = s.nextTimeout

</code></pre>

<p>I have cut a corner in the definition of <code>advance_clock</code> by incrementing it,
instead of advancing it by a positive delta. This works since we declared the
clock to be a natural number rather than a rational or real. Incrementing the
clock instead of advancing it by delta simplifies the proofs a bit.</p>

<p>Finally, we define the initialization and the transition relation as follows:</p>

<pre><code class="language-lean">/--
  Initialize a map with the default value `v` for each process in `all`.
  -/
noncomputable def init_map {α: Type} (v: α) : Std.HashMap Proc α :=
  Finset.univ.toList.foldl (fun m p =&gt; m.insert p v) (Std.HashMap.emptyWithCapacity 0)

/--
  The initial state of the protocol.
  -/
def init: Prop :=
    s.crashed = ∅
  ∧ s.sent = ∅
  ∧ s.rcvd = ∅
  ∧ s.clock = 0
  ∧ s.alive = init_map Proc ∅
  ∧ s.suspected = init_map Proc ∅
  ∧ s.delay = init_map Proc InitDelay
  ∧ s.nextTimeout = init_map Proc InitDelay

/--
  The transition relation of the protocol.
  -/
def next: Prop :=
    advance_clock Proc s s'
  ∨ ∃ p: Proc,
        timeout Proc InitDelay s s' p
      ∨ crash Proc s s' p
      ∨ ∃ q: Proc, ∃ t: ℕ,
            rcv_heartbeat_request Proc GST MsgDelay s s' p q t
          ∨ rcv_heartbeat_reply Proc GST MsgDelay s s' p q t

</code></pre>

<p>Notice the <code>noncomputable</code> qualifier in front of <code>init_map</code>. Lean requires it,
as we are converting <code>Finset.univ</code> to a list. If we want to write an executable
specification, we have to work around this, perhaps, by passing the list of all
process identities to the initializer.</p>

<p>So far, our definitions looked very much like a typical specification in
TLA<sup>+</sup>, even though we had to use Lean’s data structures such as finite
sets and hash maps, instead of TLA<sup>+</sup>’s sets and functions. I believe
that there is an advantage in keeping this resemblance. First, if we choose to
translate this specification to TLA<sup>+</sup>, e.g., to use the model
checkers, it is not hard. (Actually, I did that; it was almost no-brainer with
Copilot). Second, we can reuse the standard specification idioms from
TLA<sup>+</sup>.</p>

<h3 id="24-specifying-the-temporal-properties">2.4. Specifying the temporal properties</h3>

<p>In <a href="/lean/2025/05/10/lean-two-phase-proofs.html">two-phase commit</a>, we were only concerned with state
invariants and, thus, only had to reason about lists of actions. In the case of
failure detectors, we have to reason about temporal properties. In general,
temporal properties require us to reason about infinite behaviors. Surprisingly,
it is quite easy to specify properties of infinite behaviors in Lean. We just
use a function <code>seq</code> of natural numbers to <code>ProtocolState</code>.</p>

<p>Here is how we specify strong completeness of the failure detector:</p>

<pre><code class="language-lean">def is_strongly_complete
    (Crashed: Finset Proc)
    (seq: ℕ → ProtocolState Proc): Prop :=
  (∀ p: Proc, p ∈ Crashed ↔ ∃ i: ℕ, p ∈ (seq i).crashed)
    → ∃ k: ℕ,
        ∀ i: ℕ,
          ∀ p q: Proc,
            p ∉ Crashed ∧ q ∈ Crashed → q ∈ (seq (k + i)).suspected[p]!

</code></pre>

<p>The above definition may seem to be a bit loaded. The left part of <code>→</code> requires
that the set <code>Crashed</code> indeed contains all the processes that crashed in the
run. The set <code>Crashed</code> happened to be hard to define. More on that later.  The
right part of <code>→</code> says that there is a point <code>k</code> in <code>seq</code>, so that starting with
$k$, every further state $seq (i + k)$ satisfies $q \notin (seq (i +
k)).suspected[p]!$ for a correct $p$ and a crashed $q$.</p>

<p>If you know temporal logic, e.g., as defined in TLA<sup>+</sup>, the right-hand
side of <code>→</code> could be written like (<code>&lt;&gt;</code> is usually called “eventually” and <code>[]</code>
is called “always”):</p>

<pre><code class="language-tla">&lt;&gt;[](∀ p q: Proc, p ∉ C ∧ q ∈ C → q ∈ suspected[p])
</code></pre>

<p>Similar to <code>is_strongly_complete</code>, this is how we specify strong accuracy:</p>

<pre><code class="language-lean">def is_eventually_strongly_accurate
    (Crashed: Finset Proc)
    (seq: ℕ → (ProtocolState Proc)) : Prop :=
  (∀ p: Proc, p ∈ Crashed ↔ ∃ i: ℕ, p ∈ (seq i).crashed)
    → ∃ k: ℕ,
        ∀ i: ℕ,
          ∀ p q: Proc,
            p ∉ Crashed ∧ q ∉ Crashed → q ∉ (seq (i + k)).suspected[p]!

</code></pre>

<p>Again, in temporal logic it would look like:</p>

<pre><code class="language-tla">&lt;&gt;[](∀ p q: Proc, p ∉ crashed ∧ q ∉ crashed → q ∉ suspected[p])
</code></pre>

<p><strong>Don’t we need a framework for temporal logic?</strong> Well, actually not. Instead of
<code>[]</code> and <code>&lt;&gt;</code>, we can simply use <code>∀</code> and <code>∃</code> over indices. There is even a
deeper connection between linear temporal logic and first-order logic with
ordering, shown by <a href="https://en.wikipedia.org/wiki/Hans_Kamp">Hans Kamp</a>. For example, see a recent <a href="https://drops.dagstuhl.de/storage/00lipics/lipics-vol016-csl2012/LIPIcs.CSL.2012.516/LIPIcs.CSL.2012.516.pdf">Proof of Kamp’s
theorem</a> by Alexander Rabinovich. Temporal formulas are often
easier to read. So I prefer to accompany properties in Lean with temporal
properties in the documentation.</p>

<h3 id="25-specifying-fairness-and-fair-runs">2.5. Specifying fairness and fair runs</h3>

<p>Now that we have described the system behavior, can we proceed with the proofs?
Not so fast. If you have ever tried to prove liveness, you know that we have to
restrict our analysis to <em>fair</em> system executions.</p>

<p>For instance, our definition of <code>next</code> allows the scheduler to always choose
<code>advance_clock</code> as the next action. So we would end up with a sequence that
consists of states that only have the increasing clock values. Is it an
interesting sequence? Not really. We have not even had a chance to try other
actions. Usually, such executions are called unfair. We want to restrict our
liveness analysis to fair executions.</p>

<p>To save you guess work, here are the three kinds of conditions we want from a
fair execution in our failure detector:</p>

<ol>
  <li>
    <p>For every message $m$ that is sent, the message destination receives it on
 time, unless it crashes before the message $m$ expires. This is the constraint
 right from <a href="#22-partial-synchrony">partial synchrony</a>.</p>
  </li>
  <li>
    <p>If a process $p$ has a scheduled timeout, $p$ should process the timeout
 before the global clock advances too far, unless $p$ crashed before the timeout
 had to be handled. While this may sound obvious, this requirement is crucial
 for the failure detector.</p>
  </li>
  <li>
    <p>The global clock must advance infinitely often. Indeed, it is possible to
 construct a sequence of states that have the global clock increased only
 finitely many times. This effect is usually called <em>zenoness</em>, after <a href="https://en.wikipedia.org/wiki/Zeno%27s_paradoxes">Zeno’s
 paradoxes</a>. We want to avoid such executions. If you do not believe this
 is possible, look carefully at the definition <code>rcv_heartbeat_request</code>.  It can
 receive the same message multiple times! Sure, we could eliminate this behavior
 by receiving every message at most once. It would be harder to do that in a
 more complex protocol. Just requiring non-zenoness is much simpler.</p>
  </li>
</ol>

<p>Lean has no idea about distributed algorithms and fair executions. We can get
some inspiration from TLA<sup>+</sup>. Unfortunately, fairness in
TLA<sup>+</sup> is a bit too complicated. If we wanted to transfer this approach
to our proofs, we would have to figure out how to write our fairness constraints
with strong fairness, weak fairness, and <code>ENABLED</code>.</p>

<p>To avoid this complex ceremony, we recall that our actions have very simple
structure. Essentially, every protocol state is constructed by executing one
of the six actions:</p>

<pre><code class="language-lean">inductive Action where
  | Init
  | AdvanceClock
  | Timeout(p: Proc)
  | Crash(p: Proc)
  | RcvHeartbeatRequest(src: Proc) (dst: Proc) (timestamp: ℕ)
  | RcvHeartbeatReply(src: Proc) (dst: Proc) (timestamp: ℕ)

</code></pre>

<p>Our key idea here is that we could explicitly force some of the actions to be
taken in a fair execution. To this end, we refine our transition relation <code>next</code>
with <code>next_a</code>:</p>

<pre><code class="language-lean">def next_a (a: @Action Proc): Prop :=
match a with
| Action.Init =&gt;
    s' = s -- dummy action
| Action.AdvanceClock =&gt;
    advance_clock Proc s s'
| Action.Timeout p =&gt;
    timeout Proc InitDelay s s' p
| Action.Crash p =&gt;
    crash Proc s s' p
| Action.RcvHeartbeatRequest src dst timestamp =&gt;
    rcv_heartbeat_request Proc GST MsgDelay s s' src dst timestamp
| Action.RcvHeartbeatReply src dst timestamp =&gt;
    rcv_heartbeat_reply Proc GST MsgDelay s s' src dst timestamp

</code></pre>

<p>Of course, we would have to prove equivalence of <code>next</code> and <code>next_a</code>. Actually,
we have to account for the case <code>Init</code>, where we just require $s’ = s$. This is
easy to do in Lean. Below is the theorem statement. Check
<a href="https://github.com/konnov/leanda/blob/main/epfd/Epfd/PropositionalProofs.lean">PropositionalProofs.lean</a> for the actual proof:</p>

<pre><code class="language-lean">theorem next_a_iff_next
    (s: ProtocolState Proc)
    (s': ProtocolState Proc):
      s' = s ∨ next Proc InitDelay GST MsgDelay s s'
        ↔ ∃ a: Action Proc, next_a Proc InitDelay GST MsgDelay s s' a := by

</code></pre>

<p>Now, instead of reasoning just about sequences of protocol states, we can
reason about sequences of states and actions. Formally, we introduce the
definition of a <em>trace</em> and related definitions:</p>

<pre><code class="language-lean">structure StateAction where
  s: ProtocolState Proc
  a: @Action Proc

/--
  A trace is an infinite sequence of pairs:
   - a state and
   - the action that produced the state from the previous one.

  The initial state is produced by the dummy action `Init`.
  We do not enforce the states to be connected by the `next` relation.
  See `is_path` and `is_run` for stronger conditions.
  -/
abbrev Trace := ℕ → StateAction Proc

/--
  Interpret a trace as a sequence of protocol states.
  -/
def states_of_trace (tr: Trace Proc) :=
  fun i: ℕ =&gt; (tr i).s

</code></pre>

<p>Not every trace can be produced by the failure detector protocol. We define what
it means for a trace to be a run of the protocol, not necessarily a fair one,
and what it means for a trace to be a fair run of the protocol:</p>

<pre><code class="language-lean">/--
  A trace is a path, if every pair of state-action pairs `((s_i, _), (s_{i+1},
  a_{i+1})` is a transition via `next_a`. A path does not have to start with an
  initial state.
  -/
def is_path (tr: Trace Proc) : Prop :=
  ∀ i: ℕ,
    next_a Proc InitDelay GST MsgDelay (tr i).s (tr (i + 1)).s (tr (i + 1)).a

/--
  A trace is a (protocol) run, if it starts with an initial state,
  and it is a path.
  -/
def is_run (tr: Trace Proc) : Prop :=
  let s0 := (tr 0).s
  init Proc InitDelay s0
    ∧ is_path Proc InitDelay GST MsgDelay tr

/--
  Does a trace constitute a fair run of the protocol?
  -/
def is_fair_run (tr: Trace Proc) : Prop :=
  is_run Proc InitDelay GST MsgDelay tr
    ∧ is_reliable_communication Proc GST MsgDelay tr
    ∧ is_fair_timeout Proc tr
    ∧ is_fair_clock Proc tr

</code></pre>

<p>Having all these definitions, we proceed with our fairness constraints. The
simplest one is <code>is_fair_clock</code>:</p>

<pre><code class="language-lean">def is_fair_clock (tr: Trace Proc) : Prop :=
  ∀ i: ℕ,
    ∃ j: ℕ,
      j &gt; i ∧ (tr j).a = Action.AdvanceClock

</code></pre>

<p>Essentially, <code>is_fair_clock</code> says that we observe <code>AdvanceClock</code> in a trace
infinitely often.  In TLA<sup>+</sup>, it would be written like
<code>&lt;&gt;[]&lt;advance_clock&gt;_vars</code>. If you don’t know what it means, just skip it.</p>

<p>Further, we define <code>is_fair_timeout</code> as follows:</p>

<pre><code class="language-lean">def is_fair_timeout (tr: Trace Proc): Prop :=
  ∀ i: ℕ,
    ∀ p: Proc,
      ∃ k: ℕ,
        (p ∉ (tr (i + k - 1)).s.crashed → (tr (i + k)).a = Action.Timeout p)
          -- TODO: is this a bit too strong in the presence of is_fair_clock?
          ∧ (tr (i + k)).s.clock = (tr i).s.nextTimeout[p]!

</code></pre>

<p>Finally, <code>is_reliable_communication</code> has the longest definition:</p>

<pre><code class="language-lean">def is_reliable_communication (tr: Trace Proc) : Prop :=
  ∀ k: ℕ,
    ∀ m ∈ (tr k).s.sent,
      ∃ i: ℕ,
        let { s := s_j, a := a_j } := tr (k + i)
        isMsgTimely GST MsgDelay m.timestamp s_j.clock
          ∧ m.dst ∈ s_j.crashed
            ∨ match m.kind with
            | MsgTag.HeartbeatReply =&gt;
                a_j = Action.RcvHeartbeatReply m.src m.dst m.timestamp
            | MsgTag.HeartbeatRequest =&gt;
                a_j = Action.RcvHeartbeatRequest m.src m.dst m.timestamp

</code></pre>

<p>Now we are ready for the proofs!</p>

<h2 id="3-proving-strong-completeness-in-lean">3. Proving strong completeness in Lean</h2>

<p>The main theorem that we want to prove is <code>strong_completeness_on_states</code>, which
basically delegates the work to <code>strong_completeness</code> over fair traces:</p>

<pre><code class="language-lean">/--
  Show that the property `is_strongly_complete` holds on fair runs.
  -/
theorem strong_completeness_on_states
    (tr: Trace Proc)
    (h_is_fair_run: is_fair_run Proc InitDelay GST MsgDelay tr)
    (Crashed: Finset Proc):
      is_strongly_complete Proc Crashed (states_of_trace Proc tr) := by
  unfold is_strongly_complete states_of_trace
  intro h_crashed
  have h_is_crashing_set: is_crashing_set tr Crashed := by exact h_crashed
  exact strong_completeness InitDelay GST MsgDelay
    tr h_is_fair_run Crashed h_is_crashing_set

</code></pre>

<p>The below figure summarizes the lemmas (yellow) and theorems (green) that I had
to prove, in order to show <code>strong_completeness_on_states</code>.</p>

<picture>
  <source srcset="/img/epfd-completeness-deps.png" type="image/webp" />
  <img class="responsive-img" src="/img/epfd-completeness-deps.png" alt="Proof schema" />
</picture>

<p>If you want to understand the proofs, you should inspect
<a href="https://github.com/konnov/leanda/blob/main/epfd/Epfd/PropositionalProofs.lean">PropositionalProofs.lean</a> with the Lean extension for VSCode. I will only
give you human-readable summaries as well as my observations about how I wrote
these proofs.</p>

<h3 id="31-shorthand-temporal-definitions">3.1. Shorthand temporal definitions</h3>

<p>I found it convenient to define shorthands for the temporal properties that are
used throughout the proofs. For instance, below is the definition of
<code>never_crashes</code>:</p>

<pre><code class="language-lean">/-- A process `p` never crashes, i.e., `[](p ∉ crashed)`.  -/
def never_crashes (tr: Trace Proc) (p: Proc): Prop :=
  ∀ i: ℕ,
    p ∉ (tr i).s.crashed

</code></pre>

<p>This property can be visualized as follows:</p>

<table class="timeline-table">
  <tr>
    <th>Time →</th>
    <td>0</td>
    <td>1</td>
    <td>2</td>
    <td>3</td>
    <td>4</td>
    <td>5</td>
  </tr>
  <tr>
    <th>crashed:</th>
    <td>{}</td>
    <td>{}</td>
    <td>{}</td>
    <td>{}</td>
    <td>{}</td>
    <td>{}</td>
  </tr>
  <tr>
    <th></th>
    <td>✅</td>
    <td>✅</td>
    <td>✅</td>
    <td>✅</td>
    <td>✅</td>
    <td>✅</td>
  </tr>
</table>
<p class="timeline-valid">[](p ∉ crashed) holds</p>

<p>The negation of <code>never_crashes</code> is <code>eventually_crashes</code>:</p>

<pre><code class="language-lean">/-- A process `p` eventually crashes, i.e., `&lt;&gt;(p ∈ crashed)`.  -/
def eventually_crashes (tr: Trace Proc) (p: Proc): Prop :=
  ∃ i: ℕ,
    p ∈ (tr i).s.crashed

</code></pre>

<table class="timeline-table">
  <tr>
    <th>Time →</th>
    <td>0</td>
    <td>1</td>
    <td>2</td>
    <td>3</td>
    <td>4</td>
    <td>5</td>
  </tr>
  <tr>
    <th>crashed:</th>
    <td>{}</td>
    <td>{}</td>
    <td>{}</td>
    <td>{p}</td>
    <td>{p}</td>
    <td>{p}</td>
  </tr>
  <tr>
    <th></th>
    <td>❌</td>
    <td>❌</td>
    <td>❌</td>
    <td>✅</td>
    <td>✅</td>
    <td>✅</td>
  </tr>
</table>
<p class="timeline-valid">&lt;&gt;(p ∈ crashed) holds</p>

<p>We also need <code>eventually_never_alive</code>:</p>

<pre><code class="language-lean">/--
  Eventually, `p` never registers `q` as alive, i.e., `&lt;&gt;[](q ∉ alive[p])`.
  -/
def eventually_never_alive (tr: Trace Proc) (p q: Proc): Prop :=
  ∃ k: ℕ, ∀ i: ℕ,
    q ∉ (tr (k + i)).s.alive[p]!

</code></pre>

<table class="timeline-table">
  <tr>
    <th>Time →</th>
    <td>0</td>
    <td>1</td>
    <td>2</td>
    <td>3</td>
    <td>4</td>
    <td>5</td>
  </tr>
  <tr>
    <th>alive[p]:</th>
    <td>{q}</td>
    <td>{q}</td>
    <td>{q}</td>
    <td>{}</td>
    <td>{}</td>
    <td>{}</td>
  </tr>
  <tr>
    <th></th>
    <td>❌</td>
    <td>❌</td>
    <td>❌</td>
    <td>✅</td>
    <td>✅</td>
    <td>✅</td>
  </tr>
</table>
<p class="timeline-valid">&lt;&gt;[] (q ∉ alive[p]) holds</p>

<p>We also need <code>q_is_always_suspected</code> and <code>eventually_q_is_always_suspected</code>:</p>

<pre><code class="language-lean">/--
  `p` suspects `q` permanently from some point `k`, i.e.,
  `tr[k,...] ⊧ [](q ∈ suspected[p])`.
  -/
def q_is_always_suspected (tr: Trace Proc) (p q: Proc) (k: ℕ): Prop :=
  ∀ i: ℕ,
    q ∈ (tr (k + i)).s.suspected[p]!

/--
  Eventually, `p` suspects `q` permanently, i.e., `&lt;&gt;[](q ∈ suspected[p])`.
  -/
def eventually_q_is_always_suspected (tr: Trace Proc) (p q: Proc): Prop :=
  ∃ k: ℕ,
    q_is_always_suspected tr p q k

</code></pre>

<table class="timeline-table">
  <tr>
    <th>Time →</th>
    <td>0</td>
    <td>1</td>
    <td>2</td>
    <td>3</td>
    <td>4</td>
    <td>5</td>
  </tr>
  <tr>
    <th>suspected[p]:</th>
    <td>{}</td>
    <td>{}</td>
    <td>{}</td>
    <td>{q}</td>
    <td>{q}</td>
    <td>{q}</td>
  </tr>
  <tr>
    <th></th>
    <td>❌</td>
    <td>❌</td>
    <td>❌</td>
    <td>✅</td>
    <td>✅</td>
    <td>✅</td>
  </tr>
</table>
<p class="timeline-valid">&lt;&gt;[] (q ∈ suspected[p]) holds</p>

<p>Finally, we need the definition of the set of crashing processes:</p>

<pre><code class="language-lean">/--
  A set of processes `C` is a crashing set if every process in `C`
  eventually crashes, and every process not in `C` never crashes.
 -/
def is_crashing_set (tr: Trace Proc) (C: Finset Proc): Prop :=
  ∀ p: Proc, p ∈ C ↔ eventually_crashes tr p

</code></pre>

<h3 id="32-warming-up-with-simple-temporal-lemmas">3.2. Warming up with simple temporal lemmas</h3>

<p>Before discussing hard-to-prove lemmas, let’s have a look at a few very simple
ones. To start with, we can easily show that the global clock never decreases in
a single step. The proof is basically done by the <code>simp</code> tactic:</p>

<pre><code class="language-lean">/--
  A single step does not decrease the clock value. In temporal logic,
  `[](clock' ≥ clock)`.
  -/
lemma clock_is_monotonic_in_one_step
    (s: ProtocolState Proc) (s': ProtocolState Proc) (a: Action Proc)
    (h_next: next_a Proc InitDelay GST MsgDelay s s' a):
      s'.clock ≥ s.clock := by
  unfold next_a crash rcv_heartbeat_reply advance_clock
         rcv_heartbeat_request timeout at h_next
  cases a with
  | Init =&gt; simp at h_next; rw [h_next]
  | AdvanceClock | RcvHeartbeatRequest _ _ _
  | RcvHeartbeatReply _ _ _ | Timeout _ | Crash _ =&gt;
    simp [h_next]

</code></pre>

<p>Very much similar to <code>clock_is_monotonic_in_one_step</code>, we can prove that the set
of the crashed processes can only grow in a single step:</p>

<pre><code class="language-lean">/--
  A single step does not decrease the set of the crashed processes.
  In temporal logic, `[](crashed' ⊇ crashed)`.
  -/
lemma crashed_is_monotonic_in_one_step
    (s: ProtocolState Proc) (s': ProtocolState Proc) (a: Action Proc)
    (h_next: next_a Proc InitDelay GST MsgDelay s s' a):
      s'.crashed ⊇ s.crashed := by
  -- literally the same proof as above
  unfold next_a crash rcv_heartbeat_reply
         advance_clock rcv_heartbeat_request timeout at h_next
  cases a with
  | Init =&gt; simp at h_next; rw [h_next]
  | AdvanceClock | RcvHeartbeatRequest _ _ _
  | RcvHeartbeatReply _ _ _ | Timeout _ | Crash _ =&gt;
    simp [h_next]

</code></pre>

<p>We use these simple lemmas to prove that $clock$ never decreases in a fair run,
and once a process has crashed, it always remains crashed. In both cases, the
proof is done by simple induction over the indices in a fair run. For example,
here is the lemma <code>crashed_is_monotonic_in_fair_run</code>, together with its proof:</p>

<pre><code class="language-lean">/--
  The set `crashed` grows monotonically in a fair run.

  In temporal logic, `∀p: Proc, [](p ∈ crashed) → [](p ∈ crashed))`.
  -/
lemma crashed_is_monotonic_in_fair_run
    (tr: Trace Proc)
    (h_is_fair_run: is_fair_run Proc InitDelay GST MsgDelay tr)
    (p: Proc) (k: ℕ) (h_p_crashed: p ∈ (tr k).s.crashed) (i: ℕ):
      p ∈ (tr (k + i)).s.crashed := by
  induction i with
  | zero =&gt; exact h_p_crashed
  | succ i ih =&gt;
    unfold is_fair_run at h_is_fair_run
    rcases h_is_fair_run with ⟨ h_is_run, _ ⟩
    unfold is_run at h_is_run
    rcases h_is_run with ⟨ _, h_is_path ⟩
    unfold is_path at h_is_path
    specialize h_is_path (k + i)
    -- apply crashed_is_monotonic_in_one_step to the last step
    have h_last_step_mono :=
      crashed_is_monotonic_in_one_step InitDelay GST MsgDelay
        (tr (k + i)).s (tr (k + i + 1)).s (tr (k + i + 1)).a h_is_path
    exact h_last_step_mono ih

</code></pre>

<p>Further, we prove another useful lemma: Given a clock value $t$, eventually
the global clock reaches the value $t$:</p>

<pre><code class="language-lean">/--
  Every fair run covers every clock value `t`. Note that this requires fairness.
  Otherwise, the clock may not advance at all.

  In temporal logic, `∀t ∈ ℕ, &lt;&gt;(clock ≥ t)`.
  -/
lemma eventually_clock_is_t
    (tr: Trace Proc)
    (h_is_fair_run: is_fair_run Proc InitDelay GST MsgDelay tr)
    (t: ℕ):
      (∃ i: ℕ, (tr i).s.clock ≥ t) := by

</code></pre>

<p>You can check the <a href="https://github.com/konnov/leanda/blob/a96eb677d7f514e8d6ac1cdd6970643f2488b442/epfd/Epfd/PropositionalProofs.lean#L187-L233">full proof of
eventually_clock_is_t</a>.
It is not a long one, but it requires a bit of linear arithmetic to reason about
indices and clock constraints.</p>

<h3 id="33-proving-completeness-for-two-processes">3.3. Proving completeness for two processes</h3>

<p>Before we dive into the results for all processes, we focus on just two
processes in a fair run <code>tr</code>:</p>

<ul>
  <li>a process <code>p</code> that never crashes in <code>tr</code>,</li>
  <li>a process <code>q</code> that eventually crashes in <code>tr</code>.</li>
</ul>

<p>Since the Lean proofs are quite detailed, I provide the lemmas with short
human-readable proofs. Actually, I had to write proof schemas on paper, before
developing detailed proofs. The math-like proofs below are summaries of the
detailed Lean proofs, as my pen &amp; paper proofs had several flaws.</p>

<h4 id="331-main-lemma-eventually-q-is-always-suspected-by-p">3.3.1. Main lemma: Eventually q is always suspected by p</h4>

<p>To show strong completeness for $p$ and $q$, we prove the following key lemma:</p>

<p><strong>Lemma</strong> <code>eventually_crashes_implies_always_suspected</code>. If $q$ crashes at some
time $j$ and $p$ never crashes, then there exists $k$ such that for all $i \ge
k$, we have $q \in \text{suspected}[p]$ at $i$.</p>

<p>Using the TLA<sup>+</sup> notation, we could write this lemma in temporal logic:</p>

\[\square (p \notin crashed) \land \Diamond (q \in crashed)
  \Rightarrow \Diamond \square (q \in suspected[p])\]

<p>This is how the lemma is formulated in Lean:</p>

<pre><code class="language-lean">lemma eventually_crashes_implies_always_suspected
    (tr: Trace Proc)
    (h_is_fair_run: is_fair_run Proc InitDelay GST MsgDelay tr)
    (p q: Proc)
    (h_p_never_crashes: never_crashes tr p)
    (h_crashes: eventually_crashes tr q):
      eventually_q_is_always_suspected tr p q := by

</code></pre>

<p>You can check the <a href="https://github.com/konnov/leanda/blob/a96eb677d7f514e8d6ac1cdd6970643f2488b442/epfd/Epfd/PropositionalProofs.lean#L709-L722">detailed proof in
Lean</a>.
The proof is about 100 LOC long. I have asked ChatGPT to summarize the proof
similar to a mathematician’s proof.  It looked very convincing, but the AI has
hallucinated a lot, by inventing additional lemmas and mixing process names.
Instead, here is my proof summary, 100% organic:</p>

<p><strong>Proof.</strong> Since $q$ eventually crashes, we apply Lemma
<code>eventually_crashes_implies_never_alive</code> (see
<a href="#332-eventually-q-is-never-alive-for-p">below</a>) to show that there is an index
$k$ such that for all $i \ge k$, we have $q \notin alive[p]$ at $i$. Now, we may
still have $q \in suspected[p]$ at $k$. Hence, we apply the fairness constraint
<code>is_fair_timeout</code> to show that there is an index $j &gt; k$ such that $p$ timeouts
at $j$. By the definition of <code>timeout</code>, we have $q \in suspected[p]$ at $j + 1$,
as the action <code>timeout</code> updates <code>suspected[p]</code> with <code>Finset.univ \ alive[p]</code>,
and $q \notin alive[p]$ at $j$.</p>

<p>It remains to show that $q \in suspected[p]$ at an arbitrary $i &gt; j$. We do this
by induction on $i$. All actions except <code>Timeout</code> preserve the value of the
field <code>suspected</code>, so we have $q \in suspected[p]$. In case of <code>Timeout r</code>, we
consider two cases: (1) $r \ne p$, and (2) $r = p$. When $r \ne p$, the value of
$suspected[p]$ does not change.  When $r = p$, we again invoke the conclusion
that $q \notin alive[p]$ at $i$. Similar to the above reasoning about the action
<code>timeout</code>, we conclude $q \in suspected[p]$ at all $i &gt; j$. $\blacksquare$</p>

<h4 id="332-eventually-q-is-never-alive-for-p">3.3.2. Eventually q is never alive for p</h4>

<p>As you have noticed, we invoked Lemma <code>eventually_crashes_implies_never_alive</code>.
This is how it looks like in a human-readable form:</p>

<p><strong>Lemma</strong> <code>eventually_crashes_implies_never_alive</code>. If $q$ crashes at some
time $j$ and $p$ never crashes, then there exists $k$ such that for all $i \ge
k$, we have $q \notin \text{alive}[p]$ at $i$.</p>

<p>Using the TLA<sup>+</sup> notation, we could write this lemma in temporal logic:</p>

\[\square (p \notin crashed) \land \Diamond (q \in crashed)
  \Rightarrow \Diamond \square (q \notin alive[p])\]

<p>This is how the lemma is formulated in Lean, the <a href="https://github.com/konnov/leanda/blob/a96eb677d7f514e8d6ac1cdd6970643f2488b442/epfd/Epfd/PropositionalProofs.lean#L538-L545">detailed
proof in Lean</a>
is 170 LOC:</p>

<pre><code class="language-lean">lemma eventually_crashes_implies_never_alive
    (tr: Trace Proc)
    (h_is_fair_run: is_fair_run Proc InitDelay GST MsgDelay tr)
    (p q: Proc)
    (h_p_never_crashes: never_crashes tr p)
    (h_crashes: eventually_crashes tr q):
      eventually_never_alive tr p q := by

</code></pre>

<p><strong>Proof.</strong> Since $q$ eventually crashes, there is an index $i_{crash}$, so
$q \in crashed$ at $i_{crash}$. Let’s denote with $t_{crash}$ the clock value at
$i_{crash}$. However, $p$ may still receive heartbeats from $q$ that were sent
in the past. Hence, we choose the time point $t_{magic}$:</p>

\[t_{magic} = max(\mathit{GST} + \mathit{MsgDelay}, t_{crash} + \mathit{MsgDelay}) + 1\]

<p>We invoke Lemma <code>eventually_clock_is_t</code> to show that eventually the global clock
reaches the value $t_{magic}$. Further, we invoke Lemma
<code>eventually_alive_is_empty</code> (see
<a href="#333-non-crashing-p-resets-alive-infinitely-often">below</a>) to show that
eventually $alive[p] = \emptyset$ after that point. Hence, we have an index
$i_{empty}$ with the following constraints at:</p>

<ol>
  <li>
    <p>$alive[p] = \emptyset$ at $i_{empty}$,</p>
  </li>
  <li>
    <p>$q$ has crashed and no heartbeats from $q$ can longer arrive, as the global
 clock is past $\mathit{GST} + \mathit{MsgDelay}$,</p>
  </li>
</ol>

<p>The rest of the proof goes by induction over $i \ge i_{empty}$.  We have already
shown the inductive base. The inductive step is proven by contradiction: Assume
that there is an index $i + 1$, where $alive[p] \ne \emptyset$.  We do case
analysis on the action that produces the state at $i + 1$. There are two
interesting cases:</p>

<ol>
  <li>
    <p>A process $r$ times out. If $r \ne p$, the $r$ keeps the value of
 $alive[p]$.  If $p$ times out, it resets $alive[p]$ to $\emptyset$. In both
 cases, $alive[p] = \emptyset$.</p>
  </li>
  <li>
    <p>A process $\mathit{dst}$ receives a heartbeat reply $m$ from a process
 $\mathit{src}$. The cases of $dst \ne p$ or $src \ne q$ are trivial, as the
 predicate $q \in alive[p]$ does not change in those cases. The case of $src =
 q$ and $dst = p$ is the hardest one. First, we apply Lemma
 <code>crashed_process_does_not_send</code> (see below) to show that the message timestamp
 $m.ts$ is not greater than $t_{crash}$. Second, we apply Lemma
 <code>clock_is_monotonic_in_fair_run</code> to show that $clock \ge t_{magic}$ at point
 $i$. Third, we apply the constraint <code>isMsgTimely</code> from the definition of
 <code>rcv_heartbeat_reply</code>. We arrive at the following combination of linear
 constraints that do not have a solution:</p>
  </li>
</ol>

\[\require{cases}
\begin{cases}
  \mathit{m.ts} &amp;\le t_{crash}\\
  clock &amp;\ge t_{magic}\\
  clock &amp;\ge m.ts\\
  clock &amp;\le max(GST, m.ts) + \mathit{MsgDelay}
\end{cases}\]

<p>This inductive argument finishes the proof. $\blacksquare$</p>

<h4 id="333-non-crashing-p-resets-alive-infinitely-often">3.3.3. Non-crashing p resets alive infinitely often</h4>

<p>We invoked Lemma <code>eventually_alive_is_empty</code> in the previous section. This is
how this lemma looks like in a human-readable form:</p>

<p><strong>Lemma</strong> <code>eventually_alive_is_empty</code>. If $p$ never crashes, then for every
$k &gt; 0$,
there is $i \ge k$ such that we have $\text{alive}[p] = \emptyset$ at $i$.</p>

<p>Using the TLA<sup>+</sup> notation, we could write this lemma in temporal logic:</p>

\[\square (p \notin crashed)
  \Rightarrow \square \Diamond (alive[p] = \emptyset)\]

<p>This is how the lemma is formulated in Lean, the <a href="https://github.com/konnov/leanda/blob/a96eb677d7f514e8d6ac1cdd6970643f2488b442/epfd/Epfd/PropositionalProofs.lean#L241">detailed
proof in Lean</a>
is 30 LOC:</p>

<pre><code class="language-lean">lemma eventually_alive_is_empty
    (tr: Trace Proc)
    (h_is_fair_run: is_fair_run Proc InitDelay GST MsgDelay tr)
    (p: Proc)
    (h_p_never_crashes: never_crashes tr p)
    (k: ℕ)
    (h_i_positive: k &gt; 0):
      ∃ i: ℕ, (tr (k + i)).s.alive[p]! = ∅ := by

</code></pre>

<p><strong>Proof.</strong> Since $p$ never crashes, we apply the fairness constraint
<code>is_fair_timeout</code> to the index $k$. Hence, there exists an index $i$,
so that $p$ times out at $k + i$. When processing the action <code>timeout</code>,
process $p$ resets $alive[p]$ to the empty set. $\blacksquare$</p>

<h4 id="334-a-crashed-process-q-stops-sending-messages">3.3.4. A crashed process q stops sending messages</h4>

<p>We invoked Lemma <code>crashed_process_does_not_send</code> in the proof of
<code>eventually_crashes_implies_never_alive</code>. This is how this lemma looks like in a
human-readable form:</p>

<p><strong>Lemma</strong> <code>crashed_process_does_not_send</code>. If $q$ is crashed at $k$ and the
clock value at $k$ equals to some $c$, then for every $i \ge 0$ and every
message $m \in sent$ at $k + i$, if $m.src = q$, then $m.ts \le c$.</p>

<p>Using the TLA<sup>+</sup> notation, we could write this lemma in temporal logic:</p>

\[\square (\forall c \in \mathbb{N}: (q \in crashed \land clock = c)
  \Rightarrow \square (\forall m \in sent: (m.src = q) \Rightarrow m.ts \le c))\]

<p>This is how the lemma is formulated in Lean, the <a href="https://github.com/konnov/leanda/blob/a96eb677d7f514e8d6ac1cdd6970643f2488b442/epfd/Epfd/PropositionalProofs.lean#L419">detailed
proof in Lean</a>
is 110 LOC:</p>

<pre><code class="language-lean">lemma crashed_process_does_not_send
    (tr: Trace Proc)
    (h_is_fair_run: is_fair_run Proc InitDelay GST MsgDelay tr)
    (p: Proc) (k: ℕ) (h_p_crashed: p ∈ (tr k).s.crashed):
      ∀ i: ℕ,
        ∀ m ∈ (tr (k + i)).s.sent,
          m.src = p → m.timestamp ≤ (tr k).s.clock := by

</code></pre>

<p>Although this lemma seems to be obvious, the proof is relatively long. It is
mostly technical.</p>

<p><strong>Proof.</strong> The proof is done by induction on $i$. It invokes two other lemmas:</p>

<ol>
  <li>
    <p>Lemma <code>crashed_is_monotonic_in_fair_run</code> that we discussed before, and</p>
  </li>
  <li>
    <p>Lemma <code>no_sent_from_the_future</code> (see
  <a href="#335-no-message-sent-from-the-future">below</a>), which states that no message
  can have a timestamp above the current value of the global clock.</p>
  </li>
</ol>

<p>The induction goes by case analysis on the action that is executed at point $i$.
There are two interesting cases that extend the set of sent messages $sent$:</p>

<ol>
  <li>
    <p>A timeout by process $r$. Since $q$ crashed at $k$, it remains crashed at
 $k+i$. Hence, $q$ cannot timeout, and, thus, $r \ne q$. Therefore, the new
 heartbeat requests in $sent$ do not have $q$ as their source.</p>
  </li>
  <li>
    <p>Receiving a heartbeat request by process $r$. Again, $q$ is crashed at
 $k+i$, and $r \ne q$. Therefore, the new heartbeat replies in $sent$ do not
 have $q$ as their source. $\blacksquare$</p>
  </li>
</ol>

<h4 id="335-no-message-sent-from-the-future">3.3.5. No message sent from the future</h4>

<p>Lemma <code>no_sent_from_the_future</code> plays an important role in the proof of
<code>crashed_process_does_not_send</code>. This is how this lemma looks like in a
human-readable form:</p>

<p><strong>Lemma</strong> <code>no_sent_from_the_future</code>. For every point $i \ge 0$ and every
message $m \in sent$ at $i$, it holds that $m.ts \le clock$.</p>

<p>Using the TLA<sup>+</sup> notation, we could write this lemma in temporal logic:</p>

\[\square (\forall m \in sent: m.ts \le clock)\]

<p>This is how the lemma is formulated in Lean, the <a href="https://github.com/konnov/leanda/blob/a96eb677d7f514e8d6ac1cdd6970643f2488b442/epfd/Epfd/PropositionalProofs.lean#L303">detailed
proof in Lean</a>
is 110 LOC:</p>

<pre><code class="language-lean">lemma no_sent_from_the_future
    (tr: Trace Proc)
    (h_is_fair_run: is_fair_run Proc InitDelay GST MsgDelay tr):
      ∀ i: ℕ,
        ∀ m ∈ (tr i).s.sent,
          m.timestamp ≤ (tr i).s.clock := by

</code></pre>

<p><strong>Proof.</strong> The proof is done by induction on $i$ and case analysis on the
action executed at $i$. The proof is quite mechanical. We basically show that
either the set $sent$ does not change, whereas the value of $clock$ does not
decrease, or the new messages have their timestamp set to $clock$. Such messages
are sent in <code>timeout</code> and <code>rcv_heartbeat_request</code>. $\blacksquare$</p>

<p>Most likely, a human reader would immediately infer this lemma without much
thought. However, the lemma’s proof works only because we’re using the global
clock value $clock$ when sending messages. If we used local clocks for
timestamps, the lemma would no longer hold.</p>

<h4 id="336-other-lemmas">3.3.6. Other lemmas</h4>

<p>The proof of <code>no_sent_from_the_future</code> invokes another lemma called
<code>inductive_inv</code>.  It provides us with a general proof of inductive invariants in
the context of our protocol. I expected more proofs to use <code>inductive_inv</code>, but
it happened that the other proofs required additional temporal reasoning beyond
simple inductive invariants.</p>

<p>Finally, we have another lemma called <code>when_clock_is_positive_step_is_non_init</code>.
It is just a technical lemma to work around a corner case that could not be
solved by the tactic <code>omega</code>. You can check <a href="https://github.com/konnov/leanda/blob/a96eb677d7f514e8d6ac1cdd6970643f2488b442/epfd/Epfd/PropositionalProofs.lean#L279">this
lemma</a>.
There is really nothing interesting in it.</p>

<p>Overall, when we go down the dependency tree of our lemmas, the proofs in the
top require quite a bit of creative thinking. The proofs in the bottom like
<code>crashed_process_does_not_send</code> and <code>no_sent_from_the_future</code> are quite
mechanical. They are long but they do not require much thinking. It would be
great if those proofs could be derived automatically.</p>

<h3 id="34-from-2-to-n-processes">3.4. From 2 to N processes</h3>

<p>With <a href="#331-main-lemma-eventually-q-is-always-suspected-by-p">main lemma</a>, we have proven strong completeness for a pair of processes.
Actually, we could just stop there. However, I decided to go the last mile and
prove strong completeness for arbitrary sets of processes, exactly as the
properties are written in <a href="https://www.distributedprogramming.net/">DP2011</a>. The last mile happened to be harder than I
anticipated. Nevertheless, the findings and the proof technique are quite
interesting.</p>

<h4 id="341-defining-the-crashed-processes">3.4.1. Defining the crashed processes</h4>

<p>When I was writing a proof on paper, I was writing something along these lines:</p>

<blockquote>
  <p>Given a fair run, let us define the set <code>Crashed</code> that contains exactly those
processes that crash in the run.</p>
</blockquote>

<p>Hence, I tried to write a definition like this in Lean:</p>

<pre><code class="language-lean">def crashed_set (tr: Trace Proc) :=
  { p: Proc | ∃ i ∈ Nat, p ∈ (tr i).s.crashed }
</code></pre>

<p>Lean produced an a bit obscure error: “failed to synthesize Membership ?m.10217 Type”.</p>

<p>So I thought, OK, it seems to be hard to define a potentially infinite set
that uses a proposition over an infinite sequence. Let’s try finite sets:</p>

<pre><code class="language-lean">def crashed_set (tr: Trace Proc) :=
  Finset.univ.filter (fun p =&gt; ∃ i ∈ ℕ, p ∈ (tr i).s.crashed)
</code></pre>

<p>The same error. What is going on? If we rewrite the definition like this, it
works (obviously, it does not do what we want though):</p>

<pre><code class="language-lean">def crashed_set (tr: Trace Proc) :=
  Finset.univ.filter (fun p =&gt; p ∈ (tr 0).s.crashed)
</code></pre>

<p>Ugh. It looks like Lean does not like that we have to prove existence of a
member of an infinite set to filter a finite set. It would be fine to use
that in a proposition, but not in a definition. Well, this kind of makes sense.
We cannot just compute <code>crashed_set</code>, as we cannot predict when processes crash!
Lean is a bit strict about random mathy stuff.</p>

<p>Interestingly though, given a fair run, we should be able to define the set of
the crashed processes. This set is bounded from above with the finite set
<code>Finset.univ</code> of type <code>Finset Proc</code>. Also, as we showed in
<code>crashed_is_monotonic_in_fair_run</code>, the set of crashed processes can only grow,
not shrink. Hence, in theory, we should be able to define <code>crash_set</code> as the
fixpoint of the operator that transforms $s_i$ into $s_{i+1}$ in our run. We
should be able to apply <a href="https://en.wikipedia.org/wiki/Knaster%E2%80%93Tarski_theorem">Knaster-Tarski</a> theorem. Conversations with ChatGPT
about Knaster-Tarski in Lean opened a new rabbit hole.</p>

<p>This was getting too hard, all of a sudden. So I have decided that it was not
worth the effort. If I had a conversation like that with a customer, they would
tell me to stop right there. Hence, I decided that the properties should simply
have two parameters:</p>

<ul>
  <li>The set <code>(Crashed: Finset Proc)</code>, and</li>
  <li>a proof that it’s exactly the set of the crashing processes.</li>
</ul>

<p>This is why our <a href="#24-specifying-the-temporal-properties">temporal properties</a> have these two parameters.</p>

<h4 id="342-where-do-the-suspected-sets-meet">3.4.2. Where do the suspected sets meet?</h4>

<p>Recall that we have proven the main lemma for two processes, and it looks like
this:</p>

<pre><code class="language-lean">lemma eventually_crashes_implies_always_suspected
    (tr: Trace Proc)
    (h_is_fair_run: is_fair_run Proc InitDelay GST MsgDelay tr)
    (p q: Proc)
    (h_p_never_crashes: never_crashes tr p)
    (h_crashes: eventually_crashes tr q):
      eventually_q_is_always_suspected tr p q := by

</code></pre>

<p>The yet-to-prove theorem <code>strong_completeness</code> looks like this:</p>

<pre><code class="language-lean">theorem strong_completeness
    (tr: Trace Proc)
    (h_is_fair_run: is_fair_run Proc InitDelay GST MsgDelay tr)
    (Crashed: Finset Proc)
    (h_is_crashing_set: is_crashing_set tr Crashed):
      ∃ k: ℕ, ∀ i: ℕ, ∀ p q: Proc,
        (p ∉ Crashed ∧ q ∈ Crashed) → q ∈ (tr (k + i)).s.suspected[p]! := by

</code></pre>

<p>The challenge here is that the theorem claims existence of a single index $k$
for all processes, whereas we get different indices when applying the lemma.
Intuitively, we should be just able to pick the maximum index among them. This
starts to smell like the above problem with the crashing sets. On the other
hand, choosing the maximum among the values of a finite set should be possible.
This made me think about <a href="https://en.wikipedia.org/wiki/Well-founded_relation">Well-founded induction</a>. Intuitively, we should be
able to start with the empty set, add elements one-by-one and pick the maximum
of two numbers at each inductive step: the maximum of the smaller set, and the
value for the new element. This is what <a href="https://leanprover-community.github.io/mathlib4_docs/Mathlib/Data/Finset/Insert.html"><code>Finset.induction</code></a>
can do for us!</p>

<p>Another way to see the issue is by comparing these two temporal formulas in
TLA<sup>+</sup>:</p>

\[\begin{align}
  \forall p, q:\ \Diamond \square (&amp;p \notin crashed \land q \in crashed
    \Rightarrow q \in suspected[p]) \tag{1}\\
  \Diamond \square (\forall p, q:\ &amp;p \notin crashed \land q \in crashed
    \Rightarrow q \in suspected[p]) \tag{2}
\end{align}\]

<p>Hence, to go from Equation (1) to Equation (2), we have to move two quantifiers
$\forall p$ and $\forall q$ inside $\Diamond \square (\dots)$. This observation
together with <code>Finset.induction</code> gave me this nice theorem:</p>

<pre><code class="language-lean">/--
  For a finite set of values `vs` and a state proposition `P`, show that we can
  swap a universal quantifier and eventually-always. In temporal logic `∀ v ∈
  vs, &lt;&gt;[](P v) → &lt;&gt;[](∀ v ∈ vs, P v)`.
  -/
theorem forall_FG_implies_FG_forall
    (P: TraceProp)
    (vs: Finset Val):
    (∀ v ∈ vs, ∃ k: ℕ, ∀ i: ℕ, P (k + i) v) →
      (∃ k: ℕ, ∀ i: ℕ, ∀ v ∈ vs, P (k + i) v) := by

</code></pre>

<p>The proof of the theorem is not hard, but it is 60 LOC. So you can <a href="https://github.com/konnov/leanda/blob/a96eb677d7f514e8d6ac1cdd6970643f2488b442/epfd/Epfd/TemporalLemmas.lean#L22">check it
online</a>.</p>

<p>By using this theorem twice, we finally arrive at the final lemma:</p>

<pre><code class="language-lean">/--
  For a set of crashing processes `C` and a trace `tr`, show that if for every
  crashing process `q` and every correct process `p`, it holds that `p`
  eventually suspects `q` forever, then there is a common time point `k` such
  that all correct processes suspect all crashed processes forever.

  In temporal logic, `∀ q ∈ Crashed, ∀ p ∈ Correct, &lt;&gt;[](q ∈ suspected[p]!)`
  implies `&lt;&gt;[] ∀ q ∈ Crashed, ∀ p ∈ Correct, q ∈ suspected[p]!`.
  -/
lemma eventually_always_suspected_meet
    (tr: Trace Proc)
    (Crashed: Finset Proc)
    (h_suspected:
      ∀ q ∈ Crashed,
        ∀ p ∈ Finset.univ \ Crashed,
          eventually_q_is_always_suspected tr p q):
      ∃ k: ℕ,
        ∀ i: ℕ,
          ∀ q ∈ Crashed,
            ∀ p ∈ Finset.univ \ Crashed,
              q ∈ (tr (k + i)).s.suspected[p]! := by
  -- fix the set of correct processes
  let Correct := Finset.univ \ Crashed
  -- we have to bubble up `∃ k: ℕ` two times
  -- bubble up `∃ k: ℕ` the first time
  have bubble_once: (q: Proc) → (h_q_crashed: q ∈ Crashed) →
      ∃ k: ℕ, ∀ i: ℕ, ∀ p ∈ Correct, q ∈ (tr (k + i)).s.suspected[p]! := by
    intro q h_q_crashed
    specialize h_suspected q h_q_crashed
    let P (i: ℕ) (p: Proc) := q ∈ (tr i).s.suspected[p]!
    exact forall_FG_implies_FG_forall P Correct h_suspected
  -- the predicate `P` to use in the next instance of `forall_FG_implies_FG_forall`
  let P (i: ℕ) (q: Proc) :=
    ∀ p ∈ Correct, q ∈ (tr i).s.suspected[p]!
  -- bubble up `∃ k: ℕ` the second time
  exact forall_FG_implies_FG_forall P Crashed bubble_once

</code></pre>

<p>With this lemma, we finally prove <code>strong_completeness</code>:</p>

<pre><code class="language-lean">theorem strong_completeness
    (tr: Trace Proc)
    (h_is_fair_run: is_fair_run Proc InitDelay GST MsgDelay tr)
    (Crashed: Finset Proc)
    (h_is_crashing_set: is_crashing_set tr Crashed):
      ∃ k: ℕ, ∀ i: ℕ, ∀ p q: Proc,
        (p ∉ Crashed ∧ q ∈ Crashed) → q ∈ (tr (k + i)).s.suspected[p]! := by

</code></pre>

<p>The
<a href="https://github.com/konnov/leanda/blob/a96eb677d7f514e8d6ac1cdd6970643f2488b442/epfd/Epfd/PropositionalProofs.lean#L850-L851">proof</a>
is just a technical application of <code>eventually_crashes_implies_always_suspected</code>
and <code>eventually_always_suspected_meet</code>.  It is 40 LOC of unfolding definitions
and repacking them into the right format.</p>

<h1 id="conclusions">Conclusions</h1>

<p>This was probably the longest blog post I have ever written. It almost feels
like an academic paper. I don’t expect many people to read all of it. If you
have read the whole blog post and reached the conclusions, leave me a comment! I
really want to know, whether anyone manages to read the whole writeup.</p>

<p><a name="end"></a></p>

<h2 id="footnotes">Footnotes</h2>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1">
      <p>Christian Cachin, Rachid Guerraoui, and Luís Rodrigues. Introduction to Reliable and Secure Distributed Programming. Second Edition, Springer, 2011, XIX, 320 pages <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>{&quot;igor&quot;=&gt;{&quot;name&quot;=&gt;&quot;Igor Konnov&quot;, &quot;url&quot;=&gt;&quot;https://konnov.phd&quot;, &quot;email&quot;=&gt;&quot;igor@konnov.phd&quot;}}</name></author><category term="lean" /><summary type="html"><![CDATA[Author: Igor Konnov]]></summary></entry><entry><title type="html">Proving consistency of two-phase commit in Lean4</title><link href="https://protocols-made-fun.com/lean/2025/05/10/lean-two-phase-proofs.html" rel="alternate" type="text/html" title="Proving consistency of two-phase commit in Lean4" /><published>2025-05-10T00:00:00+00:00</published><updated>2025-05-10T00:00:00+00:00</updated><id>https://protocols-made-fun.com/lean/2025/05/10/lean-two-phase-proofs</id><content type="html" xml:base="https://protocols-made-fun.com/lean/2025/05/10/lean-two-phase-proofs.html"><![CDATA[<p><strong>Author:</strong> <a href="https://konnov.phd">Igor Konnov</a></p>

<p><strong>Tags:</strong> specification lean distributed proofs tlaplus</p>

<p>In the previous <a href="/lean/2025/04/25/lean-two-phase.html">blog post</a>, we discussed specification,
randomized simulation, and property-based testing of Two-phase commit in <a href="https://github.com/leanprover/lean4">Lean
4</a>. The obvious question is whether we can use Lean for what it was
designed for, namely, proving correctness of the protocol. Yes, we can!
Here is our proof plan:</p>

<picture>
  <source srcset="/img/two-phase-proof-schema.png" type="image/webp" />
  <img class="responsive-img" src="/img/two-phase-proof-schema.png" alt="Our proof schema" />
</picture>

<p>In short, I have managed to write full proofs of consistency in Lean 4, starting
with a functional specification. Except for a few moments, it was clear how to
proceed, though interactive proofs are tedious. In total, it took me 29 hours to
write the proofs, excluding the time that was needed to read the Lean manuals.
Together with specification and simulation from the previous <a href="/lean/2025/04/25/lean-two-phase.html">blog
post</a>, the whole effort required 45 hours.</p>

<p>I believe the proofs went quickly because the inductive invariant was already
correct, since we have found it with the model checker <a href="https://apalache-mc.org/">Apalache</a>. In fact, I
could probably reduce the proof times even further if I focused on minimizing
the inductive invariant. If the invariant had not been correct, though, the
process likely would not have gone as smoothly.</p>

<p>Let us have a look at the statistics in the table below.</p>

<table>
  <thead>
    <tr>
      <th>Files</th>
      <th style="text-align: right">LOC (excluding comments &amp; whitespace)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><a href="https://github.com/konnov/leanda/blob/main/twophase/Twophase/Functional.lean">Functional.lean</a> + <a href="https://github.com/konnov/leanda/blob/main/twophase/Twophase/System.lean">System.lean</a></td>
      <td style="text-align: right">139</td>
    </tr>
    <tr>
      <td><a href="https://github.com/konnov/leanda/blob/main/twophase/Twophase/Propositional.lean">Propositional.lean</a></td>
      <td style="text-align: right">90</td>
    </tr>
    <tr>
      <td><a href="https://github.com/konnov/leanda/blob/main/twophase/Twophase/PropositionalProofs.lean">PropositionalProofs.lean</a></td>
      <td style="text-align: right">275</td>
    </tr>
    <tr>
      <td><a href="https://github.com/konnov/leanda/blob/main/twophase/Twophase/InductiveProofs.lean">InductiveProofs.lean</a></td>
      <td style="text-align: right">1077</td>
    </tr>
  </tbody>
</table>

<p>The ratio of proofs (propositional and inductive) to the system code
(propositional) is about 15. This fits into the empirical ratio of software
verification, where the proofs are 10-20 longer than the source code.</p>

<p>In this blog post, we have explored a “traditional” path of interactive theorem
proving, though we have <a href="#3-finding-an-inductive-invariant">cut corners</a> by
finding the inductive invariant with the model checker.</p>

<p>Another route to explore is to prove equivalence between our specification in
<a href="https://github.com/konnov/leanda/blob/main/twophase/Twophase/Propositional.lean">Propositional.lean</a> and the <a href="https://github.com/verse-lab/veil/">Veil</a> specification. The Veil examples already
contain a <a href="https://github.com/verse-lab/veil/blob/main/Examples/IvyBench/TwoPhaseCommit.lean">version of two-phase commit</a>, though it is slightly
different from the <a href="two-phase-tla">two-phase commit in TLA<sup>+</sup></a> and our
specification in Lean. Perhaps, this is a good topic for another exercise.</p>

<p>Certainly, this is not the first exercise in using interactive theorem provers
to verify safety of a distributed algorithms. To name a few examples, there were
larger-scale efforts such as <a href="https://dl.acm.org/doi/10.1145/2815400.2815428">IronFleet</a>, <a href="https://github.com/uwplse/verdi">Verdi</a>, <a href="https://ilyasergey.net/papers/disel-popl18.pdf">Disel</a>, and
<a href="https://github.com/verse-lab/bythos">Bythos</a>.</p>

<h2 id="table-of-contents">Table of contents</h2>

<ol>
  <li><a href="#1-what-to-prove">What to prove?</a></li>
  <li><a href="#2-connecting-functional-and-propositional-specs">Connecting functional and propositional specs</a></li>
  <li><a href="#3-finding-an-inductive-invariant">Finding an inductive invariant</a></li>
  <li><a href="#4-proving-the-inductive-step-in-lean-4">Proving the inductive step in Lean 4</a></li>
  <li><a href="#5-proving-consistency-with-the-inductive-invariant">Proving consistency with the inductive invariant</a></li>
  <li><a href="#6-proving-the-inductive-base">Proving the inductive base</a></li>
</ol>

<h2 id="1-what-to-prove">1. What to prove?</h2>

<p>The task does not seem simple, though. How do we approach it? Our goal is to
prove the consistency of the protocol. Fortunately, we started with the
<a href="https://github.com/tlaplus/Examples/blob/master/specifications/transaction_commit/TwoPhase.tla">specification in TLA+</a>, so we can stand on the shoulders of
giants and reuse the TLA<sup>+</sup> methodology. Here is how consistency is
specified in TLA<sup>+</sup>:</p>

<pre><code class="language-tlaplus">TCConsistent ==  
  (*************************************************************************)
  (* A state predicate asserting that two RMs have not arrived at          *)
  (* conflicting decisions.                                                *)
  (*************************************************************************)
  \A rm1, rm2 \in RM : ~ /\ rmState[rm1] = "aborted"
                         /\ rmState[rm2] = "committed"

</code></pre>

<p>This invariant looks quite similar in Lean:</p>

<pre><code class="language-lean">def consistency (s: ProtocolState RM) : Prop :=
  ∀ rm₁ rm₂: RM,
    s.rmState[rm₁]? ≠ some RMState.Committed ∨ s.rmState[rm₂]? ≠ some RMState.Aborted

</code></pre>

<p>By following the TLA<sup>+</sup> methodology, to show that $TCConsistent$ is an
invariant of Two-phase commit, it suffices to find a state predicate $IndInv$
and prove three properties:</p>

<ol>
  <li>
    <p>The initial states satisfy the invariant $IndInv$, that is, $Init \Rightarrow
IndInv$.</p>
  </li>
  <li>
    <p>The transition relation preserves the invariant $IndInv$, that is,
$Next \land IndInv \Rightarrow IndInv’$.</p>
  </li>
  <li>
    <p>The invariant $IndInv$ implies the state invariant $TCConsistent$, that is,
$IndInv \Rightarrow TCConsistent$.</p>
  </li>
</ol>

<p>The invariant $\mathit{IndInv}$ is called an <em>inductive invariant</em>, since it
allows us to reason about all states that are reachable from $\mathit{Init}$ via
$\mathit{Next}$ <em>by induction</em>. The interesting thing is that it is sufficient
to prove this principle only once and reuse it for all specifications. This is
why we simply use this approach without re-proving it every time. The cool thing
about Lean is that we still can re-prove this inductive principle, if we want to.</p>

<h2 id="2-connecting-functional-and-propositional-specs">2. Connecting functional and propositional specs</h2>

<p>Now we have to understand what to use as the initial predicate $Init$ and the
transition relation $Next$. If you have read the <a href="/lean/2025/04/25/lean-two-phase.html">previous blog
post</a>, you remember that we had two kinds of specifications:</p>

<ul>
  <li>
    <p>A functional specification in <a href="https://github.com/konnov/leanda/blob/main/twophase/Twophase/Functional.lean">Functional.lean</a> and <a href="https://github.com/konnov/leanda/blob/main/twophase/Twophase/System.lean">System.lean</a>, and</p>
  </li>
  <li>
    <p>A propositional specification in <a href="https://github.com/konnov/leanda/blob/main/twophase/Twophase/Propositional.lean">Propositional.lean</a>.</p>
  </li>
</ul>

<p>Luckily, <a href="https://ilyasergey.net/">Ilya Sergey</a> warned me that writing proofs at the functional level
is hard, so I did it at the propositional level. To connect the functional spec
and the propositional spec, we prove two theorems:</p>

<pre><code class="language-lean">theorem tp_init_correct (all: List RM) (s: ProtocolState RM):
    tp_init all s ↔ init all = s := by

</code></pre>

<pre><code class="language-lean">theorem tp_next_correct (s: ProtocolState RM) (s': ProtocolState RM):
    tp_next s s' ↔ ∃ a: Action, next s a = some s' := by

</code></pre>

<p>You can find complete proofs in <a href="https://github.com/konnov/leanda/blob/main/twophase/Twophase/PropositionalProofs.lean">PropositionalProofs.lean</a>. What’s
interesting, it took me just 2.5 hours to write the equivalence proofs for all
seven actions.  That was fast, because once I wrote three proofs, the remaining
four were completely generated by Copilot! As I expected, these proofs were
relatively easy to write.</p>

<p>For instance, here is the function <code>tm_commit</code> in the functional specification:</p>

<pre><code class="language-lean">/--
  The transaction manager commits the transaction. Enabled iff the TM is its initial
  state and every RM has sent a `Prepared` message.
 -/
def tmCommit :=
    if s.tmState = TMState.Init &amp;&amp; s.tmPrepared = s.all then some {
        s with
        tmState := TMState.Committed,
        msgs := s.msgs ∪ { Message.Commit }
    } else none

</code></pre>

<p>And here is the propositional version <code>tm_commit</code>, which looks very much like an
action in TLA<sup>+</sup>:</p>

<pre><code class="language-lean">/-- The proposition version of `tmCommit`. -/
def tm_commit: Prop :=
    s.tmState = TMState.Init
  ∧ s.tmPrepared = s.all
  ∧ s'.tmState = TMState.Committed
  ∧ s'.msgs = s.msgs ∪ { Message.Commit }
  ∧ s'.tmPrepared = s.tmPrepared
  ∧ s'.rmState = s.rmState
  ∧ s'.all = s.all

</code></pre>

<p>The theorem <code>tm_commit_correct</code> is connecting both of them.</p>

<pre><code class="language-lean">theorem tm_commit_correct (s: ProtocolState RM) (s': ProtocolState RM):
    tm_commit s s' ↔ tmCommit RM s = some s' := by
  apply Iff.intro
  case mp =&gt;
    intro hrel
    simp [tm_commit] at hrel
    rcases hrel with ⟨ h_tmState, h_tmPrepared, h_tmState', h_msgs',
      h_tmPrepared', h_rmState', h_all' ⟩
    simp [tmCommit, h_tmState, h_tmPrepared]
    apply ProtocolState.ext
    repeat simp [*]

  case mpr =&gt;
    intro heq
    simp [tmCommit] at heq
    rcases heq with ⟨ ⟨ h_tmState, h_tmPrepared ⟩, h_seq ⟩
    unfold tm_commit
    simp [h_tmState, h_tmPrepared]
    cases h_seq
    repeat simp [*]

</code></pre>

<p>If you are like me, it is hard to make sense of this proof by just starring at
it, in contrast to pen &amp; paper proofs. If you want to understand the proof,
download the spec and go over the proof line by line with the <a href="https://marketplace.visualstudio.com/items?itemName=leanprover.lean4">Lean plugin</a>.
Of course, you would have to understand how the proofs are organized in Lean.
The book on <a href="https://lean-lang.org/theorem_proving_in_lean4/title_page.html">Theorem Proving in Lean 4</a> explains this.</p>

<p>It took me one more hour to prove the theorem <code>tp_next_correct</code>. However, when I
turned to <code>tp_init_correct</code>, I got carried away trying to prove a statement that
was too difficult. The proof involved several inductive arguments about hash
maps, and I ended up spending four hours wrestling with a challenging fact. Once
I clarified that, it only took 30 minutes to write a simpler and more effective
proof.</p>

<p>Basically, the entire set of refinement proofs could be completed in a single
day! The fact that Copilot was able to fill in four out of seven cases suggests
that these proofs could be generalized into a broader lemma. This is evident
from the structure of the proofs. I decided to leave them as they are for now,
but we should consider making them more compact for easier maintenance.</p>

<p>For the remainder of this post, we will use only the propositional specification.</p>

<h2 id="3-finding-an-inductive-invariant">3. Finding an inductive invariant</h2>

<p>To follow the proof methodology, we have to find $IndInv$. We could try to use
our goal invariant <code>consistency</code> as a candidate for the invariant. However,
safety properties rarely work as inductive invariants. Intuitively, an inductive
invariant should generalize all reachable states, and <code>consistency</code> is too weak
for that role.</p>

<p>How do we find $IndInv$? One approach would be to start with <code>True</code>, then try to
prove the three properties. Once we understood why <code>True</code> is not good enough,
add constraints. Repeat. In theory, this approach could work. In practice, it is
too hard, as we have to write proofs by hand. It may happen that we finish 90%
of a proof just to find that our candidate for $IndInv$ was not good enough.</p>

<p>We do not have to go the hard way. Instead, we can just use a model checker for
TLA<sup>+</sup> to quickly iterate on a candidate for an inductive invariant for
a small set of resource managers. This is exactly what we did for our <a href="https://2019.splashcon.org/details/splash-2019-oopsla/7/TLA-Model-Checking-Made-Symbolic">paper at
OOPSLA19</a> at some point in 2018. I guess, it should be available in
the <a href="https://zenodo.org/records/3370071">artifact</a>. In the modern version of <a href="https://apalache-mc.org/">Apalache</a>, the
inductive invariant looks like this:</p>

<pre><code class="language-lean">IndInv ==
    /\ TPTypeOK
    /\ TCConsistent 
    /\ (\E rm \in RM: rmState[rm] = "committed") =&gt; tmState = "committed"
    /\ tmState = "committed" =&gt; /\ tmPrepared = RM
                                /\ \A rm \in RM: rmState[rm] \notin {"working", "aborted"}
                                /\ MkCommit \in msgs
    /\ tmState = "aborted" =&gt; MkAbort \in msgs
    /\ \A rm \in RM:
      /\ rm \in tmPrepared =&gt;
        /\ rmState[rm] /= "working"
        /\ MkPrepared(rm) \in msgs
      /\ rmState[rm] = "working" =&gt; MkPrepared(rm) \notin msgs
      /\ MkPrepared(rm) \in msgs =&gt; rmState[rm] /= "working" 
      /\ rmState[rm] = "aborted" =&gt;
        \/ MkAbort \in msgs
        \/ MkPrepared(rm) \notin msgs
    /\ MkAbort \in msgs =&gt;
        \* it is either the TM or an RM who was in the "working" state
        \/ tmState = "aborted"
        \/ \E rm \in RM:
          /\ rmState[rm] = "aborted"
          /\ rm \notin tmPrepared
          /\ MkPrepared(rm) \notin msgs                 
    /\ MkCommit \in msgs =&gt;
        /\ tmPrepared = RM
        /\ \/ tmState = "committed"
           \/ \E rm \in RM: rmState[rm] = "committed" 

</code></pre>

<p>By running Apalache, we can make sure that our invariant is inductive for three
resource managers:</p>

<pre><code class="language-sh">$ apalache-mc check --length=0  --init=Init --inv=IndInv MC3_TwoPhaseTypedInv.tla
...
Total time: 1.282 sec
$ apalache-mc check --length=1 --init=IndInv --inv=IndInv MC3_TwoPhaseTypedInv.tla
...
The outcome is: NoError
Total time: 1.621 sec
$ apalache-mc check --length=0  --init=IndInv --inv=TCConsistent MC3_TwoPhaseTypedInv.tla
...
The outcome is: NoError
Total time: 1.342 sec
</code></pre>

<p>We can do the same for 7, 20, and even 50 resource managers. However, as we
increase the number of resource managers, the model checker takes longer to
verify the properties. For example, checking inductiveness for 20 resource
managers takes about 8 seconds, compared to just 2 seconds for 3 resource
managers.</p>

<p>To make sure that the model checker is not just printing <code>"NoError"</code> but also
doing something useful, we replace <code>"committed"</code> with <code>"aborted"</code> in the last
line of <code>IndInv</code>.  In this case, the model checker immediately gives us a
counterexample to induction.</p>

<pre><code class="language-sh">$ apalache-mc check --length=1 --inv=IndInv --init=IndInv MC3_TwoPhaseTypedInv.tla
...
Check the trace in: [...]/violation1.tla
State 1: state invariant 5 violated.
Total time: 1.760 sec
</code></pre>

<p>Basically, this is how we speed up the guess-work of finding an inductive
invariant with the model checker.</p>

<p>Interestingly, we can remove the line <code>/\ TCConsistent</code>, and Apalache does not
complain. This constraint is redundant. It is actually great that we have
removed this redundant constraint, as it would take us much longer to write
interactive proofs with this constraint in place.</p>

<p>We <code>IndInv</code> as a conjunction of six lemmas in Lean, as it is easier to reason
about the invariant this way:</p>

<pre><code class="language-lean">-- Our inductive invariant that we use to prove the consistency property.
def invariant (s: ProtocolState RM) : Prop :=
  lemma1 s ∧ lemma2 s ∧ lemma3 s ∧ lemma4 s ∧ lemma5 s ∧ lemma6 s

</code></pre>

<p>You can find all six lemmas in <a href="https://github.com/konnov/leanda/blob/main/twophase/Twophase/InductiveProofs.lean">InductiveProofs.lean</a>. For example, here
is how we write <code>lemma1</code>:</p>

<pre><code class="language-lean">def lemma1 (s: ProtocolState RM): Prop :=
  (∃ rm: RM, s.rmState[rm]? = some RMState.Committed) → s.tmState = TMState.Committed

</code></pre>

<p>If you look carefully at <code>lemma5</code> and <code>lemma6</code> in <a href="https://github.com/konnov/leanda/blob/main/twophase/Twophase/InductiveProofs.lean">InductiveProofs.lean</a>, you
will notice that they also have redundancies. I missed that and this resulted in
a lot of extra work when writing proofs. It would only take several seconds to
check with Apalache, whether we could remove these subformulas. More on that
later.</p>

<h2 id="4-proving-the-inductive-step-in-lean-4">4. Proving the inductive step in Lean 4</h2>

<p>Now that we know our invariant is inductive for small sets of resource managers,
we have good chances of proving the inductive step of <code>invariant</code>. The
transition relation <code>tp_next</code> is a disjunction of seven smaller subformulas such
as <code>tm_commit</code> and <code>tm_abort</code>. Further, <code>invariant</code> is a conjunction of six
lemmas.</p>

<p>Thus, our plan is very simple: Prove inductiveness for each of the seven actions
and each of the six lemmas. This gives us 42 facts to prove. While this sounds
like a lot of work, the good news is that proving 42 smaller lemmas is easier
than proving one huge theorem.</p>

<p>Actually, the above decomposition is not hand-waiving. Our theorem
<code>invariant_is_inductive</code> is proven exactly by this decomposition. Below you can
see the first six cases:</p>

<pre><code class="language-lean">/--
 Showing that `invariant` is inductive, that is, it is preserved by the transition relation.
-/
theorem invariant_is_inductive (s: ProtocolState RM) (s': ProtocolState RM)
  (h_all: ∀ rm: RM, rm ∈ s.all) (h_inv: invariant s) (h_next: tp_next s s'):
    invariant s' := by
  unfold tp_next at h_next
  cases h_next
  case inl h_tm_commit =&gt;
    -- action tm_commit
    unfold invariant
    -- prove the lemmas one by one
    apply And.intro
    . exact invariant_is_inductive_tm_commit_lemma1 s s' h_tm_commit
    . apply And.intro
      . exact invariant_is_inductive_tm_commit_lemma2 s s' h_all h_inv h_tm_commit
      . apply And.intro
        . apply invariant_is_inductive_tm_commit_lemma3 s s' h_tm_commit
        . apply And.intro
          . exact invariant_is_inductive_tm_commit_lemma4 s s' h_inv h_tm_commit
          . apply And.intro
            . exact invariant_is_inductive_tm_commit_lemma5 s s' h_inv h_tm_commit
            . exact invariant_is_inductive_tm_commit_lemma6 s s' h_tm_commit

  case inr h_rest =&gt;

</code></pre>

<p>(If you know how to write the above decomposition nicer in Lean, please <a href="#end">leave a
comment</a> below.)</p>

<p>Now we have to prove 42 lemmas by <strong>writing the proofs</strong>. For example, here is
the proof for the action <code>tm_commit</code> and <code>lemma1</code>:</p>

<pre><code class="language-lean">-- Effort: 10m
lemma invariant_is_inductive_tm_commit_lemma1 (s: ProtocolState RM) (s': ProtocolState RM)
  (h_tm_commit: tm_commit s s'):
    lemma1 s' := by
    unfold lemma1
    intro h_committed
    exact show s'.tmState = TMState.Committed by
      unfold tm_commit at h_tm_commit
      simp [h_tm_commit]

</code></pre>

<p>As you can see, this was an easy one. It took me just 10 minutes to write it.
Actually, closer to the end of the proof, I was writing similar proofs in 1-2
minutes. You can find the remaining 41 proofs in <a href="https://github.com/konnov/leanda/blob/main/twophase/Twophase/InductiveProofs.lean">InductiveProofs.lean</a>. The
plot below shows the total proof efforts:</p>

<picture>
  <source srcset="/img/two-phase-inductive-proof.png" type="image/webp" />
  <img class="responsive-img" src="/img/two-phase-inductive-proof.png" alt="Proof efforts for inductiveness" />
</picture>

<p>There are several things to notice:</p>

<ol>
  <li>Over a half of the lemmas took me less than 15 minutes each.</li>
  <li>Ten lemmas took me from 20 to 30 minutes.</li>
  <li>Six lemmas took me about 1 hour.</li>
  <li>There is one outlier that required over three hours.</li>
</ol>

<p>I am not exactly sure what happened with the lemma that took me so long. It was
the second one I had to prove, so I might have spent a lot of time just figuring
out the right tactics to use. If you look at the proof of
<a href="https://github.com/konnov/leanda/blob/199b26cb022dfa05c3e7c1576384dcea8a0bd648/twophase/Twophase/InductiveProofs.lean#L394-L429"><code>invariant_is_inductive_tm_commit_lemma2</code></a>,
it involves reasoning with universal quantifiers and some small proofs by
contradiction. If I were to write it again now, it probably would not take
nearly as much time.</p>

<p>Something interesting happened when I proved about 80% of the lemmas. I got
stuck. This is not really surprising, as I was at the core argument that
required reasoning about one resource manager being in the <code>Aborted</code> state and
another resource manager receiving a <code>Commit</code> message. Basically, I was
struggling with an argument for the constraints in the commented out section
below:</p>

<pre><code class="language-lean">def lemma5 (s: ProtocolState RM) : Prop :=
  Message.Abort ∈ s.msgs → s.tmState = TMState.Aborted
  /- Added when discovering the invariant with the model checker.
      It was redundant and complicated the proof.
  (s.tmState = TMState.Aborted
    ∨ ∃ rm: RM,
        s.rmState[rm]? = some RMState.Aborted
      ∧ rm ∉ s.tmPrepared
      ∧ Message.Prepared rm ∉ s.msgs)
  -/

</code></pre>

<p>The commented out disjunction is not exactly wrong. The second disjunct of it
implies the first disjunct. However, the second disjunct is much harder to
reason about. The solution? Just remove it.</p>

<p>Once I have removed the redundant part, the proof was quick and easy:</p>

<pre><code class="language-lean">lemma invariant_is_inductive_rm_rcv_commit_msg_lemma5 (s: ProtocolState RM) (s': ProtocolState RM)
  (rm: RM) (h_inv: invariant s) (h_rm_rcv_commit_msg: rm_rcv_commit_msg s s' rm): lemma5 s' := by
    unfold lemma5
    unfold rm_rcv_commit_msg at h_rm_rcv_commit_msg
    have h_unchanged_msgs: s'.msgs = s.msgs := by simp [h_rm_rcv_commit_msg]
    have h_unchanged_tm_state: s'.tmState = s.tmState := by simp [h_rm_rcv_commit_msg]
    simp [h_unchanged_tm_state, h_unchanged_msgs]
    rcases h_inv with ⟨_, _, _, _, h_lemma5_s, _⟩
    unfold lemma5 at h_lemma5_s
    exact h_lemma5_s

</code></pre>

<p>A similar redundancy was present in <code>lemma6</code>. So I removed it, too. By removing
these two redundancies, I have cut the proofs in half! Small differences in
proof goals may have big implications on the proof complexity.</p>

<pre><code class="language-lean">def lemma6 (s: ProtocolState RM) : Prop :=
  Message.Commit ∈ s.msgs →
    s.tmPrepared = s.all ∧ s.tmState = TMState.Committed
    /- Added when discovering the invariant with the model checker.
        It was redundant and complicated the proof.
        (s.tmState = TMState.Committed
      ∨ ∃ rm: RM, s.rmState[rm]? = some RMState.Committed)
      -/

</code></pre>

<h2 id="5-proving-consistency-with-the-inductive-invariant">5. Proving consistency with the inductive invariant</h2>

<p>Once we are sure that <code>invariant</code> is inductive, it is easy to prove that it
implies <code>consistency</code>. It took me just 15 minutes to write the proof:</p>

<pre><code class="language-lean">-- Proving that the inductive invariant implies the consistency property.
-- Effort: 15m
theorem invariant_implies_consistency (s: ProtocolState RM) (h_inv: invariant s):
    consistency s := by
  unfold consistency
  intro rm₁ rm₂
  by_contra h_committed_and_aborted -- assume the opposite
  simp at h_committed_and_aborted
  rcases h_committed_and_aborted with ⟨h_rm1_committed, h_rm2_aborted⟩
  have h_ex_committed: ∃ rm: RM, s.rmState[rm]? = some RMState.Committed := by use rm₁
  unfold invariant at h_inv
  rcases h_inv with ⟨h_lemma1_s, h_lemma2_s, _⟩
  unfold lemma1 at h_lemma1_s
  unfold lemma2 at h_lemma2_s
  have h_tm_committed: s.tmState = TMState.Committed := by exact h_lemma1_s h_ex_committed
  simp [h_tm_committed] at h_lemma2_s
  rcases h_lemma2_s with ⟨_, _, h_no_working_or_aborted⟩
  specialize h_no_working_or_aborted rm₂
  rcases h_no_working_or_aborted with ⟨_, h_rm2_not_aborted⟩
  rw [h_rm2_aborted] at h_rm2_not_aborted
  exact h_rm2_not_aborted rfl

</code></pre>

<h2 id="6-proving-the-inductive-base">6. Proving the inductive base</h2>

<p>Finally, we have to show that the initial states also satisfy <code>invariant</code>. Here
is just the header of the theorem:</p>

<pre><code class="language-lean">theorem init_implies_invariant (all: List RM) (s: ProtocolState RM)
    (h_all: ∀ rm: RM, rm ∈ s.all) (h_init: tp_init all s): invariant s := by

</code></pre>

<p>Initially, I thought that the proof would not be harder than the proof of the
inductive step, since there are fewer conditions to prove. Essentially, we have
to prove six lemmas for the initial states. In total, I think it took me about
three hours to finish the proofs.</p>

<p>This proof had an unexpected complication. I had to show that the initialization
of the hash map <code>rmState</code> sets all resource managers to the state <code>Working</code>.
Here is the header of this lemma:</p>

<pre><code class="language-lean">-- show that the initialization predicate sets all the resource managers to `Working`
lemma init_rm_state_post (all: List RM) (s: ProtocolState RM)
    (h_init: tp_init all s):
    ∀ rm ∈ s.all, s.rmState.get? rm = RMState.Working := by

</code></pre>

<p>I observed similar issues when going through Isabelle tutorials many years ago,
so I remember that sometimes one had to generalize the proof goal. This is
exactly what happened here. In the end, the proof required this additional
lemma:</p>

<pre><code class="language-lean">-- An additional lemma to reason about map initialization.
-- Figuring out that we need this lemma was probably the hardest part of the proof.
lemma init_rm_keys (rm: RM):
    ∀ all: List RM,
      ∀ hashmap: Std.HashMap RM RMState,
        (all.foldl (fun m rm' =&gt; m.insert rm' RMState.Working) hashmap)[rm]? =
          if rm ∈ all then some RMState.Working else hashmap[rm]? := by

</code></pre>

<p>Interestingly, this is the only proof that explicitly uses the <code>induction</code>
tactic. Even though we have been reasoning about an inductive invariant, we did
not need to go into induction. Coming from TLA<sup>+</sup>, this is interesting.
I had to use <code>foldl</code> to initialize the hash map, since Lean does not seem to
have convenient primitives such as the function constructor.  In
TLA<sup>+</sup>, we would just use:</p>

<pre><code class="language-tlaplus">  [ rm ∈ RM |-&gt; "working"]
</code></pre>

<p>The function constructor does not introduce explicit iteration and actually has
the semantics of what I had to prove with the lemma <code>init_rm_state_post</code>.  (To
be precise, it also specifies the function domain.) Perhaps, we could introduce
higher-level primitives in Lean 4 to deal with this.  If you know of a better
alternative to using <code>HashMap</code>, please let me know in the
<a href="#end">comments</a>.</p>

<p><a name="end"></a></p>]]></content><author><name>{&quot;igor&quot;=&gt;{&quot;name&quot;=&gt;&quot;Igor Konnov&quot;, &quot;url&quot;=&gt;&quot;https://konnov.phd&quot;, &quot;email&quot;=&gt;&quot;igor@konnov.phd&quot;}}</name></author><category term="lean" /><summary type="html"><![CDATA[Author: Igor Konnov]]></summary></entry><entry><title type="html">Specifying and simulating two-phase commit in Lean4</title><link href="https://protocols-made-fun.com/lean/2025/04/25/lean-two-phase.html" rel="alternate" type="text/html" title="Specifying and simulating two-phase commit in Lean4" /><published>2025-04-25T00:00:00+00:00</published><updated>2025-04-25T00:00:00+00:00</updated><id>https://protocols-made-fun.com/lean/2025/04/25/lean-two-phase</id><content type="html" xml:base="https://protocols-made-fun.com/lean/2025/04/25/lean-two-phase.html"><![CDATA[<p><strong>Author:</strong> <a href="https://konnov.phd">Igor Konnov</a></p>

<p><strong>Reviewers:</strong> <a href="https://thpani.net/">Thomas Pani</a></p>

<p><strong>Tags:</strong> specification lean distributed simulation pbt tlaplus</p>

<p>More and more people mention the <a href="https://github.com/leanprover/lean4">Lean</a> theorem prover in my bubble. Just the
last week <a href="https://ilyasergey.net/">Ilya Sergey</a> and co announced <a href="https://github.com/verse-lab/veil/">Veil</a>, an <a href="https://kenmcmil.github.io/ivy">IVy</a>-like
verification framework on top of Lean. Luckily, I heard a long tutorial on Lean
by <a href="https://sebasti.a.nullri.ch/">Sebastian Ullrich</a> and <a href="https://www.joachim-breitner.de/">Joachim Breitner</a> at <a href="https://www.soundandcomplete.org/vstte2024.html">VSTTE24</a>. The
interesting thing about Lean is that it’s not only a theorem prover, but also a
decent <a href="https://lean-lang.org/functional_programming_in_lean/title.html">programming language</a>.</p>

<p>So, I have decided to take on a case study of specifying a relatively simple yet
interesting distributed protocol in Lean. For this reason, I did not choose a
more complex consensus algorithm and instead settled on two-phase commit, for
the following reasons:</p>

<ul>
  <li>
    <p>It is an interesting distributed protocol with applications in databases.</p>
  </li>
  <li>
    <p>It has been <a href="https://github.com/tlaplus/Examples/blob/master/specifications/transaction_commit/TwoPhase.tla">specified in TLA<sup>+</sup></a> by Leslie Lamport,
 so we do not have to think about choosing the right level of abstraction.  We
 simply follow the same level of abstraction as in the TLA<sup>+</sup>
 specification.</p>
  </li>
</ul>

<p>It took me about <em>three hours</em> to translate the TLA<sup>+</sup> specification
into <a href="https://github.com/konnov/leanda/tree/main/twophase/Twophase">Lean spec</a>, occasionally debating syntax and best
practices with ChatGPT and Copilot along the way. See the <a href="#22-functional-specification-in-lean">Functional
specification</a> and <a href="#23-system-level-specification-in-lean">System-level specification</a>. I had to
invent several patterns on the way, as specifying distributed algorithms in Lean
looks like terra incognita :dragon: (when compared with TLA<sup>+</sup>).
Importantly, I tried to stay close to the original TLA<sup>+</sup> specification
in spirit, but not in the dogmas. My goal was to specify the protocol in a way
that looks natural in Lean, instead of literally replicating the TLA<sup>+</sup>
idioms. In addition to that, I wanted the specification to be executable,
since I was not sure about writing complete proofs of correctness. When I was
writing this blog post, I realized that it is also possible to write a
<a href="#5-propositional-specification-in-lean">Propositional specification</a>.  This specification looks even closer
to the original TLA<sup>+</sup> specification, and also much more “natural”, if
you are really into TLA<sup>+</sup>, even though this version is not executable.</p>

<p>What is intriguing is that the resulting specification had the necessary
ingredients to be <strong>executable</strong>, since the Lean tools can compile a large
subset of the language into C. Obviously, it does not automatically generate a
distributed implementation of our distributed protocol. As we know from
<a href="https://konnov.phd/quint">Quint</a>, it is quite useful to have a randomized simulator, especially, if
there are no other automatic analysis tools around.</p>

<p>As the next step, I wrote a very simple randomized simulator to check the
properties against the specification — see <a href="#3-randomised-simulator-in-lean">Randomized
simulation</a>. It turned out to be even easier to implement in Lean than
I expected. After about <em>four hours</em> of work, I had the simulator running and
producing counterexamples to the properties that are expected to be violated.
Maybe this is because I knew exactly what was needed, having written the Quint
simulator a couple of years ago. Or maybe it is because Lean does not shy away
from the occasional need for imperative code. Of course, this is all done
through <a href="https://lean-lang.org/functional_programming_in_lean/monads.html">“monads”</a>, but they are relatively easy to use in Lean
— even if you are not quite ready to buy into the FP propaganda.
As a bonus point, <a href="#34-our-simulator-is-really-fast">this simulator is really fast</a>.</p>

<p>Of course, if you have been reading the <a href="https://news.ycombinator.com/">orange website</a>, you should tell me
that writing a simulator by hand is not the way to go. Instead, we should use
<a href="https://en.wikipedia.org/wiki/Software_testing#Property_testing">property-based testing</a> (PBT). Well, after the experiments with the
simulator, this is exactly what I did with <a href="https://github.com/leanprover-community/plausible">Plausible</a>. See <a href="#4-property-based-testing-in-lean">Property-based
testing</a>.  To my disappointment, PBT took took me about <em>six hours</em> of
work, delivering less impressive results in comparison to the simulator. I
wasted so much time figuring out the generators and trying to define my own
instances of <code>Sampleable</code> and <code>Shrinkable</code>, that I needed something <code>Drinkeable</code>
after that.  This was a bit unexpected for me. True, it was my first time using
PBT in Lean — though not my first time with PBT in general, since I had
prior experience using <a href="https://scalacheck.org/">ScalaCheck</a> in Scala for <a href="https://apalache-mc.org/">Apalache</a>. While PBT might
be the way to go in the long run, at the moment, I find Plausible a bit too hard
to use. To be fair, ScalaCheck was not trivial to figure out, too. For some
reason, the PBT frameworks assume that everybody knows Haskell and
<a href="https://hackage.haskell.org/package/QuickCheck">QuickCheck</a>.</p>

<p>After having played with the simulator and the Plausible tests, I was quite
satisfied with my spec. (My experiments with more advanced language features did
not lead to serious improvements in the spec.) Obviously, the next question is
whether we want to prove that the protocol satisfies its state invariants.  At
the high-level, it is clear how to proceed: Discover an inductive invariant,
prove its inductiveness and show that the inductive invariant implies the state
invariants of the protocol. I must have an inductive invariant for the
TLA<sup>+</sup> spec of two-phase commit lying somewhere. Even if the inductive
invariant is lost, I am pretty sure that it is relatively easy to find it again
by iteratively running <a href="https://apalache-mc.org/">Apalache</a>. We have seen that this is doable
with <a href="/specification/modelchecking/tlaplus/apalache/2024/11/03/ben-or.html">Ben-Or’s Byzantine consensus</a>. Even <a href="https://github.com/tlaplus/tlaplus">TLC</a> should work for the two-phase
commit. At the lower levels, it is hard to predict how much effort it would take
to write complete proofs with Lean tactics. This is definitely something to do
in another sprint.</p>

<p>To sum it up, I quite liked the experience with Lean as a programming and
specification language. I will definitely add it to my set of available tools.
If someone wants to write executable protocol specs in Lean, say, instead of
TLA<sup>+</sup>, Quint, or Python, I would be <a href="/contact/">happy to help</a>.</p>

<p>The rest of the blogpost goes into details. If you are interested, keep reading.
If not, you are free to stop here.</p>

<h1 id="table-of-contents">Table of Contents</h1>

<ol>
  <li><a href="#1-a-brief-intro-to-two-phase-commit-and-the-tla-specification">A brief intro to Two-phase commit and the TLA<sup>+</sup> specification</a></li>
  <li><a href="#2-specification-in-lean">Specification in Lean</a>
    <ol>
      <li><a href="#21-data-structures">Data structures</a></li>
      <li><a href="#22-functional-specification-in-lean">Functional specification in Lean</a></li>
      <li><a href="#23-system-level-specification-in-lean">System-level specification in Lean</a></li>
    </ol>
  </li>
  <li><a href="#3-randomised-simulator-in-lean">Randomised simulator in Lean</a>
    <ol>
      <li><a href="#31-why">Why?</a></li>
      <li><a href="#32-what">What?</a></li>
      <li><a href="#33-how">How?</a></li>
      <li><a href="#34-our-simulator-is-really-fast">Our simulator is really fast!</a></li>
    </ol>
  </li>
  <li><a href="#4-property-based-testing-in-lean">Property-based testing in Lean</a></li>
  <li><a href="#5-propositional-specification-in-lean">Propositional specification in Lean</a></li>
</ol>

<h2 id="1-a-brief-intro-to-two-phase-commit-and-the-tla-specification">1. A brief intro to Two-phase commit and the TLA<sup>+</sup> specification</h2>

<p>I am giving only a very brief introduction in two-phase commit. It is quite a
famous protocol. <a href="https://www.microsoft.com/en-us/research/publication/consensus-on-transaction-commit/">Gray &amp; Lamport’04</a> (Sec. 3) explain the
protocol nicely and by using the same vocabulary, as in the TLA<sup>+</sup>
specification. Moreover, Leslie Lamport explains the protocol idea and the
specification in his <a href="https://youtu.be/U4mlGqXjtoA?t=117">Lecture 6</a> of the TLA<sup>+</sup> Course
on YouTube.</p>

<p>Here is the elevator pitch of two-phase commit: One transaction manager and $N$
resource managers have to agree on whether to commit or abort a transaction,
e.g., a database transaction. If they all decide to commit the transaction,
including the transaction manager, they all commit it. If at least one of them
decides to abort, all of them should abort the transaction.</p>

<p>The sequence diagram below shows the happy path: Everyone decides to commit
the transaction, and everybody does so:</p>

<picture>
  <source srcset="/img/two-phase-commit.png" type="image/webp" />
  <img class="responsive-img" src="/img/two-phase-commit.png" alt="A happy path where TM commits" />
</picture>

<p>The following diagram shows an unhappy scenario, when the transaction manager
declines the transaction, even though all resources managers were ready to commit:</p>

<picture>
  <source srcset="/img/two-phase-abort.png" type="image/webp" />
  <img class="responsive-img" src="/img/two-phase-abort.png" alt="An unhappy path where TM aborts" />
</picture>

<p><strong>The TLA<sup>+</sup> specification.</strong> Let’s have a quick look at the
TLA<sup>+</sup> <a href="https://github.com/tlaplus/Examples/blob/master/specifications/transaction_commit/TwoPhase.tla">specification</a> of the two-phase commit. If you
don’t know TLA<sup>+</sup>, you can skip this part. The whole specification is
about 140 lines long. As usual, it starts with the declaration of the constants
and state variables:</p>

<pre><code class="language-tlaplus">CONSTANT RM \* The set of resource managers

VARIABLES
  rmState,       \* $rmState[rm]$ is the state of resource manager RM.
  tmState,       \* The state of the transaction manager.
  tmPrepared,    \* The set of RMs from which the TM has received $"Prepared"$
                 \* messages.
  msgs           
    (***********************************************************************)
    (* In the protocol, processes communicate with one another by sending  *)
    (* messages.  Since we are specifying only safety, a process is not    *)
    (* required to receive a message, so there is no need to model message *)
    (* loss.  (There's no difference between a process not being able to   *)
    (* receive a message because the message was lost and a process simply *)
    (* ignoring the message.)  We therefore represent message passing with *)
    (* a variable $msgs$ whose value is the set of all messages that have  *)
    (* been sent.  Messages are never removed from $msgs$.  An action      *)
    (* that, in an implementation, would be enabled by the receipt of a    *)
    (* certain message is here enabled by the existence of that message in *)
    (* $msgs$.  (Receipt of the same message twice is therefore allowed;   *)
    (* but in this particular protocol, receiving a message for the second *)
    (* time has no effect.)                                                *)

</code></pre>

<p>We have one specification parameter <code>RM</code> that fixes the set of the resource
managers. Further, we have four state variables: <code>rmState</code>, <code>tmState</code>,
<code>tmPrepared</code>, and <code>msgs</code>.</p>

<p>Since the classic TLA<sup>+</sup> is untyped, the specification comes with a
special predicate that captures the possible values that the state variables could
have:</p>

<pre><code class="language-tlaplus">Message ==
  (*************************************************************************)
  (* The set of all possible messages.  Messages of type $"Prepared"$ are  *)
  (* sent from the RM indicated by the message's $rm$ field to the TM\@.   *)
  (* Messages of type $"Commit"$ and $"Abort"$ are broadcast by the TM, to *)
  (* be received by all RMs.  The set $msgs$ contains just a single copy   *)
  (* of such a message.                                                    *)
  (*************************************************************************)
  [type : {"Prepared"}, rm : RM]  \cup  [type : {"Commit", "Abort"}]
   
TPTypeOK ==  
  (*************************************************************************)
  (* The type-correctness invariant                                        *)
  (*************************************************************************)
  /\ rmState \in [RM -&gt; {"working", "prepared", "committed", "aborted"}]
  /\ tmState \in {"init", "committed", "aborted"}
  /\ tmPrepared \subseteq RM
  /\ msgs \subseteq Message

</code></pre>

<p>There is also a <a href="https://github.com/apalache-mc/apalache/blob/main/test/tla/TwoPhaseTyped.tla">typed version</a> for the Apalache model checker,
as it needs types. If you are interested, go and check it.</p>

<p>As is typical, the specification contains a series of actions, which prescribe
the behavior of the resource managers and the behavior of the transaction
manager. For example, <code>RMPrepare(rm)</code> prescribe how a resource manager <code>rm</code>
transitions from its state <code>"working"</code> to the state <code>"prepared"</code>, while sending
the message <code>[ type |-&gt; "Prepared", rm |-&gt; rm ]</code>:</p>

<pre><code class="language-tlaplus">RMPrepare(rm) == 
  (*************************************************************************)
  (* Resource manager $rm$ prepares.                                       *)
  (*************************************************************************)
  /\ rmState[rm] = "working"
  /\ rmState' = [rmState EXCEPT ![rm] = "prepared"]
  /\ msgs' = msgs \cup {[type |-&gt; "Prepared", rm |-&gt; rm]}
  /\ UNCHANGED &lt;&lt;tmState, tmPrepared&gt;&gt;

</code></pre>

<p>The transaction manager receives that message from <code>rm</code> in the action
<code>TMRcvPrepared(rm)</code>:</p>

<pre><code class="language-tlaplus">TMRcvPrepared(rm) ==
  (*************************************************************************)
  (* The TM receives a $"Prepared"$ message from resource manager $rm$.    *)
  (*************************************************************************)
  /\ tmState = "init"
  /\ [type |-&gt; "Prepared", rm |-&gt; rm] \in msgs
  /\ tmPrepared' = tmPrepared \cup {rm}
  /\ UNCHANGED &lt;&lt;rmState, tmState, msgs&gt;&gt;

</code></pre>

<p>The individual actions such as <code>RMPrepare(rm)</code> and <code>TMRcvPrepared(rm)</code> are put
together with the next-state predicate:</p>

<pre><code class="language-tlaplus">TPNext ==
  \/ TMCommit \/ TMAbort
  \/ \E rm \in RM : 
       TMRcvPrepared(rm) \/ RMPrepare(rm) \/ RMChooseToAbort(rm)
         \/ RMRcvCommitMsg(rm) \/ RMRcvAbortMsg(rm)

</code></pre>

<h2 id="2-specification-in-lean">2. Specification in Lean</h2>

<p>You can find the whole <a href="https://github.com/konnov/leanda/tree/main/twophase/Twophase">specification in Lean</a> on GitHub. I am
presenting it in small pieces, to demonstrate the decisions that had to be made.</p>

<h2 id="21-data-structures">2.1. Data structures</h2>

<p>Let’s start with the data structures. Since Lean is typed, we have to understand
how to represent the parameter <code>RM</code> and the state variables <code>rmState</code>,
<code>tmState</code>, <code>tmPrepared</code>, and <code>msgs</code>.</p>

<p>When it comes to the parameter <code>RM</code>, which was declared with <code>CONSTANT RM</code> in
TLA<sup>+</sup>, we simply declare <code>RM</code> to be a type variable:</p>

<pre><code class="language-lean">-- The abstract type of resource managers.
variable (RM : Type) [DecidableEq RM] [Hashable RM] [Repr RM]

</code></pre>

<p>Since Lean does not make default assumptions about types, we also specify
what is required from the type <code>RM</code>:</p>

<ul>
  <li><code>RM</code> must have decidable equality, that is, we should be able to check <code>rm1 = rm2</code>
 for two instances <code>rm1: RM</code> and <code>rm2: RM</code>. We have <code>[DecidableEq RM]</code>.</li>
  <li><code>RM</code> must be usable as a key in a hash table, that is, we have <code>[Hashable RM]</code>.</li>
  <li><code>RM</code> must be convertible to a string, that is, we have <code>[Repr RM]</code>.</li>
</ul>

<p>Now we have to understand how to deal with the types of the resource manager
state and the transaction manager state, which are simply written in
TLA<sup>+</sup> as <code>"init"</code>, <code>"working"</code>, <code>"prepared"</code>, etc. To this end, we
declare two types <code>RMState</code> and <code>TMState</code>, which are similar to <code>enum</code> types,
e.g., in Rust:</p>

<pre><code class="language-lean">/-- A state of a resource manager. -/
inductive RMState where
    | Working
    | Prepared
    | Committed
    | Aborted
    deriving DecidableEq, Repr

/-- A state of the transaction manager. -/
inductive TMState where
    | Init
    | Committed
    | Aborted
    deriving DecidableEq, Repr

</code></pre>

<p>Further, we should understand the type of messages, which are written in
TLA<sup>+</sup> like <code>[ type |-&gt; "Prepared", rm |-&gt; rm ]</code>. Again, Lean’s
inductive types fit in very nicely:</p>

<pre><code class="language-lean">/-- A message that sent by either the transaction manager or a resource manager. -/
inductive Message where
    | Commit
    | Abort
    | Prepared(rm: RM)
    deriving DecidableEq, Repr

</code></pre>

<p>Finally, how shall we represent the state variables <code>rmState</code>, <code>tmState</code>,
<code>tmPrepared</code>, and <code>msgs</code>? To this end, we simply declare the structure
<code>ProtocolState</code>:</p>

<pre><code class="language-lean">/-- A state of the Two-phase commit protocol. -/
structure ProtocolState where
    -- The set of the resource managers.
    -- It may differ from run to run, but remains constant during a run.
    all: Finset RM
    rmState: Std.HashMap RM RMState
    tmState: TMState
    tmPrepared: Finset RM
    msgs: Finset (Message RM)

</code></pre>

<p>As you can see, <code>ProtocolState</code> is the place where we had to make a number of
decisions:</p>

<ul>
  <li>
    <p>The state carries the set <code>all</code> of all resource managers. The transaction
 manager needs this set to decided whether it had received the messages from all
 resource managers, see below. Perhaps, there is some Lean magic that would do
 that automatically.</p>
  </li>
  <li>
    <p><code>rmState</code> is a <em>hash map</em> from the resource managers to <code>RMState</code>.
Recall that it was simply defined as a function in TLA<sup>+</sup>.</p>
  </li>
  <li>
    <p><code>tmPrepared</code> is a <em>finite set</em> of resource managers.</p>
  </li>
  <li>
    <p><code>msgs</code> is a <em>finite set</em> of messages.</p>
  </li>
</ul>

<div class="tip-box">
    <div class="tip-header"><strong>💡 Tip:</strong></div>
    <div class="tip"><p>Deciding on the field types of <code>ProtocolState</code>
affects the rest of the specification. We could use <code>AssocList</code> or <code>RBMap</code>
instead of <code>HashMap</code>, or <code>HashSet</code> instead of <code>Finset</code>. In the case of sets,
I’ve settled on <code>Finset</code>, to avoid leaking the abstraction. However, in case of
<code>rmState</code>, I found <code>HashMap</code> a bit more convenient. In any case, it would be
good to avoid the implementation details at this level.</p>
</div>
</div>

<h3 id="22-functional-specification-in-lean">2.2. Functional specification in Lean</h3>

<p>Now that we have chosen our data structures, we can specify the behavior of the
two-phase commit. We do this in the module <a href="https://github.com/konnov/leanda/blob/main/twophase/Twophase/Functional.lean">Functional.lean</a>. For example,
this is how we specify the behavior of the transaction manager on receiving the
message <code>Prepared rm</code>, for a resource manager <code>rm</code>:</p>

<pre><code class="language-lean">/-- The transaction manager receives a `Prepared` message from a resource manager `rm`. -/
def tmRcvPrepared (rm: RM) :=
    if s.tmState = TMState.Init &amp;&amp; Message.Prepared rm ∈ s.msgs then some {
        s with tmPrepared := s.tmPrepared ∪ { rm },
    } else none

</code></pre>

<p>The above definition is very simple, but let’s go over it, just in case:</p>

<ul>
  <li>
    <p>We check, whether the input state <code>s</code> (more on that below) has the field
 <code>tmState</code> set to <code>TMState.Init</code>.</p>
  </li>
  <li>
    <p>Further, we check, whether the set of messages contains the message
 <code>Message.Prepared rm</code>. The operator $x ∈ S$ is set membership, which returns
 <code>True</code>, if and only if the set $S$ contains $x$ as its element.</p>
  </li>
  <li>
    <p>If the both of the above conditions hold true, then we produce a new state
 that is like the state <code>s</code> but its field <code>tmPrepared</code> is set to <code>s.tmPrepared ∪
 { rm }</code>, that is, the set <code>s.tmPrepared</code> with the value <code>rm</code> added to it. Note
 that we return “some” value in this case, using the constructor <code>some</code> of the
 option type <code>Option ProtocolState</code>.</p>
  </li>
  <li>
    <p>If at least one of the conditions does not hold true, we return the value of
 type <code>none</code>, to indicate that the function parameter <code>rm</code> is not applicable in
 this case.</p>
  </li>
</ul>

<p>Where does the parameter <code>s</code> come from? It is not declared in <code>tmRcvPrepared</code> at
all. This is an interesting feature of Lean. The parameter <code>s</code> is implicitly
added as a parameter of <code>tmRcvPrepared</code>. To achieve this, we wrap the
definitions in the section <code>defs</code> and declare <code>s</code> as a section variable there:</p>

<pre><code class="language-lean">section defs
-- The state `s` is a state of the protocol, explicitly added to all the functions.
variable (s: ProtocolState RM)

def tmRcvPrepared (rm: RM) :=
  ...

def tmCommit :=
  ...

...
end defs
</code></pre>

<p>Here is another example of <code>rmPrepare</code>:</p>

<pre><code class="language-lean">/-- Resource manager `rm` prepares. -/
def rmPrepare (rm: RM) :=
    if s.rmState.get? rm = RMState.Working then some {
        s with
        rmState := s.rmState.insert rm RMState.Prepared,
        msgs := s.msgs ∪ { Message.Prepared rm }
    } else none

</code></pre>

<p>Check the remaining definitions in <a href="https://github.com/konnov/leanda/blob/main/twophase/Twophase/Functional.lean">Functional.lean</a>.</p>

<div class="tip-box">
    <div class="tip-header"><strong>💡 Tip:</strong></div>
    <div class="tip"><p>We could translate the actions to more closely
match the original specification.</p>
</div>
</div>

<p>If you’ve written TLA<sup>+</sup> in the past, you probably expected the
definition of <code>tmRcvPrepared</code> to look like this:</p>

<pre><code class="language-lean">/-- The proposition version of `tmRcvPrepared`. -/
def tm_rcv_prepared (rm: RM): Prop :=
    s.tmState = TMState.Init
  ∧ Message.Prepared rm ∈ s.msgs
  ∧ s'.tmPrepared = s.tmPrepared ∪ { rm }

</code></pre>

<p>See the discussion about these two definitions in the <a href="#5-propositional-specification-in-lean">Propositional
specification</a>.</p>

<div class="tip-box">
    <div class="tip-header"><strong>💡 Tip:</strong></div>
    <div class="tip"><p>If you looked at Quint, our functional definitions
in Lean are quite similar to the <code>pure def</code> definitions of Quint.</p>
</div>
</div>

<h3 id="23-system-level-specification-in-lean">2.3. System-level specification in Lean</h3>

<p>Now we have the functional definitions of the resource managers and the
transaction manager. How do we put these things together, to capture the
behavior of the distributed system? We do this in the module <a href="https://github.com/konnov/leanda/blob/main/twophase/Twophase/System.lean">System.lean</a>.</p>

<p>Recall that the TLA<sup>+</sup> specification does this via the predicates
<code>TPInit</code> and <code>TPNext</code>:</p>

<pre><code class="language-tlaplus">TPInit ==   
  (*************************************************************************)
  (* The initial predicate.                                                *)
  (*************************************************************************)
  /\ rmState = [rm \in RM |-&gt; "working"]
  /\ tmState = "init"
  /\ tmPrepared   = {}
  /\ msgs = {}

</code></pre>

<pre><code class="language-tlaplus">TPNext ==
  \/ TMCommit \/ TMAbort
  \/ \E rm \in RM : 
       TMRcvPrepared(rm) \/ RMPrepare(rm) \/ RMChooseToAbort(rm)
         \/ RMRcvCommitMsg(rm) \/ RMRcvAbortMsg(rm)

</code></pre>

<p>Initializing the system does not look hard. We do it like this:</p>

<pre><code class="language-lean">/-- initialize the state of all resource managers to `Working` -/
def init_rm_state (all: List RM) :=
  all.foldl
    (fun m rm =&gt; m.insert rm RMState.Working)
    (Std.HashMap.emptyWithCapacity 0)

/-- The initial state of the protocol -/
def init (all: List RM): ProtocolState RM := {
    all := all.toFinset,
    rmState := init_rm_state all,
    tmState := TMState.Init,
    tmPrepared := ∅,
    msgs := ∅
}

</code></pre>

<p>The definition of <code>init_rm_state</code> definitely looks less elegant than the
function constructor in TLA<sup>+</sup>, but it does its job: We iterate over
the resource managers <code>rm</code> and add pairs <code>(rm, RMState.Working)</code> to the hash
map.</p>

<p>What can we do about <code>TPNext</code>? This looks tricky: In every state, it should be
possible to select one out of seven actions, as well as the parameter <code>rm</code>.
This corresponds to control and data <strong>non-determinism</strong>, which poses a
challenge to a functional definition, which is <strong>deterministic</strong>.</p>

<p>Luckily, the literature on distributed computing can help us with this problem.
It is not unusual to define a system execution as a function of a <strong>schedule</strong>,
or as a function of an <strong>adversary</strong>. Basically, we imagine that there is an
external entity that tells us which steps to take.</p>

<p>Once we understand this trick, it becomes easy to use. We simply declare the
type <code>Action</code> like this:</p>

<pre><code class="language-lean">/--
Since Lean4 is not TLA+, it does not have a built-in syntax for actions.
Hence, we introduce the type that essentially turns control and data
non-deterministm into inputs. This trick is usually called a "schedule"
or "adversary" in the literature.
-/
inductive Action where
  | TMCommit
  | TMAbort
  | TMRcvPrepared(rm: RM)
  | RMPrepare(rm: RM)
  | RMChooseToAbort(rm: RM)
  | RMRcvCommitMsg(rm: RM)
  | RMRcvAbortMsg(rm: RM)
  deriving DecidableEq, Repr

</code></pre>

<p>Having defined the <code>Action</code> type, we define the function <code>next</code> of a protocol
state <code>s</code> and an action <code>a</code>:</p>

<pre><code class="language-lean">/-- The transition function (!) of the protocol. Since we provide `next` with
the action as an argument, `next` is a function, not a relation. -/
def next s a :=
  match a with
  | Action.TMCommit =&gt; tmCommit RM s
  | Action.TMAbort =&gt; tmAbort _ s
  | Action.TMRcvPrepared rm =&gt; tmRcvPrepared _ s rm
  | Action.RMPrepare rm =&gt; rmPrepare _ s rm
  | Action.RMChooseToAbort rm =&gt; rmChooseToAbort _ s rm
  | Action.RMRcvCommitMsg rm =&gt; rmRcvCommitMsg _ s rm
  | Action.RMRcvAbortMsg rm =&gt; rmRcvAbortMsg _ s rm

</code></pre>

<p>It looks simple and clean. I would argue that this definition of <code>next</code> is more
elegant than the definition of <code>TPNext</code> in TLA<sup>+</sup>.</p>

<p>This is all we have in <a href="https://github.com/konnov/leanda/blob/main/twophase/Twophase/System.lean">System.lean</a>.</p>

<h2 id="3-randomised-simulator-in-lean">3. Randomised simulator in Lean</h2>

<h3 id="31-why">3.1. Why?</h3>

<p>We have a formal specification in Lean. What is next? Obviously, our ultimate
goal is to <strong>verify</strong> protocol correctness. In particular, we would like to
verify consistency across the resource managers, for every intermediate state of
the protocol:</p>

<pre><code class="language-lean">-- the main invariant of the protocol, namely, that resource managers cannot disagree
def consistentInv (s: ProtocolState RM): Bool :=
  let existsAborted :=
    ∅ ≠ (Finset.filter (fun rm =&gt; s.rmState.get? rm = RMState.Aborted) s.all)
  let existsCommitted :=
    ∅ ≠ (Finset.filter (fun rm =&gt; s.rmState.get? rm = RMState.Committed) s.all)
  ¬existsAborted ∨ ¬existsCommitted

</code></pre>

<p>Basically, we want to show that it is impossible to find a protocol state, which
has a resource manager in the <code>Committed</code> state and a resource manager in the
<code>Aborted</code> state. Well, Lean offers us a lot of machinery for proving such
properties. However, this machinery requires someone to write the proofs. Even
though there is hope for large language models generating repetitive proofs,
there is little hope for automatically proving properties of completely
arbitrary algorithms.</p>

<p>Before going into a long-term effort of proving protocol properties, it is
usually a good idea to try to <strong>falsify</strong> the protocol properties. This is what
randomized simulations and model checkers can help us with. Even though such
tools would not be able to give us a complete proof of correctness, they are
quite useful in producing <strong>counterexamples</strong>. After all, if we want to prove a
property $p$ and an automatic tool gives us a proof of $\neg p$, that is, a
counterexample to $p$, we immediately know that the goal of proving $p$ is
hopeless. Sometimes, model checkers can give us slightly better guarantees, see
my recent blog post on <a href="/modelchecking/2025/04/08/value.html">value of model checking</a>.</p>

<p>In contrast to TLA<sup>+</sup>, which has two model checkers <a href="https://github.com/tlaplus/tlaplus">TLC</a> and
<a href="https://apalache-mc.org/">Apalache</a>, there is no model checker for Lean. Hence, the easiest approach to
falsify a property in Lean is by using randomized techniques. In this section,
we discuss <strong>randomized simulation</strong>. In the next section on <a href="#4-property-based-testing-in-lean">Property-based
testing</a>, we discuss <a href="https://github.com/leanprover-community/plausible">Plausible</a> — a PBT framework for Lean.</p>

<h3 id="32-what">3.2. What?</h3>

<p>Our goal in this section is to quickly turn our Lean specification into a simulator
that runs as follows:</p>

<pre><code class="language-sh">$ lake exec RunTwophase4 100000 20 consistentInv 123
</code></pre>

<p>The above command runs the simulator <code>RunTwophase4</code> that executes 100000 random
runs, each having up to 20 steps. In each run, the simulator checks the state
invariant <code>consistentInv</code> for each intermediate state. Since our simulator is
randomized, we have to supply the seed. In our example, we give 123 as the seed.
We could also supply the current time in seconds since Unix epoch by replacing
<code>123</code> with <code>$(date +%s)</code>.</p>

<p>Since the property <code>consistentInv</code> is not supposed to break, we also introduce
“falsy invariants” that should actually break. By checking these invariants,
we may convince ourselves that our specification is doing something useful.
Here are the interesting invariants that we define right in the simulator:</p>

<pre><code class="language-lean">-- This is a false invariant to demonstrate that TM can abort.
def noAbortEx (s: ProtocolState RM): Bool :=
  s.tmState ≠ TMState.Aborted

-- This is a false invariant to demonstrate that TM can commit.
def noCommitEx (s: ProtocolState RM): Bool :=
  s.tmState ≠ TMState.Committed

-- This is a false invariant for a trickier property:
-- Even though all resource managers are prepared, the TM can still abort.
def noAbortOnAllPreparedEx (s: ProtocolState RM): Bool :=
  s.tmState = TMState.Aborted → s.tmPrepared ≠ s.all

</code></pre>

<p>For example, this is how we find a schedule that makes the transaction manager
commit:</p>

<pre><code class="language-sh">$ /usr/bin/time -h lake exec RunTwophase4 1000000 20 noCommitEx 1745505753
❌ Counterexample found after 403736 trials
#0: Action.RMPrepare (RM.RM3)
#1: Action.RMPrepare (RM.RM4)
#2: Action.TMRcvPrepared (RM.RM3)
#3: Action.RMPrepare (RM.RM2)
#4: Action.TMRcvPrepared (RM.RM3)
#5: Action.TMRcvPrepared (RM.RM2)
#6: Action.TMRcvPrepared (RM.RM3)
#7: Action.RMPrepare (RM.RM1)
#8: Action.TMRcvPrepared (RM.RM4)
#9: Action.TMRcvPrepared (RM.RM1)
#10: Action.TMCommit
	2.06s real		1.88s user		0.53s sys
</code></pre>

<p>The simulator also finds an example, in which all resource managers were ready
to commit a transaction, but the transaction manager decided to abort it:</p>

<pre><code class="language-sh">$ /usr/bin/time -h lake exec RunTwophase4 1000000 20 noAbortOnAllPreparedEx 1745505754
❌ Counterexample found after 97860 trials
#0: Action.RMPrepare (RM.RM3)
#1: Action.RMPrepare (RM.RM4)
#2: Action.RMPrepare (RM.RM2)
#3: Action.TMRcvPrepared (RM.RM2)
#4: Action.TMRcvPrepared (RM.RM4)
#5: Action.TMRcvPrepared (RM.RM3)
#6: Action.RMPrepare (RM.RM1)
#7: Action.TMRcvPrepared (RM.RM1)
#8: Action.TMAbort
	1.21s real		0.67s user		0.32s sys
</code></pre>

<p>Actually, the diagrams in the <a href="#1-a-brief-intro-to-two-phase-commit-and-the-tla-specification">Brief introduction</a> are generated
from the examples that are found by our simulator. As you can see the
counterexamples are found in a few seconds, even though the simulator has to
enumerate hundreds of thousands of schedules.</p>

<h3 id="33-how">3.3. How?</h3>

<p>You can find the whole simulator in <a href="https://github.com/konnov/leanda/blob/main/twophase/Twophase4_run.lean">Twophase4_run.lean</a>.  The core of
our simulator is the following loop, written right in the entry point:</p>

<pre><code class="language-lean">def main (args: List String): IO UInt32 := do
  -- parse the arguments
  -- ...
  -- run a loop of `maxSamples`
  for trial in [0:maxSamples] do
    let mut state := /- initial_state (...) -/ 
    -- run a loop of `maxSteps` steps
    for _ in [0:maxSteps] do
      -- ...
      let action := mkAction -- (random parameters)
      match next state action with
      | some new_state =&gt;
        --- add to trace
      | none =&gt; -- do nothing
      -- check our invariant
      if ¬inv state then
        -- print the trace
        return 1
</code></pre>

<p>Before going into the details of the simulator loop, we have to introduce a bit
of boilerplate code. First, we have to define the type of the resource managers
<code>RM</code>. We do it by simply enumerating four constructors <code>RM1</code>, …, <code>RM4</code>:</p>

<pre><code class="language-lean">-- An instance of four resource managers.
inductive RM
  | RM1
  | RM2
  | RM3
  | RM4
  deriving Repr, DecidableEq, Hashable, Inhabited

</code></pre>

<p>Second, since our simulator is randomized, we have to understand how to randomly
produce schedules. To this end, we write the definition <code>mkAction</code>:</p>

<pre><code class="language-lean">def mkAction (action_no: Nat) (rm_no: Nat): @Action RM :=
  let rm := match rm_no with
    | 0 =&gt; RM.RM1
    | 1 =&gt; RM.RM2
    | 2 =&gt; RM.RM3
    | _ =&gt; RM.RM4
  match action_no with
  | 0 =&gt; Action.TMCommit
  | 1 =&gt; Action.TMAbort
  | 2 =&gt; Action.TMRcvPrepared rm
  | 3 =&gt; Action.RMPrepare rm
  | 4 =&gt; Action.RMChooseToAbort rm
  | 5 =&gt; Action.RMRcvCommitMsg rm
  | _ =&gt; Action.RMRcvAbortMsg rm

</code></pre>

<p>Having defined <code>mkAction</code>, we simply generate two random numbers via Lean’s
<code>randNat</code> that we supply to <code>mkAction</code>. Now we are ready to see the full
implementation of the simulator loop:</p>

<pre><code class="language-lean">  -- run a loop of `maxSamples`
  let mut rng := mkStdGen seed
  for trial in [0:maxSamples] do
    let mut state := init [ RM.RM1, RM.RM2, RM.RM3, RM.RM4 ]
    -- run a loop of `maxSteps` steps
    let mut trace: List (@Action RM) := []
    for _ in [0:maxSteps] do
      let (action_no, next_rng) := randNat rng 0 6
      let (rm_no, next_rng) := randNat next_rng 0 3
      rng := next_rng
      let action := mkAction action_no rm_no
      match next state action with
      | some new_state =&gt;
        state := new_state
        trace := action::trace
      | none =&gt; pure ()

      -- check our invariant
      if ¬inv state then
        IO.println s!"❌ Counterexample found after {trial} trials"
        for (a, i) in trace.reverse.zipIdx 0 do
          IO.println s!"#{i}: {repr a}"
        return 1

</code></pre>

<p>As you can see, the simulator is extremely simple! There is really not much to
explain here.  The large body of work is done by Lean itself. Our job was to
simply produce schedules, that is, lists of <code>Action</code> values.</p>

<h3 id="34-our-simulator-is-really-fast">3.4. Our simulator is really fast!</h3>

<p>We have seen that in the cases when the invariant was violated — e.g.,
when checking <code>noAbortEx</code>, <code>noCommitEx</code>, or <code>noAbortOnAllPreparedEx</code> — our
simulator was finding examples in a few seconds. However, we do not know the
number of samples that is sufficiently large to convince ourselves that the
invariant is not violated. Hence, our simulator should be fast enough to crunch
plenty of runs.</p>

<p>Let us run the simulator on 1 million runs up to 30 steps each:</p>

<pre><code class="language-sh">$ /usr/bin/time -h -l lake exec RunTwophase4 1000000 30 consistentInv 1745505754
	10.93s real		10.34s user		0.39s sys
           422985728  maximum resident set size
...
</code></pre>

<p>Nice! The simulator came back in 11 seconds and it consumed up to 434 MB of
memory. We can also run it for a larger number of samples. As we can see, the
simulator is quite robust. As expected, the running times are growing linearly,
and there are no visible memory leaks:</p>

<table>
  <thead>
    <tr>
      <th>Number of samples</th>
      <th>Real Time</th>
      <th>Max RSS (MB)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1M</td>
      <td>10.93s</td>
      <td>422</td>
    </tr>
    <tr>
      <td>10M</td>
      <td>1m43.42s</td>
      <td>423</td>
    </tr>
    <tr>
      <td>100M</td>
      <td>18m7.72s</td>
      <td>423</td>
    </tr>
  </tbody>
</table>

<p>Is it actually fast? I think it is really damn fast! What can we use as the
baseline? We can compare it with the Quint simulator. All we have to do is to
get the <a href="https://github.com/informalsystems/quint/blob/main/examples/classic/distributed/TwoPhaseCommit/two_phase_commit.qnt">Quint spec of two-phase commit</a>, which is also a
translation of the same TLA<sup>+</sup> spec that we used as the original
source. To make the comparison faithful, we have to add the instance of four
resource managers like this:</p>

<pre><code class="language-quint">module two_phase_commit_4 {
  import two_phase_commit(resourceManagers = Set("rm1", "rm2", "rm3", "rm4")).*
}
</code></pre>

<p>Now we can run the Quint simulator like this:</p>

<pre><code class="language-sh">$ /usr/bin/time -l -h quint run --max-samples 1000000 --max-steps=30 \
  --invariant=consistencyInv \
  examples/classic/distributed/TwoPhaseCommit/two_phase_commit.qnt --main two_phase_commit_4
...
	27m57.62s real		28m20.04s user		21.03s sys
           217235456  maximum resident set size
</code></pre>

<p>This is quite interesting. Although the Quint simulator uses half as much
memory, it is 168 times slower when running one million samples. I believe there
are several reasons for this. First, Quint is essentially a JavaScript
interpreter, while our Lean simulator is transpiled to C and compiled with full
optimization for the target architecture. Second, the Quint simulator makes a
slightly greater effort to find satisfying assignments randomly. Nevertheless,
the performance gap is substantial.</p>

<p>Quint also has a new backend in Rust. Let’s try it, too:</p>

<pre><code class="language-sh">/usr/bin/time -l -h quint run --max-samples 1000000 --max-steps=30 \
  --invariant=consistencyInv examples/classic/distributed/TwoPhaseCommit/two_phase_commit.qnt \
  --main two_phase_commit_4 --backend=rust
...
	9m27.57s real		9m24.46s user		1.53s sys
           120913920  maximum resident set size
</code></pre>

<p>Nice! The rust backend is three times faster and uses half as much memory as the
JavaScript backend. Still, 10 minutes is nowhere close to the 10 seconds by our
simulator in Lean. Apparently, it’s really hard to compete with a transpiler to C,
even if the interpreter is written in :crab:</p>

<h2 id="4-property-based-testing-in-lean">4. Property-based testing in Lean</h2>

<p>Instead of writing our own simulator — even though it happened to be easy
in the case of two-phase commit — we employ <a href="https://github.com/leanprover-community/plausible">Plausible</a> in this section.
Plausible also applies randomization to check properties, but it does it in a
slightly different way. Basically, it generates data structures of a given type
at random up to some predefined bound and checks the property. What is
interesting about property-based testing is that, once the framework finds a
property violation, it tries to minimize the failing input.</p>

<p>You can find the whole PBT setup in <a href="https://github.com/konnov/leanda/blob/main/twophase/Twophase4_pbt.lean">Twophase4_pbt.lean</a> and
<a href="https://github.com/konnov/leanda/blob/main/twophase/Twophase4_pbt_failing.lean">Twophase4_pbt_failing.lean</a>. If you are using only the standard types, using
Plausible looks pretty simple: Just write a quantified property, and Plausible
will check it. In our case, we have two user-defined data types, namely, <code>RM</code>
and <code>Action</code>. This is not unusual. In my experience, writing generators is the
hardest and the most time-consuming part of using PBT. Here is how we are
defining the custom type for <code>RM</code> and its generator:</p>

<pre><code class="language-lean">-- An instance of three resource managers.
inductive RM
  | RM1
  | RM2
  | RM3
  | RM4
  deriving Repr, DecidableEq, Hashable, Inhabited

-- define a generator for RM
instance RM.shrinkable: Shrinkable RM where
  shrink := fun (_: RM) =&gt; []

def genRm: Gen RM := (Gen.elements [ RM.RM1, RM.RM2, RM.RM3, RM.RM4 ] (by decide))

instance : SampleableExt RM :=
  SampleableExt.mkSelfContained genRm

</code></pre>

<p>Basically, we define the generator <code>genRm</code> that randomly selects one of the four
values <code>RM1</code>, …, <code>RM4</code>. Since we do not know how to further minimize a value
of type <code>RM</code>, we simply define the instance of <code>Shrinkable</code> as returning the
empty list.</p>

<p>Similar to that, we define a generator for <code>Action RM</code>. This time, we delegate
the choice of <code>RM</code> values to <code>genRm</code>:</p>

<pre><code class="language-lean">-- define a generator for Action RM
instance Action.shrinkable: Shrinkable (@Action RM) where
  shrink := fun (_: @Action RM) =&gt; []

def genAction: Gen (@Action RM) :=
  Gen.oneOf #[
    Gen.elements [ Action.TMCommit, Action.TMAbort] (by decide),
    (do return (Action.TMRcvPrepared (← genRm))),
    (do return (Action.RMPrepare (← genRm))),
    (do return (Action.RMChooseToAbort (← genRm))),
    (do return (Action.RMRcvCommitMsg (← genRm))),
    (do return (Action.RMRcvAbortMsg (← genRm)))
  ]
  (by decide)

instance : SampleableExt (@Action RM) :=
  SampleableExt.mkSelfContained genAction

</code></pre>

<p>Once we have defined the generator for actions, it is very easy to define a
generator of schedules, by simply applying a standard generator in Plausible:</p>

<pre><code class="language-lean">def genSchedule: Gen (List (@Action RM)) :=
  Gen.listOf genAction

</code></pre>

<p>The good news is that defining the generators was the hardest part. At least,
it was the hardest part for me. Now, since Lean does not have any idea about
how our specification should be executed, we define two simple functions
<code>applySchedule</code> and <code>checkInvariant</code>:</p>

<pre><code class="language-lean">-- given a concrete schedule, inductively apply the schedule and check the invariant
def applySchedule (s: ProtocolState RM) (schedule: List (@Action RM))
    (inv: ProtocolState RM → Bool): ProtocolState RM :=
  schedule.foldl (fun s a =&gt; if inv s then (next s a).getD s else s) s

-- apply a schedule to the initial state
def checkInvariant (schedule: List (@Action RM)) (inv: ProtocolState RM → Bool): Bool :=
  let init_s := init [ RM.RM1, RM.RM2, RM.RM3, RM.RM4 ]
  let last_s := applySchedule init_s schedule inv
  inv last_s

</code></pre>

<p>Now we are ready to write our first property to be checked by Plausible. This
is how we check, whether it’s possible to reach a state where the transaction
manager goes into <code>TMState.Aborted</code>:</p>

<pre><code class="language-lean">-- noAbortEx
example schedule:
    let inv := fun (s: ProtocolState RM) =&gt;
      s.tmState ≠ TMState.Aborted
    checkInvariant schedule inv
  := by plausible (config := { numInst := 3000, maxSize := 100 })

</code></pre>

<p>In the above definition we say that whatever the schedule we generate,
<code>checkInvariant</code> does not return <code>false</code>. We configure Plausible to produce 3000
instances with the data structures bounded to 100. Comparing it to our
simulator, this would mean 3000 random schedules of up to 100 steps.</p>

<p>In this case, Plausible finds a counterexample and minimizes it:</p>

<pre><code>===================
Found a counter-example!
schedule := [Action.TMAbort]
issue: false does not hold
(1 shrinks)
-------------------
</code></pre>

<p>Similar to <code>noAbortEx</code>, we define two other properties:</p>

<pre><code class="language-lean">-- noCommitEx
example schedule:
    let inv := fun (s: ProtocolState RM) =&gt;
      s.tmState ≠ TMState.Committed
    checkInvariant schedule inv
  := by plausible (config := { numInst := 3000, maxSize := 100 })

-- noAbortOnAllPreparedEx
example schedule:
    let inv := fun (s: ProtocolState RM) =&gt;
      s.tmState = TMState.Aborted → s.tmPrepared ≠ { RM.RM1, RM.RM2, RM.RM3, RM.RM4 }
    checkInvariant schedule inv
  := by plausible (config := { numInst := 3000, maxSize := 100 })

</code></pre>

<p>If we are lucky, both properties are violated, because they do not hold true.
In my case, Plausible does not find a counterexample to <code>noCommitEx</code>, and it
finds a counterexample to <code>noAbortOnAllPreparedEx</code>:</p>

<pre><code>===================
Found a counter-example!
schedule := [Action.RMPrepare (RM.RM3),
 Action.RMPrepare (RM.RM4),
 Action.TMRcvPrepared (RM.RM3),
 Action.RMPrepare (RM.RM1),
 Action.RMPrepare (RM.RM2),
 Action.TMRcvPrepared (RM.RM1),
 Action.TMRcvPrepared (RM.RM4),
 Action.TMRcvPrepared (RM.RM2),
 Action.TMAbort]
issue: false does not hold
(23 shrinks)
-------------------
</code></pre>

<p>Finally, we define the property <code>consistentInv</code>, which should not be violated:</p>

<pre><code class="language-lean">-- consistentInv
#eval Testable.check &lt;| ∀ (schedule: List (@Action RM)),
    let inv := fun (s: ProtocolState RM) =&gt;
      let existsAborted :=
        ∅ ≠ (Finset.filter (fun rm =&gt; s.rmState.get? rm = RMState.Aborted) s.all)
      let existsCommitted :=
        ∅ ≠ (Finset.filter (fun rm =&gt; s.rmState.get? rm = RMState.Committed) s.all)
      ¬existsAborted ∨ ¬existsCommitted
    checkInvariant schedule inv

</code></pre>

<p>So PBT seems to work and looks fine! The only thing that I could not figure out
is how to increase the number of samples to large values, e.g., millions of
samples, as we did in the simulator. Unfortunately, Plausible produces a stack
overflow even on 10K examples, even when run via <code>lake test</code>. This needs more
debugging.</p>

<h2 id="5-propositional-specification-in-lean">5. Propositional specification in Lean</h2>

<p>Finally, let’s discuss, whether our specification style is the most natural one
when translating TLA<sup>+</sup> specifications. We used the functional approach
to define individual behavior of resource managers and transition managers in
<a href="https://github.com/konnov/leanda/blob/main/twophase/Twophase/Functional.lean">Functional.lean</a> and <a href="https://github.com/konnov/leanda/blob/main/twophase/Twophase/System.lean">System.lean</a>. When we did a similar trick in Quint,
some people were telling me that it’s not TLA<sup>+</sup>. I think it is easier
to see what they meant by this in Lean!</p>

<p>Instead of defining functions, we can write the specification as propositions.
As before, we define two variables: <code>s</code> for the current state, and <code>s'</code> for the
next state.</p>

<pre><code class="language-lean">-- The state `s` is the "current" state of the protocol.
variable (s: ProtocolState RM)
-- The state `s'` is the "next" state of the protocol.
variable (s': ProtocolState RM)

</code></pre>

<p>Now we can translate TLA<sup>+</sup>’s action <code>TMRcvPrepared</code> as a proposition:</p>

<pre><code class="language-lean">/-- The proposition version of `tmRcvPrepared`. -/
def tm_rcv_prepared (rm: RM): Prop :=
    s.tmState = TMState.Init
  ∧ Message.Prepared rm ∈ s.msgs
  ∧ s'.tmPrepared = s.tmPrepared ∪ { rm }

</code></pre>

<p>This looks very similar to TLA<sup>+</sup>! Moreover, we should be able to use
some definitions that do not have an executable implementation. For instance, we
can use quantifiers over sets, instead of filtering sets. I also suspect that it
would be easier to write proofs over such propositions rather than over
functional code.</p>

<p>Can we find a connection between the functional definitions and the
propositions? Well, in Lean we can just write simple theorems like
<code>tm_rcv_prepared_correct</code>:</p>

<pre><code class="language-lean">theorem tm_rcv_prepared_correct (rm: RM):
    tm_rcv_prepared s s' rm ↔
      tmRcvPrepared RM s rm = some s' := by sorry


</code></pre>

<p>I do not know how to write a Lean proof of the above theorem yet. In principle,
it should be quite easy to write for someone proficient with Lean proofs.
Indeed, the functional definition and the proposition are very close in their
syntactic structure.</p>

<p>I find this connection especially appealing. This would let us stop arguing
about which level is the right one. We could simply write both functional and
propositional definitions and connect them via (hopefully!) simple proofs.</p>]]></content><author><name>{&quot;igor&quot;=&gt;{&quot;name&quot;=&gt;&quot;Igor Konnov&quot;, &quot;url&quot;=&gt;&quot;https://konnov.phd&quot;, &quot;email&quot;=&gt;&quot;igor@konnov.phd&quot;}}</name></author><category term="lean" /><summary type="html"><![CDATA[Author: Igor Konnov]]></summary></entry></feed>