<?xml version="1.0" encoding="utf-8"?>

<feed xmlns="http://www.w3.org/2005/Atom" >
  <generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator>
  <link href="https://protocols-made-fun.com/feed.xml?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" rel="self" type="application/atom+xml" />
  <link href="https://protocols-made-fun.com/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" rel="alternate" type="text/html" />
  <updated>2026-05-29T11:56:54+00:00</updated>
  <id>https://protocols-made-fun.com/feed.xml</id>

  
  
  

  
    <title type="html">Protocols Made Fun</title>
  

  
    <subtitle>All things about protocol specification, testing, and verification. Creative Commons Attribution 4.0 International License.</subtitle>
  

  
    <author>
        <name>Igor Konnov</name>
      
        <email>igor@konnov.phd</email>
      
      
        <uri>https://konnov.phd?utm_source=protocols_made_fun&amp;utm_medium=referral&amp;utm_campaign=pmf_site</uri>
      
    </author>
  

  
  
  
  
  
  
    <entry>
      
      

      <title type="html">Extracting formal specifications from Apache ZooKeeper with AI tools and Apalache</title>
      <link href="https://protocols-made-fun.com/testing/model-checking/2026/05/26/zookeeper-testing.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" rel="alternate" type="text/html" title="Extracting formal specifications from Apache ZooKeeper with AI tools and Apalache" />
      <published>2026-05-26T00:00:00+00:00</published>
      <updated>2026-05-26T00:00:00+00:00</updated>
      <id>https://protocols-made-fun.com/testing/model-checking/2026/05/26/zookeeper-testing</id>
      
      
        <content type="html" xml:base="https://protocols-made-fun.com/testing/model-checking/2026/05/26/zookeeper-testing.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed"><![CDATA[<p><strong>Author:</strong> <a href="https://konnov.phd?utm_source=protocols_made_fun&amp;utm_medium=referral&amp;utm_campaign=pmf_site">Igor Konnov</a></p>

<p><strong>Date:</strong> May 26, 2026</p>

<p><em>This text is artisanally typed using a keyboard, with occasional suggestions by
Copilot. The figures are generated with ChatGPT 5.5. The plots are produced by
AI-generated scripts from the experimental data. By AI tools, I refer to Codex
GPT 5.4/5.5 and Claude Code Sonnet/Opus 4.6/4.7.</em></p>

<p><img class="zm-logo" src="https://protocols-made-fun.com/img/zk-testing/zm-logo.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="ZooKeeper logo" /></p>

<p>Recently, I gave a talk on “<em>Interactive symbolic testing with TLA<sup>+</sup>,
Apalache, and LLMs</em>” at the <a href="https://conf.tlapl.us/2026-etaps/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">TLA+ Community Meeting 2026</a>. If you
prefer watching talks, see <a href="https://www.youtube.com/watch?v=CQPhAfi-6Uk&amp;utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">the talk recording</a>. I talked
about the new <a href="https://github.com/apalache-mc/apalache/tree/main/json-rpc?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Apalache JSON-RPC</a> and how it can be used to test real
distributed protocols. As the first example, I presented the case study on
<a href="https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">symbolic testing of TFTP protocol</a>, which was published in
December 2025. As the second example, I presented a case study on symbolic
testing of <a href="https://zookeeper.apache.org/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Apache ZooKeeper</a>, which is the subject of this
blog post. I also talked about this as ongoing work at <a href="https://www.tu.berlin/en/mtv/research/events/d-con-2026?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">D-CON 2026</a> (thanks to
<a href="https://www.tu.berlin/en/mtv/team/head/uwe-nestmann?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Uwe Nestmann</a> for inviting me!).</p>

<p>In case of TFTP, <strong>the main hypothesis was that AI tools can accelerate the
process of writing test harnesses for protocol testing</strong>. In October 2025, I
used Copilot and Sonnet 4.5. The answer was “yes”, though the AI tools in 2025
required plenty of manual intervention and literally drained my energy. Back
then, I wrote the TLA<sup>+</sup> specification of TFTP by hand. I also had to
refine it manually, in about 20-25 iterations. As a reward, the test harness
helped me to find a few bugs in the real implementations. I still had to triage
the bugs manually though.</p>

<p><em>Footnote</em>: Actually, the real question for me was not whether AI tools could
help the engineers to write a test harness. In my experience, engineers avoid
writing test harnesses as much as they can. So the real question was whether the
AI tools could do the job that engineers avoid doing.</p>

<p>The next step was to ask the following question:</p>

<div style="font-size: 1.3em; text-align: center;">
<p style="font-size: 1.3em;"><strong>Can AI tools extract formal specifications
from the source code and write test harnesses?</strong></p>
</div>

<p>In March-April 2026, I ran Claude Code Sonnet/Opus 4.6 and Codex GPT 5.4 to
<strong>check this hypothesis on the example of Apache ZooKeeper</strong>.
This case study is the subject of this blog post. I already hinted at this work
in <a href="https://protocols-made-fun.com/testing/model-checking/2026/03/23/debug-as-code-generation.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Debug as Code Generation</a>.  To this end, I have been running
this loop:</p>

<figure>
    <a href="https://protocols-made-fun.com/img/zk-testing/extraction-loop.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/zk-testing/extraction-loop.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Extraction loop" />
    </picture></a>
    <figcaption>Figure 1: Formal specification extraction loop with AI tools and Apalache.</figcaption>
  </figure>

<p>In general, it looks like <strong>the answer is “yes, but”</strong>. The extraction-checking
loop stopped finding new mismatches between the behavior of a running ZooKeeper
replica and the extracted formal specification and harness. So I do not have
new logs to feed into Codex and Claude Code, and it is good time to reflect on
this.</p>

<p>Look carefully at Figure 1. Even though my AI agents had a lot of freedom in
coming up with their plans and implementing them, <strong>I did not let the agents run
wild</strong>.  I keep reading claims about “autonomous agents” and “agentic loops”,
where agents simulate unhealthy human management loops. I still had to read the
triage reports and implementation plans. Several times, had not I caught an
agent in planning to introduce really bad workarounds in the specification, we
would have gone down the rabbit hole of slop. Every iteration had a separate
commit, so we could keep track of regressions in the specification and harness.
Having said that, I admit that my reviews were high-level and intuitive, not
Github-level reviews.</p>

<p>What I believe is the killer feature of this approach is that <strong>it does not need
any pre-existing test suites</strong>. We do not mutate the existing tests. The model
checker <strong>finds new tests</strong>, including timeouts, crashes, TPC disconnects, etc.
Moreover, this approach requires <strong>zero code instrumentation</strong>. We do not have
to add any hooks or logging to the implementation. <strong>The test harness operates at
the TCP boundary</strong>.</p>

<p><strong>Did I burn thousands of dollars on this?</strong> Not at all. I did this case study
with two lowest-tier subscriptions to Codex and Claude Code, which cost me
<strong>about $80 for two months</strong> in total. (Given the news about <a href="https://www.theregister.com/2026/04/22/anthropic_removes_claude_code_pro/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Claude price
changes</a> and <a href="https://github.blog/news-insights/company-news/changes-to-github-copilot-individual-plans/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Copilot price changes</a>,
this becomes more expensive). Most of the time went into running the testing
experiments on my workstation: AMD Ryzen 9 5950X processor (16 physical, 32
logical cores), 128 GB RAM. The cool thing about my testing architecture is that
the machine was running 10-15 episodes of 300 steps in parallel on 10-20 cores,
totalling in 30000-90000 steps in a single campaign.  Hence, the AI tools had to
triage 1-30 counterexamples at once, before starting a new campaign.</p>

<p>Since we now live in a hype-driven world, I want to stress that <strong>this is still
an experiment</strong>. I am pretty sure that <a href="https://dl.acm.org/doi/abs/10.1145/3689031.3696069?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Ouyang et. al.</a> had much
more time to write their TLA<sup>+</sup> specifications of ZooKeeper and to
conduct their experiments.</p>

<p>If you read the blog post carefully, you will probably find some points that
could be investigated further. I have decided to time box this experiment and
report about it where it is.</p>

<div class="tip-box">
    <div class="tip-header"><strong>💡 Tip:</strong></div>
    <div class="tip"><p>As I mentioned earlier, I stopped accompaying my
blog posts with complete artifacts. AI slop forks are real. It takes me time to
design and conduct the experiments on a beefy machine, as well as to find the
right format to interpret and explain the data. It only takes 10-15 minutes to
repackage the benchmarks and results with an AI tool, having the experimental
data. Hence, I am sharing my lab book with the customers and researchers, upon
request.</p>
</div>
</div>

<p><strong>Want to skip the long text?</strong> <a href="https://protocols-made-fun.com/testing/model-checking/2026/05/26/zookeeper-testing.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#conclusions">Jump to the
conclusions</a>.</p>

<h2 id="1-the-effort">1. The effort</h2>

<p>This experiment took about two months, from March 2026 to April 2026. The git
repository has 336 commits in total. Except for several initial commits, each
new commit corresponds to a new iteration of the extraction-checking loop.</p>

<p>You can see the statistics in the figures below:</p>

<ul>
  <li>Figure 2 shows the number of commits per day.</li>
  <li>Figure 3 shows the number of lines added and deleted in the whole repository.</li>
  <li>Figure 4 shows the number of lines added and deleted in the specification files.</li>
  <li>Figure 5 shows the number of lines added and deleted in the test harness (zoomonkey).</li>
</ul>

<p>You can see that the commit volume decays with time. This is a sign of
convergence. The first week has the most commits and the most code added and
deleted. This was the bootstrapping phase. It’s also interesting to observe a
big splash around the first-second week of April. This is where we start to
reach a new class of behaviors that did not match the implementation.</p>

<figure>
    <a href="https://protocols-made-fun.com/img/zk-testing/git_stats_commits.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/zk-testing/git_stats_commits.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Git stats" />
    </picture></a>
    <figcaption>Figure 2: Commit statistics in this experiment.</figcaption>
  </figure>

<figure>
    <a href="https://protocols-made-fun.com/img/zk-testing/git_stats_lines.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/zk-testing/git_stats_lines.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Git stats" />
    </picture></a>
    <figcaption>Figure 3: Addition/deletion statistics in this experiment.</figcaption>
  </figure>

<figure>
    <a href="https://protocols-made-fun.com/img/zk-testing/git_stats_spec.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/zk-testing/git_stats_spec.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Git stats" />
    </picture></a>
    <figcaption>Figure 4: Addition/deletion statistics in the specification files.</figcaption>
  </figure>

<figure>
    <a href="https://protocols-made-fun.com/img/zk-testing/git_stats_zoomonkey.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/zk-testing/git_stats_zoomonkey.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Git stats" />
    </picture></a>
    <figcaption>Figure 5: Addition/deletion statistics in the test harness.</figcaption>
  </figure>

<h2 id="2-extracting-formal-specifications">2. Extracting formal specifications</h2>

<p>As I learned with <a href="https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">TFTP testing</a>, AI tools need a good predefined
architecture. Hence, I spent some time capturing this architecture in
<code>AGENTS.md</code> and <code>CLAUDE.md</code>. The formal specification is composed of several
modules, each corresponding to a subprotocol of ZooKeeper:</p>

<ol>
  <li><strong>Module process</strong> captures the normal lifecycle of a ZooKeeper replica: <code>start</code>,
<code>on_started</code>, <code>on_stopped</code>. Crashes and restarts are handled by
the <strong>system</strong> module.</li>
  <li><strong>Module tcp</strong> captures the standard TCP lifecycle: <code>connect</code>, <code>accept</code>, <code>disconnect</code>,
<code>half_close</code>, <code>reset</code>, <code>refused</code>.</li>
  <li><strong>Module fle</strong> captures the Fast Leader Election protocol, which is used by ZooKeeper to
elect a leader among the replicas: <code>send_notification</code>, <code>rcv_notification</code>,
<code>become_leader</code>, <code>become_follower</code>, <code>restart_election</code>.</li>
  <li><strong>Module zab</strong> captures the ZooKeeper Atomic Broadcast protocol and its clients. It has 22
actions, including <code>proposal</code>, <code>ack_proposal</code>, <code>commit</code>, <code>diff</code>, <code>trunc</code>, <code>snap</code>,
<code>client_connect</code>, <code>client_ping</code>,  <code>client_create</code>, <code>client_set_data</code>, etc.</li>
  <li><strong>Module system</strong> composes the above modules and adds failures.</li>
</ol>

<p>These modules were written by the AI tools, by following the high-level
architecture, hands off the keyboard. To get the flavor of the specification,
look at one action from the specification of ZAB:</p>

<pre><code class="language-python">@action(inline=False)
def send_diff(c: Context[ZabState], leader: Expr, follower: Expr, next_turn: Expr):
    """Leader sends DIFF to a follower (requires quorum of epoch_acked)."""
    s = c.state
    c.assume(follower != leader)
    c.assume(_proc_up(s, leader))
    # The leader must have completed its election before registering in
    # epoch_leader.  Without this guard a FOLLOWING replica whose
    # fle_current_vote happens to be targeted by another follower can
    # act as a second leader for the same epoch, violating leadership2.
    c.assume(s.fle_role[leader] == LEADING)
    c.assume(_follower_targets_leader(s, follower, leader))
    c.assume(s.zab_sync[follower] == SYNC_EPOCH_ACKED)
    c.assume(_has_quorum_of_sync_state(s, leader, "epoch_acked"))
    c.assume(_can_send_diff(s, leader, follower))
    # Leader-initiated
    c.assume(_turn_matches_iut_actor(s, leader, next_turn))
    next_s = s.edit()
    next_s.zab_sync[follower] = SYNC_DIFF_SENT
    next_s.zab_state[leader] = SYNCHRONIZATION
    next_s.zab_accepted_epoch[leader] = s.zab_current_epoch[leader]
    next_s.zab_persisted_accepted_epoch[leader] = s.zab_current_epoch[leader]
    # By this point ACKEPOCH quorum has formed (see epoch_acked guard above),
    # which means ZK's Leader.lead() has already called setCurrentEpoch on
    # disk. Bump the currentEpoch shadow here to match that disk write.
    next_s.zab_persisted_current_epoch[leader] = s.zab_current_epoch[leader]
    next_s.epoch_leader[s.zab_current_epoch[leader]] = (
        s.epoch_leader[s.zab_current_epoch[leader]].union(Set(leader))  # type: ignore
    )
    s.zab_action = ZabAction.SendDiff(  # type: ignore
        ZabDiff(leader=leader, follower=follower)
    )
</code></pre>

<p>As you can see, this is Python code, not TLA<sup>+</sup>. I noticed that the AI
tools are quite good at writing Python. Hence, they write the specification in a
Python DSL, which is automatically translated to TLA<sup>+</sup>. The test
harness is also written in Python, and it uses the <a href="https://github.com/apalache-mc/apalache/tree/main/json-rpc?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Apalache JSON-RPC</a> to
interact with the model checker. <strong>If you are interested in the details of this
Python DSL, <a href="https://konnov.phd?pmf=20260427&amp;utm_source=protocols_made_fun&amp;utm_medium=referral&amp;utm_campaign=pmf_site">contact me</a></strong>.</p>

<p>The fragment of the above action in TLA<sup>+</sup> looks like <a href="https://gist.github.com/konnov/38af0cbd45b68da819cd76f70859ed94?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#file-system-tla-L2272-L2311">this</a>.
The complete generated specification looks much more hairy. If you are still
curious, check its snapshot in <a href="https://gist.github.com/konnov/38af0cbd45b68da819cd76f70859ed94?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">this gist</a>.</p>

<p>In the table below, you can see the statistics on the formal specification.
Since the TLA<sup>+</sup> specification is generated from the Python code,
this specification is monolithic and has no submodules.</p>

<table>
  <thead>
    <tr>
      <th>Module</th>
      <th style="text-align: right">Actions</th>
      <th style="text-align: right">Python LOC</th>
      <th style="text-align: right">TLA<sup>+</sup> LOC</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>process</td>
      <td style="text-align: right">5</td>
      <td style="text-align: right">90</td>
      <td style="text-align: right">-</td>
    </tr>
    <tr>
      <td>tcp</td>
      <td style="text-align: right">6</td>
      <td style="text-align: right">220</td>
      <td style="text-align: right">-</td>
    </tr>
    <tr>
      <td>fle</td>
      <td style="text-align: right">5</td>
      <td style="text-align: right">1605</td>
      <td style="text-align: right">-</td>
    </tr>
    <tr>
      <td>zab</td>
      <td style="text-align: right">22</td>
      <td style="text-align: right">2256</td>
      <td style="text-align: right">-</td>
    </tr>
    <tr>
      <td>system</td>
      <td style="text-align: right">9</td>
      <td style="text-align: right">1603</td>
      <td style="text-align: right">-</td>
    </tr>
    <tr>
      <td><strong>Total</strong></td>
      <td style="text-align: right"> </td>
      <td style="text-align: right"><strong>5774</strong></td>
      <td style="text-align: right"><strong>3065</strong></td>
    </tr>
  </tbody>
</table>

<h2 id="3-generating-the-test-harness">3. Generating the test harness</h2>

<p>The test harness is also written in Python. It is composed of several modules,
which are listed in the table below.</p>

<table>
  <thead>
    <tr>
      <th>Subsystem</th>
      <th>Modules</th>
      <th style="text-align: right">Lines</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Orchestration</td>
      <td>main.py, scheduler.py, runner.py</td>
      <td style="text-align: right">3462</td>
    </tr>
    <tr>
      <td>Validation</td>
      <td>oracle.py, serde.py</td>
      <td style="text-align: right">2493</td>
    </tr>
    <tr>
      <td>Networking / wire</td>
      <td>comms.py, client_wire.py, quorum_wire.py, election_wire.py</td>
      <td style="text-align: right">3063</td>
    </tr>
    <tr>
      <td>Data model / support</td>
      <td>events.py, queues.py, config.py, fixed_tree.py, log.py</td>
      <td style="text-align: right">867</td>
    </tr>
    <tr>
      <td>Tooling</td>
      <td>log_to_mermaid.py</td>
      <td style="text-align: right">414</td>
    </tr>
    <tr>
      <td><strong>Total</strong></td>
      <td> </td>
      <td style="text-align: right"><strong>10299</strong></td>
    </tr>
  </tbody>
</table>

<p>The interesting design choice here is that the test harness runs a <strong>single
replica of ZooKeeper</strong>. We call this replica <strong>implementation under test</strong>
(IUT). The whole distributed system exists only in the formal specification and
its behavior. This is conceptually similar to a <a href="https://en.wikipedia.org/wiki/Digital_twin?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Digital Twin</a> of the real
distributed system.</p>

<p>Since most of the behavior exists only in the specification, this approach is
sensitive to <strong>quick and accurate</strong> choice of events. I believe that a random
simulator would not help us much here, as it would keep crunching through a very
large set of unproductive events.</p>

<p>This is where the new <a href="https://github.com/apalache-mc/apalache/tree/main/json-rpc?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Apalache JSON-RPC</a> comes into play. The test harness
chooses the next action to execute and asks the symbolic model checker to find
the action parameters that would enable it. It also calls Apalache to check the
state invariants and find out whether the implementation’s output matches the
specification. Since the complexity of SMT solving grows with the number of
steps very quickly, we use the new method <a href="https://github.com/apalache-mc/apalache/tree/main/json-rpc?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#311-method-compact">compact</a> to prune the symbolic
context and keep it manageable.</p>

<h2 id="4-running-the-test-harness">4. Running the test harness</h2>

<p>Running the test harness looks quite boring:</p>

<pre><code class="language-sh">$ ./scripts/run-parallel.sh 8 -- --replicas 3 --episodes 20 --steps 500 \
  --failure-rate 0.2 --fle-rate 0.05 --fallback-rate 0.1 --decay 0.8 --crashes 0
</code></pre>

<p>It basically runs 20 episodes of 500 steps in parallel, with 3 replicas and a
number of parameters to control the test scenario generation. The script is
using <a href="https://www.gnu.org/software/parallel/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">GNU Parallel</a> to run the episodes in parallel. Since each test runs <strong>a
single actual replica</strong> and simulates the rest of the distributed system with
the specification, running multiple experiments in parallel is easy. We only
have to make sure that different experiments get assigned different ports.</p>

<p>Every episode produces a detailed log of events. If it finds an invariant
violation or a mismatch between the behavior of the real replica and the
specification, it produces a trace in the <a href="https://apalache-mc.org/docs/adr/015adr-trace.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">ITF format</a>. These logs and traces
are read by the AI tools to triage the mismatches and to improve the
specification and the test harness.</p>

<p>Similar to <a href="https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">tftp-testing</a>, I have a script to convert the logs into a sequence
chart in Mermaid. However, for a system of this complexity, these diagrams are
hard to digest. Instead, I produce a high-level figure of the test campaign that
shows the events in all episodes in one big picture. See Figure 6. Click on it
to see the full-size version and examine it in detail.</p>

<figure>
    <a href="https://protocols-made-fun.com/img/zk-testing/episodes-summary.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/zk-testing/episodes-summary.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Episodes summary" />
    </picture></a>
    <figcaption>Figure 6: Episodes summary of the test campaign (view the full-size image by clicking).</figcaption>
  </figure>

<p>If you look at the events in the figure, you will see that the most episodes
have productive events such as ZAB proposals, commits, diffs, snapshots, client
operations, etc. However, a few episodes degrade into permanent leader election,
where the implementation-under-test keeps sending FLE notifications. Basically,
the two simulated replicas keep working together and exclude the IUT from the
quorum.</p>

<h2 id="5-triaging-conformance-mismatches">5. Triaging conformance mismatches</h2>

<p>Back in October 2025, Copilot + Sonnet 4.5 were quite bad at triaging
specification mismatches. Now, the frontier models are quite good at it. This is
definitely an improvement in the frontier models.  I also believe that my effort
of definining a good architecture for the test harness paid off this time. Below
are fragments of a triage report by Claude Code Opus 4.7:</p>

<div class="ai-output-frame">

  <p>A single oracle-reported spec violation landed in the 2026-04-24 parallel
sweep. The dump files live at:</p>

  <ul>
    <li>logs/20260424_071307/episode_009_step_160_spec_violation.itf.json</li>
    <li>logs/20260424_071307/episode_009_step_160_spec_violation_trace.itf.json</li>
  </ul>

  <p>Configuration: inst03 (PERSIST_IUT_STATE=True), 3 replicas,
permutation {1: 2, 2: 3, 3: 1}, IUT is spec replica 1 / dynamic id 2.
Violation reason: output_queue_violation — the oracle could not validate
the IUT’s zab_follower_info output after 10 drain passes.</p>

  <p>Episode timeline (relevant subset)</p>

  <ol>
    <li>Replica 1 starts at 10:32; long FLE churn; becomes follower; emits
 zab_follower_info at 10:41 with accepted_epoch=0 — spec accepts
 (transition 31). No zab_leader_info ever validated in this episode.</li>
    <li>At 10:58 replica 1 is stopped (tester action, not crash). On-disk
 acceptedEpoch at that moment: still 0 (follower never received
 LEADERINFO, never persisted).</li>
    <li>At 11:06:52 replica 1 restarts. inst03 preserves the data directory,
 so on-disk acceptedEpoch is read back as 0.</li>
    <li>At 11:12:34 IUT emits fle_become_leader(r1) (transition 27). Spec
 speculatively bumps zab_accepted_epoch[1] to 1 via
 candidate_epoch_capped in spec/system.py:665.</li>
    <li>At 11:12:42 scheduler picks simulated→IUT zab_follower_info
 (transition 31 on the simulated follower side). Spec’s
 send_followerinfo (spec/zab.py:900) atomically observes leader-side
 quorum (1 IUT + 1 simulated ≥ majority of 3) and bumps
 zab_persisted_accepted_epoch[1] at spec/zab.py:972 — the comment
 at line 1068 acknowledges this is the earliest-disk-write model of
 Leader.getEpochToPropose().</li>
    <li>At 11:12:48 tcp_disconnect r1 ↔ r3 (quorum).</li>
    <li>At 11:12:50 IUT emits fle_restart_election(r1) (transition 30). Spec
 rolls back zab_accepted_epoch[1] to
 zab_persisted_accepted_epoch[1] per commit aee611c
 (spec/system.py:648-693) — which is already 1 due to step 5, so the
 rollback is a no-op on the live field and the shadow stays at 1.</li>
    <li>At 11:14:16 replica 1 (now FOLLOWING r2 in both spec and IUT) emits
 zab_follower_info(accepted_epoch=0). Spec expects 1. Blocked.</li>
  </ol>

  <p>Root cause</p>

  <p>The spec assumes Leader.getEpochToPropose()’s disk write has completed
 atomically the moment any FOLLOWERINFO arrival would establish quorum —
 that’s the design of the zab_persisted_accepted_epoch[leader] bump in
 send_followerinfo (spec/zab.py:972). In this trace the IUT’s leader
 thread either never processed the simulated FOLLOWERINFO at all (the
 incoming quorum connection to r1 was disrupted around that time — the
 TCP disconnect at 11:12:48 is between r1 and r3, but similar disruption
 to the r2–r1 quorum path around 11:12:42–48 would leave
 getEpochToPropose() blocked), or processed it but was interrupted
 before the disk write completed. Either way, the IUT’s on-disk
 acceptedEpoch stayed at 0 while the spec’s shadow went to 1.</p>

  <p>The aee611c rollback is correctly applied but operates on the live
 epoch field only; by design it restores live from shadow, so when the
 shadow itself is stale-high, restart_election can’t recover. This is
 the new Family-A sub-variant noted in the earlier audit: speculative
 advance of the persisted shadow, not just the live epoch.</p>

</div>

<p>Several things are impressive here:</p>

<ol>
  <li>
    <p><strong>The test harness stopped a replica and dropped a TCP connection at the
 right moments</strong>, so the replica did not have a chance to persist the new
 accepted epoch.  It did not happen often, but the parallel campaign was diverse
 enough to trigger this scenario. To be fair, the initial version of the test
 harness would not be able to trigger this scenario. <strong>I had to teach the AI
 tools to properly diversify the test scenarios</strong>.</p>
  </li>
  <li>
    <p><strong>Claude figured this out in a matter of minutes</strong>. It would be hard for me to
 figure this out.</p>
  </li>
  <li>
    <p><strong>It also proposed a fix.</strong></p>
  </li>
</ol>

<h2 id="6-checking-invariants-and-producing-examples">6. Checking invariants and producing examples</h2>

<p>Since the AI tools write the specification and the test harness, we have to
evaluate the quality of the specification and the harness together. To this end,
we do two things:</p>

<ol>
  <li>Add state invariants to evaluate safety.</li>
  <li>Add state examples to illustrate reachability of interesting states.</li>
</ol>

<h3 id="61-state-invariants">6.1. State invariants</h3>

<p>To our luck, ZooKeeper already has several <a href="https://github.com/Disalg-ICS-NJU/zookeeper-tla-spec?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">TLA<sup>+</sup>
specifications</a> for earlier versions. I let the AI tools
harvest these specifications for invariants.</p>

<p>For example, these are the shortest invariants these tools wrote:</p>

<pre><code class="language-python">@invariant
def leadership1(s: SystemState):
    return s.REPLICA.forall(
        lambda i: s.REPLICA.forall(
            lambda j: (
                _is_established_leader(s, i)
                &amp; _is_established_leader(s, j)
                &amp; (s.zab_accepted_epoch[i] == s.zab_accepted_epoch[j])
            ).implies(i == j)
        )
    )

@invariant
def leadership2(s: SystemState):
    return Set(Val(1), ..., Val(s.MAX_EPOCH)).forall(
        lambda epoch: s.epoch_leader[epoch].size &lt;= Val(1)
    )

@invariant
def fle_wait_finalize_sound(s: SystemState):
    return s.REPLICA.forall(
        lambda replica: (
            _fle_invariant_replica_live(s, replica) &amp; s.fle_wait_finalize[replica]
        ).implies(
            _fle_has_proposed_recv_quorum(s, replica)
            | _fle_has_local_ooe_quorum(s, replica)
        )
    )
</code></pre>

<p>Their TLA<sup>+</sup> translations look like this:</p>

<pre><code class="language-tla">Leadership1 ==
    \A i142 \in REPLICA: \A j143 \in REPLICA:
        (/\ /\ /\ (fle_role[i142] = "LEADING")
               /\ \/ (zab_state[i142] = "synchronization")
                  \/ (zab_state[i142] = "broadcast")
            /\ /\ (fle_role[j143] = "LEADING")
               /\ \/ (zab_state[j143] = "synchronization")
                  \/ (zab_state[j143] = "broadcast")
         /\ (zab_accepted_epoch[i142] = zab_accepted_epoch[j143])) =&gt; ((i142 = j143))

Leadership2 ==
    \A epoch144 \in (1)..(MAX_EPOCH): (Cardinality(epoch_leader[epoch144]) &lt;= 1)
</code></pre>

<p>The translation of <code>fle_wait_finalize_sound</code> is a bit longer, you can check it
in the
<a href="https://gist.github.com/konnov/38af0cbd45b68da819cd76f70859ed94?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#file-system-tla-L535-L559">FleWaitFinalizeSound</a>.</p>

<p>We have 11 invariants in total. The other 8 invariants are more complex. These
invariants are checked by the test harness with Apalache. We can also check them
against the generated TLA<sup>+</sup> specification.</p>

<h3 id="62-reachability-examples">6.2. Reachability examples</h3>

<p>I usually write “falsy invariants” to check reachability of interesting states.
Again, the AI tools are quite good at writing such “examples”. For instance:</p>

<pre><code class="language-python">@example
def at_least_one_committed(s: SystemState):
    return s.REPLICA.exists(
        lambda replica: s.zab_last_committed[replica].index &gt;= Val(1)
    )
</code></pre>

<p>This example is translated to the following TLA<sup>+</sup> invariant:</p>

<pre><code class="language-tla">AtLeastOneCommitted ==
    ~(\E replica63 \in REPLICA: (zab_last_committed[replica63].index &gt;= 1))
</code></pre>

<p>To see an instance of this example, I ran a test campaign with 10 episodes of
100 steps each. One of the episodes reached a state that satisfies
<code>at_least_one_committed</code>. Below is the summary of this episode by Claude Code
Opus 4.7:</p>

<div class="ai-output-frame">

  <p>Trace summary — example of at_least_one_committed</p>

  <p>Episode 5, step 86. Replica mapping: spec 1 ↔ dynamic 2 (the IUT), spec 2 ↔ dynamic 1, spec 3 ↔ dynamic 3. So the IUT
is spec replica 1 in the analysis below.</p>

  <p>What happened, in order:</p>

  <ol>
    <li>Initial election attempt fails. Spec 1 (IUT) briefly thinks it’s leader (fle_become_leader replica=1 at t≈+25s) but
 then has to fle_restart_election (t≈+45s). Spec 3 simultaneously becomes leader (fle_become_leader replica=3 at
 t≈+19s); spec 2 becomes follower early on.</li>
    <li>Stable leadership formed under spec 3. After the restart, IUT (spec 1) joins as follower of spec 3
 (fle_become_follower replica=1, then zab_follower_info {follower_id:1, leader_id:3}).</li>
    <li>ZAB recovery completes. The leader (spec 3) walks IUT through LEADERINFO → DIFF → NEWLEADER → UPTODATE (all with
 follower_id:1, leader_id:3). A second LEADERINFO/NEWLEADER/UPTODATE round (no ids logged) syncs the other simulated
 follower spec 2. All three replicas reach zab_state = “broadcast”.</li>
    <li>Write submitted to the leader. Step 85: a simulated client connects to spec 3 (the leader) and issues
 zab_client_create(“/p1”).</li>
    <li>Proposal phase (step 86):
      <ul>
        <li>Leader (spec 3) emits zab_proposal at zxid=1, epoch=1.</li>
        <li>A follower acks: zab_ack_proposal zxid=1 validates cleanly (transition 52).</li>
        <li>With its own self-ack the leader reaches quorum and commits locally.</li>
        <li>Oracle then detects the violated invariant — i.e. the dual @example fired.</li>
      </ul>
    </li>
  </ol>

  <p>Violation state (spec replica 3 only):</p>
  <ul>
    <li>zab_history[3] = [{create, /p1, zxid=1, epoch=1}]</li>
    <li>zab_last_committed[3] = {index:1, zxid:1}</li>
    <li>zab_committed_zxid = 1</li>
    <li>Replicas 1 (IUT) and 2 still have zab_last_committed.index = 0 and empty histories — the commit has not yet been
broadcast to them.</li>
  </ul>

</div>

<p>Below is the sequence diagram of this full episode. Click on it to see the full-size
version and examine it in detail. It is quite long, so feel free to scroll
through it.</p>

<figure>
    <a href="https://protocols-made-fun.com/img/zk-testing/at_least_one_committed.svg?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/zk-testing/at_least_one_committed.svg?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="At least one committed" />
    </picture></a>
    <figcaption>Figure 7: A trace where at least one replica has committed (view the full-size image by clicking).</figcaption>
</figure>

<p>The table below shows the example coverage in two test campaigns (20 episodes of
100 steps and 20 episodes of 200 steps).</p>

<table>
  <thead>
    <tr>
      <th>Example</th>
      <th style="text-align: right">Times found</th>
      <th style="text-align: right">Min step</th>
      <th style="text-align: right">Max step</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>at_least_one_committed</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">48</td>
      <td style="text-align: right">48</td>
    </tr>
    <tr>
      <td>at_least_two_committed</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">74</td>
      <td style="text-align: right">74</td>
    </tr>
    <tr>
      <td>some_follower_synced</td>
      <td style="text-align: right">4</td>
      <td style="text-align: right">26</td>
      <td style="text-align: right">43</td>
    </tr>
    <tr>
      <td>quorum_recovery_completed</td>
      <td style="text-align: right">7</td>
      <td style="text-align: right">21</td>
      <td style="text-align: right">185</td>
    </tr>
    <tr>
      <td>newleader_sent</td>
      <td style="text-align: right">3</td>
      <td style="text-align: right">23</td>
      <td style="text-align: right">42</td>
    </tr>
    <tr>
      <td>forwarded_request_sent</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">46</td>
      <td style="text-align: right">46</td>
    </tr>
    <tr>
      <td>forwarded_request_received</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">68</td>
      <td style="text-align: right">68</td>
    </tr>
    <tr>
      <td>forwarded_write_committed</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">74</td>
      <td style="text-align: right">74</td>
    </tr>
    <tr>
      <td>proposal_in_flight</td>
      <td style="text-align: right">2</td>
      <td style="text-align: right">46</td>
      <td style="text-align: right">47</td>
    </tr>
    <tr>
      <td>proposal_has_quorum_acks</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">74</td>
      <td style="text-align: right">74</td>
    </tr>
    <tr>
      <td>two_proposals_in_flight</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">72</td>
      <td style="text-align: right">72</td>
    </tr>
    <tr>
      <td>first_write_committed_on_quorum</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">48</td>
      <td style="text-align: right">48</td>
    </tr>
    <tr>
      <td>two_distinct_znodes_committed</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">74</td>
      <td style="text-align: right">74</td>
    </tr>
  </tbody>
</table>

<p><a id="conclusions"></a></p>
<h2 id="7-conclusions">7. Conclusions</h2>

<p>Obviously, the AI tools change the way we should test distributed systems.
Interestingly, my conversations with these tools show that they have very little
understanding of distributed computations. Funny enough, even though they have
all the knowledge about TCP/IP at their fingertips, if you ask them right, they
cannot efficiently operate with this knowledge. However, when they have plenty
of counterexamples to learn from, they improve the quality of the testing
harness very quickly. This is where the interplay of formal verification and AI
tools becomes really powerful. The model checker produces negative and positive
examples, and the AI tools learn from them and improve the specification and the
test harness.</p>

<p><strong>The good</strong>:</p>

<ul>
  <li>
    <p>It is actually <strong>possible to extract formal specifications from the source
 code</strong> of a real distributed system and to write test harnesses with AI tools.
 We have to keep in mind that this requires a verification loop, which uses a
 tool such as <a href="https://apalache-mc.org?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Apalache</a>.</p>
  </li>
  <li>
    <p>In this experiment, we extracted a <strong>modular specification that captures five
 protocols</strong>.</p>
  </li>
  <li>
    <p>If we do not try to one-shot the testing process and <strong>follow an iterative
 process with a clear pre-defined architecture</strong>, the AI tools actually help us.
 <strong>“Test it and make no mistakes” obviously does not work</strong>.</p>
  </li>
  <li>
    <p>Comparing to <a href="https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">tftp-testing</a>, the <strong>AI tools in 2026 are much better</strong> at
 triaging specification mismatches and producing fixes. The whole process is
 <strong>much less energy-draining now</strong>.</p>
  </li>
</ul>

<p><strong>The bad</strong>:</p>

<ul>
  <li>
    <p>I have <strong>no idea about the extracted specification</strong>. When I write a
 specification by hand, I internalize the protocol behavior.  Even after I
 forget the details, I can still come back and recover them from the spec.
 Here, it is much harder.</p>
  </li>
  <li>
    <p>If we focus on bug finding, it is fine to have a hard-to-understand
 specification. However, <strong>from the maintainability perspective, it is a big
 problem</strong>. This is probably why we see such a spike in security bugs, but no so
 much in real software products.</p>
  </li>
</ul>

<p>Even though the whole development is quite exciting, my main takeaway is that
<strong>writing formal specifications is still a human job</strong>. AI tools
can assist us in producing test harness and finding issues.</p>

<p>If you need help with writing formal specifications and producing test
harnesses, contact me. I can help you with that. It still takes time, expertise,
and effort to do it right. Also, coming up with the right architecture is not as
easy as it may seem. Of course, you can hire an intern and spend several months
learning from your own mistakes together. Or you can fast-forward it and hire
me.</p>

<h2 id="want-to-talk">Want to talk?</h2>

<!-- References -->]]></content>
      

      
      
      
      
      

      <author>
          <name>Igor Konnov</name>
        
          <email>igor@konnov.phd</email>
        
        
          <uri>https://konnov.phd?utm_source=protocols_made_fun&amp;utm_medium=referral&amp;utm_campaign=pmf_site</uri>
        
      </author>

      
        
          <category term="testing" />
        
          <category term="model-checking" />
        
      

      

      
      
        <summary type="html"><![CDATA[Author: Igor Konnov]]></summary>
      

      
      
    </entry>
  
    <entry>
      
      

      <title type="html">TLC breadth-first search vs random simulation</title>
      <link href="https://protocols-made-fun.com/model-checking/simulation/2026/04/30/tlc-vs-simulation.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" rel="alternate" type="text/html" title="TLC breadth-first search vs random simulation" />
      <published>2026-04-30T00:00:00+00:00</published>
      <updated>2026-04-30T00:00:00+00:00</updated>
      <id>https://protocols-made-fun.com/model-checking/simulation/2026/04/30/tlc-vs-simulation</id>
      
      
        <content type="html" xml:base="https://protocols-made-fun.com/model-checking/simulation/2026/04/30/tlc-vs-simulation.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed"><![CDATA[<p><strong>Author:</strong> <a href="https://konnov.phd?utm_source=protocols_made_fun&amp;utm_medium=referral&amp;utm_campaign=pmf_site">Igor Konnov</a></p>

<p><strong>Date:</strong> April 30, 2026</p>

<p>Recently, I wrote a blog post on <a href="https://protocols-made-fun.com/testing/model-checking/2026/03/09/random-walks.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">random walks</a> that compared the state
coverage of random walks for increasingly larger sets of experiments: 100
thousands, 1 million, 10 million, and even 100 million episodes. There, I used
custom-built simulators in Rust to randomly walk through the state spaces of
several TLA<sup>+</sup> benchmarks: two-phase commit, readers-writers, and
FPaxos.</p>

<p>A. Jesse Jiryu Davis noticed my blog post and <a href="https://groups.google.com/g/tlaplus/c/iFUAhlsIuQQ/m/t044etF6AwAJ?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">wrote a
message</a> on the TLA<sup>+</sup> Google group. Since the
random walks in the blog post are not exactly the same as the random simulation
in TLC, we both wondered how the TLC simulation mode compares to the model
checker in terms of coverage and running times. Markus Kuppe shared the options
to collect distinct state coverage in the TLC simulation mode, see the
<a href="https://groups.google.com/g/tlaplus/c/iFUAhlsIuQQ/m/t044etF6AwAJ?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">discussion</a>.</p>

<p>The <strong>important distinction</strong> between the <a href="https://protocols-made-fun.com/testing/model-checking/2026/03/09/random-walks.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">random walks</a> and the TLC
simulator is that <strong>a random walk chooses one successor state at each step</strong>,
whereas <strong>TLC enumerates all successor states and then chooses one of them
uniformly at random.</strong> Markus explains the rationale behind this design choice
in TLC:</p>

<blockquote>
  <p>…enumerating all successors is useful for more than just choosing
the next step: TLC can also check invariants on all generated successor states,
not only on the one that ends up being sampled. That is a meaningful benefit
when the goal is to catch bugs, not just drive a walk.</p>
</blockquote>

<p>This new blog post explores this direction. To get more details about the
bennchmarks, read the original blog post on <a href="https://protocols-made-fun.com/testing/model-checking/2026/03/09/random-walks.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">random walks</a>.</p>

<p>Look at two groups of figures below. They summarize the results of running
random walks on specifications of three prominent distributed protocols:
two-phase commit, readers-writers, and FPaxos (see <a href="https://protocols-made-fun.com/model-checking/simulation/2026/04/30/tlc-vs-simulation.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#benchmarks">Benchmarks</a>).
The figures show the coverage achieved by the TLC simulation mode, with 100%
being the numbers of distinct states (reported by TLC). All running times are
given for an AMD Ryzen 9 5950X processor (16 physical, 32 logical cores), 128 GB
memory.</p>

<p>Importantly, the TLC simulations are run on <strong>a single worker</strong>, they are not
running in parallel. We do that, in order to compute the state coverage
precisely. When you look at running times, keep in mind that TLC can run
multiple simulation workers in parallel.</p>

<div class="tip-box">
    <div class="tip-header"><strong>💡 Tip:</strong></div>
    <div class="tip"><p>In contrast to the previous blog posts, I do not
provide the artifacts for download. AI slop forks are real. It still takes me
several days to design and conduct the experiments on a beefy machine, as well
as to find the right format to interpret and plot the data. It only takes 10-15
minutes to repackage the benchmarks and results with an AI tool, having the
experimental data. Hence, I am sharing my lab book with the customers and
researchers, upon request.</p>
</div>
</div>

<h2 id="1-coverage-for-minimal-instances">1. Coverage for minimal instances</h2>

<p>In this set of experiments, we run <strong>TLC simulations</strong> for the minimal instances
of the benchmarks. We start with the <strong>meaningful default</strong> of 100,000
simulation runs, with at most 100 steps per run. (Mind that the successor set is
computed at each step.) As you can see from Figure 1, the coverage is close to
100%, but it’s not complete. Interestingly, 10 million runs give us 99.9%
coverage.</p>

<div class="figure-grid">
  <figure>
    <a href="https://protocols-made-fun.com/img/tlc-vs-simulation/twophase-n2-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/tlc-vs-simulation/twophase-n2-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Coverage of random walks for the two-phase commit benchmark with 2 resource managers" />
    </picture></a>
    <figcaption>Figure 1.a: Two-phase commit, 2 RMs.</figcaption>
  </figure>
  <figure>
    <a href="https://protocols-made-fun.com/img/tlc-vs-simulation/twophase-n3-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/tlc-vs-simulation/twophase-n3-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Coverage of random walks for the two-phase commit benchmark with 3 resource managers" />
    </picture></a>
    <figcaption>Figure 1.b: Two-phase commit, 3 RMs.</figcaption>
  </figure>
  <figure>
    <a href="https://protocols-made-fun.com/img/tlc-vs-simulation/rw-inst3-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/tlc-vs-simulation/rw-inst3-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Coverage of random walks for the readers-writers benchmark with 3 actors" />
    </picture></a>
    <figcaption>Figure 1.c: Readers-writers, 3 actors.</figcaption>
  </figure>
  <figure>
    <a href="https://protocols-made-fun.com/img/tlc-vs-simulation/fpaxos-inst2-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/tlc-vs-simulation/fpaxos-inst2-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Coverage of random walks for the FPaxos benchmark with 2 acceptors" />
    </picture></a>
    <figcaption>Figure 1.d: FPaxos, 2 acceptors.</figcaption>
  </figure>
</div>

<h2 id="2-running-times-for-tiny-instances">2. Running times for tiny instances</h2>

<p><strong>All of the above benchmarks are quite small by the model checking
standards. They have tens of thousands of states. It takes TLC only 1-3 seconds
to explore the state space and check the invariants for each of these
benchmarks.</strong></p>

<p>Figure 2 shows the running times for the above simulation benchmarks. The dashed
lines show the running times for the TLC model checker to explore the complete
state space.  Additionally, the right-hand y-axis shows the slowdown factor of
the simulations compared to the model checker. For example, for the two-phase
commit benchmark with 3 resource managers, the model checker takes about 2
seconds, while 10 million simulations take about 2 hours, which is a slowdown
factor of about 10,000. As you can see, 10 million simulations take hours, where
the model checker needs several seconds.</p>

<div class="figure-grid">
  <figure>
    <a href="https://protocols-made-fun.com/img/tlc-vs-simulation/twophase-n2-runtime.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/tlc-vs-simulation/twophase-n2-runtime.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Running times for the two-phase commit benchmark with 2 resource managers" />
    </picture></a>
    <figcaption>Figure 2.a: Two-phase commit, 2 RMs.</figcaption>
  </figure>
  <figure>
    <a href="https://protocols-made-fun.com/img/tlc-vs-simulation/twophase-n3-runtime.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/tlc-vs-simulation/twophase-n3-runtime.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Running times for the two-phase commit benchmark with 3 resource managers" />
    </picture></a>
    <figcaption>Figure 2.b: Two-phase commit, 3 RMs.</figcaption>
  </figure>
  <figure>
    <a href="https://protocols-made-fun.com/img/tlc-vs-simulation/rw-inst3-runtime.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/tlc-vs-simulation/rw-inst3-runtime.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Running times for the readers-writers benchmark with 3 actors" />
    </picture></a>
    <figcaption>Figure 2.c: Readers-writers, 3 actors.</figcaption>
  </figure>
  <figure>
    <a href="https://protocols-made-fun.com/img/tlc-vs-simulation/fpaxos-inst2-runtime.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/tlc-vs-simulation/fpaxos-inst2-runtime.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Running times for the FPaxos benchmark with 2 acceptors" />
    </picture></a>
    <figcaption>Figure 2.d: FPaxos, 2 acceptors.</figcaption>
  </figure>
</div>

<h2 id="3-slightly-larger-instances">3. Slightly larger instances</h2>

<p>What happens if we take the instances that are still small, but have 1-2
participants more? Figure 3 shows the results of running TLC simulations on
these instances.</p>

<p>As you can see, with the meaningful default of 100,000 random walks, we achieve
poor coverage on readers-writers and FPaxos, though the coverage on two-phase
commit is nearly 99%. So this TLC simulations achieve much better coverage on
two-phase commit than <a href="https://protocols-made-fun.com/testing/model-checking/2026/03/09/random-walks.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">random walks</a>, but they have comparable coverage on the
readers-writers and FPaxos benchmarks! You can also switch between this blog
post and <a href="https://protocols-made-fun.com/testing/model-checking/2026/03/09/random-walks.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">random walks</a> to see the difference in coverage between the two
approaches.</p>

<p>To stress the message of the previous blog post, these instances are <strong>not that
large by the model checking standards</strong>.</p>

<div class="figure-grid">
  <figure>
    <a href="https://protocols-made-fun.com/img/tlc-vs-simulation/twophase-n5-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/tlc-vs-simulation/twophase-n5-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Coverage of random walks for the two-phase commit benchmark with 5 resource managers" />
    </picture></a>
    <figcaption>Figure 3.a: Two-phase commit, 5 RMs.</figcaption>
  </figure>
  <figure>
    <a href="https://protocols-made-fun.com/img/tlc-vs-simulation/rw-inst4-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/tlc-vs-simulation/rw-inst4-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Coverage of random walks for the readers-writers benchmark with 4 actors" />
    </picture></a>
    <figcaption>Figure 3.b: Readers-writers, 4 actors.</figcaption>
  </figure>
  <figure>
    <a href="https://protocols-made-fun.com/img/tlc-vs-simulation/fpaxos-inst3-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/tlc-vs-simulation/fpaxos-inst3-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Coverage of random walks for the FPaxos benchmark with 3 acceptors" />
    </picture></a>
    <figcaption>Figure 3.c: FPaxos, 3 acceptors.</figcaption>
  </figure>
  <figure>
    <a href="https://protocols-made-fun.com/img/tlc-vs-simulation/fpaxos-inst4-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/tlc-vs-simulation/fpaxos-inst4-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Coverage of random walks for the FPaxos benchmark with 4 acceptors" />
    </picture></a>
    <figcaption>Figure 3.d: FPaxos, 4 acceptors.</figcaption>
  </figure>
</div>

<h2 id="4-running-times-for-larger-instances">4. Running times for larger instances</h2>

<p>Again, <strong>it takes the model checker TLC up to 10 minutes to enumerate all the
states and check the invariants for these instances</strong>, whereas we have been
<strong>running the simulations for hours!</strong>.</p>

<div class="figure-grid">
  <figure>
    <a href="https://protocols-made-fun.com/img/tlc-vs-simulation/twophase-n5-runtime.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/tlc-vs-simulation/twophase-n5-runtime.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Running times for the two-phase commit benchmark with 5 resource managers" />
    </picture></a>
    <figcaption>Figure 4.a: Two-phase commit, 5 RMs.</figcaption>
  </figure>
  <figure>
    <a href="https://protocols-made-fun.com/img/tlc-vs-simulation/rw-inst4-runtime.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/tlc-vs-simulation/rw-inst4-runtime.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Running times for the readers-writers benchmark with 4 actors" />
    </picture></a>
    <figcaption>Figure 4.b: Readers-writers, 4 actors.</figcaption>
  </figure>
  <figure>
    <a href="https://protocols-made-fun.com/img/tlc-vs-simulation/fpaxos-inst3-runtime.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/tlc-vs-simulation/fpaxos-inst3-runtime.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Running times for the FPaxos benchmark with 3 acceptors" />
    </picture></a>
    <figcaption>Figure 4.c: FPaxos, 3 acceptors.</figcaption>
  </figure>
  <figure>
    <a href="https://protocols-made-fun.com/img/tlc-vs-simulation/fpaxos-inst4-runtime.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/tlc-vs-simulation/fpaxos-inst4-runtime.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Running times for the FPaxos benchmark with 4 acceptors" />
    </picture></a>
    <figcaption>Figure 4.d: FPaxos, 4 acceptors.</figcaption>
  </figure>
</div>

<h2 id="5-conclusions">5. Conclusions</h2>

<p>I am not going to repeat the <a href="https://protocols-made-fun.com/testing/model-checking/2026/03/09/random-walks.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#6-conclusions">conclusions from the previous blog
post</a>. They are still valid. The TLC simulation
mode achieves better coverage than the random walks on two-phase commit, but it
has comparable coverage on readers-writers and FPaxos. The running times of the
TLC simulator with a single-worker are worse than the model checker and the
random walks.</p>

<h2 id="want-to-talk">Want to talk?</h2>

<!-- References -->]]></content>
      

      
      
      
      
      

      <author>
          <name>Igor Konnov</name>
        
          <email>igor@konnov.phd</email>
        
        
          <uri>https://konnov.phd?utm_source=protocols_made_fun&amp;utm_medium=referral&amp;utm_campaign=pmf_site</uri>
        
      </author>

      
        
          <category term="model-checking" />
        
          <category term="simulation" />
        
      

      

      
      
        <summary type="html"><![CDATA[Author: Igor Konnov]]></summary>
      

      
      
    </entry>
  
    <entry>
      
      

      <title type="html">Specification debugging as code generation</title>
      <link href="https://protocols-made-fun.com/testing/model-checking/2026/03/23/debug-as-code-generation.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" rel="alternate" type="text/html" title="Specification debugging as code generation" />
      <published>2026-03-23T00:00:00+00:00</published>
      <updated>2026-03-23T00:00:00+00:00</updated>
      <id>https://protocols-made-fun.com/testing/model-checking/2026/03/23/debug-as-code-generation</id>
      
      
        <content type="html" xml:base="https://protocols-made-fun.com/testing/model-checking/2026/03/23/debug-as-code-generation.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed"><![CDATA[<p><strong>Author:</strong> <a href="https://konnov.phd?utm_source=protocols_made_fun&amp;utm_medium=referral&amp;utm_campaign=pmf_site">Igor Konnov</a></p>

<p><strong>Date:</strong> March 23, 2026</p>

<p>This is an anecdote about another useful application of Codex and Claude Code in
the middle of a testing project. It is another example of using LLMs to make
distributed systems easier to test and debug, instead of generating piles of
slop.</p>

<h2 id="context">Context</h2>

<p>I am currently developing a test harness for an implementation of distributed
consensus, cannot disclose the details yet. Think of the approach presented in
<a href="https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">TFTP Symbolic Testing</a> but for a more complex distributed system.
It involves five state machines, each for a different subprotocol of the system.
The submachines are composed into a single machine. We generate the protocol
specifications and the test harness with Claude Code and Codex. This test
harness produces input events for the protocol implementation with <a href="https://apalache-mc.org?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Apalache</a>
and replays the output events from the implementation, checking that they
conform to the protocol specification with Apalache. Pretty cool. This is
<strong>AI-assisted protocol extraction and testing</strong>.</p>

<p>It took me a few days to bootstrap this project, designing the proper interfaces
and the harness architecture. After sheparding the AI tools for dozens of
iterations, I got a harness that communicates with the implementation. Whenever,
it surfaces a mismatch, the LLM looks into the source code, the specification,
and the test harness, and investigates the mismatch. When it identifies the root
cause and proposes a fix, I take a careful look at the proposed fix, and if it
looks good, I apply it. Sometimes, the LLM identifies a completely wrong root
cause. However, after a few iterations, we end up with the right root cause and
the right fix. This is a very efficient way to extract the protocol
specification from the actual implementation, and it is much faster than doing
it manually.</p>

<p>This specification-testing-refinement loop worked quite well for multiple
iterations. Sometimes, I could even leave this agentic loop running for several
hours unattended, though protocol extraction often requires human supervision.
At some point, however, the harness and the implementation started to produce a
sequence of events that was rejected by the specification, or, to be more
precise, by the model checker following the specification. Codex and Claude
tried to identify the root cause and fix it. They introduced multiple fixes, but
the mismatch persisted. I basically lost a whole day looking at the LLM outputs
and prompting them. At some point, I looked at the git log and realized that
<strong>we were going in circles</strong>. What was worse, every fix was introducing a
workaround and generally the harness started to degrade. So we went down the
rabbit hole of slop. In the rest of this post, I will just call both Claude and
Codex “the LLM”, as I don’t remember which one did what, and it doesn’t matter
for the story.</p>

<p>At this point, I realized that the AI feedback loop stopped working, and the
human had to seriously intervene. The LLMs could describe what was happening, we
had a concrete state from the model checker, but a single transition that
<strong>must have been enabled</strong> in this state was not enabled. If you worked with
formal verification tools, you know that <strong>this is the situation we all dread</strong>.
It is a clear sign of <strong>the specification being overconstrained</strong>. The model
checker is doing its job, and it is correctly rejecting the transition. The
issue is that the specification requires an impossible combination somewhere,
like <strong>x = 2 and x = 3</strong>. In this case, the model checker cannot produce a
counterexample, or anything meaningful, because the constraints are
contradictory. (There is a line of research on UNSAT cores, but it’s hard to
apply in practice in TLA<sup>+</sup>.)</p>

<p>If you wrote the specification yourself, you can usually stare at it and find
the combination of contradicting constraints. However, in this case, the
specification was written by the LLM! Of course, I looked at it. Things looked
fine. The LLM agreed with me that the transition should be enabled.</p>

<h2 id="debug-an-overconstrained-specification-like-a-human-would">Debug an overconstrained specification like a human would</h2>

<p>The LLMs were stuck. So I decided to explain them how I usually debug
overconstrained specifications. First, I asked the LLM to use <code>git bisect</code> to
find the last working commit. It crunched for 10-15 minutes and found the last
working one. Comparing the git diffs did not help though.</p>

<p>The next usual step is to comment out some parts of the specification, and see
whether the transition becomes enabled. If it does, then we know that one of the
problematic constraints is in the commented-out part. We did this exercise for
about 1 hour. The agentic loop was amazing. The LLM was doing everything
automatically.  In the end, we still could not find the root cause.</p>

<p><strong>I was surprised how well Codex and Claude were running Apalache and
transforming the TLA<sup>+</sup> specification. They neither required skills,
nor MCP.</strong> They simply ran the model checker, parsed its output and parsed the
produced counterexamples. Being a CLI tool finally paid off for Apalache!</p>

<h2 id="turn-specification-debugging-into-code-generation">Turn specification debugging into code generation</h2>

<p>I could stare at the specification and look for a mismatch. In the hindsight,
that would not help me, as the issue was outside of the subprotocol
specification. So I thought: LLMs fail to identify the issue, but they can
generate code in minutes. <strong>Can I turn this debugging problem into a code
generation problem?</strong></p>

<p>Hence, I told the LLM to take the pieces of my specification framework, extract
the random simulator, write an ad hoc simulator that drives the system into the
exact problematic state, do random exploration from there, look for the disabled
assumptions. Since my framework is written in Python, it was quite an easy task
for the LLM. <strong>In 5 minutes, it ran the ad hoc simulator</strong>. <strong>In 2-3 more
minutes, we had the root cause!</strong> Indeed, there were two contradicting
constraints. It was hard to identify them by looking at the specification, as
one of them was in the subprotocol specification, and the other one was in the
system specification. Obviously, I extended <code>AGENTS.md</code> with an instruction to
avoid introducing constraints at both levels.</p>

<p>This approach worked, since my specification is not just a TLA<sup>+</sup>
specification. It is actually Python code. It can be executed. But it can also
generate a TLA<sup>+</sup> specification. As a result, the LLM can easily
interact with the Python code, wrap it into a large ad-hoc simulator, and run
it. At the same time, the model checker uses the generated TLA<sup>+</sup>
specification to reason about it. Moreover, the LLM also benefits from having
two different perspectives in the form of the Python code and TLA<sup>+</sup>.</p>

<p>If you find this hint useful, leave a comment below.</p>

<h2 id="want-to-talk">Want to talk?</h2>

<!-- References -->]]></content>
      

      
      
      
      
      

      <author>
          <name>Igor Konnov</name>
        
          <email>igor@konnov.phd</email>
        
        
          <uri>https://konnov.phd?utm_source=protocols_made_fun&amp;utm_medium=referral&amp;utm_campaign=pmf_site</uri>
        
      </author>

      
        
          <category term="testing" />
        
          <category term="model-checking" />
        
      

      

      
      
        <summary type="html"><![CDATA[Author: Igor Konnov]]></summary>
      

      
      
    </entry>
  
    <entry>
      
      

      <title type="html">All you need is a simulator? Nope</title>
      <link href="https://protocols-made-fun.com/testing/model-checking/2026/03/09/random-walks.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" rel="alternate" type="text/html" title="All you need is a simulator? Nope" />
      <published>2026-03-09T00:00:00+00:00</published>
      <updated>2026-03-09T00:00:00+00:00</updated>
      <id>https://protocols-made-fun.com/testing/model-checking/2026/03/09/random-walks</id>
      
      
        <content type="html" xml:base="https://protocols-made-fun.com/testing/model-checking/2026/03/09/random-walks.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed"><![CDATA[<p><strong>Author:</strong> <a href="https://konnov.phd?utm_source=protocols_made_fun&amp;utm_medium=referral&amp;utm_campaign=pmf_site">Igor Konnov</a></p>

<p><strong>Date:</strong> March 09, 2026</p>

<p><strong>Punchline: Testing distributed protocols with random simulation and stateful
property-based testing (PBT) is not enough!</strong> Yes, running a simulator for days
is better than doing manual testing or just running unit tests. But <strong>you will
miss states, which may expose bugs</strong>. <strong>Even on very small systems.</strong> I have
been saying exactly this to many software engineers. Many times. However,
whiteboard arguments do not help. As humans, we have a great deal of trust in
probabilities, and our intuitive understanding of randomness is often wrong.
Hence, I am giving you concrete figures and plots in this blog post. I must
admit that my own intuition was also wrong: I expected fewer random walks to be
needed to achieve good coverage. For a quick glance, see <a href="https://protocols-made-fun.com/testing/model-checking/2026/03/09/random-walks.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#quick-summary">Quick
summary</a>.</p>

<p>Achieving <strong>complete coverage with random walks is hard</strong>. This is especially
important to know, <strong>if you are using them to produce test cases for your
implementation</strong>. It is also crucial to know, in case you generate an
implementation of a distributed protocol with AI tools and <strong>hope for random
walks/PBT to work as an ultimate guardrail</strong>.</p>

<p>Don’t get me wrong. I like PBT and simulators (having written the <a href="https://github.com/informalsystems/quint?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Quint</a>
simulator). I believe that these tools are must-have tools for testing.  See my
recent blog post on <a href="https://protocols-made-fun.com/pbt/2025/12/22/pbt-adversarial-llms.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Property-based testing, adversarial developers, and
LLMs</a>. However, they are not the only tools that we need to make sure
that our systems work as expected. This is especially true now, when we do not
have time to properly design and review the AI-generated code.</p>

<p><strong>Why now?</strong> It has always been difficult to compare search procedures that were
developed by different branches of computer science. Everyone wanted to promote
their technique as the ultimate winner. Want to compare property-based testing
and model checking? Bad luck. Different tools require different inputs. Some are
libraries for programming languages (like <a href="https://en.wikipedia.org/wiki/QuickCheck?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">QuickCheck</a>), some are tools for
specification languages (like <a href="https://github.com/tlaplus/tlaplus?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">TLC</a> and <a href="https://apalache-mc.org?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Apalache</a>).  Now it is much faster
to design frameworks, to experiment with multiple search procedures. It is also
easier to do reproducible experiments with LLMs. Good times, if you know how to
conduct experimental research.</p>

<div class="tip-box">
    <div class="tip-header"><strong>💡 Tip:</strong></div>
    <div class="tip"><p>In contrast to the previous blog posts, I do not
provide the artifacts for download. AI slop forks are real. It still takes me
several days to design and conduct the experiments on a beefy machine, as well
as to find the right format to interpret and plot the data. Even with the help
of the frontier models, though they are of great help. It only takes 10-15
minutes to repackage the benchmarks and results with an AI tool, having the
experimental data. Hence, I am sharing my lab book with the customers and
researchers, upon request.</p>
</div>
</div>

<p><a id="quick-summary"></a></p>

<h2 id="1-quick-summary-for-the-impatient-readers">1. Quick summary for the impatient readers</h2>

<p>Look at two groups of figures below. They summarize the results of running
random walks on specifications of three prominent distributed protocols:
two-phase commit, readers-writers, and FPaxos (see <a href="https://protocols-made-fun.com/testing/model-checking/2026/03/09/random-walks.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#benchmarks">Benchmarks</a>).
The figures show the coverage achieved by random walks, with 100% being the
numbers of distinct states (reported by the model checker TLC). In addition, we
plot the running times of the random walks, with the values plotted against the
right y-axis. All running times are on a AMD Ryzen 9 5950X processor (16
physical, 32 logical cores), 128 GB memory.</p>

<h3 id="11-coverage-for-minimal-instances">1.1 Coverage for minimal instances</h3>

<p>In this set of experiments, we do random walks for the minimal instances of the
benchmarks. We start with the <strong>meaningful default</strong> of 100,000 random walks,
with at most 100 steps per walk. As you can see from Figure 1, only in the case
of two-phase commit and two resource managers, we achieve complete state
coverage. This is not surprising, since this instance has only 56 states. It’s
tiny! For two-phase commit with three resource managers and readers-writers with
three actors, we achieve 85-90% coverage. This is also in the reasonable range.
On <strong>FPaxos with two acceptors, we achieve the 77.5% coverage with 100k random
walks</strong>. This is a bit worrying, since the state space is about 37k states.</p>

<p>The good news is that we can push all of the above benchmarks to achieve over
99% coverage. As you can see in the figures, it takes <strong>10 million random walks
to achieve 99% coverage</strong>. In addition to that, <strong>these runs require 1-2
hours</strong>.</p>

<p><strong>All of the above benchmarks are quite small by the model checking
standards. They have tens of thousands of states. It takes TLC only 1-3 seconds
to explore the state space and check the invariants for each of these
benchmarks.</strong></p>

<div class="figure-grid">
  <figure>
    <a href="https://protocols-made-fun.com/img/random-walks/twophase-n2-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/random-walks/twophase-n2-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Coverage of random walks for the two-phase commit benchmark with 2 resource managers" />
    </picture></a>
    <figcaption>Figure 1.a: Two-phase commit, 2 RMs.</figcaption>
  </figure>
  <figure>
    <a href="https://protocols-made-fun.com/img/random-walks/twophase-n3-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/random-walks/twophase-n3-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Coverage of random walks for the two-phase commit benchmark with 3 resource managers" />
    </picture></a>
    <figcaption>Figure 1.b: Two-phase commit, 3 RMs.</figcaption>
  </figure>
  <figure>
    <a href="https://protocols-made-fun.com/img/random-walks/rw-inst3-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/random-walks/rw-inst3-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Coverage of random walks for the readers-writers benchmark with 3 actors" />
    </picture></a>
    <figcaption>Figure 1.c: Readers-writers, 3 actors.</figcaption>
  </figure>
  <figure>
    <a href="https://protocols-made-fun.com/img/random-walks/fpaxos-inst2-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/random-walks/fpaxos-inst2-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Coverage of random walks for the FPaxos benchmark with 2 acceptors" />
    </picture></a>
    <figcaption>Figure 1.d: FPaxos, 2 acceptors.</figcaption>
  </figure>
</div>

<h3 id="12-slightly-larger-instances">1.2 Slightly larger instances</h3>

<p>What happens if we take the instances that are still small, but have 1-2
participants more? Figure 2 shows the results of doing random walks on these
instances.</p>

<p>As you can see, with the meaningful default of 100,000 random walks, we achieve
extremely poor coverage, about 25-30% on the benchmarks up to 2 million states.
<strong>On FPaxos with 4 acceptors, we achieve only 3% coverage after 100,000
random walks</strong>. Really bad!</p>

<p>To see how far we could push the coverage, we did the experiments with 10-100
million random walks. It is clear that <strong>in 1-2 hours of simulation we get to
60-80% coverage</strong>. It is good, but not great. When we push FPaxos with 3
acceptors to 100 million random walks, we get to 94.5% coverage. Nice, though it
took us 7.5 hours to get there. However, <strong>on FPaxos with 4 acceptors, we get a
poor coverage of 60.4% even with 100 million random walks, which took us 8.5
hours to run</strong>. This benchmark has about 11 million states. So it is reasonably
large, but, again, <strong>not that large by the model checking standards</strong>.</p>

<p>Again, <strong>it takes the model checker TLC up to 10 minutes to enumerate all the
states and check the invariants for these instances</strong>, whereas we have been
<strong>running the simulations for hours!</strong> This is especially
striking, given that we are <strong>running optimized simulators in Rust</strong>.</p>

<div class="figure-grid">
  <figure>
    <a href="https://protocols-made-fun.com/img/random-walks/twophase-n5-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/random-walks/twophase-n5-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Coverage of random walks for the two-phase commit benchmark with 5 resource managers" />
    </picture></a>
    <figcaption>Figure 2.a: Two-phase commit, 5 RMs.</figcaption>
  </figure>
  <figure>
    <a href="https://protocols-made-fun.com/img/random-walks/rw-inst4-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/random-walks/rw-inst4-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Coverage of random walks for the readers-writers benchmark with 4 actors" />
    </picture></a>
    <figcaption>Figure 2.b: Readers-writers, 4 actors.</figcaption>
  </figure>
  <figure>
    <a href="https://protocols-made-fun.com/img/random-walks/fpaxos-inst3-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/random-walks/fpaxos-inst3-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Coverage of random walks for the FPaxos benchmark with 3 acceptors" />
    </picture></a>
    <figcaption>Figure 2.c: FPaxos, 3 acceptors.</figcaption>
  </figure>
  <figure>
    <a href="https://protocols-made-fun.com/img/random-walks/fpaxos-inst4-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/random-walks/fpaxos-inst4-coverage.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Coverage of random walks for the FPaxos benchmark with 4 acceptors" />
    </picture></a>
    <figcaption>Figure 2.d: FPaxos, 4 acceptors.</figcaption>
  </figure>
</div>

<p><a id="benchmarks"></a></p>

<h2 id="2-the-benchmarks">2. The benchmarks</h2>

<p>As benchmarks, we use three specifications of distributed protocols. These are
prominent examples from the repository of <a href="https://github.com/tlaplus/Examples/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">TLA+ Examples</a>:</p>

<ul>
  <li>
    <p><strong>Two-phase commit</strong>. This is the famous two-phase commit. The specification
 is explained in <a href="https://www.microsoft.com/en-us/research/publication/consensus-on-transaction-commit/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Consensus on Transaction Commit</a> by Jim Gray
 and Leslie Lamport. You can check the TLA<sup>+</sup> specification in
 <a href="https://github.com/tlaplus/Examples/blob/master/specifications/transaction_commit/TwoPhase.tla?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">TwoPhase.tla</a>.</p>
  </li>
  <li>
    <p><strong>Readers-writers</strong>. This is a solution to the <a href="https://en.wikipedia.org/wiki/Readers%E2%80%93writers_problem?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Readers-Writers Problem</a>.
 The TLA<sup>+</sup> specification by Stephan Merz can be found in
 <a href="https://github.com/tlaplus/Examples/blob/master/specifications/ReadersWriters/ReadersWriters.tla?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">ReadersWriters.tla</a>.</p>
  </li>
  <li>
    <p><strong>FPaxos</strong>. This is <a href="https://fpaxos.github.io/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Flexible Paxos</a> by Heidi Howard, Dahlia Malkhi, and
 Alexander Spiegelman. The TLA<sup>+</sup> specification can be found in
 <a href="https://github.com/fpaxos/fpaxos-tlaplus/blob/main/FPaxos.tla?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">FPaxos.tla</a>.</p>
  </li>
</ul>

<p>All of the above specifications are parameterized in the number of participating
processes. We consider several instances of each benchmark. To give you an idea
of their state space size (the number of reachable states), we compute the
figures with <a href="https://github.com/tlaplus/tlaplus?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">TLC</a>. The reachable states are called <em>distinct states</em> in TLC,
whereas <em>produced states</em> are the number of states that TLC generates during the
search. Another important metric is the <em>diameter</em> of the state space, which is
the length of the longest shortest path between any two reachable states (read
it again!).</p>

<p>As you can see from Table 1, these transition systems are not tiny, but they are
actually small by the model checking standards. Surprisingly, they are
sophisticated enough to challenge random walks! <strong>Distributed protocols are
hard.</strong></p>

<figure>

  <table>
    <thead>
      <tr>
        <th>Benchmark</th>
        <th>Instance</th>
        <th>Distinct states</th>
        <th>Produced states</th>
        <th>Diameter</th>
        <th>TLC times</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Two-phase commit</td>
        <td>2 resource managers</td>
        <td>56</td>
        <td>154</td>
        <td>8</td>
        <td>1 sec</td>
      </tr>
      <tr>
        <td> </td>
        <td>3 resource managers</td>
        <td>288</td>
        <td>1,146</td>
        <td>11</td>
        <td>2 sec</td>
      </tr>
      <tr>
        <td> </td>
        <td>5 resource managers</td>
        <td>8,832</td>
        <td>58,146</td>
        <td>17</td>
        <td>2 sec</td>
      </tr>
      <tr>
        <td>Readers-writers</td>
        <td>2 readers/writers</td>
        <td>390</td>
        <td>935</td>
        <td>9</td>
        <td>2 sec</td>
      </tr>
      <tr>
        <td> </td>
        <td>3 readers/writers</td>
        <td>21,527</td>
        <td>59,674</td>
        <td>13</td>
        <td>2 sec</td>
      </tr>
      <tr>
        <td> </td>
        <td>4 readers/writers</td>
        <td>2,192,020</td>
        <td>7,069,237</td>
        <td>17</td>
        <td>1 min</td>
      </tr>
      <tr>
        <td>FPaxos</td>
        <td>2 acceptors</td>
        <td>36,953</td>
        <td>245,288</td>
        <td>19</td>
        <td>4 sec</td>
      </tr>
      <tr>
        <td> </td>
        <td>3 acceptors</td>
        <td>362,361</td>
        <td>2,697,682</td>
        <td>25</td>
        <td>21 sec</td>
      </tr>
      <tr>
        <td> </td>
        <td>4 acceptors</td>
        <td>11,279,393</td>
        <td>96,056,172</td>
        <td>31</td>
        <td>9 min</td>
      </tr>
    </tbody>
  </table>

  <figcaption>Table 1: The state space size of the benchmarks</figcaption>
</figure>

<p>In the experiments, I am using a custom framework to represent the above
<strong>specifications-as-code</strong> that makes it easy to experiment with
different search procedures. To make sure that these specifications faithfully
represent the original TLA<sup>+</sup> specifications, I do the following:</p>

<ol>
  <li>
    <p>do a code review (obviously),</p>
  </li>
  <li>
    <p>automatically translate the specifications to TLA<sup>+</sup> and check them
 with TLC,</p>
  </li>
  <li>
    <p>run a custom-tailored model checker to compute the number of distinct states
   and check the invariants.</p>
  </li>
</ol>

<p><a id="experimental-results"></a></p>

<p><a id="what-are-random-walks"></a></p>

<h2 id="3-what-are-random-walks-and-state-enumeration">3. What are random walks and state enumeration?</h2>

<p>I have mentioned random walks and state enumeration multiple times so far.
Let’s clarify what these terms mean. The concept of a random walk is intuitively
simple, though the details matter. Instead of looking at a large specification,
let’s look at a simple example of a system that models adding and removing
workers from a pool. This example is inspired by the example in <a href="https://learntla.com/topics/tips.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#parameterize-your-actions">Parameterize
Your Actions</a> by Hillel Wayne. We add the variable
<code>count</code> to have a meaningful invariant. The specification is shown below. Even
if you do not know TLA<sup>+</sup>, it should be easy to understand.  If you
still have trouble understanding it, just ask an LLM, they are good at
explaining TLA<sup>+</sup> specifications.</p>

<figure>

  <pre><code class="language-tla">EXTENDS Integers, FiniteSets

CONSTANTS
    (* The set of workers to choose from. *)
    (* @type: Set(Int);                   *)
    Worker

VARIABLES
    (* The set of active workers.         *)
    (* @type: Set(Int);                   *)
    active,
    (* The number of active workers.      *)
    (* @type: Int;                        *)
    count

(* Add a worker w to the set of active workers, if it is not already active. *)
(* @type: (Int) =&gt; Bool;                                                     *)
Add(w) ≜ w ∉ active ∧ active' = active ∪ {w} ∧ count' = count + 1

(* Remove a worker w from the set of active workers, if it is active.        *)
(* @type: (Int) =&gt; Bool;                                                     *)
Remove(w) ≜ w ∈ active ∧ active' = active \ {w} ∧ count' = count - 1

(* Initialize the system with no active workers and a count of zero.         *)
Init ≜ active = {} ∧ count = 0

(* In a next state, either add a worker or remove a worker.                  *)
Next ≜ ∃ w ∈ Worker:
          Add(w) ∨ Remove(w)

(* An invariant: `count` matches the cardinality of the active set.          *)
Inv ≜ (count = Cardinality(active))
</code></pre>

  <figcaption>Figure 3: TLA<sup>+</sup> specification for the Workers example.</figcaption>
</figure>

<p>If we fix the set of workers to be <code>Worker = {1, 2}</code>, we get a nice labelled
transition system (LTS) of 4 states. The graphical representation of this LTS is
shown below.</p>

<figure>

  <div><a href="https://protocols-made-fun.com/img/random-walks-lts.svg?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size">
<picture>
  <img class="responsive-img full-width-img" src="https://protocols-made-fun.com/img/random-walks-lts.svg?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="LTS for the Workers specification of two workers" />
</picture>
</a></div>

  <figcaption>Figure 4: The labelled transition system for two workers.</figcaption>
</figure>

<p>TLA<sup>+</sup> does not have any built-in notion of randomness or
probabilities.  It is what is usually called a <em>qualitative</em> specification.
When evaluating <code>Next</code> in a state, we can only evaluate whether a specific
transition is possible under a specific choice of <code>w</code> and the action scheduling
decision (whether to execute <code>Add(w)</code> or <code>Remove(w)</code>). This is the standard
semantics under the definition of behaviors. We can enumerate all reachable
states for the above system by breadth-first search or depth-first search. This
is what the model checker TLC does (it uses breadth-first search). This is what
I will call <em>state enumeration</em> in this blog post.</p>

<p>We could also interpret the choice of <code>w</code> and the action scheduling decision as
a random choice. Since the above specification is small, we can visualize it as
a <a href="https://en.wikipedia.org/wiki/Markov_decision_process?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Markov decision process
(MDP)</a>. The states are
the same as in the LTS, but we also attach probabilities to the transitions.</p>

<figure>

  <div><a href="https://protocols-made-fun.com/img/random-walks-mdp.svg?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size">
<picture>
  <img class="responsive-img full-width-img" src="https://protocols-made-fun.com/img/random-walks-mdp.svg?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="MDP for the Workers specification of two workers" />
</picture>
</a></div>

  <figcaption>Figure 5: The MDP for two workers.</figcaption>
</figure>

<p>Notice that we assign probabilities for choosing the value of <code>w</code> and for
choosing the action to execute: <code>Add(w)</code> or <code>Remove(w)</code>. For example, in the
initial state, we choose <code>w=1</code> with probability 0.5, then the action <code>Add(1)</code>
with probability 0.5, which gives us a transition to the state where <code>active =
{1}</code> and <code>count = 1</code> (with probability 0.25). However, if we choose <code>w=1</code> and
the action <code>Remove(1)</code>, we have to backtrack to the initial state, since the
precondition of <code>Remove(1)</code> is not satisfied.</p>

<p>A <em>random walk</em> is a path through the MDP. It is a sequence of states that we
get by making random choices at each step. In the above figure, you can see one
walk in blue and one walk in red. To avoid too many backward edges, we have a
retry budget, typically, 3-10 retries per step. We take this simple approach in
our custom framework. It is similar to what the randomized simulator in
<a href="https://github.com/informalsystems/quint?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Quint</a> is doing, though the Quint simulator is trying a bit more locally
before backtracking. Probabilities are basically used to produce various random
walks. There is no inherent statistical meaning to these probabilities in random
walks. This is very much how stateful property-based testing works, too, though
PBT frameworks usually use biased coins, instead of uniform ones.</p>

<div class="tip-box">
    <div class="tip-header"><strong>💡 Tip:</strong></div>
    <div class="tip"><p>TLC also supports random simulation, but it assigns
probabilities differently. Given a state, TLC first computes all successors of
the state and then chooses one of the successors uniformly at random. This would
give us a different MDP that filters out disabled transitions. Both approaches
have their merits and drawbacks. The approach of TLC requires us to enumerate
successors, unless we use reservoir sampling. It would actually work better on
the examples in this blog post, since they have many disabled transitions.
However, in systems that inject faults, this approach has an issue, as the
faulty transitions often dominate the search.</p>
</div>
</div>

<h2 id="4-which-states-are-missing">4. Which states are missing?</h2>

<p>Since we can measure state coverage now, the next question is: What are these
states that we are missing? Maybe these states are not important at all. To
check that, I ran the random walks for the two-phase commit benchmark with 2
resource managers for 10,000 instead of 100,000 runs. Conveniently, exactly one
state was missing from the coverage. As our specifications are code, I just
asked Claude to instrument the search to experimentally evaluate the visit
frequencies per run for each reachable state. Figure 6 is quite detailed. Click
on it to see the full-size version.</p>

<figure>

  <div><a href="https://protocols-made-fun.com/img/two_phase_graph.svg?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size">
    <picture>
      <img class="responsive-img full-width-img" src="https://protocols-made-fun.com/img/two_phase_graph.svg?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Experimental evaluation of the visit frequencies for each state in the two-phase commit benchmark with 2 resource managers" />
    </picture>
  </a></div>

  <figcaption>
    Figure 6: Reachability frequencies for the two-phase commit benchmark with
    2 resource managers.
  </figcaption>
</figure>

<p>As we can see, the missing state (with the frequency of 0) is the state where
the transasction manager aborts the transaction, one resource manager also
aborts the transaction, and the other resource manager is in the “prepared”
state. This is an interesting state in this protocol, as the other resource
manager still has the potential to commit the transaction, though it should not
do that.</p>

<p><strong>Bottom line:</strong> We may miss important states with random walks.</p>

<h2 id="5-more-coverage-plots">5. More coverage plots</h2>

<p>Figure 7 shows the coverage evolution for the large instances of the benchmarks.
With this, we can see how increasing the number of random walks helps to
increase the coverage.  It also demonstrates the growing volume of covered and
missing states.</p>

<p>I wanted to share these flame plots with you. I find them cool.</p>

<div class="figure-grid">
  <figure>
    <a href="https://protocols-made-fun.com/img/random-walks/twophase-n5-overlay.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/random-walks/twophase-n5-overlay.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Overlaid coverage of random walks for the two-phase commit benchmark with 5 resource managers" />
    </picture></a>
    <figcaption>Figure 7.a: Overlaid coverage for two-phase commit, 5 RMs.</figcaption>
  </figure>
  <figure>
    <a href="https://protocols-made-fun.com/img/random-walks/rw-inst4-overlay.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/random-walks/rw-inst4-overlay.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Overlaid coverage of random walks for the readers-writers benchmark with 4 actors" />
    </picture></a>
    <figcaption>Figure 7.b: Overlaid coverage for readers-writers, 4 actors.</figcaption>
  </figure>
  <figure>
    <a href="https://protocols-made-fun.com/img/random-walks/fpaxos-inst3-overlay.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/random-walks/fpaxos-inst3-overlay.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Overlaid coverage of random walks for the FPaxos benchmark with 3 acceptors" />
    </picture></a>
    <figcaption>Figure 7.c: Overlaid coverage for FPaxos, 3 acceptors.</figcaption>
  </figure>
  <figure>
    <a href="https://protocols-made-fun.com/img/random-walks/fpaxos-inst4-overlay.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" target="_blank" title="Click to open full-size"><picture>
      <img class="responsive-img" src="https://protocols-made-fun.com/img/random-walks/fpaxos-inst4-overlay.png?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Overlaid coverage of random walks for the FPaxos benchmark with 4 acceptors" />
    </picture></a>
    <figcaption>Figure 7.d: Overlaid coverage for FPaxos, 4 acceptors.</figcaption>
  </figure>
</div>

<h2 id="6-conclusions">6. Conclusions</h2>

<p><strong>Random walks are not sufficient to achieve complete coverage
except for very small state spaces</strong>. Moreover, <strong>random walks take
significantly longer than the model checker</strong>. This is especially striking,
given that we are <strong>running optimized simulators in Rust</strong>. Another issue with
state coverage by random walks is that <strong>you would not even know that you
achieved complete coverage</strong>. You can measure the speed of discovering new
states, but understanding that the simulator has converged basically requires a
model checker.</p>

<p>Interestingly, random walks behaved badly on FPaxos with four acceptors. This is
a relatively benign benchmark, not having a state explosion like specifications
of Byzantine consensus protocols (BFT). In BFT, the minimal configurations
contain 4-6 replicas, depending on the protocol. Hence, <strong>we should expect a
significantly worse coverage by random walks on BFT</strong>.</p>

<p>Why do engineers keep running randomized experiments? Well, it is relatively
easy to write a simulator. (It is not that easy to write one that actually
works!) I have seen people playing with action distributions in the simulator,
just to drive the search towards “interesting” states. Whenever I was asking,
where the distributions were coming from, they could not explain this.
Simulators are deceptive. You have to understand what you are doing, or,
better, incorporate feedback. The most basic feedback is state coverage, though
we can implement more sophisticated feedback mechanisms.</p>

<p>From our experiments it may look like <strong>state enumeration is all we need</strong>. I
would argue that it is true <strong>as long as the set of reachable states fits into
memory</strong>. We do not have to store the states directly in memory, practical model
checkers store hashes of states. We can go as far as to store 2-3 bits per
state, assuming that collisions are acceptable (still better than random
walks!). Having a machine with 128 GB of memory, we can store roughly 50
billions of states. This is way more than the number of states in our
benchmarks – dozens of billions vs. thousands and millions.</p>

<p>There are cases where randomness may find bugs, where state enumeration gets
stuck:</p>

<ol>
  <li>
    <p><strong>Value domains are quite large.</strong> For example, if we choose values from the
 set of all 64-bit integers, it is not feasible to enumerate all successors even
 for a single state. A random walk can still do some progress without getting
 stuck. One can argue that choosing a value from the set $[0, 2^{64})$
 uniformly at random is shooting in the dark, but sometimes it helps us to find
 bugs, especially if the large set has just a few large equivalence classes.
 Arguably, one should be able to apply data abstraction in this case. Also,
 this is usually the moment when you should consider using a model checker that
 supports symbolic representation of states, like <a href="https://apalache-mc.org?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Apalache</a>.</p>
  </li>
  <li>
    <p><strong>Guided search.</strong> If we have an heuristic that guides the search towards
 interesting states, we can achieve better coverage with random walks faster.
 Maybe we use reinforcement learning to learn such a heuristic. Maybe we use an
 LLM to predict which actions are more likely to lead to interesting states.
 The main issue is that it is quite hard to find a direction for the search in
 the state space of distributed protocols.</p>
  </li>
</ol>

<h2 id="want-to-talk">Want to talk?</h2>

<!-- References -->]]></content>
      

      
      
      
      
      

      <author>
          <name>Igor Konnov</name>
        
          <email>igor@konnov.phd</email>
        
        
          <uri>https://konnov.phd?utm_source=protocols_made_fun&amp;utm_medium=referral&amp;utm_campaign=pmf_site</uri>
        
      </author>

      
        
          <category term="testing" />
        
          <category term="model-checking" />
        
      

      

      
      
        <summary type="html"><![CDATA[Author: Igor Konnov]]></summary>
      

      
      
    </entry>
  
    <entry>
      
      

      <title type="html">AI-generated shovels or second-order slop?</title>
      <link href="https://protocols-made-fun.com/llms/testing/2026/02/12/second-order-slop.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" rel="alternate" type="text/html" title="AI-generated shovels or second-order slop?" />
      <published>2026-02-12T00:00:00+00:00</published>
      <updated>2026-02-12T00:00:00+00:00</updated>
      <id>https://protocols-made-fun.com/llms/testing/2026/02/12/second-order-slop</id>
      
      
        <content type="html" xml:base="https://protocols-made-fun.com/llms/testing/2026/02/12/second-order-slop.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed"><![CDATA[<p><strong>Author:</strong> <a href="https://konnov.phd?utm_source=protocols_made_fun&amp;utm_medium=referral&amp;utm_campaign=pmf_site">Igor Konnov</a></p>

<p><strong>Date:</strong> February 12, 2026</p>

<p>tl;dr:</p>

<ul>
  <li>
    <p><em>AI coding tools now reduce development costs, but they also accelerate the
 creation of software that appears high-quality while hiding serious correctness
 and reliability risks.</em></p>
  </li>
  <li>
    <p><em>When both code and tests are autogenerated, traditional quality checks lose
 their signaling value, increasing the likelihood of costly failures, outages,
 and liability exposure in production systems.</em></p>
  </li>
  <li>
    <p><em>To responsibly capture productivity gains without undermining trust,
 organizations must pair AI-generated code with specification-driven development
 and automated validation techniques that verify real system behavior rather
 than surface-level compliance.</em></p>
  </li>
</ul>

<p>In 2025, we saw plenty of enthusiastic announcements about LLMs generating (or,
more correctly, <em>replicating</em>) relatively complex projects, like web
applications or video games. Apparently, many people got so tired of all this
that you could hear the words “AI slop” quite often. So often that some very
important people asked all of us not to call the output of their amazing tools
“slop”.</p>

<p>Anyhow, by the end of 2025, the amazing AI tools became <em>visibly more amazing</em>.
In the early 2025, I was only using ChatGPT and Copilot to produce small code
snippets and scripts, as well as to search for design solutions. In the summer
of 2025, I used Copilot &amp; Sonnet to produce boilerplate code. Now, I am using
Claude Code and Copilot (both with Opus and Sonnet) to generate code and tests
as well as to fix linting errors (still the hardest task!). I still have to
define the core data structures, write non-standard code and explain in detail
what I want to achieve. It is still hit and miss (see the most notable examples
below). However, it becomes economically feasible for me to use these tools,
unless they get 10x more expensive. By the way, after finishing my experiment
with <a href="https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Symbolic testing of TFTP</a>, I was still not sure, whether I
wanted to use agentic tools every day. The feedback loop was energy draining. It
looks like the tools became better, and I’ve learnt how to give them more
focused and smaller-scoped tasks.</p>

<p>It still remains to see an AI-generated product that generates revenue. Are
there any examples, except the AI coding assistants themselves? In any case,
this is not what I wanted to write about. I wanted to write about something that
looks like a new phenomenon to me. We all have heard the saying: <em>When everyone
is digging for gold, sell shovels</em>. Just over a couple of weeks, there was an
unusual number of announcements about development tools that were generated with
AI. This is what I call <em>AI-generated shovels</em>.  These announcements bring so
much joy to AI influencers that it’s hard to find anything else. Do these tools
actually work though? At a closer look, some of the shovels break on a first
try, some happen to work only under very specific conditions. Most likely, you
have seen some announcements, and you know what I am talking about. It is also
very likely that you have not seen all of the announcements that I have in mind.
Since we are talking about development tools, libraries, or even languages that
do not actually work, not web apps, it is not just slop, it is a <em>second-order
slop</em>!</p>

<p>I am not going to call any names, or do any fingerpointing. This is not the
point. What makes me seriously concerned about the second-order slop is that the
software development industry was cutting corners everywhere even before the AI
boom. “Move fast and break things!”, <a href="https://en.wikipedia.org/wiki/Minimum_viable_product?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">minimum-viable products</a> (or are they
solutions?), <a href="https://en.wikipedia.org/wiki/Product-market_fit?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">product-market-fit</a>, etc. A couple of years ago, I was joking
that I would rather not use an MVP compiler, operating system, or database.
Well, AI tools generate compilers. Here we are.</p>

<p><strong>Shovel ad!</strong> Since I have been working on pre-LLM shovels like <a href="https://apalache-mc.org?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Apalache</a>
and <a href="https://github.com/informalsystems/quint?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Quint</a> myself, I am in the shovel business, too! (Do you know that SMT
solvers were also considered AI?) Of course, I am developing new shovels, and
they are also AI-generated and AI-compatible, and they are the best in town, by
the way. So if you want to talk, <a href="https://protocols-made-fun.com/llms/testing/2026/02/12/second-order-slop.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#want-to-talk">drop me a message</a>. To be fair,
my time tracker shows that I’ve burnt six weeks of my time on the latest shovel,
in addition to burning through and over my Copilot and Claude budgets, so it’s
not entirely AI-generated. Perhaps, a bit artisanal.</p>

<p><strong>Good shovel or slop?</strong> How do we distinguish a robust AI-generated shovel from
a second-order slop? In the pre-LLM years, I could just look at the test suite
and say, whether the team was serious or not. Those amazing days when blockchain
engineers would nod their heads to the question: <em>Do you have integration
tests?</em> They were proudly demonstrating a <em>single</em> integration test that was 3-5
KLOC long. Also, by looking at the code, you could sense whether it was written
just yesterday, or someone had time to think about it.</p>

<p>In 2026, the code may look professionally written and follow all the best
practices and still be completely broken. On top of that, LLMs generate
well-looking tests, if you ask them. A lot of tests! The more tests you have,
the more tokens you have to pay for. Win-win. Moreover, the generated
tests may check that the code works, but this does not mean that the code does
what you expect. This happened to me (see below).</p>

<p>So when we evaluate an AI-generated shovel, we want to answer two questions:</p>

<ol>
  <li>
    <p>Does this shovel do what the authors claim it should do?</p>
  </li>
  <li>
    <p>Does this shovel work beyond a few simple tests?</p>
  </li>
</ol>

<p>These are not new questions. The testing and verification communities have been
trying to automate validation and verification for long time. Interestingly,
these questions did not get much attention over the last two decades. It was
expected that open source projects and products by respectable companies were
“more or less” correct and complete. In my understanding, two factors
contributed to that:</p>

<ol>
  <li>
    <p>The code was written and reviewed by highly-skilled engineers, for fun or
 profit.</p>
  </li>
  <li>
    <p>The projects were extensively tested with continuous integration tools.</p>
  </li>
</ol>

<p>Now, if an LLM generated the code just yesterday, and all tests pass, are we
good? It is hard to tell. If we follow the brand new <em>spec-driven development</em>,
we have a bunch of markdown files. Apparently, we should ask a few other LLMs to
check whether the implemented code matches the markdown specs. Something like
that.</p>

<p><strong>Can we do better?</strong> I believe we can. For example, if you are developing
a distributed system, do not generate it directly. First, write or AI-generate a
sequential reference implementation (e.g., in Python) or, even better, a formal
specification (e.g., in TLA<sup>+</sup>). Second, use this artifact to produce
the code for the actual distributed system.</p>

<p>Why does this help? For two reasons:</p>

<ol>
  <li>
    <p>It is easier to compare the reference implementation or specification
 against the markdown requirements than to compare the entire codebase.</p>
  </li>
  <li>
    <p>The reference implementation/specification is an actionable artifact.  Use
 it to produce tests for the distributed system. Instead of generating 10 KLOC
 tests once (and paying for loading them into the LLM context), automatically
 produce as many tests as you can. This is where property-based testing and
 model checking start to shine. See <a href="https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Symbolic testing of TFTP</a>
 for an example.</p>
  </li>
</ol>

<p><strong>Examples of LLMs hit and miss.</strong> If you <a href="https://www.linkedin.com/in/igor-konnov-at/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">follow me on LinkedIn</a>, you could
have seen some of the examples. Below are the most curious instances that I would
regret missing in a code review (by Sonnet 4.5 and Opus 4.5):</p>

<ul>
  <li>
    <p><strong>The set minimum.</strong> When I asked an LLM to implement the search for the
 minimal element of a set by using its string representation (called <code>repr</code> in
 Python), it collected all set elements in a list, sorted them by <code>repr</code> and
 picked the first one. It looks like my requirement was slightly non-standard.</p>
  </li>
  <li>
    <p><strong>Sets with duplicates.</strong> An LLM has produced a unit test that constructed
 the data structure called “Set” from the list <code>[ V(1), V(2), V(3), V(1) ]</code> and
 asserted that the set cardinality was 4. The test passed, since <code>V</code> did not
 have equality defined, and two different instances of <code>V(1)</code> had different
 references. So it was doing the things right, but it was not doing the right
 things!</p>
  </li>
  <li>
    <p><strong>Performance bottleneck.</strong> An LLM translated my Python function into a Rust
 function. Perfectly looking code. However, instead of adding a big integer <code>x</code>
 to the big integer <code>y</code>, it used an iterator that made <code>y</code> increments of <code>x</code>.
 Almost like a theorem prover! A logically correct solution, but my Rust code
 was slower than the Python code. I only spotted it after running the profiler.
 Again, a bit non-standard setup threw it off.</p>
  </li>
</ul>

<h2 id="want-to-talk">Want to talk?</h2>

<!-- References -->]]></content>
      

      
      
      
      
      

      <author>
          <name>Igor Konnov</name>
        
          <email>igor@konnov.phd</email>
        
        
          <uri>https://konnov.phd?utm_source=protocols_made_fun&amp;utm_medium=referral&amp;utm_campaign=pmf_site</uri>
        
      </author>

      
        
          <category term="llms" />
        
          <category term="testing" />
        
      

      

      
      
        <summary type="html"><![CDATA[Author: Igor Konnov]]></summary>
      

      
      
    </entry>
  
    <entry>
      
      

      <title type="html">Property-based testing, adversarial developers, and LLMs</title>
      <link href="https://protocols-made-fun.com/pbt/2025/12/22/pbt-adversarial-llms.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" rel="alternate" type="text/html" title="Property-based testing, adversarial developers, and LLMs" />
      <published>2025-12-22T00:00:00+00:00</published>
      <updated>2025-12-22T00:00:00+00:00</updated>
      <id>https://protocols-made-fun.com/pbt/2025/12/22/pbt-adversarial-llms</id>
      
      
        <content type="html" xml:base="https://protocols-made-fun.com/pbt/2025/12/22/pbt-adversarial-llms.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed"><![CDATA[<p><strong>Author:</strong> <a href="https://konnov.phd?utm_source=protocols_made_fun&amp;utm_medium=referral&amp;utm_campaign=pmf_site">Igor Konnov</a></p>

<p><strong>Date:</strong> December 22, 2025</p>

<p>I present a simple example that illustrates how property-based testing (PBT) and
model checking can help us catch unexpected behaviors of LLMs when they are used
to generate code. The example is inspired by the <a href="https://youtu.be/IYzDFHx6QPY?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">talk on property-based
testing</a> by <a href="https://scottwlaschin.com/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Scott Wlaschin</a>. If you are looking for a light
example that stresses the importance of writing good properties and having them
checked, this post is for you.</p>

<h2 id="1-adversarial-developer">1. Adversarial Developer</h2>

<p>A few days ago, I watched the <a href="https://youtu.be/IYzDFHx6QPY?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">talk on property-based testing</a> by
<a href="https://scottwlaschin.com/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Scott Wlaschin</a>. He started the talk by introducing a persona that he called
the <strong>Enterprise Developer from Hell</strong>. This is basically someone who
implements a feature to satisfy the given requirements, but they do it in
creatively evil (or just stupid) and unexpected ways. I will call such a persona
an <strong>adversarial developer</strong> in the rest of this post.</p>

<p>Then, Scott<sup id="fnref:scott-talk"><a href="https://protocols-made-fun.com/pbt/2025/12/22/pbt-adversarial-llms.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#fn:scott-talk" class="footnote" rel="footnote" role="doc-noteref">1</a></sup> showed how an adversarial developer could ruin as simple task as
adding up two numbers. For example, if we give them two tests $2+2=4$ and $10+33
= 43$, they will implement exactly those cases by case distinction. I am not
going to repeat Scott’s talk. <a href="https://youtu.be/IYzDFHx6QPY?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Watch it</a>! It’s instructive and
entertaining.</p>

<p>Back in 2020, of course, Scott added that we are often aversarial developers
ourselves, and our peers are rarely that evil. It could be an enthusiastic
junior developer, who has just started and now wants to rewrite the whole code
base.  Now, we are a few weeks away from 2026, and <strong>we definitely have such a
peer</strong>!  It is called an LLM, or just AI, as the corporate marketers prefer.
LLMs are not necessarily evil, but they are definitely less predictable than
experienced human software engineers. I am not talking about <a href="https://genai.owasp.org/llmrisk/llm01-prompt-injection/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">prompt
injection</a> here, which is another real issue with LLMs.</p>

<p>To be clear, in the rest of this text, I am talking about how an <strong>LLM could
behave like an adversarial developer when it generates code</strong>. It does not mean
that I ran one of the commercial LLMs and got those results.</p>

<h2 id="2-property-based-testing">2. Property-Based Testing</h2>

<p>The point of Scott’s talk was to show that a few data points (typical unit
tests) are insufficient to demonstrate correctness of the implementation.
A totally valid point!</p>

<p>In addition to the standard unit tests, we should also write the expected
properties of our implementation. The PBT frameworks test the code by producing
input values at random. For example, have a look at <a href="https://hypothesis.readthedocs.io/en/latest/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Hypothesis</a>. While this
may seem to be a silly idea at first, property-based tests uncovers tricky
bugs. Moreover the input value distribution does not have to be uniform. Keep
reading to see how this helps us catch the adversarial developer.</p>

<p>Here are the three properties of addition that Scott used to defeat the
adversarial developer:</p>

<ul>
  <li><strong>identity</strong>: for every number $x$, we have $x + 0 = x$,</li>
  <li><strong>commutativity</strong>: for every numbers $x$ and $y$, we have $x + y = y + x$, and</li>
  <li><strong>associativity</strong>: for every numbers $x$, $y$, and $z$, we have
$(x + y) + z = x + (y + z)$.</li>
</ul>

<p>At this point of the talk, I was like: Wait a minute! <strong>I could continue this
game of the adversarial developer</strong>. Before doing this, let’s look at where we
are with respect to the code and the properties. Here is the obvious
implementation of integer addition in Python, since the language has built-in
support for unbounded integers:</p>

<pre><code class="language-python">def add(a: int, b: int) -&gt; int:
    return a + b

</code></pre>

<p>Here are the property-based tests in <a href="https://hypothesis.readthedocs.io/en/latest/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Hypothesis</a>, generated by Claude Sonnet 4.5:</p>

<pre><code class="language-python">@given(st.integers())
def test_identity(a):
    """Test identity property: a + 0 = a."""
    assert add(a, 0) == a
    assert add(0, a) == a


@given(st.integers(), st.integers())
def test_commutativity(a, b):
    """Test commutativity property: a + b = b + a."""
    assert add(a, b) == add(b, a)


@given(st.integers(), st.integers(), st.integers())
def test_associativity(a, b, c):
    """Test associativity property: (a + b) + c = a + (b + c)."""
    assert add(add(a, b), c) == add(a, add(b, c))

</code></pre>

<p>You can find these and further examples in the <a href="https://github.com/konnov/pbt-example-summation?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">example repository</a>.</p>

<p>We run these tests with <code>pytest</code> to make sure that they all pass:</p>

<pre><code class="language-sh">$ git clone https://github.com/konnov/pbt-example-summation.git
$ cd pbt-example-summation/python
$ poetry run pytest tests/test_add.py \
  -k "test_identity or test_commutativity or test_associativity" --verbose
...
tests/test_add.py::test_identity PASSED                                        [ 33%]
tests/test_add.py::test_commutativity PASSED                                   [ 66%]
tests/test_add.py::test_associativity PASSED                                   [100%]
</code></pre>

<h2 id="3-symbolic-model-checking-with-apalache">3. Symbolic model checking with Apalache</h2>

<p>I decided to go even further and write a TLA<sup>+</sup> specification, to check
the three properties with <a href="https://apalache-mc.org?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Apalache</a> and <a href="https://github.com/Z3Prover/z3?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Z3</a> (only showing the relevant parts):</p>

<pre><code class="language-tlaplus">----------------------------------- MODULE Add --------------------------------
(*
 * A simple TLA+ specification of different kinds of addition.
 *
 * Igor Konnov, 2025
 *)
EXTENDS Integers

VARIABLE
    \* @type: Int;
    x,
    \* @type: Int;
    y,
    \* @type: Int;
    z

AddMath(a, b) == a + b

InitMath ==
    /\ x \in Int
    /\ y \in Int
    /\ z \in Int

Next == UNCHANGED &lt;&lt;x, y, z&gt;&gt;

Identity(F(_, _)) ==
    F(x, 0) = x

Commutativity(F(_, _)) ==
    F(x, y) = F(y, x)

Associativity(F(_, _)) ==
    F(F(x, y), z) = F(x, F(y, z))

InvMath ==
    /\ Identity(AddMath)
    /\ Commutativity(AddMath)
    /\ Associativity(AddMath)

</code></pre>

<p>With the above specification, we define a very simple state machine that
non-deterministically picks three integers <code>x</code>, <code>y</code>, and <code>z</code> with <code>InitMath</code>.
These variables do not change their values in the state machine, as you can see
from the definition of <code>Next</code>. We use <code>x</code>, <code>y</code>, and <code>z</code> to define three
properties of addition: <code>Identity</code>, <code>Commutativity</code>, and <code>Associativity</code>. As you
can see, these definitions are parameterized by the operator <code>F</code>, which is <code>Add</code>
for now.  Our invariant <code>InvMath</code> is simply the conjunction of the three
properties.</p>

<p>This is how we run Apalache to check the invariant:</p>

<pre><code class="language-sh">$ cd pbt-example-summation/tla-spec
$ docker pull ghcr.io/apalache-mc/apalache
$ docker run --rm -v `pwd`:/var/apalache ghcr.io/apalache-mc/apalache \
  check --init=InitMath --inv=InvMath --length=0 Add.tla
</code></pre>

<p>With the above command, we tell Apalache to check the invariant <code>InvMath</code>
starting from the initial state <code>InitMath</code>. The <code>--length=0</code> option tells
Apalache to unroll <code>Next</code> zero times, which is sufficient in our case, since the
state machine does not change the values of the state variables.</p>

<h2 id="4-playing-adversarial">4. Playing adversarial</h2>

<p>Ok, the code and the specification above seem to be correct. But what if our
friendly AI produced something unexpected?</p>

<h3 id="41-hallucinating-addition-over-32-bit-unsigned-integers">4.1. Hallucinating addition over 32-bit unsigned integers</h3>

<p>Since we are dealing with an adversarial developer, they could simply use a
different definition of addition. So far, we have been talking about unbounded
mathematical integers, which Python conveniently implements for us.</p>

<p>Now, the adversarial developer gives us this implementation:</p>

<pre><code class="language-python">def add32(a: int, b: int) -&gt; int:
    return (a + b) % (2**32)

</code></pre>

<p>This implementation is actually not wrong. An LLM could copy it from a code base
that emulates a 32-bit CPU architecture in Python. This is a bit of a stretch,
but possible.</p>

<p>Let’s add property-based tests for this implementation as well:</p>

<pre><code class="language-python"># Tests for add32 (32-bit natural numbers with wrapping)

@given(st.integers(min_value=0, max_value=2**32 - 1))
def test_add32_identity(a):
    """Test identity property for add32: a + 0 = a."""
    assert add32(a, 0) == a
    assert add32(0, a) == a


@given(st.integers(min_value=0, max_value=2**32 - 1), st.integers(min_value=0, max_value=2**32 - 1))
def test_add32_commutativity(a, b):
    """Test commutativity property for add32: a + b = b + a."""
    assert add32(a, b) == add32(b, a)


@given(
    st.integers(min_value=0, max_value=2**32 - 1),
    st.integers(min_value=0, max_value=2**32 - 1),
    st.integers(min_value=0, max_value=2**32 - 1)
)
def test_add32_associativity(a, b, c):
    """Test associativity property for add32: (a + b) + c = a + (b + c)."""
    assert add32(add32(a, b), c) == add32(a, add32(b, c))

</code></pre>

<p>These tests also pass:</p>

<pre><code class="language-sh">$ poetry run pytest tests/test_add.py \
  -k "test_add32_identity or test_add32_commutativity or test_add32_associativity" \
  --verbose
...
tests/test_add.py::test_add32_identity PASSED                                [ 33%]
tests/test_add.py::test_add32_commutativity PASSED                           [ 66%]
tests/test_add.py::test_add32_associativity PASSED                           [100%]
</code></pre>

<p>What is going on? Well, identity, commutativity, and associativity also hold for
32-bit integers with overflow semantics. <strong>If we let AI generate not only the
implementation but also the properties, we may end up with a correct
implementation, but not the one we wanted!</strong> In this case, imagine an LLM has
added the <code>@given</code> decorators for the inputs to be in the range in $[0,
2^{32})$, whereas we wanted unbounded integers! This is an example of the
classical question in requirements engineering:
<em>do we get things right</em> vs. <em>do we get the right things</em>.</p>

<p>Just to double check that it is not the random chance, I ran Apalache on the
TLA<sup>+</sup> specification above with <code>Add32</code> instead of <code>Add</code>:</p>

<pre><code class="language-shell">$ docker run --rm -v `pwd`:/var/apalache ghcr.io/apalache-mc/apalache \
  check --length=0 --init=Init32 --inv=Inv32 Add.tla
...
Checker reports no error up to computation length 0
Total time: 2.205 sec
</code></pre>

<p>Further, the SMT solver Z3 confirms that identity, commutativity, and
associativity hold for 32-bit integers with overflow semantics. This is provided
that we pick the integers from the range $[0, 2^{32})$, which we do with
<code>Init32</code>.</p>

<p>However, <strong>this is not what we wanted initially</strong>. Let’s catch the adversarial
developer with the PBT tests that pick unbounded non-negative integers:</p>

<pre><code class="language-python">@given(st.integers(0))
def test_add32_unbounded_inputs_identity(a):
    """Test identity property for add32: a + 0 = a."""
    assert add32(a, 0) == a
    assert add32(0, a) == a

@given(st.integers(0), st.integers(0))
def test_add32_unbounded_inputs_commutativity(a, b):
    """Test commutativity property for add32: a + b = b + a."""
    assert add32(a, b) == add32(b, a)


@given(
    st.integers(0),
    st.integers(0),
    st.integers(0)
)
def test_add32_unbounded_inputs_associativity(a, b, c):
    """Test associativity property for add32: (a + b) + c = a + (b + c)."""
    assert add32(add32(a, b), c) == add32(a, add32(b, c))

</code></pre>

<p>This time, Hypothesis catches the issue with identity:</p>

<pre><code class="language-sh">$ poetry run pytest tests/test_add.py  --verbose \
  -k "test_add32_unbounded_inputs_identity or test_add32_unbounded_inputs_commutativity or test_add32_unbounded_inputs_associativity"
...
a = 4294967296

    @given(st.integers(0))
    def test_add32_unbounded_inputs_identity(a):
        """Test identity property for add32: a + 0 = a."""
&gt;       assert add32(a, 0) == a
E       assert 0 == 4294967296
E        +  where 0 = add32(4294967296, 0)
E       Falsifying example: test_add32_unbounded_inputs_identity(
E           a=4_294_967_296,
E       )
</code></pre>

<div class="tip-box">
    <div class="tip-header"><strong>💡 Tip:</strong></div>
    <div class="tip"><p>We can also implement <code>add64</code> that wraps
integers modulo $2^{64}$ and the PBT tests will catch the issue with
identity almost immediately.</p>
</div>
</div>

<h3 id="42-is-property-based-testing-a-magic-tool">4.2. Is property-based testing a magic tool?</h3>

<p>Let’s stop and think about our example. How did Hypothesis catch the issue over
an unbounded integer domain? Even it was picking the integers from the interval
$[0, 2^{64})$, the chance of picking <code>4294967296</code> by uniform random sampling is
pretty slim. Yet, Hypothesis keeps picking this number.</p>

<p>Well, the trick is that its input generator tries
the well-known “magic numbers” such as <code>0</code>, <code>1</code>, <code>-1</code>, <code>2**32</code>, <code>2**64</code>, etc.
In this sense, Hypothesis does not use uniform random sampling. See the
discussion on <a href="https://hypothesis.readthedocs.io/en/latest/explanation/domain.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">domain and distribution</a> in the Hypothesis documentation for
more details.</p>

<p>What if our adversarial developer hallucinated an implementation that stays
undetected by Hypothesis? This is what our next example is about.</p>

<h3 id="43-hallucinating-addition-over-256-bit-unsigned-integers">4.3. Hallucinating addition over 256-bit unsigned integers</h3>

<p>This time, the adversarial developer uses 256-bit unsigned integers with overflow
semantics:</p>

<pre><code class="language-python">def add256(a: int, b: int) -&gt; int:
    return (a + b) % (2**256)

</code></pre>

<p>If you think that using 256-bit integers is absurd, well, the Ethereum Virtual
Machine (EVM) does exactly that. So an LLM could have adapted the above code
from an EVM-related code base.</p>

<pre><code class="language-python">@given(st.integers(0))
@settings(max_examples=100000)
def test_add256_unbounded_inputs_identity(a):
    """Test identity property for add256: a + 0 = a."""
    assert add256(a, 0) == a
    assert add256(0, a) == a

@given(st.integers(0), st.integers(0))
@settings(max_examples=100000)
def test_add256_unbounded_inputs_commutativity(a, b):
    """Test commutativity property for add256: a + b = b + a."""
    assert add256(a, b) == add256(b, a)


@given(
    st.integers(0),
    st.integers(0),
    st.integers(0)
)
@settings(max_examples=100000)
def test_add256_unbounded_inputs_associativity(a, b, c):
    """Test associativity property for add256: (a + b) + c = a + (b + c)."""
    assert add256(add256(a, b), c) == add256(a, add256(b, c))
</code></pre>

<p>This time, the adversarial developer gets away, all tests pass:</p>

<pre><code class="language-shell">$ poetry run pytest tests/test_add.py --verbose -k \
  "test_add256_unbounded_inputs_identity or test_add256_unbounded_inputs_commutativity or test_add256_unbounded_inputs_associativity" 
...
tests/test_add.py::test_add256_unbounded_inputs_identity PASSED                [ 33%]
tests/test_add.py::test_add256_unbounded_inputs_commutativity PASSED           [ 66%]
tests/test_add.py::test_add256_unbounded_inputs_associativity PASSED           [100%]
</code></pre>

<p>Why? Hypothesis does not try $2^{256}$ as a magic number. I gave it the budget of
100,000 examples, so it had a chance to try multiple powers of two, but it did
not try anything above $2^{256} - 1$.</p>

<p>We can make sure that the identity test indeed fails when we pass $2^{256}$
as an example:</p>

<pre><code class="language-python">@given(st.integers(0))
@settings(max_examples=100000)
@example(2**256)  # This should fail!
def test_add256_unbounded_inputs_identity(a):
    """Test identity property for add256: a + 0 = a."""
    assert add256(a, 0) == a
    assert add256(0, a) == a
</code></pre>

<h3 id="44-catching-the-adversarial-developer-with-apalache">4.4. Catching the adversarial developer with Apalache</h3>

<p>Here is how we modify the TLA<sup>+</sup> specification to use <code>Add256</code>:</p>

<pre><code class="language-tlaplus">Add256(a, b) == (a + b) % (2^256)

InitNat ==
    /\ x \in Nat
    /\ y \in Nat
    /\ z \in Nat

Inv256 ==
    /\ Identity(Add256)
    /\ Commutativity(Add256)
    /\ Associativity(Add256)

</code></pre>

<p>Apalache immediately finds the issue with identity when we run it with <code>Add256</code>:</p>

<pre><code class="language-shell">$ docker run --rm -v `pwd`:/var/apalache ghcr.io/apalache-mc/apalache \
  check --length=0 --init=InitNat --inv=Inv256 Add.tla
...
State 0: state invariant 0 violated.
Total time: 2.272 sec
</code></pre>

<p>If we check the counterexample, we see that the solver picks the value $2^{256}$
for <code>x</code>:</p>

<pre><code class="language-sh">$ head -n 14 _apalache-out/Add.tla/2025-12-22T15-20-31_8166638658721555415/violation.tla
---------------------------- MODULE counterexample ----------------------------
EXTENDS Add

(* Constant initialization state *)
ConstInit == TRUE

(* Initial state [_transition(0)] *)
State0 ==
  x
      = 115792089237316195423570985008687907853269984665640564039457584007913129639936
    /\ y = 0
    /\ z = 0

</code></pre>

<p><strong>This is not just luck and not an heuristic!</strong> Apalache delegates solving to
the SMT solver <a href="https://github.com/Z3Prover/z3?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Z3</a>, which solves integer constraints. If you want to make
sure that it’s not using magic numbers, go and change the modulo operator in
<code>Add256</code> to a large prime number, e.g., $2^{256} + 297$. Rerun the model
checker, and it will still find the issue with identity.</p>

<h3 id="45-for-the-curious-how-apalache-and-z3-work-together">4.5. For the curious: how Apalache and Z3 work together</h3>

<p>Our example is so simple that we can even go over the actual SMT constraints
that Apalache generates. Let’s run Apalache with the option <code>--debug</code>:</p>

<pre><code class="language-sh">$ docker run --rm -v `pwd`:/var/apalache ghcr.io/apalache-mc/apalache \
  check --debug \
  --length=0 --init=InitNat --inv=Inv256 Add.tla
</code></pre>

<p>Open the file
<code>_apalache-out/Add.tla/2025-12-23T09-31-46_16436938462355564409/log0.smt</code> (the
timestamp will be different on your machine). The log is pretty verbose. Here
are the crucial parts, which I’ve accompanied with explanations:</p>

<pre><code class="language-lisp">-- the initial value of `x`
(declare-const $C$6 Int)
-- `x` is a natural number
(assert (&gt;= $C$6 0))
-- introduce a boolean variable for the identity property
(declare-const $C$9 Bool)
-- Encode the identity property for `Add256` and `x`.
-- The huge number is 2^256.
(assert (= $C$9
   (= (mod $C$6
           115792089237316195423570985008687907853269984665640564039457584007913129639936)
      $C$6)))
-- assert that the identity property is violated
(declare-const $C$10 Bool)
(assert (= (not $C$10) $C$9))
(assert $C$10)
-- check, whether the above constraints have a solution
(check-sat)
</code></pre>

<p>If you want to understand what is going on, read the comments above. At first, I
was actually surprised that the SMT constraints did not contain addition at all.
Then I recalled that Apalache has a bunch of rewriting rules that simplify the
constraints. In this case, the symbolic model checker has applied the property
<code>a + 0 = a</code> internally to get rid of the addition (yeah, it is the identity
property!).  It was an equivalent transformation, so we are still left with the
modulo operator.</p>

<p>Essentially, we are asking Z3 to solve these inequalities over integers:</p>

\[\left\{
\begin{aligned}
x &amp;\ge 0\\
x &amp;\pmod{2^{256}} \neq 0
\end{aligned}
\right.\]

<p>What is crucial here is that the SMT solver <strong>Z3 has a strict contract with the
user</strong>. When we give it a set of constraints and ask it to check their
satisfiability, it will apply sound algorithms to arrive at one of the three
answers:</p>

<ul>
  <li>
    <p><strong>sat</strong>: there is a solution to the above constraints, i.e., an assignment of
values to the variables that makes the constraints true,</p>
  </li>
  <li>
    <p><strong>unsat</strong>: there is no solution, and</p>
  </li>
  <li>
    <p><strong>unknown</strong>: the constraints are too hard, or it took the solver too long to
solve them.</p>
  </li>
</ul>

<p>In contrast to PBT, it is not just like “I tried a few random inputs and did not
find a bug”. If Z3 answers <code>sat</code>, there is indeed a solution to the constraints,
and the solver gives it to us as a model. If it returns <code>unsat</code>, there is no
solution. Whenever you see <code>unknown</code>, it’s a bad day. Sometimes, it also
indicates a bug in the solver itself, as I’ve <a href="https://github.com/Z3Prover/z3/issues?q=is%3Aissue+state%3Aclosed+author%3Akonnov&amp;utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">reported to the Z3 developers a
few times</a>. However, Z3 is pretty reliable in my experience, and
producing an <code>unknown</code> is an achievement, unless you set very tight timeouts,
or use tricky non-linear arithmetic.</p>

<p>If you are further interested in how Z3 actually solved the above constraints, a
simple answer is that it used something like the <a href="https://en.wikipedia.org/wiki/Simplex_algorithm?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Simplex algorithm</a> for
integer linear programming. The constraints are linear in our case, so Z3 could
apply this algorithm to find a solution. Most likely, Z3 used a more recent
algorithm, but the idea is similar.</p>

<p>In case you really want to know how SMT solvers work under the hood, I recommend
starting with the book on <a href="https://www.decision-procedures.org/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Decision Procedures</a> by Kroening and Strichman.</p>

<h2 id="5-conclusions">5. Conclusions</h2>

<p>We all have to <strong>learn how to write high-quality properties and understand the
boundaries of the “magic” tools</strong>. <strong>Even if you don’t use LLMs, your peers
will.</strong></p>

<p><strong>How to learn writing good properties</strong>? We can play with property-based
testing. However, <strong>PBT tools are not reliable teachers</strong>. By their random
nature, a PBT tool may miss a bug on one run and find it on another run. Don’t
get me wrong. <strong>Property-based testing has its value</strong>, as many other testing
and verification techniques. However, <strong>PBT is not a silver bullet</strong>. It may
miss bugs, especially if the input generator does not cover the right input
space well enough.</p>

<p><strong>Shall we use interactive provers like <a href="https://lean-lang.org/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Lean</a> and <a href="https://rocq-prover.org/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Rocq</a></strong>? Learning how to
prove code correct definitely helps! However, <strong>these tools only tell us that the
proof does not go through</strong>. It would not give us a counterexample.
State-of-the-art provers also recommend using PBT for bug finding.</p>

<p>In my opinion, <strong>model checking is the best way to learn how to write good
properties</strong>. You can write as many properties as you like, and the model
checker will produce you counterexamples, or not. Importantly, model checkers
come with a guarantee of not having a bug in their <em>search scope</em>, if they
terminate.  See my blog post on the <a href="https://protocols-made-fun.com/modelchecking/2025/04/08/value.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">value of model
checking</a> on that.</p>

<p>Usually, I recommend people to start with <a href="https://github.com/tlaplus/tlaplus?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">TLC</a>. It works by state enumeration
and easy to understand. If your search scope is small, TLC is a good learning
tool. In our example, the search scope is astronomical. In this case,
<a href="https://apalache-mc.org?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Apalache</a> is there to help.</p>

<p>By the way, our example was so simple, that we could encode it in <a href="https://github.com/Z3Prover/z3?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Z3</a>
directly via its python bindings. We could use other model checkers. If you do
that, let me know!</p>

<h2 id="6-bonus-hypothesis--crosshair">6. Bonus: Hypothesis + Crosshair</h2>

<p>Hypothesis offers an integration with <a href="https://crosshair.readthedocs.io/en/latest/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Crosshair</a>, which is a symbolic
execution engine for Python using Z3. I did not explore this integration in
depth. Claude told me that it is sufficient to just add this import to the test:</p>

<pre><code class="language-python">import hypothesis_crosshair_provider
</code></pre>

<p>Well, this did not help me to find the violation of identity for <code>add256</code>.  If
you know how to make Crosshair work with Hypothesis, please let me know!</p>

<p>When we run Crosshair directly on the <code>add256</code> implementation, it finds the
issue with identity right away:</p>

<pre><code class="language-sh">$ poetry run crosshair check tests.test_add_crosshair.check_add256_identity
.../python/tests/test_add_crosshair.py:12: error: false when calling check_add256_identity(115792089237316195423570985008687907853269984665640564039457584007913129639936) (which returns False)
</code></pre>

<p>The Crosshair test looks like follows:</p>

<pre><code class="language-python">from crosshair.core_and_libs import standalone_statespace
from pbt_add import add256


def check_add256_identity(a: int) -&gt; bool:
    """
    Check identity property for add256: a + 0 = a.
    
    pre: a &gt;= 0
    post: _
    """
    return add256(a, 0) == a and add256(0, a) == a

</code></pre>

<!-- References -->

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:scott-talk">
      <p>I’ve never met Scott Wlaschin in real life, online or offline. I hope he would not mind me referring to him by his first name. <a href="https://protocols-made-fun.com/pbt/2025/12/22/pbt-adversarial-llms.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#fnref:scott-talk" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content>
      

      
      
      
      
      

      <author>
          <name>Igor Konnov</name>
        
          <email>igor@konnov.phd</email>
        
        
          <uri>https://konnov.phd?utm_source=protocols_made_fun&amp;utm_medium=referral&amp;utm_campaign=pmf_site</uri>
        
      </author>

      
        
          <category term="pbt" />
        
      

      

      
      
        <summary type="html"><![CDATA[Author: Igor Konnov]]></summary>
      

      
      
    </entry>
  
    <entry>
      
      

      <title type="html">Interactive Symbolic Testing of TFTP with TLA+ and Apalache</title>
      <link href="https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" rel="alternate" type="text/html" title="Interactive Symbolic Testing of TFTP with TLA+ and Apalache" />
      <published>2025-12-15T00:00:00+00:00</published>
      <updated>2025-12-15T00:00:00+00:00</updated>
      <id>https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing</id>
      
      
        <content type="html" xml:base="https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed"><![CDATA[<p><strong>Author:</strong> <a href="https://konnov.phd?utm_source=protocols_made_fun&amp;utm_medium=referral&amp;utm_campaign=pmf_site">Igor Konnov</a></p>

<p><strong>Date:</strong> December 15, 2025</p>

<p><em>Note: I mostly stopped using LLMs for proof-reading my texts, so you know
it is not all generated. Enjoy my typos and weird grammar!</em></p>

<p><strong>Abstract.</strong> As promised in the <a href="https://protocols-made-fun.com/tlaplus/2025/12/02/small-scope.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">blog post on small-scope
hypothesis</a>, I am continuing with the main body of the talk that I
presented at the internal Nvidia FM Week 2025. This blog post is rather long. If
you do not want to read the whole post, here are the most exciting new
developments:</p>

<ul>
  <li>
    <p>A <strong>new JSON-RPC server API</strong> for <a href="https://apalache-mc.org?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Apalache</a>, which allows external tools
 and scripts to drive the symbolic execution of TLA<sup>+</sup> specifications
 and interact with the solver.  Read the section on <a href="https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#4-the-new-json-rpc-api-of-apalache">The new JSON-RPC API of
 Apalache</a>.</p>
  </li>
  <li>
    <p>A new approach to <strong>conformance testing of TLA<sup>+</sup> specifications and
 real implementations</strong>, called <strong>interactive symbolic testing</strong>. This approach
 is inspired by the work of <a href="https://www.mcmil.net/pubs/SIGCOMM19.pdf?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">McMillan and Zuck (2019)</a> on testing of the
 QUIC protocol with IVy and SMT. Read the section on <a href="https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#3-interactive-symbolic-testing-with-smt">Interactive symbolic
 testing with SMT</a>.</p>
  </li>
  <li>
    <p>A case study on <strong>testing multiple open-source implementations of TFTP</strong>,
 including unexpected (but not harmful) deviations from the protocol. This case
 study includes the experience report on using Claude to bootstrap the harness
 for testing TFTP implementations against the TLA<sup>+</sup> specification.
 Read the section on <a href="https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#7-bootstrapping-the-testing-harness-with-claude">Bootstrapping the testing harness with
 Claude</a> and <a href="https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#9-testing-against-adversarial-behavior">Testing against
 adversarial behavior</a>.  My point is
 not <strong>not to brainwash you into LLMs</strong>, but to <strong>show what works for me and
 what does not</strong>.</p>
  </li>
  <li>
    <p>The specification and the test harness are <strong>openly available</strong>. Check the
 <a href="https://github.com/konnov/tftp-symbolic-testing?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Github repository</a>.</p>
  </li>
</ul>

<p>In this blog post, I am using TLA<sup>+</sup>. The same tooling and results
equally apply to <a href="https://github.com/informalsystems/quint?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Quint</a>.</p>

<p><strong>Contents:</strong></p>

<ol>
  <li><a href="https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#1-introduction">Introduction</a></li>
  <li><a href="https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#2-model-based-testing-and-trace-validation">Model-based testing and trace validation</a></li>
  <li><a href="https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#3-interactive-symbolic-testing-with-smt">Interactive symbolic testing with SMT</a></li>
  <li><a href="https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#4-the-new-json-rpc-api-of-apalache">The new JSON-RPC API of Apalache</a></li>
  <li><a href="https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#5-case-study-tftp-protocol">Case study: TFTP protocol</a></li>
  <li><a href="https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#6-initial-tla-specification-of-tftp">Initial TLA<sup>+</sup> specification of TFTP</a></li>
  <li><a href="https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#7-bootstrapping-the-testing-harness-with-claude">Bootstrapping the testing harness with Claude</a></li>
  <li><a href="https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#8-debugging-the-tla-specification-with-the-implementation">Debugging the TLA<sup>+</sup> specification with the implementation</a></li>
  <li><a href="https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#9-testing-against-adversarial-behavior">Testing against adversarial behavior</a></li>
  <li><a href="https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#10-the-specification-as-a-differential-testing-oracle">The specification as a differential testing oracle</a></li>
  <li><a href="https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#11-prior-work">Prior Work</a></li>
  <li><a href="https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#12-conclusions">Conclusions</a></li>
</ol>

<h2 id="1-introduction">1. Introduction</h2>

<p>This work aims at demonstrating how to answer the following two questions with
<a href="https://apalache-mc.org?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Apalache</a>:</p>

<p class="highlight-question"><strong><em>
  1. How to test the actual implementation against its TLA<sup>+</sup> specification?
</em></strong></p>

<p class="highlight-question"><strong><em>
  2. How to test the TLA<sup>+</sup> specification against the actual implementation?
</em></strong></p>

<p>For long time, these questions have been mostly ignored by the TLA<sup>+</sup>
community. Over the last 4-5 years, researchers started to look into these two
questions and found out that having a connection between the specification and
the implementation is much more useful than it was initially thought. (The
engineers were telling this to me all the time!) Check <a href="https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#10-prior-work">the prior work
section</a> for the papers and talks on this topic.  Roughly
speaking, the approaches follow the two ideas:</p>

<ul>
  <li>
    <p><strong>Model-based testing (MBT)</strong>. The TLA<sup>+</sup> specification is used to
 generate test cases that are then executed against the implementation. This is
 an answer to question 1 above. The state exploration is driven by the
 specification. Hence, we are testing, whether the implementation matches the
 inputs and outputs, as produced by the specification.</p>
  </li>
  <li>
    <p><strong>Trace validation (TV)</strong>. The traces are collected from the implementation
 and checked against the TLA<sup>+</sup> specification. This is an answer to
 question 2 above. State exploration is driven by the implementation, e.g., by
 executing the existing test suites, or just by running the system for some
 time. Hence, we are testing whether the specification matches the inputs and
 outputs of the implementation. Alternatively, we may check whether the
 implementation states may be lifted to the specification states, in order to
 produce a feasible trace in the specification.</p>
  </li>
</ul>

<p>If you re-read the description of MBT and TV above, you may notice that there
are two more dimensions of how to do testing:</p>

<ul>
  <li>
    <p><strong>State-based</strong>. In this case, we have to establish a relation between the
 implementation states and the specification states in each step of the trace.
 This usually done by defining mapping functions, either from the implementation
 states to the specification states, or vice versa. Notice that mapping an
 implementation state to a specification state is usually much easier, as it
 involves <em>state abstraction</em> (e.g., dropping some variables). Mapping a
 specification state to an implementation state is more difficult, as it
 involves <em>state concretization</em>, e.g., choosing a representative concrete value
 for each abstract value in the specification state. For example, if the
 specification says $x \in [10, 20]$, then we have to choose a concrete value
 for $x$ in this range, e.g., at random.</p>
  </li>
  <li>
    <p><strong>Action-based</strong>. In this case, we have to establish a relation between the
 implementation actions and the specification actions. Again, we would need to
 define mappings. Interestingly, in my experience, defining action mappings is
 way easier than defining state mappings.</p>
  </li>
</ul>

<h2 id="2-model-based-testing-and-trace-validation">2. Model-based testing and trace validation</h2>

<h3 id="21-model-based-testing-in-one-picture">2.1. Model-based testing in one picture</h3>

<p>Without going into too many details, the following picture illustrates the main
idea of model-based testing. We generate an “interesting” trace with a model
checker, e.g., with <a href="https://apalache-mc.org?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Apalache</a>. This trace is fed to the test harness that:
(1) does action concretization, (2) executes the actions against the
implementation. The moment the implementation refuses to replay an action, we
know that there is a divergence. Notice that we often do not even need to query
the system for its current state, as we only care about the actions.</p>

<picture>
  <img class="responsive-img" src="https://protocols-made-fun.com/img/mbt.svg?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Model-based testing" />
</picture>

<p>One downside of this approach is that the model checker can be quickly overwhelmed
by the many possible action interleavings unless the search scope is further
restricted. In my experience, the SMT solver Z3 slows down dramatically when it
must solve two problems simultaneously:</p>

<ol>
  <li>
    <p>Choose a sequence of actions (a schedule) to explore, and</p>
  </li>
  <li>
    <p>Find variable assignments (states) that produce a feasible trace for the
chosen schedule.</p>
  </li>
</ol>

<p>When a schedule is fixed, the SMT solver must solve far fewer constraints
because it mainly propagates values through the actions. If the solver must also
pick a schedule, it must backtrack along two axes: (1) schedules and (2) states.
This increases solving times in practice.</p>

<p>To mitigate this, Apalache lets you randomly sample schedules and execute them
symbolically. To enumerate different “interesting” schedules, the user can
define a view operator, which usually projects state variables to more abstract
values. The model checker will then produce traces projected onto those views.
This works significantly better for test generation in practice. However, this
exploration strategy is fixed and cannot be changed without modifying Apalache
itself.</p>

<h3 id="22-trace-validation-in-one-picture">2.2. Trace validation in one picture</h3>

<p>Trace validation is conceptually simpler than model-based testing. We simply
execute the system under test (SUT) and collect traces. These traces are then
mapped to the abstract states, if necessary, and checked against the
specification.</p>

<picture>
  <img class="responsive-img" src="https://protocols-made-fun.com/img/tv.svg?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Trace validation" />
</picture>

<p>This approach has been tried in multiple projects that use the exhaustive-state
model checker <a href="https://github.com/tlaplus/tlaplus?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">TLC</a> as the back-end solver. See <a href="https://protocols-made-fun.com/tlaplus/2025/12/15/tftp-symbolic-testing.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#10-prior-work">the prior work
section</a>.</p>

<p>Trace validation also has its challenges:</p>

<ol>
  <li>
    <p>We need a good test suite, in order to produce “interesting” traces.
 However, test cases are usually written for the happy-path scenarios. Hence,
 it is easy to miss handling of error cases and faults. <a href="https://www.youtube.com/watch?v=DO8MvouV29M&amp;utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Srinidhi Nagendra et
 al. (2025)</a> address this issue by fuzzing the tests.</p>
  </li>
  <li>
    <p>Someone has to instrument the SUT to trace the relevant events. In some
 cases, it easy to do, e.g., by tracing message exchanges, as presented by
 <a href="https://www.youtube.com/watch?v=NZmON-XmrkI&amp;utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Markus Kuppe et. al. (2024)</a>. In other cases, it may be quite difficult
 to do, e.g., when we want to dump the internal states of the SUT. In a
 concurrent system this may require a global lock and traversing large data
 structures. In a distributed system, this may further require a distributed
 snapshot or using vector clocks.</p>
  </li>
  <li>
    <p>We have to run the whole system to collect traces. It is hard to isolate one
 component, e.g., one network node.</p>
  </li>
</ol>

<h2 id="3-interactive-symbolic-testing-with-smt">3. Interactive symbolic testing with SMT</h2>

<p>As we can see, both model-based testing and trace validation in their above
formulation are non-interactive. They both require a complete trace to be
produced first, and <strong>there is no feedback loop between the specification and
the implementation</strong>.</p>

<p>There is a third way to do conformance testing that leverages SMT solvers, yet
receives feedback from the implementation during the testing. I will call it
<strong>interactive symbolic testing</strong>. I think the first time I heard about this
approach was from <a href="https://www.losa.fr/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Giuliano Losa</a>, when he explained the paper by <a href="https://www.mcmil.net/pubs/SIGCOMM19.pdf?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Ken
McMillan and Leonore Zuck (2019)</a> to me. If you have not read this paper
yet, I highly recommend doing so. On the naming side, McMillan and Zuck call
their approach “specification-based testing”. I find this name to be a bit
non-descriptive, as MBT is also specification-based.</p>

<p>The idea is to generate an action with the SMT solver by following the
specification, execute it against the implementation, and then feed the results
back to the SMT solver to generate the next action. This way, we can
systematically explore the protocol specification while getting feedback from
the implementation.</p>

<p>The picture below illustrates this approach, by approximately following the
internal transition executor of Apalache.</p>

<picture>
  <img class="responsive-img" src="https://protocols-made-fun.com/img/symbolic-testing.svg?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Symbolic testing" />
</picture>

<p>To implement this approach to testing with Apalache, we would have to find a way
for Apalache and the test harness to communicate. My experience with development
of Apalache shows that <strong>fixing exploration strategies inside the model checker
is not a good idea</strong>. People always want to tweak them a bit for their purposes.
Given this observation, <a href="https://blltprf.xyz?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Thomas Pani</a> and I have decided to implement a simple
server API for Apalache that would allow external tools to drive the symbolic
execution of TLA<sup>+</sup> specifications.</p>

<h2 id="4-the-new-json-rpc-api-of-apalache">4. The new JSON-RPC API of Apalache</h2>

<p><a href="https://blltprf.xyz?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Thomas</a> and I wanted to have a lightweight API that we could use
from any programming language without writing too much boilerplate code. At this
point, every engineer would whisper: hey, you need gRPC, I’ve got some. Well, we
tried gRPC in the integration of <a href="https://apalache-mc.org?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Apalache</a> with <a href="https://github.com/informalsystems/quint?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Quint</a>. It is hard to call
gRPC lightweight.</p>

<p>So we have decided to go with <a href="https://www.jsonrpc.org/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">JSON-RPC</a> this time, which is a very simple
protocol that works over HTTP/HTTPS. Implementing a JSON-RPC server is quite
straightforward.  Since Apalache is written in Scala, which is JVM-compatible,
we can use the well-known and battle-tested libraries. Perhaps, a bit
unexpectedly for a Scala project, I’ve decided to implement this server with
<a href="https://jetty.org/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Jetty</a> for serving the HTTP requests and <a href="https://github.com/FasterXML/jackson?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Jackson</a> for JSON serialization.
(The reason is that we have already burnt ourselves with fancy but poorly
supported libraries in Scala.) The resulting server is lightweight and fast.
Moreover, it can be tested with command-line tools like <a href="https://curl.se/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">curl</a>.</p>

<p>The state-chart diagram of the Apalache JSON-RPC server for a single session is
shown below.</p>

<picture>
  <img class="responsive-img" src="https://protocols-made-fun.com/img/apalache-api.svg?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Apalache JSON-RPC API" />
</picture>

<p>To see a detailed description of this API, check <a href="https://github.com/apalache-mc/apalache/tree/main/json-rpc?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Apalache JSON-RPC</a>.  Just to
give you the taste of it, here is how you start the server without having
anything installed but Docker:</p>

<pre><code class="language-shell">$ docker pull ghcr.io/apalache-mc/apalache
$ docker run --rm /tmp:/var/apalache -p 8822:8822 \
    ghcr.io/apalache-mc/apalache:latest \
    server --server-type=explorer
</code></pre>

<p>Now, we create a new Apalache session with a TLA<sup>+</sup> specification (in a
separate tab):</p>

<pre><code class="language-shell">$ SPEC=`cat &lt;&lt;EOF | base64
---- MODULE Inc ----
EXTENDS Integers
VARIABLE
  \* @type: Int;
  x
Init == I:: x = 0
Next == (A:: (x &lt; 3 /\\ x' = x + 1)) \\/ (B:: (x &gt; -3 /\\ x' = x - 1))
Inv3 == Inv:: x /= 0
\* @type: () =&gt; &lt;&lt;Bool, Bool, Bool&gt;&gt;;
View == &lt;&lt;x &lt; 0, x = 0, x &gt; 0&gt;&gt;
=====================
EOF`
$ curl -X POST http://localhost:8822/rpc \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"loadSpec","params":{"sources": [ "'${SPEC}'" ],
       "invariants": ["Inv3"], "exports": ["View"]},"id":1}'
</code></pre>

<p>Is not that amazing? No protobuf, no code generation, just pure shell and
readable JSON.</p>

<p>Having the specification loaded, we load the predicate <code>Init</code> into the solver
context, which is encoded as transition 0:</p>

<pre><code class="language-shell">$ curl -X POST http://localhost:8822/rpc \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"assumeTransition","params":{"sessionId":"1",
       "transitionId":0,"checkEnabled":true},"id":2}'
</code></pre>

<p>Assuming that the previous call returned <code>ENABLED</code>, we switch to the next
step, which applies the effect of <code>Init</code> to the current symbolic state:</p>

<pre><code class="language-shell">$ curl -X POST http://localhost:8822/rpc \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"nextStep","params":{"sessionId":"1"},"id":3}'
</code></pre>

<p>Now, we can check the invariant <code>Inv3</code> against all states that satisfy <code>Init</code>:</p>

<pre><code class="language-shell">$ curl -X POST http://localhost:8822/rpc \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"checkInvariant",
       "params":{"sessionId":"1","invariantId":0},"id":3}'
</code></pre>

<p>Since invariant <code>Inv3</code> is violated by the initial state, the server returns
<code>VIOLATED</code>, along with a counter-example trace:</p>

<pre><code class="language-json">{
  "jsonrpc": "2.0",
  "id": 3,
  "result": {
    "sessionId": "1",
    "invariantStatus": "VIOLATED",
    "trace": {
      "#meta": {
        "format": "ITF",
        "varTypes": { "x": "Int" },
        "format-description": "https://apalache-mc.org/docs/adr/015adr-trace.html",
        "description": "Created by Apalache on Thu Dec 11 16:56:47 CET 2025"
      },
      "vars": [ "x" ],
      "states": [ {
          "#meta": { "index": 0 },
          "x": { "#bigint": "0" }
      } ]
    }
  }
}
</code></pre>

<p>The trace is encoded in the <a href="https://apalache-mc.org/docs/adr/015adr-trace.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">ITF format</a>, which is a simple JSON-based
format for TLA<sup>+</sup> and Quint traces.</p>

<p>Had the invariant been violated on a deeper trace, we would have to assume more
transitions by calling <code>assumeTransition</code> and <code>nextStep</code> multiple times.</p>

<p>If you want to access this API from Python right away, use two helper libraries:</p>

<ul>
  <li>
    <p><a href="https://github.com/konnov/apalache-rpc-client/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">apalache-rpc-client</a> for interacting with the JSON-RPC server of
  Apalache, and</p>
  </li>
  <li>
    <p><a href="https://github.com/konnov/itf-py/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">itf-py</a> for serializing and deserializing ITF traces.</p>
  </li>
</ul>

<h2 id="5-case-study-tftp-protocol">5. Case study: TFTP protocol</h2>

<p>To experiment with interactive symbolic testing and the new JSON-RPC API, I
wanted to choose a relatively simple network protocol that had multiple
implementations. After several sessions with ChatGPT, I ended up with the
Trivial File Transfer Protocol (TFTP) as a reasonable target for this small
project.</p>

<p>The Wikipedia page on <a href="https://en.wikipedia.org/wiki/Trivial_File_Transfer_Protocol?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">TFTP</a> gives us a good overview of the protocol. In
short, TFTP is a simple protocol to transfer files over UDP. It supports reading
and writing files from a remote server. It is mostly used for booting from the
network. The protocol is simple enough to be specified in TLA<sup>+</sup>
without too much effort, yet it has enough complexity to make the testing effort
interesting. Actually, I’ve only specified reading requests (RRQ) and no writing
requests (WRQ) to keep the scope manageable.</p>

<p>You can find more detailed specifications in the original <a href="https://www.rfc-editor.org/rfc/rfc1350?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">RFC 1350</a>, as well
as in its extensions <a href="https://www.rfc-editor.org/rfc/rfc2347?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">RFC 2347</a>, <a href="https://www.rfc-editor.org/rfc/rfc2348?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">RFC 2348</a>, and <a href="https://www.rfc-editor.org/rfc/rfc2349?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">RFC 2349</a>. RFC 1350
defines a simple non-negotated version of the protocol. Below is an example of
such an interaction between the client and the server. Notice that the client
first sends a read request (RRQ) to the server on the control port 69, which
responds with the first data block (DATA) on a newly allocated ephemeral port.
The client acknowledges (ACK) the received data block on the same ephemeral
port.  This continues until the server sends the last data block, which is
smaller than the maximum block size (512 bytes by default).</p>

<picture>
  <img class="responsive-img" src="https://protocols-made-fun.com/img/rrq1350.svg?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Read request and transfer as per RFC 1350" />
</picture>

<p>Further, <a href="https://www.rfc-editor.org/rfc/rfc2347?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">RFC 2347</a> defines an option negotiation phase that happens right after the
read request. The client and the server may negotiate options like block size,
timeout, and transfer size. <a href="https://www.rfc-editor.org/rfc/rfc2348?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">RFC 2348</a> defines the block size option, while
<a href="https://www.rfc-editor.org/rfc/rfc2349?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">RFC 2349</a> defines the transfer size option. Below is an example interaction with
option negotiation:</p>

<picture>
  <img class="responsive-img" src="https://protocols-made-fun.com/img/rrq2347.svg?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Read request and transfer as per RFC 2347" />
</picture>

<p>The cool thing about TFTP is that it has multiple open-source implementations of
TFTP clients and servers in different programming languages. Here are some of
them:</p>

<ul>
  <li>
    <p><a href="https://kernel.googlesource.com/pub/scm/network/tftp/tftp-hpa/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">tftp-hpa</a> is the canonical implementation of TFTP for Linux (and UNIX?) in C.</p>
  </li>
  <li>
    <p><a href="https://github.com/madmartin/atftp?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">atftpd</a> is advanced TFTP, which is intended for fast boot in large
 clusters, also in C.</p>
  </li>
  <li>
    <p><a href="http://www.thekelleys.org.uk/dnsmasq/doc.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">dnsmasq</a> is a lightweight DNS and DHCP server that also includes a TFTP
 server, in C.</p>
  </li>
  <li>
    <p><a href="https://github.com/altugbakan/rs-tftpd?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">rs-tftpd</a> (Rust) is an implementation of a TFTP server in Rust.</p>
  </li>
  <li>
    <p><a href="https://github.com/pin/tftp?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">gotfpd</a> (Go) is an implementation of a TFTP server in Go.</p>
  </li>
  <li>
    <p>busybox also has its minimalistic implementation for file reads.</p>
  </li>
</ul>

<h2 id="6-initial-tla-specification-of-tftp">6. Initial TLA<sup>+</sup> specification of TFTP</h2>

<p>In the first stage of this experiment, I read the RFCs and wrote a
TLA<sup>+</sup> specification of the TFTP protocol. At that stage, I did not
introduce packet loss, duplication, or reordering. I just wanted to have a
simple working specification that I could use for testing the implementations.
<strong>This stage took me just two days.</strong> Well, I have been writing plenty of
TLA<sup>+</sup> specifications in the past.</p>

<p>You can check this initial specification in the <a href="https://github.com/konnov/tftp-symbolic-testing/tree/6fb00d1878b7e37a629868ac25b853d95b16cbdc?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">initial commit</a> of the
<a href="https://github.com/konnov/tftp-symbolic-testing?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">testing repo</a>. The main body of the specification lives in <code>tftp.tla</code>,
which imports several several auxiliary modules:</p>

<ul>
  <li>
    <p><code>typedefs.tla</code> defines the types of the data structures and the basic
 constructors for these data structures. Since I am using Apalache, the
 specification needs type definitions. Luckily, these days, I just write the
 type definitions in comments and let Claude generate the auxilliary operators
 such as constructors and accessors. If you already have an untyped
 specification, Claude is good at figuring out the types in the agent mode. Just
 use <a href="https://github.com/apalache-mc/apalache/blob/main/prompts/type-annotation-assistant.md?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">this prompt</a>.</p>
  </li>
  <li>
    <p><code>util.tla</code> defines common utility definitions such as <code>Min</code>, <code>Max</code>, and
 option conversions.</p>
  </li>
</ul>

<p>Finally, <code>MC2_tftp.tla</code> defines a protocol instance of two clients and one
server. If you stumble upon the definitions that end with <code>View</code> there, ignore
them. They are not essential for this blog post. I used them to experiment with
more advanced symbolic exploration scripts.</p>

<p>If you are not familiar with TLA<sup>+</sup>, or your TLA<sup>+</sup> skills are
rusty, I recommend giving one of the definitions and this prompt to ChatGPT. It
actually explains TLA<sup>+</sup> quite well:</p>

<pre><code>Assume that I am a software engineer. I don't know TLA+ but know Golang or Rust.
Explain me this TLA+ snippet using my knowledge: ...
</code></pre>

<p>To see the kinds of actions this initial specification had, have a look at the
definition of <code>Next</code> in <code>tftp.tla</code>:</p>

<pre><code class="language-tlaplus">Next ==
    \* the actions by the clients
    \/  \E srcIp \in CLIENT_IPS, srcPort \in PORTS:
            \E filename \in DOMAIN FILES, timeout \in 1..255:
                \* "man tftpd": 65464 is the theoretical maximum for block size
                \* https://linux.die.net/man/8/tftpd
                \E tsize \in 0..FILES[filename], blksize \in 0..65464:
                    \* choose a subset of the options to request
                    \E optionKeys \in SUBSET OPTIONS_RFC2349:
                        LET options ==
                            mk_options(optionKeys, blksize, tsize, timeout)
                        IN
                        ClientSendRRQ(srcIp, srcPort, filename, options)
    \/  \E udp \in packets:
            \/ ClientRecvDATA(udp)
            \/ ClientRecvOACK(udp)
            \/ ClientRecvErrorAndCloseConn(udp)
    \/  \E ipPort \in DOMAIN clientTransfers:
            ClientTimeout(ipPort)
    \* the server
    \/  \E udp \in packets:
            \/ ServerRecvRRQ(udp)
            \/ ServerSendDATA(udp)
            \/ ServerRecvAckAndCloseConn(udp)
            \/ ServerRecvErrorAndCloseConn(udp)
    \/  \E ipPort \in DOMAIN serverTransfers:
            ServerTimeout(ipPort)
    \* handle the clock and timeouts
    \/  \E delta \in 1..255:
            AdvanceClock(delta)

</code></pre>

<div class="tip-box">
    <div class="tip-header"><strong>💡 Tip:</strong></div>
    <div class="tip"><p>Do not spend too much time on reading this
initial specification. I misunderstood several thigs about TFTP from the
RFCs, which I fixed later. Especially, the timeouts are completely wrong
in this initial version. Good that the actual implementations helped me to
find these mistakes!</p>
</div>
</div>

<p><strong>Falsy invariants</strong>. As I always do, I also specified “falsy invariants” to
produce interesting examples. For example, using the invariant
<code>RecvThreeDataBlocksEx</code> below, I can easily produce a trace where a client
receives three data blocks from the server.</p>

<pre><code class="language-tla">\* Check this falsy invariant to see an example of a client receiving 3 blocks.
RecvThreeDataBlocksEx ==
    ~(\E p \in DOMAIN clientTransfers:
        Len(clientTransfers[p].blocks) &gt;= 3)
</code></pre>

<p>If you want to try it right way without installing anything, just do this with
docker:</p>

<pre><code class="language-shell">$ git clone git@github.com:konnov/tftp-symbolic-testing.git
$ git checkout 6fb00d1878b7e37a629868ac25b853d95b16cbdc
$ docker pull ghcr.io/apalache-mc/apalache
$ docker run --rm -v `pwd`:/var/apalache ghcr.io/apalache-mc/apalache \
  check --inv=RecvThreeDataBlocksEx MC2_tftp.tla
</code></pre>

<p><strong>Trace visualization.</strong>
Since Apalache emits traces in the <a href="https://apalache-mc.org/docs/adr/015adr-trace.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">ITF format</a>, which has a very simple
schema in JSON, it was easy for me to convince Claude to produce a Python script
that would convert ITF traces to human-readable state charts in Mermaid. Here is
just an example of such a trace produced by Apalache when checking the invariant
<code>RecvThreeDataBlocksEx</code> in Mermaid:</p>

<pre><code>sequenceDiagram
    participant ip10_0_0_3_port65000 as 10.0.0.3:65000
    participant ip10_0_0_1_port10000 as 10.0.0.1:10000
    participant ip10_0_0_1_port69 as 10.0.0.1:69

    ip10_0_0_3_port65000-&gt;&gt;ip10_0_0_1_port69: RRQ(file1, blksize=0, timeout=4)
    ip10_0_0_1_port10000-&gt;&gt;ip10_0_0_3_port65000: DATA(blk=1, 512B)
    ip10_0_0_3_port65000-&gt;&gt;ip10_0_0_1_port10000: ACK(blk=1)
    ip10_0_0_1_port10000-&gt;&gt;ip10_0_0_3_port65000: DATA(blk=2, 512B)
    ip10_0_0_3_port65000-&gt;&gt;ip10_0_0_1_port10000: ACK(blk=2)
    ip10_0_0_1_port10000-&gt;&gt;ip10_0_0_3_port65000: DATA(blk=3, 0B)
    ip10_0_0_3_port65000-&gt;&gt;ip10_0_0_1_port10000: ACK(blk=3)
</code></pre>

<p>This is how it looks like when rendered by <a href="https://www.mermaidchart.com/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Mermaid</a>:</p>

<picture>
  <img class="responsive-img" src="https://protocols-made-fun.com/img/tftp3.svg?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Visualized trace of TFTP client receiving three data blocks" />
</picture>

<p><strong>Note on abstractions.</strong> Similar to the <a href="https://www.mcmil.net/pubs/SIGCOMM19.pdf?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">McMillan and Zuck</a>, I tried
to avoid unnecessary abstractions and approximations in the specification.  If
you look at the type definition of a TFTP packet in
<a href="https://github.com/konnov/tftp-symbolic-testing/blob/6fb00d1878b7e37a629868ac25b853d95b16cbdc/spec/typedefs.tla?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed"><code>typedefs.tla</code></a>,
you will see that all fields except <code>data</code> are modeled as strings and integers:</p>

<pre><code class="language-tlaplus">  // TFTP Packet Types
  @typeAlias: tftpPacket =
    // Read Request (RFC 1350, Figure 5-1, RFC 2347).
    // See RFCs 2348-2349 for the options.
      RRQ({ opcode: Int, filename: Str, mode: Str, options: Str -&gt; Int })
    // Write Request (RFC 1350, Figure 5-1, RFC 2347).
    // See RFCs 2348-2349 for the options.
    | WRQ({ opcode: Int, filename: Str, mode: Str, options: Str -&gt; Int })
    // Acknowledgment (RFC 1350, Figure 5-3)
    | ACK({ opcode: Int, blockNum: Int })
    // Option Acknowledgment (RFC 2347)
    | OACK({ opcode: Int, options: Str -&gt; Int })
    // Data packet (RFC 1350, Figure 5-2)
    // In our specification, we simply pass the length of data instead of the
    // data itself. The test harness should pass the actual data.
    | DATA({ opcode: Int, blockNum: Int, data: Int })
    // Error packet (RFC 1350, Figure 5-4)
    | ERROR({ opcode: Int, errorCode: Int, msg: Str })
  ;

</code></pre>

<p>Thinking about it now, I could even model <code>data</code> as a sequence of bytes, but it
was obvious to me that only the length of <code>data</code> matters for the protocol logic.</p>

<h2 id="7-bootstrapping-the-testing-harness-with-claude">7. Bootstrapping the testing harness with Claude</h2>

<p>Now, we have the initial TLA<sup>+</sup> specification of TFTP and the standard
implementation <a href="https://kernel.googlesource.com/pub/scm/network/tftp/tftp-hpa/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">tftp-hpa</a>, which is the default <code>tftpd</code> server on Linux.</p>

<p>I wanted to avoid running the TFTP server on my laptop. What if I accidentally
find a bug that corrupts my file system? So I have decided to run the server and
the client harnesses in Docker containers. This way, I could easily reset the
SUT and have an isolated network for the TFTP server and clients.</p>

<p>Below is the architecture of the test harness that I had in mind. It’s quite a
bit overengineered for testing TFTP. I also wanted to experiment with Docker
networking and managing multiple containers for potential future projects.</p>

<pre><code>┌──────────────────────────────────────────────────────────────────┐
│                        Host Machine                              │
│                                                                  │
│  ┌────────────────────────────────────────────────────────────┐  │
│  │ harness.py                                                 │  │
│  │  - Coordinates symbolic execution                          │  │
│  │  - Manages Apalache server                                 │  │
│  │  - Controls Docker containers                              │  │
│  │  - Generates and saves test runs                           │  │
│  └────────┬────────────────────────┬──────────────────────────┘  │
│           │                        │                             │
│           ▼                        ▼                             │
│  ┌─────────────────┐     ┌──────────────────────────┐            │
│  │ Apalache Server │     │  Docker Manager          │            │
│  │  (port 8822)    │     │  - Network: 172.20.0.0/24│            │
│  └─────────────────┘     └──────────┬───────────────┘            │
│                                     │                            │
└─────────────────────────────────────┼────────────────────────────┘
                                      │
                         ┌────────────┴──────────────┐
                         │   Docker Network          │
                         │   (172.20.0.0/24)         │
                         │                           │
         ┌───────────────┼───────────────────────────┼─────────────┐
         │               │                           │             │
         ▼               ▼                           ▼             │
  ┌─────────────┐ ┌─────────────┐          ┌─────────────┐         │
  │ TFTP Server │ │  Client 1   │          │  Client 2   │         │
  │ 172.20.0.10 │ │ 172.20.0.11 │          │ 172.20.0.12 │         │
  │             │ │             │          │             │         │
  │ tftp-hpa    │ │ Python      │          │ Python      │         │
  │ Port: 69    │ │ TCP: 15001  │          │ TCP: 15002  │         │
  │ Data:1024-27│ │ (control)   │          │ (control)   │         │
  └─────────────┘ └─────────────┘          └─────────────┘         │
         ▲               │                           │             │
         │               │    UDP TFTP packets       │             │
         └───────────────┴───────────────────────────┘             │
                                                                   │
                         Docker Containers                         │
                                                                   │
                         tftp-test-harness:latest                  │
                                                                   │
└──────────────────────────────────────────────────────────────────┘
</code></pre>

<p><strong>LLMs will do the work?</strong> As you could have guessed, I had no interest in
writing the Docker files and the test harness from scratch. Having heard from so
many people that LLMs are so amazing, I have decided to give Claude a try at
generating the test harness.</p>

<p>Hence, I spent about four hours writing a very detailed prompt for Claude that
explained how I want the test harness to look like (the above architecture
diagram is actually generated by Claude from my prompt).</p>

<p><strong>Pushing the button!</strong> So I’ve run Claude in the agent mode with my prompt and
went for a coffee break. You can see the first generated version in <a href="https://github.com/konnov/tftp-symbolic-testing/commit/063da7d2b79c07dfb64225da852440c98b76c41e3?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">this
commit</a>.
The result looked so exciting and amazing until I looked at <code>CHECKLIST.md</code>:</p>

<pre><code>## Notes

- The framework is complete and production-ready
- Remaining work is mostly about connecting components
- Each task is independent and can be tackled separately
- Estimated effort: 4-8 hours for core integration (tasks 1-4)
- Additional 2-4 hours for polish and testing (tasks 5-8)
</code></pre>

<p>What is going on? Claude left me homework? I was also baffled by the hourly
estimates: Are these Claude hours or my hours? In the hindsight, the estimate
was surprisingly accurate. It took me about 1.5 days to make this code do the
first test run that made the harness exchange UDP packets with the TFTP server.</p>

<p>Then I looked at <code>harness.py</code>, which was supposed to be “complete and
production-ready”. Guess what? The main loop was left as a TODO!</p>

<pre><code class="language-python">        # TODO: Implement actual TFTP operation execution
        # This would involve:
        # - Querying Apalache for the transition details
        # - Sending commands to the TFTP client in Docker
        # - Collecting UDP packet responses
        # - Parsing the responses
</code></pre>

<p>The overall structure was there, but the most important pieces were left as
TODOs. Fine. It did the tedious part at least. So I started to chat with Claude
again to implement the missing pieces. If you look at the commit history, you
will see plenty of spaghetti code generated by Claude. In the end, it became a
bit better after my guidance, but I had to rewrite it at some point.</p>

<p>Even though I am making jokes about LLMs here, I must say that Claude really
helped me to debug the Docker setup and produce the python code for
communicating over UDP in modern Python. I could easily lose a couple of days
there.</p>

<p>Of course, the exploration logic was totally broken. After all, there is not
much for LLMs to learn from. We are doing something new here!</p>

<p><strong>1.5 days later.</strong> Something was working, but even the happy path was not
there. So I had to do the baby steps with Claude. Here are just a few examples
from my Copilot chat:</p>

<blockquote>
  <p>Let’s implement sending the RRQ packet over the wire.</p>
</blockquote>

<blockquote>
  <p>…</p>
</blockquote>

<blockquote>
  <p>Now I am receiving a response like below. This is good! What I want you to
do next. Decode the response and construct the expected packet for the TLA+
specification. Save this as the expectation that we will use in the next step.</p>
</blockquote>

<blockquote>
  <p>…</p>
</blockquote>

<blockquote>
  <p>You should not construct a TLA+ expression. Rather, convert the packet to an
ITF value using itf-py.</p>
</blockquote>

<blockquote>
  <p>…</p>
</blockquote>

<blockquote>
  <p>Can you implement this case for receiving OACK from the server and sending ACK
by the client to the server.</p>
</blockquote>

<p><strong>2 more days later.</strong> I had the happy path working. At this point, I was tired
of reading the harness logs. So I needed some form of visualization for each
run. Obviously, I wanted to have the same kind of Mermaid diagrams as before.</p>

<p>So I asked Claude to generate a script that would reconstruct the sequence
diagram from the harness logs. Well, it took me longer than expected. At some
point, Claude was producing quite convoluted log parsers with regular
expressions and python loops. Of course, it needs a human to define a simple log
format instead.</p>

<p>Below is an example of such a test run, visualized from the log by the generated
script in Mermaid:</p>

<picture>
  <img class="responsive-img" src="https://protocols-made-fun.com/img/tftp-happy.svg?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Trace visualization of the testing run" />
</picture>

<p>If you look at the above diagram carefully, you will notice that server responses
come in two flavors:</p>

<ol>
  <li>
    <p>The dashed arrows indicate that the client has received the UDP packet from
the UDP socket.</p>
  </li>
  <li>
    <p>The solid arrows indicate that the UDP packet was successfully replayed
with the TLA<sup>+</sup> specification.</p>
  </li>
</ol>

<h2 id="8-debugging-the-tla-specification-with-the-implementation">8. Debugging the TLA<sup>+</sup> specification with the implementation</h2>

<p>At that point, the tests started to produce actual interactions between the
TLA<sup>+</sup> specification (as solved by Apalache and Z3) and the real TFTP
server. This brought a lot of surprises! I am going to present some of them
below.</p>

<p>In this debugging session, I am keeping the scorecard of how many times the
TLA<sup>+</sup> specification was wrong versus how many times the implementation
(tftp-hpa) was wrong. The scorecard at this point looks like this:</p>

<table>
  <thead>
    <tr>
      <th>Spec bugs</th>
      <th>Implementation bugs</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>0</td>
      <td>0</td>
    </tr>
  </tbody>
</table>

<p>Actually, <a href="https://kernel.googlesource.com/pub/scm/network/tftp/tftp-hpa/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">tftp-hpa</a> is a quite mature implementation, so I was not expecting
any bugs there. Keep reading to see what I found.</p>

<h3 id="81-sending-errors-on-read-request">8.1. Sending errors on read request</h3>

<p>The first surprise came from my misunderstanding of how exactly TFTP is supposed
to reply to a malformed read request (<code>RRQ</code>). Since a client sends <code>RRQ</code> to the
control port 69 of the server, I thought that the server would reply with an
error packet (<code>ERROR</code>) from the port 69, instead of introducing a new ephemeral
port.</p>

<p>This is what <a href="https://www.rfc-editor.org/rfc/rfc2347?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">RFC 2347</a> says about option negotiation:</p>

<blockquote>
  <p>…the server should simply omit the option from the <code>OACK</code>, respond with an
alternate value, or send an <code>ERROR</code> packet, with error code 8, to terminate the
transfer.</p>
</blockquote>

<p>No explanation about the port from which the <code>ERROR</code> packet is sent. Well, my
understanding was wrong. The server always allocates a new ephemeral port for
sending the <code>ERROR</code> packet. This kind of makes sense, as the implementation simply
forks on a new request. One score to the implementation:</p>

<table>
  <thead>
    <tr>
      <th>Spec bugs</th>
      <th>Implementation bugs</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>0</td>
    </tr>
  </tbody>
</table>

<div class="tip-box">
    <div class="tip-header"><strong>💡 Tip:</strong></div>
    <div class="tip"><p>Actually, as I found later, the spec was not
always wrong, as the busybox implementation always uses port 69!</p>
</div>
</div>

<h3 id="82-the-server-may-send-duplicate-packets">8.2. The server may send duplicate packets</h3>

<p>Well, I knew that, but was lazy to write an action in the specification that
would handle duplicate packets. This is a typical shortcut when writing a
specification, since duplicate packets do not change the specification state and
considered “stuttering” steps. The server implementation retransmitted a <code>DATA</code>
packet, which produced a deviation in the TLA<sup>+</sup> specification. Another
score to the implementation:</p>

<table>
  <thead>
    <tr>
      <th>Spec bugs</th>
      <th>Implementation bugs</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>2</td>
      <td>0</td>
    </tr>
  </tbody>
</table>

<p>Formally speaking, this action does not affect protocol safety, so it is
tempting to simply skip duplicates. However, in conformance testing, we have to
handle all possible actions of the implementation, even if they produce stuttering
steps in the theory of TLA<sup>+</sup>.</p>

<h3 id="83-input-output-conformance-does-not-work-with-udp">8.3. Input-output conformance does not work with UDP!</h3>

<p>The next issue was quite interesting. When I read the papers on input-output
conformance testing from the 1990s, there was always an assumption that the
system under test (SUT) is input-enabled. This means that the SUT can always
accept any input at any time and respond to it, possibly, with an error message.
This assumption makes sense for synchronous systems (such as vending machines?),
where the tester can wait for the SUT to be ready to accept the input.</p>

<p>However, TFTP is not like that at all. The client may send an <code>ERROR</code> packet at
any point in time, and the server does not have to reply to it! This is exactly
a deviating test run I saw produced by the harness.</p>

<p>So instead of waiting for a reply from the server on each client action, the
test harness has to optimistically send the next UDP packet and then retrieve
the UDP packets from the server (remember that they live in Docker!).</p>

<p>This is where Claude was useful again. It helped me to collect the UDP packets
on the Docker client. Before taking the next step, the harness would retrieve
the buffered UDP packets from the Docker clients and replay these packets in the
TLA<sup>+</sup> specification, in arbitrary order.</p>

<p>This makes our testing approach a bit more sensitive to the timing of extracting
the buffered UDP packets, but it worked for TFTP.</p>

<table>
  <thead>
    <tr>
      <th>Spec bugs</th>
      <th>Implementation bugs</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>3</td>
      <td>0</td>
    </tr>
  </tbody>
</table>

<h3 id="84-the-server-recycles-an-ephemeral-port-on-error">8.4. The server recycles an ephemeral port on ERROR</h3>

<p>Another interesting deviation happened when the server recycled an ephemeral
port. <a href="https://www.rfc-editor.org/rfc/rfc1350?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">RFC 1350</a> explains how the server allocates ephemeral ports:</p>

<blockquote>
  <p>In order to create a connection, each end of the connection chooses a TID for
itself, to be used for the duration of that connection. The TID’s chosen for a
connection should be randomly chosen, so that the probability that the same
number is chosen twice in immediate succession is very low.</p>
</blockquote>

<p>Well, in our test run, the event of low probability happened (actually, I gave
the TFTP server a small range of ports to use):</p>

<picture>
  <img class="responsive-img" src="https://protocols-made-fun.com/img/tftp-fix5.svg?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Recycling ephemeral ports on error" />
</picture>

<p>Actually, this theme of reusing the same ephemeral port happened multiple times
in the following debugging iterations. It is probably the most problematic
aspect of the protocol, as there is no notion of a session in TFTP. Another
score to the implementation:</p>

<table>
  <thead>
    <tr>
      <th>Spec bugs</th>
      <th>Implementation bugs</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>4</td>
      <td>0</td>
    </tr>
  </tbody>
</table>

<h3 id="85-the-server-recycles-an-ephemeral-port-on-success">8.5. The server recycles an ephemeral port on success</h3>

<p>Guess what? A very similar thing happened on a successful file transfer as well.
Here is a pruned version of the trace that shows this behavior (the initial
sequence of <code>RRQ</code>-<code>OACK</code>-<code>DATA</code>-<code>ACK</code> is omitted for brevity):</p>

<picture>
  <img class="responsive-img" src="https://protocols-made-fun.com/img/tftp-fix6.svg?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Recycling ephemeral ports on success" />
</picture>

<p>This behavior seems to be consistent with <a href="https://www.rfc-editor.org/rfc/rfc1350?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#section-6">Section 6 of RFC 1350</a>, though it
seems to be ambiguous to me. Anyway, another score to the implementation:</p>

<table>
  <thead>
    <tr>
      <th>Spec bugs</th>
      <th>Implementation bugs</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>5</td>
      <td>0</td>
    </tr>
  </tbody>
</table>

<h3 id="86-mixing-the-protocol-versions">8.6. Mixing the protocol versions</h3>

<p>TFTP essentially has two versions: the original version defined in RFC 1350 and
the extended version with option negotiation defined in RFC 2347. In combination
with packet duplication, this produced a very interesting deviation. I’ve not
saved the full trace, but here is what happened. The server processes an RRQ
with options and sends an OACK, as per RFC 2347. After that, the TLA<sup>+</sup>
specification of the server receives an earlier RRQ without options and sends a
DATA packet in response, as per RFC 1350. This corrupts the internal state of
the server in the specification.</p>

<p>Obviously, this is caused by non-determinism in the TLA<sup>+</sup>
specification, which allows the protocol to behave according to both protocol
versions at the same time. I had to fix the specification by disallowing the
server to behave according to RFC 1350, when it receives an RRQ with options.
One score to the implementation:</p>

<table>
  <thead>
    <tr>
      <th>Spec bugs</th>
      <th>Implementation bugs</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>6</td>
      <td>0</td>
    </tr>
  </tbody>
</table>

<h3 id="87-more-deviations-on-the-specification-side">8.7. More deviations on the specification side</h3>

<p>At some point, I got tired of collecting the precise deviations. They still can
be recovered from the commit log though. Here are some of the further deviations
on the specification side that I fixed:</p>

<ul>
  <li>
    <p>The client must send <code>tsize = 0</code> in RRQ.</p>
  </li>
  <li>
    <p>The server should send default timeout if it’s not specified in the options.</p>
  </li>
  <li>
    <p>The server may send invalid (e.g., outdated) packets.</p>
  </li>
  <li>
    <p>My understanding of TFTP timeouts was wrong. I thought that a timeout was
 meant to close a transfer session. Instead, timeouts in TFTP are just
 triggering packet retransmissions. The number of retries is not specified in
 the RFCs. In practice, tftp-hpa seems to retry 5 times before giving up.</p>
  </li>
  <li>
    <p>The server specification should store transfers for  triplets <code>(clientIP,
 clientPort, serverPort)</code> instead of pairs <code>(clientIP, clientPort)</code>.</p>
  </li>
</ul>

<p>In the end, the implementation scored another 7 points, before tests started to
work.</p>

<table>
  <thead>
    <tr>
      <th>Spec bugs</th>
      <th>Implementation bugs</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>13</td>
      <td>0</td>
    </tr>
  </tbody>
</table>

<p>It looks like my TLA<sup>+</sup> specification was a bit sloppy, in comparison
to the mature implementation of <code>tftp-hpa</code>. I have not designed this protocol
and did not give much thought to it. Obviously, the engineers have spent much
more time thinking about its behavior. You can check the specification in
<a href="https://github.com/konnov/tftp-symbolic-testing?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">the repository</a>.</p>

<h2 id="9-testing-against-adversarial-behavior">9. Testing against adversarial behavior</h2>

<p>At some point I thought: My clients are too well-behaved! They never lose,
duplicate, or reorder packets. What if they start to misbehave within the
protocol boundaries? Would I be able to find bugs in the implementation? Yes, I
did. Keep reading.</p>

<p>Hence, I have added one more action that simply lets a client retransmit a
previously sent packet in <code>Next</code>:</p>

<pre><code class="language-tlaplus">    \/ \E udp \in packets:
        ClientSendDup(udp)

</code></pre>

<p>Below is the action <code>ClientSendDup</code>. It does not change the specification state
at all. However, it produces an action that retransmits a packet in the harness:</p>

<pre><code class="language-tlaplus">\* A client resends a duplicate packet that it sent in the past.
\* This is to test for the Sourcerer's Apprentice syndrome.
\* @type: $udpPacket =&gt; Bool;
ClientSendDup(_udp) ==
    ClientSendDup::
    /\ _udp.destIp = SERVER_IP
    /\ lastAction' = ActionRecvSend(_udp)
    /\ UNCHANGED &lt;&lt;packets, serverTransfers, clientTransfers, clock&gt;&gt;

</code></pre>

<p>You can find the complete specification <a href="https://github.com/konnov/tftp-symbolic-testing/tree/main/spec?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">here</a>.</p>

<p><strong>Protocol deviation.</strong> It mostly worked as expected. However, a few traces were
reporting deviations. Here is one of them. It’s pretty long. Look for an
explanation below.</p>

<picture>
  <img class="responsive-img" src="https://protocols-made-fun.com/img/tftp-malformed-ack.svg?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="The implementation diverging from the specification" />
</picture>

<p>The last UDP packet is an acknowledgment for block 1 from the server. If
you think about the protocol, the server should never send an ACK in the
sessions associated with read requests (RRQ). ACK packets are only sent by the
clients.  Yet, this is what was happening. To double check this, I’ve asked
Claude to capture the traffic in pcap files in the Docker containers. Indeed,
Wireshark was showing the ACK packet from the server. Moreover, the packet was
malformed. It looked like the option acknowledgment (OACK) packet, but had the
first bytes of an ACK packet. Sounds like memory corruption!</p>

<p>Here is the core sequence of events that produced this behavior (a few details
removed):</p>

<ol>
  <li>
    <p>The client sends <code>RRQ("file1", blksize=NN)</code> to the server (172.20.0.10:69).</p>
  </li>
  <li>
    <p>The server sends a few OACK packets to the client.</p>
  </li>
  <li>
    <p>The client erroneously sends <code>ACK(1)</code> to the server, which is a duplicate
 packet from an earlier transfer. It could be simply a delayed packet though.</p>
  </li>
  <li>
    <p>The server responds with <code>ACK(1)</code> of length 64, which is basically the
 <code>OACK</code> packet with the first 4 bytes coming from <code>ACK(1)</code>.</p>
  </li>
</ol>

<p><strong>Investigation.</strong> Luckily, the source code is readily available. I’ve looked
into the function <code>tftp_sendfile</code> of <code>tftp-hpa</code> that handles read requests.
Indeed, the option negotiation loop receives the option acknowledgment packet
<code>OACK</code> and waits for an <code>ACK</code> from the client. There are two cases:</p>

<ul>
  <li>
    <p>When it receives an <code>ACK</code> for block 0, it breaks out of the loop and continues with sending data blocks. <strong>This is the happy path.</strong></p>
  </li>
  <li>
    <p>When it receives an acknowledgment for a block other than 0, block, it simply
 continues the loop, retransmitting <code>OACK</code>. The issue is that <strong>the code uses
 the same buffer</strong> for sending <code>OACK</code> and receiving <code>ACK</code> packets via different
 pointers! Hence, it later sends an <code>OACK</code> packet that is corrupted with the
 contents of the <code>ACK</code> packet. <strong>I don’t think I would have found this by code
 review!</strong></p>
  </li>
</ul>

<p>Just for fun, I checked it with Claude. It could not identify this issue. The
trick is that the same buffer is pointed to by two different pointers, so Claude
is not clever enough to track this aliasing. When I explained the issue to
Claude, it was ecstatic: You have found a critical!</p>

<p>I’ve continued looking for the blast radius of this bug. Even though it somewhat
of memory corruption, it cannot crash the server, as the code is still writing
to the same buffer, allocated by the server itself. All it can do is to produce
malformed packets. Hence, it could probably crash a sloppy client, but would not
do much harm to a well-behaved client and itself. Moreover, if a client crashes
in such a case, anybody else on the network could have sent the malformed ACK as
well.</p>

<p>So this is a bug (from the specification p.o.v.), but it does not result in a
vulnerability. In any case, it was a deviation from the protocol specification.
Finally, one point to the specification!</p>

<table>
  <thead>
    <tr>
      <th>Spec bugs</th>
      <th>Implementation bugs</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>13</td>
      <td>1</td>
    </tr>
  </tbody>
</table>

<p><strong>Contacting the author.</strong> To be on the safe side, before writing this blog
post, I’ve contacted the author of tftp-hpa. As I expected, he also replied that
TFTP is an unencrypted unauthenticated protocol, so we should not expect much
security there.</p>

<h2 id="10-the-specification-as-a-differential-testing-oracle">10. The specification as a differential testing oracle</h2>

<p>After finding the above implementation bug, I have decided to test other TFTP
implementations as well. This is where Claude was super useful again. I just
asked it to generate Dockerfiles for other implementations, which it did
quickly. It happened that a similar issue existed in another implementation. I
could not figure out the root cause in the source code of that other
implementation, as it is a bit harder to read than <code>tftp-hpa</code>. Hence, not giving
the details here.</p>

<p>Except this second deviation, the other implementations worked fine. Overall,
the specification scored another point:</p>

<table>
  <thead>
    <tr>
      <th>Spec bugs</th>
      <th>Implementation bugs</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>13</td>
      <td>2</td>
    </tr>
  </tbody>
</table>

<p>What I find really interesting here. Whenever I talk to engineers about formal
specifications, they tell me that they would like to do <strong>differential testing</strong>
instead of writing specifications. Meaning that they would like to compare the
behavior of one implementation against another implementation. However,
differential testing is not magic. It requires test inputs to compare the
implementations. Hence, <strong>if the test suite is missing adversarial test cases,
both implementations may pass the tests</strong>, even though they are both wrong.</p>

<p>What we did here with the TLA<sup>+</sup> specification is something more than
just differential testing. First, we have debugged the specification against
<code>tftp-hpa</code>, so we have extracted its expected behavior into a relatively small
and precise formal specification. Second, we have used this specification to
produce the tests for another implementation!</p>

<h2 id="11-prior-work">11. Prior Work</h2>

<p>In this section, I’ve collected the previous work on model-based testing and
trace validation with TLA<sup>+</sup>:</p>

<ul>
  <li>
    <p>Nagendra et. al. Model guided fuzzing of distributed systems (2025).
Check <a href="https://www.youtube.com/watch?v=DO8MvouV29M&amp;utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">the talk</a>.</p>
  </li>
  <li>
    <p>Cirstea, Kuppe, Merz, Loillier. Validating Traces of Distributed Systems
Against TLA+ Specifications (2024). Check the
<a href="https://arxiv.org/abs/2404.16075?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">arxiv paper</a>.</p>
  </li>
  <li>
    <p>Chamayou et. al. Validating System Executions with the TLA+ Tools (2024).
See <a href="https://www.youtube.com/watch?v=NZmON-XmrkI&amp;utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">the talk</a>.</p>
  </li>
  <li>
    <p>Halterman. Verifiability Gap: Why We Need More From Our Specs and
How We Can Get It (2020).
See <a href="https://www.youtube.com/watch?v=itcj9j2yWQo&amp;utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">the talk</a>.</p>
  </li>
  <li>
    <p>Davis et al. eXtreme Modelling in Practice (2020).
See <a href="https://www.youtube.com/watch?v=IIGzXX72weQ&amp;utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">the talk</a>.</p>
  </li>
  <li>
    <p>Kupriyanov, Konnov. Model-based testing with TLA+ and Apalache (2020).
See <a href="https://www.youtube.com/watch?v=aveoIMphzW8&amp;utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">the talk</a>.</p>
  </li>
  <li>
    <p>Pressler. Verifying Software Traces Against a Formal Specification with
TLA<sup>+</sup> and TLC (2018).
Check <a href="https://pron.github.io/files/Trace.pdf?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">the paper</a>.</p>
  </li>
</ul>

<p>I am pretty sure that this list is incomplete, so please let me know if you are
aware of any other relevant work.</p>

<h2 id="12-conclusions">12. Conclusions</h2>

<p>This was a lot of text! Thank you for reading it till the end. It may look like
this project took me eternity to complete. In reality, <strong>it took me about two
weeks of part-time work</strong> to do it from the start to the end. On one hand, I
could probably do some parts of it faster, if I did not rely too much on Claude
for generating the test harness. On the other hand, <strong>Claude quickly generates
the code to start and stop services, parse their logs, etc.</strong> All the things
Docker were done by Claude, and I did not have to touch them. This is the work
that I find annoying and LLMs just do. In this experiment, I’ve burned all of my
monthly premium requests included in the Copilot plan. To be fair, I also had to
add a few features to the new Apalache API, as I was still experimenting with
it.</p>

<p>What I find interesting in the approach outlined here is that it presents a
(relatively) <strong>lightweight way to testing real-world protocols</strong>. Thinking of
fuzzing in this context, <strong>I don’t think a standard fuzzer would have found the
above deviations in TFTP</strong>. Indeed, the implementation was not crashing. Nor it
was accessing memory out of bounds. It was just producing malformed packets
occasionally. To detect this, <strong>we needed a test oracle</strong> that would tell us,
whether a deviation happened. Writing such an oracle manually would be tedious
and error-prone. Instead, we have used <strong>a formal specification as a precise and
unambiguous oracle</strong>. Unambiguous does not mean deterministic though. Our oracle
is non-deterministic, but it precisely defines the allowed behaviors of the
protocol.</p>

<p>In addition to that, <a href="https://kernel.googlesource.com/pub/scm/network/tftp/tftp-hpa/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">tftp-hpa</a> is not just a piece of code that was written
by a startup over a weekend, or generated by an LLM. It is <strong>a very mature
project that has been written by professionals in the times when people had time
to think</strong>. They took care of the <a href="https://en.wikipedia.org/wiki/Sorcerer%27s_Apprentice_syndrome?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Sorcerer’s Apprentice Syndrome</a>. This is
why I was quite surprised to see an unexpected packet from the server.</p>

<p>On the Apalache side, we finally have a symbolic approach that <strong>scales much
better than bounded model checking</strong>! In my experiments with TFTP, the new JSON
RPC API was showing the signs of <strong>slowing down only after about 200 steps</strong> of
symbolic execution. This is a huge improvement over the previous approach, where
Apalache was slowing down after about 10-20 steps. It is easy to see why. We
feed the concrete responses from the implementation into the SMT context, which
immediately produce a lot of simplifications.</p>

<p>We can <strong>improve this even further to essentially unlimited number of steps</strong>.
All what is needed is to keep the concrete trace on the harness side and
initialize the SMT context with the last state of the trace. We can do it every
step, or every <code>N</code> steps. The cool thing is that it all can be done outside of
Apalache, on the harness side! This opens the door to <strong>quick experimentation</strong>
with various strategies of mixing <strong>symbolic and concrete execution</strong>.</p>]]></content>
      

      
      
      
      
      

      <author>
          <name>Igor Konnov</name>
        
          <email>igor@konnov.phd</email>
        
        
          <uri>https://konnov.phd?utm_source=protocols_made_fun&amp;utm_medium=referral&amp;utm_campaign=pmf_site</uri>
        
      </author>

      
        
          <category term="tlaplus" />
        
      

      

      
      
        <summary type="html"><![CDATA[Author: Igor Konnov]]></summary>
      

      
      
    </entry>
  
    <entry>
      
      

      <title type="html">Formal Verification of the Aztec Governance Protocol</title>
      <link href="https://protocols-made-fun.com/quint/2025/12/09/aztec-governance.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" rel="alternate" type="text/html" title="Formal Verification of the Aztec Governance Protocol" />
      <published>2025-12-09T00:00:00+00:00</published>
      <updated>2025-12-09T00:00:00+00:00</updated>
      <id>https://protocols-made-fun.com/quint/2025/12/09/aztec-governance</id>
      
      
        <content type="html" xml:base="https://protocols-made-fun.com/quint/2025/12/09/aztec-governance.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed"><![CDATA[<p><strong>Authors:</strong> <a href="https://blltprf.xyz/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Thomas Pani</a>, <a href="https://konnov.phd?utm_source=protocols_made_fun&amp;utm_medium=referral&amp;utm_campaign=pmf_site">Igor Konnov</a></p>

<p><strong>Date:</strong> December 9, 2025</p>

<h2 id="1-introduction">1. Introduction</h2>

<p>In August 2025, <a href="https://aztec-labs.com?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Aztec Labs</a> engaged <a href="https://blltprf.xyz/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Thomas Pani</a> and <a href="https://konnov.phd?utm_source=protocols_made_fun&amp;utm_medium=referral&amp;utm_campaign=pmf_site">Igor Konnov</a> to formally specify and verify the new <strong>Aztec Governance Protocol</strong> – the core on-chain system that governs <a href="https://aztec.network/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Aztec Network</a>.</p>

<p>Over the course of five weeks, we reviewed every line of code in scope and developed a <strong>precise formal specification, verified automatically</strong> with <a href="https://github.com/apalache-mc/apalache?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Apalache</a>. The result: scalable, massively parallel automated verification that explored the entire protocol state space to <strong>formally confirm correctness and uncover subtle, cross-contract issues</strong> that conventional audits or fuzzing can easily miss.</p>

<p>The team at Aztec Labs reviewed our findings and addressed all of them.</p>

<h3 id="at-a-glance-metrics">At-a-Glance Metrics</h3>

<p>For the impatient reader, here are some key figures:</p>

<table>
  <thead>
    <tr>
      <th> </th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>125 invariants</strong> specified across <strong>10 contracts</strong>, <strong>8 libraries</strong>, and <strong>8 interfaces</strong></td>
    </tr>
    <tr>
      <td><strong>992 verification conditions</strong> checked in total</td>
    </tr>
    <tr>
      <td><strong>72 physical cores / 368 GiB RAM</strong>, running for <strong>321 CPU-days</strong> (≈ 2 weeks)</td>
    </tr>
    <tr>
      <td><strong>Findings:</strong> <span style="background-color:#C48F00; color:#fff; padding:2px 6px; border-radius:6px;">5 Medium</span> • <span style="background-color:#2F7D32; color:#fff; padding:2px 6px; border-radius:6px;">3 Low</span> • <span style="background-color:#005A9E; color:#fff; padding:2px 6px; border-radius:6px;">6 Info</span>  <a href="https://protocols-made-fun.com/quint/2025/12/09/aztec-governance.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#9-findings">(jump ahead)</a></td>
    </tr>
    <tr>
      <td><strong>Final complete verification run</strong> in <strong>576 CPU-hours</strong> (≈ 1 calendar day)</td>
    </tr>
    <tr>
      <td><strong>Contract size:</strong> <strong>~2 kLOC</strong> Solidity</td>
    </tr>
    <tr>
      <td><strong>Specification size:</strong> <strong>~4 kLOC</strong> Quint (incl. traceability comments)</td>
    </tr>
  </tbody>
</table>

<p>These runtimes are comparable to large-scale fuzzing campaigns – but with a crucial difference: <strong>formal verification explores every possible transaction symbolically</strong>, offering <em>exhaustive</em> reasoning rather than probabilistic coverage.</p>

<h3 id="highlights">Highlights</h3>

<p>Some of the key highlights from this article include:</p>

<ul>
  <li><strong>Representative issue: Governance Insolvency</strong> with root-cause analysis and fix (<a href="https://protocols-made-fun.com/quint/2025/12/09/aztec-governance.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#91-governance-insolvency">§9.1</a>)</li>
  <li>Choosing the <strong>right tools</strong> (<a href="https://protocols-made-fun.com/quint/2025/12/09/aztec-governance.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#4-choosing-the-right-tools">§4</a>)</li>
  <li><strong>Bootstrapping the formal specification with AI</strong> (<a href="https://protocols-made-fun.com/quint/2025/12/09/aztec-governance.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#52-bootstrapping-the-formal-specification-with-ai">§5.2</a>)</li>
  <li><strong>Making verification scale</strong> with compositional reasoning and inductive invariants (<a href="https://protocols-made-fun.com/quint/2025/12/09/aztec-governance.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#53-compositional-reasoning">§5.3</a>, <a href="https://protocols-made-fun.com/quint/2025/12/09/aztec-governance.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#73-inductive-invariants-making-verification-scale">§7.3</a>)</li>
  <li><strong>Showing that the protocol can progress</strong>: witnesses of liveness (<a href="https://protocols-made-fun.com/quint/2025/12/09/aztec-governance.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#74-witnesses-of-liveness-proving-the-protocol-can-progress">§7.4</a>)</li>
</ul>

<p>Our formal report can be accessed via <a href="https://github.com/konnov/aztec-governance-formal-verification-2025q3/blob/e313681ade9f9e96d0e83a5120a670a1e1e07188/reports/Aztec-Governance-Protocol.pdf?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">this link</a> and the specifications can be found via <a href="https://github.com/konnov/aztec-governance-formal-verification-2025q3?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">this link</a>.</p>

<h2 id="2-overview-of-aztec-governance">2. Overview of Aztec Governance</h2>

<p>Aztec Network’s governance is implemented as a suite of on-chain Solidity contracts. We summarize its multi-contract architecture, which required <strong>compositional analysis to verify</strong>, in the diagram below. The current implementation extends and formalizes the concepts from the <a href="https://forum.aztec.network/t/request-for-comments-aztec-governance/7413?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Aztec Governance RFC</a> – see <a href="https://docs.aztec.network/the_aztec_network/concepts/governance?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Aztec Governance</a> for the canonical documentation. This post reflects the protocol as of the <a href="https://github.com/AztecProtocol/aztec-packages/blob/8b10b2b220de38c9e2e2e2b7d05d7383701ba070?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">commit used in our engagement</a>. Parts of the codebase have evolved since our engagement.</p>

<p><img src="https://protocols-made-fun.com/assets/images/aztec-governance.webp?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="Aztec diagram: contract architecture of Governance, GSE, Registry, Proposer, Slasher, and flows" /></p>

<p>(We highlight the key contracts, with no special meaning attached to the colors.)</p>

<p><strong>Rollups and Registry.</strong> Aztec Network manages a system of rollups recorded in the <a href="https://github.com/AztecProtocol/aztec-packages/blob/8b10b2b220de38c9e2e2e2b7d05d7383701ba070/l1-contracts/src/governance/Registry.sol?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Registry</a>, which directs inflationary rewards to a single, designated <em>canonical rollup</em>.</p>

<p><strong>GovernanceProposer.</strong> <a href="https://github.com/AztecProtocol/aztec-packages/blob/8b10b2b220de38c9e2e2e2b7d05d7383701ba070/l1-contracts/src/governance/proposer/GovernanceProposer.sol?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">GovernanceProposer</a> (derived from <a href="https://github.com/AztecProtocol/aztec-packages/blob/8b10b2b220de38c9e2e2e2b7d05d7383701ba070/l1-contracts/src/governance/proposer/EmpireBase.sol?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">EmpireBase</a>) forms the foundational layer of the voting system, implementing a round-based signaling mechanism to determine which proposals advance to <code>Governance</code>.</p>

<p><strong>Governance.</strong> The <a href="https://github.com/AztecProtocol/aztec-packages/blob/8b10b2b220de38c9e2e2e2b7d05d7383701ba070/l1-contracts/src/governance/Governance.sol?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Governance</a> contract manages the full proposal lifecycle, including submission, voting, and execution. Given their critical role in managing the Aztec Network, <code>Governance</code> incorporates access control, such as whitelisting beneficiaries that participate in voting. On top of that, it implements an emergency proposal mechanism which requires a substantial token lock.</p>

<p><strong>Governance Staking Escrow (GSE).</strong> Governance stakes and corresponding voting rights are managed by the <em>Governance Staking Escrow</em> (<a href="https://github.com/AztecProtocol/aztec-packages/blob/8b10b2b220de38c9e2e2e2b7d05d7383701ba070/l1-contracts/src/governance/GSE.sol?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">GSE</a>) contract. <code>GSE</code> enables seamless migration of staked assets to new canonical chains, addressing the “cold-start” problem by ensuring immediate operational support during network upgrades. Proposals made through <code>GovernanceProposer</code> tie back to <code>GSE</code> during execution, verifying that at least two-thirds of the total stake is allocated to the latest rollup.</p>

<p><strong>SlashingProposer and Slasher.</strong> The <a href="https://github.com/AztecProtocol/aztec-packages/blob/8b10b2b220de38c9e2e2e2b7d05d7383701ba070/l1-contracts/src/core/slashing/SlashingProposer.sol?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">SlashingProposer</a> contract, also derived from <a href="https://github.com/AztecProtocol/aztec-packages/blob/8b10b2b220de38c9e2e2e2b7d05d7383701ba070/l1-contracts/src/governance/proposer/EmpireBase.sol?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">EmpireBase</a>, uses the same round-based signaling mechanism to determine which slashing proposals are forwarded to the <a href="https://github.com/AztecProtocol/aztec-packages/blob/8b10b2b220de38c9e2e2e2b7d05d7383701ba070/l1-contracts/src/core/slashing/Slasher.sol?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Slasher</a> contract.</p>

<p><strong>Libraries.</strong> The main contracts are supported by a set of custom libraries that use storage-layout compression for gas optimization. These libraries enable the system to retrieve historical, checkpointed state for computing voting power, implement a custom checkpointed set data structure built on OpenZeppelin’s <code>Checkpoints.Trace224</code> library, implement the vote-tallying algorithm, and provide helper functions that encode the proposal lifecycle state machine.</p>

<h2 id="3-attack-surface">3. Attack Surface</h2>

<p>The attack surface of the Governance Protocol is significant – with <strong>over 40 external state-mutating functions across multiple contracts</strong> in scope. Moreover, problematic scenarios typically:</p>

<ul>
  <li>involve multiple contracts,</li>
  <li>exercise them over several transactions, and</li>
  <li>can even involve multiple instances of the same contract (e.g., several Rollups, several Governance contracts, etc.).</li>
</ul>

<p><strong>Reasoning about time.</strong> <code>Governance</code> and <code>GovernanceProposer</code> use the block timestamp to organize signaling and voting phases. <code>GovernanceProposer</code> slots are short (<strong>fractions of a minute</strong>), while <code>Governance</code> voting periods are much longer (<strong>minutes to days</strong>). We must therefore reason about long time horizons interrupted by short-lived events.</p>

<p><strong>Malicious external inputs.</strong> To make things harder, we also considered scenarios in which a canonical rollup produces <strong>erroneous readings</strong> from time to time, e.g., due to a fault. For example, what if a canonical rollup starts to return slot numbers from the past (or far in the future)?</p>

<p>This poses both a <strong>challenge and an opportunity</strong>: standard techniques such as fuzzing, random simulation, or bounded model checking would not get us far – the state and action spaces are prohibitively large.</p>

<h2 id="4-choosing-the-right-tools">4. Choosing the Right Tools</h2>

<p>With the attack surface in view, the next question was tooling: how to verify the protocol logic without drowning in bytecode.</p>

<h3 id="41-protocol-level-specification">4.1. Protocol-Level Specification</h3>

<p>From an engineer’s perspective, the ideal solution would be to verify correctness directly at the implementation level – that is, to automatically reason about the Solidity code itself. Tools such as <a href="https://www.certora.com/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Certora Prover</a>, <a href="https://kontrol.runtimeverification.com/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Kontrol</a>, <a href="https://github.com/a16z/halmos?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Halmos</a>, and <a href="https://hevm.dev/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">HEVM</a> aim to do exactly this by automating formal reasoning over smart contracts. These tools are remarkable engineering achievements, but their task is inherently complex: among other things, they must reason precisely about stack behavior, memory, storage, and external calls – all the way down to the EVM bytecode.</p>

<p>Before diving into such low-level reasoning, however, we believe it is essential to <strong>ensure that the protocol’s logic is sound</strong>. If high-level properties of the protocol are violated, then verifying bit-level correctness provides limited value. Once the protocol logic is verified, attention can shift to the implementation.</p>

<p>In this project, we focused on <strong>specifying the Aztec Governance Protocol at the logic level</strong>. Our <strong>main objectives</strong> were to:</p>

<ul>
  <li>specify the high-level behavior of the protocol,</li>
  <li>identify its core invariants, and</li>
  <li>prove these invariants correct (or demonstrate violations through counterexamples).</li>
</ul>

<h3 id="42-languages-and-tools">4.2. Languages and Tools</h3>

<p>Several specification languages could serve this purpose. For instance, we could have expressed the protocol directly in an interactive theorem prover like <a href="https://lean-lang.org/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Lean</a> or <a href="https://rocq-prover.org/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Rocq</a>. However, both would offer little automation, which would make limited progress feasible within our one-month timeframe. (Recently, Lean has seen exciting developments such as the <a href="https://lean-lang.org/doc/reference/latest/The--grind--tactic/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed"><code>grind</code> tactic</a> and research by the <a href="https://verse-lab.github.io/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">VERSE group</a>. We may explore these in a future engagement!)</p>

<p><strong>TLA<sup>+</sup> and its tooling.</strong> <a href="https://tlapl.us/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">TLA<sup>+</sup></a> is perhaps the most well-known practical specification language. It is supported by two model checkers (<a href="https://github.com/tlaplus/tlaplus?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">TLC</a> and <a href="https://github.com/apalache-mc/apalache?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Apalache</a>) and an interactive theorem prover <a href="https://proofs.tlapl.us/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">TLAPS</a>. We use the methodology of TLA<sup>+</sup> to reason about the Governance Protocol as a collection of interacting state machines over large state spaces. Since many engineers find the syntax of TLA<sup>+</sup> confusing, we use the surface syntax <a href="https://github.com/informalsystems/quint?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Quint</a> to write the specifications. As co-authors of both <strong>Quint</strong> and the <strong>Apalache</strong> model checker – together with Gabriela Moreira, Shon Feder, Jure Kukovec, and others – we have a deep understanding of their internals and how to apply them to large-scale protocol verification. This expertise was essential for scaling our analysis to a system as complex as Aztec Governance.</p>

<h2 id="5-specification-decisions-and-challenges">5. Specification Decisions and Challenges</h2>

<p>With the goals and tools defined, the next step was to translate the Governance Protocol into a precise, analyzable specification.</p>

<h3 id="51-writing-the-specification">5.1. Writing the Specification</h3>

<p><strong>Modeling the states.</strong> Before specifying the contract behavior, we must decide how to model the contract states. We first define the shape of individual contract states. For example, below is the state of a <code>GovernanceProposer</code>.</p>

<pre><code class="language-ts">// GovernanceProposer contract state
type GovernanceProposerState = {
  // the state of the parent EmpireBase contract
  empireBase: EmpireBaseState,
  // mapping(uint256 proposalId =&gt; address proposer)
  proposalProposer: Uint256 -&gt; Address,
  // immutable config (set in constructor)
  REGISTRY: Address,
  GSE: Address
}
</code></pre>

<p>You can see that many concepts from Solidity (like mappings) are seamlessly expressed in Quint. The full protocol state – including all relevant contracts – is captured by <code>EvmState</code>. In our case, an EVM state is structured as follows:</p>

<pre><code class="language-ts">type EvmState = {
  block_timestamp: Uint256,
  // all possible instances of ERC20 used as assets
  assets: Address -&gt; ERC20State,
  // all possible instances of Governance
  governances: Address -&gt; GovernanceState,
  // all possible instances of GovernanceProposer
  governanceProposers: Address -&gt; GovernanceProposerState,
  // all possible instances of GSE
  gses: Address -&gt; GSEState,
  // all possible instances of Registry
  registries: Address -&gt; RegistryState,
  // all possible instances of RewardDistributor
  rewardDistributors: Address -&gt; RewardDistributorState,
  // all instances of Slasher
  slashers: Address -&gt; SlasherState,
  // all instances of SlashingProposer
  slashingProposers: Address -&gt; SlashingProposerState,
  // IEmperor(...).getCurrentSlot() for each rollup
  rollupSlot: IHaveVersion -&gt; int,
  // IEmperor(...).getCurrentProposer() for each rollup
  rollupProposer: IHaveVersion -&gt; Address,
  // mapping rollup addresses to their versions
  // Corresponds to _rollup.getVersion() call in Registry.sol:53
  ROLLUP_VERSIONS: IHaveVersion -&gt; Uint256,
  // mapping rollup address to the reward distributors that they create
  REGISTRY_REWARD_DISTRIBUTORS: IHaveVersion -&gt; IRewardDistributor
}
</code></pre>

<p>As you can see, we do not have to focus on nitty-gritty low-level details – like how storage is laid out in EVM. <strong><em>This frees us to focus on protocol logic and high-level correctness, rather than low-level implementation concerns. It also makes the reasoning problem more tractable for automated verification.</em></strong></p>

<p><strong>Modeling the contract functions.</strong> The contract functions are simply pure functions over the EVM state. For instance, we define the function <code>initiateWithdraw</code> in Quint as:</p>

<pre><code class="language-ts">// Governance.sol#L341
pure def Governance::initiateWithdraw(__evm_state: EvmState,
      __self: IGovernance, __msg_sender: Address,
      _to: Address, _amount: Uint256): Result[EvmState] = {
  val __state = __evm_state.governances.get(__self)
  val config = __state.configuration

  // ConfigurationLib.sol#L36:
  //   Timestamp.wrap(Timestamp.unwrap(_self.votingDelay) / 5) +
  //     _self.votingDuration + _self.executionDelay;
  val withdrawDelay = config.votingDelay / 5
      + config.votingDuration + config.executionDelay

  // L342: _initiateWithdraw(msg.sender, _to, _amount,
  //                         configuration.withdrawalDelay());
  Governance::_initiateWithdraw(__evm_state, __self, __msg_sender,
                                _to, _amount, withdrawDelay)
}
</code></pre>

<p>In Quint, we explicitly model all side-effects of the Solidity code, including exceptions and reverts. While it makes our specification more verbose, all branches and assignments become immediately visible at the code level – auditors do this in their heads all the time. For example, <code>_initiateWithdraw</code> computes and returns an updated <code>EvmState</code>, unless it reverts:</p>

<pre><code class="language-ts">// Governance.sol#L694
pure def Governance::_initiateWithdraw(__evm_state: EvmState,
      __self: IGovernance, _from: Address, _to: Address,
      _amount: Uint256, _delay: Timestamp): Result[EvmState] = {
  val __state = __evm_state.governances.get(__self)
  // L695: users[_from].sub(_amount);
  val fromAmount = __state.users.getOrElse(_from, checkpoints::constructor)
  val userTraceOrError = checkpoints::sub(__evm_state, fromAmount, _amount)
  if (isErr(userTraceOrError)) {
    err(__evm_state, userTraceOrError.err)
  } else {
    // L696: total.sub(_amount);
    val totalTraceOrError = checkpoints::sub(__evm_state, __state.total, _amount)
    if (isErr(totalTraceOrError)) {
      err(__evm_state, totalTraceOrError.err)
    } else {
      // L698: uint256 withdrawalId = withdrawalCount++;
      // L700: withdrawals[withdrawalId] = Withdrawal({...});
      val withdrawal = {
          amount: _amount,
          unlocksAt: __evm_state.block_timestamp + _delay,
          recipient: _to, claimed: false
      }

      val __state1 = {
        ...__state,
        users: __state.users.put(_from, userTraceOrError.v),  // L695
        total: totalTraceOrError.v,                           // L696
        withdrawals: __state.withdrawals.append(withdrawal),  // L700
      }
      ok({...__evm_state,
        governances: __evm_state.governances.put(__self, __state1)
      })
    }
  }
}
</code></pre>

<p><strong>Modeling transactions.</strong> We model transactions, e.g., initiated by externally-owned accounts (EOAs), via Quint <em>actions</em>:</p>

<pre><code class="language-ts">action governance_initiate_withdraw = {
  nondet _g = evm.governances.keys().oneOf()
  nondet _sender = ALL_SENDERS.oneOf()
  nondet _to = ALL_ADDRESSES.oneOf()
  nondet _amount = 0.to(MAX_UINT256).oneOf()
  val result = Governance::initiateWithdraw(evm, _g, _sender, _to, _amount)
  all {
    is_valid_sender(_sender) and isOk(result),
    evm' = result.v,
    // ...
  }
}
</code></pre>

<p>This directly controls the domains from which input parameters are drawn. When we run the Quint randomized simulator, non-deterministic values are sampled uniformly at random. When we run the Apalache model checker, it uses logic constraints in the <a href="https://github.com/Z3Prover/z3?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Z3 SMT solver</a> to reason about all possible non-deterministic values at once.</p>

<h3 id="52-bootstrapping-the-formal-specification-with-ai">5.2. Bootstrapping the Formal Specification with AI</h3>

<p>The above specification looks a bit machine-generated. This is not far from the truth. We used an LLM to produce the initial specifications, given the source code in Solidity and the Quint data types.</p>

<p>Obviously, an LLM cannot make high-level modeling decisions, like how to structure the EVM state, or how best to turn Solidity into functional definitions – this <strong>requires years of practical experience</strong>. We developed a custom system prompt that gives the LLM clear instructions and examples for translating Solidity into Quint. (It’s an internal tool we refine and apply with clients when we bootstrap their specifications.)</p>

<p>Of course, as with all AI assistants, we had to carefully proofread the translation results. Also, we were fortunate to have the model checker <a href="https://github.com/apalache-mc/apalache?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Apalache</a> on our side – it automatically pointed us to some inconsistencies in the translation. Compared to writing the specification by hand, this approach allowed us to bootstrap the project very quickly and to start evaluating the protocol early on.</p>

<h3 id="53-compositional-reasoning">5.3. Compositional Reasoning</h3>

<p>Some security researchers believe that formal verification does not scale to more than 1–2 smart contracts, or to exploit scenarios longer than 1–2 external calls deep. We have organized our specification in such a way that the verification tools can deal with the behavior of 10–20 smart contracts, and arbitrarily long transaction sequences. <strong>This level of scalability requires not just formal verification expertise, but a deep understanding of how model checkers and provers work internally</strong> and interact with protocol architecture. It builds directly on our prior formal verification work – including <a href="https://protocols-made-fun.com/zksync/matterlabs/quint/specification/modelchecking/2024/09/12/zksync-governance.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">zkSync Governance</a>, <a href="https://protocols-made-fun.com/consensus/matterlabs/quint/specification/modelchecking/2024/07/29/chonkybft.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">ChonkyBFT</a>, and <a href="https://arxiv.org/abs/2501.07958?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Ethereum 3-slot-finality</a> – <strong>where we pushed verification tools to reason compositionally across complex systems</strong>. More on this in Section <a href="https://protocols-made-fun.com/quint/2025/12/09/aztec-governance.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#73-inductive-invariants-making-verification-scale">7.3. Inductive Invariants: Making Verification Scale</a>.</p>

<h2 id="6-protocol-invariants">6. Protocol Invariants</h2>

<p>From Aztec’s documentation and source code, we extracted and formalized <strong>125 key invariants</strong> of the Governance Protocol. To get a taste of the invariants, here are a few examples in English (more of them are in the <a href="https://github.com/konnov/aztec-governance-formal-verification-2025q3/blob/e313681ade9f9e96d0e83a5120a670a1e1e07188/reports/Aztec-Governance-Protocol.pdf?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">report</a>):</p>

<ul>
  <li><strong>GOV-16</strong>: If the proposal has not been active yet, then no votes have been cast.</li>
  <li><strong>GOV-20:</strong> The timestamps in the <code>users</code> traces are ordered.</li>
  <li><strong>GOV-26</strong>: For each timestamp <code>t</code>, <code>total[t]</code> equals the sum of the users’ voting
power at <code>t</code>.</li>
  <li><strong>GP-02-01</strong>: For each submitted proposal in <code>proposalProposer</code>, there is round accounting for a corresponding executed proposal (i.e., submitted to Governance).</li>
  <li><strong>GP-08</strong>: A proposal cannot be executed without a quorum.</li>
  <li><strong>GSE-17:</strong> for each proposal that the <code>delegatee</code> has <code>powerUsed</code> on, Governance contains that proposal.</li>
  <li><strong>GSE-19:</strong> <code>powerUsed</code> cannot exceed the attester’s voting power at the time of the proposal’s <code>pendingThrough</code>.</li>
  <li><strong>GSE-23:</strong> <code>delegation.supply</code> at each checkpoint is the sum of all <code>delegation.ledgers[instance].supply</code> at that time.</li>
  <li><strong>SP-10</strong>: lastSignalSlot is in the valid range.
This range is <code>[round * ROUND_SIZE, (round + 1) * ROUND_SIZE)</code>.</li>
  <li><strong>SP-11</strong>: The number of signals is correct. It does not exceed <code>lastSignalSlot % ROUND_SIZE + 1</code>.</li>
</ul>

<p><strong>Formalized invariants in Quint.</strong> We formalized all 125 invariants in Quint as
well. For example, the <code>Governance</code> contract should uphold the <strong>Solvency Invariant</strong>
(<a href="https://www.certora.com/blog/the-holy-grail?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">‘The Holy Grail’</a>, as coined by FV researchers at
<a href="https://www.certora.com/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Certora</a>):</p>

<pre><code class="language-ts">// GOV-28: Solvency: Governance holds enough balance to cover all future
// withdrawals.
pure def governance_solvency_inv(_evm: EvmState, ga: IGovernance): bool = {
  pure val g = _evm.governances.get(ga)
  and {
    // the withdrawals that happen in the future
    pure val payable = g.withdrawals.indices().fold(0, (sum, i) =&gt; {
      pure val withdrawal = g.withdrawals[i]
      sum + if (withdrawal.claimed) 0 else withdrawal.amount
    })
    // the total user's balance, add payable, is below the contract's balance
    pure val asset = _evm.assets.get(g.ASSET)
    g.total.latest() + payable &lt;= asset.balances.getOrElse(ga, 0)
  }
}
</code></pre>

<p>Turns out, the solvency invariant is actually violated under certain conditions. We will get back to it in Section <a href="https://protocols-made-fun.com/quint/2025/12/09/aztec-governance.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#91-governance-insolvency">9.1. Governance Insolvency</a>.</p>

<p>With the key invariants defined, we started verifying them using Quint and Apalache.</p>

<h2 id="7-formal-verification-workflow">7. Formal Verification Workflow</h2>

<p>As soon as parts of the specification stabilized, we began verification – moving from randomized simulation to full symbolic and inductive reasoning.</p>

<h3 id="71-randomized-simulator-stuck-at-unproductive-inputs">7.1. Randomized Simulator: Stuck at Unproductive Inputs</h3>

<p>The <strong>Quint randomized simulator</strong> operates similarly to property-based testing for implementation languages: it assigns concrete values to <code>nondet</code> declarations and resolves non-deterministic control choices by selecting one branch at random.</p>

<p><strong>Limitations.</strong> We briefly experimented with this approach, but it proved ineffective for our purposes. The simulator’s uniform random sampling consistently failed to produce valid configurations that would even satisfy the protocol’s initial state:</p>

<pre><code class="language-shell">$ quint run --max-samples=100000 --max-steps=10  --invariant=past_signals \
    spec/slashing_proposer_machine.qnt
An example execution:

[ok] No violation found (768ms at 130208 traces/second).
Trace length statistics: max=0, min=0, average=0.00
</code></pre>

<p>We believe the randomized simulator could be improved in future versions. If you’d like to explore in more detail why this happens – and how it could be mitigated – check out our workshop <a href="https://blltprf.xyz/blog/25-min-solidity-fuzzer/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed"><em>25-Minute Solidity Fuzzer: Fuzzing Smarter, Not Harder</em></a>.</p>

<p>In its current form, however, it did not help us uncover issues. This led us to use the <strong>symbolic analysis tools in Apalache</strong>, which can reason over all possible inputs symbolically rather than sampling concrete ones.</p>

<h3 id="72-symbolic-random-walks-scaling-up">7.2. Symbolic Random Walks: Scaling Up</h3>

<p>With symbolic random walks (part of Apalache), we quickly checked several invariants. The following run revealed an issue: the system could receive outdated (“past”) signals when the canonical rollup was faulty:</p>

<pre><code class="language-shell">$ quint verify --random-transitions=true --max-steps=10 \
  --invariant=past_signals spec/slashing_proposer_machine.qnt
...
[violation] Found an issue (22181ms)
</code></pre>

<p>When an invariant is violated, Apalache produces a counterexample with all details needed to understand the issue. We omit it here because it is quite verbose.</p>

<p><strong>Limitations.</strong> Even though this approach proved to be quite useful in bootstrapping and debugging our specification, it reached its limits when we began dealing with multiple contracts. This limitation stems from the protocol’s scale – with over 40 external functions, many of which can be invoked at nearly any point in time, the number of possible symbolic paths grows combinatorially with path length. We then moved to proving <em>inductive invariants</em> automatically.</p>

<h3 id="73-inductive-invariants-making-verification-scale">7.3. Inductive Invariants: Making Verification Scale</h3>

<p>To scale our formal verification efforts further, we specified 125 invariants that together capture any arbitrary state of the Governance Protocol. For example, below is the invariant <code>gse_rollups_inv</code> that groups the invariants <code>GSE-28</code> to <code>GSE-32</code>:</p>

<pre><code class="language-ts">pure def gse_rollups_inv(evm: EvmState, gsea: IGSE): bool = {
  val gse = evm.gses.get(gsea)
  val chkpts = gse.rollups._checkpoints
  and {
    // GSE-28: `rollups` is an ordered checkpointed trace with ascending timestamps
    _trace_is_ordered(gse.rollups),
    chkpts.indices().forall(i =&gt; and {
      // GSE-29: `rollups` values are rollup addresses
      chkpts[i]._value.in(ROLLUP_ADDRESSES),
      // GSE-30: the bonus instance does not appear in the `rollups` history
      chkpts[i]._value != BONUS_INSTANCE_ADDRESS,
      // GSE-31: `rollups` values are registered in `instances`
      chkpts[i]._value.in(gse.instances.keys()),
    })
  }
}
</code></pre>

<p>The following command checks that all protocol invariants (<code>all_inv</code>) – including <code>gse_rollups_inv</code> – hold whenever the protocol is in a state that satisfies <code>all_inv</code> and one of the contracts makes a single step:</p>

<pre><code class="language-shell">$ ./scripts/quint-inductive.sh spec/invariant_model.qnt 31 32 5 100 all_inv
</code></pre>

<p>Beware that the above command runs over 900 verification runs in parallel (in the example above using at most 5 CPUs at once). This can easily overwhelm your laptop. If you want to reproduce our experiments, read the next section on our experimental setup.</p>

<p><strong>Scalable verification.</strong> The <strong>core technique</strong> that enables this level of scalability is the use of <strong>inductive invariants</strong>. Instead of exploring all possible symbolic paths of the specification from an initial state (this approach, used by most code-level symbolic tools, is called <em>symbolic execution</em>), we start with a much richer set of states (captured by the inductive invariant <code>all_inv</code>) and simply enumerate all possible external functions and make them execute exactly once from any state in the inductive invariant. By assuming that <code>all_inv</code> holds in an arbitrary state and showing that it still holds after symbolically executing any single transaction, our check <strong>extends inductively to all possible executions</strong>.</p>

<p><strong>Note for sticklers.</strong> We still have to show that the initial states satisfy the inductive invariant. In our case, this is easy. Essentially, the initial state of the protocol is an “empty” EVM state where none of the governance contracts are deployed yet.</p>

<h3 id="74-witnesses-of-liveness-proving-the-protocol-can-progress">7.4. Witnesses of Liveness: Proving the Protocol Can Progress</h3>

<p>When verifying safety, there is always a risk that we introduce a bug in the specification that restricts the protocol behavior too much. This would still keep the protocol “safe” from the verification point of view, but, obviously, the protocol would not do as many useful things as it is meant to do. To avoid this pitfall, we introduce “falsy invariants” that instruct Apalache to generate a witness of the protocol reaching an “interesting” state. Below is an example to produce an execution to a state in which at least one governance proposal has been executed:</p>

<pre><code class="language-ts">// Check this invariant to find an example of having at least one executed proposal:
// quint verify --max-steps=0 --invariant=gov_proposals_executed_ex \
//   spec/invariant_model.qnt
val gov_proposals_executed_ex = {
  not(evm.governances.keys().forall(ga =&gt; {
    val g = evm.governances.get(ga)
    g.proposals.indices().exists(proposalId =&gt; {
      val proposal = g.proposals[proposalId]
      proposal.cachedState == ProposalState_Executed
    })
  }))
}
</code></pre>

<p>This ability to automatically generate an execution trace to an ‘interesting’ state is a <strong>superpower of symbolic model checkers like Apalache</strong> – a functionality that would be <strong>far more difficult</strong> to automate with an interactive theorem prover such as Lean or Rocq. (Provers have property-based testing tools, but they are not tuned to bug finding in distributed protocols like <a href="https://github.com/apalache-mc/apalache?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Apalache</a>.)</p>

<h2 id="8-experimental-setup-and-verification-runs">8. Experimental Setup and Verification Runs</h2>

<p><strong>Experimental setup.</strong> As mentioned above, checking the inductive invariant of our specification produces 992 verification tasks in total (for the combinations of a specific invariant and an external function call). Apalache decomposes invariant checking into smaller tasks, so we employ <a href="https://www.gnu.org/software/parallel/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">GNU parallel</a> to <strong>massively parallelize the verification</strong>. We use two servers to run the experiments:</p>

<ol>
  <li>AMD Ryzen 9 5950X processor (16 physical, 32 logical cores), 128 GB memory</li>
  <li>2× Intel Xeon Platinum 8280 processor (56 physical, 112 logical cores total), pinned at 3.1 GHz, 240 GB memory</li>
</ol>

<p><strong>Verification Runs.</strong> Some of the verification tasks take a few minutes to check, and some of them take a few hours. This is caused by the nature of the SMT constraints. It is well-known that SMT solvers, including Z3, are challenged by non-linear integer arithmetic – in this project, they naturally appear, e.g., as part of Aztec’s vote tallying logic.</p>

<p>Instead of writing many words, we simply show you the plot below. It visualizes the running times of <a href="https://github.com/apalache-mc/apalache?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Apalache</a> when checking the 992 verification tasks. The X-axis shows the number of verification conditions solved (roughly, individual constraints in the inductive invariant), sorted from fastest to slowest. Each point corresponds to one verification condition. The Y-axis represents the running time per verification condition, formatted in human-readable units (milliseconds to hours). Notice the logarithmic scale!</p>

<p><img src="https://protocols-made-fun.com/assets/images/aztec-governance-verification-times.svg?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="aztec-gov-plot-all" /></p>

<p>As we can see from the plot, over 85% of the verification conditions are checked in less than 10 minutes each, about 7% are checked in several hours, and about 8% of the verification conditions require plenty of running time.</p>

<p><strong>Timeouts.</strong> As it happens with SMT solvers, 3% of our verification conditions time out. These are the runs at the end of the “hockey stick”. We capped the running time of Z3 at 12 hours. Since we are decomposing the inductive invariant into smaller pieces, these problematic conditions are well-localized. We have investigated these conditions. They all have to do with non-linear arithmetic.</p>

<p>Below is an example of such an invariant. Notice that the very last expression involves modulo over a non-constant value, since <code>ROUND_SIZE</code> is initialized in the contract constructor.</p>

<pre><code class="language-ts">// GovernanceProposer invariant on last signals and total signals
pure def governance_proposer_signal_inv(evm: EvmState,
                                        ga: IGovernanceProposer): bool = {
  val gp = evm.governanceProposers.get(ga)
  gp.empireBase.rounds.keys().forall(rollup =&gt; {
    gp.empireBase.rounds.get(rollup).keys().forall(round =&gt; {
      val rollupRounds = gp.empireBase.rounds.get(rollup)
      val accounting = rollupRounds.get(round)
      and {
        // GP-12: ...
        // ...
        // GP-13: The number of signals is in the right range
        // It does not exceed `lastSignalSlot % ROUND_SIZE + 1`.
        // This property is very hard for Z3. It is not falsified.
        and {
          gp.empireBase.ROUND_SIZE &lt;= MAX_ROUND_SIZE,
          totalSignalCount &gt;= 0,
          totalSignalCount &lt;= gp.empireBase.ROUND_SIZE,
          totalSignalCount &lt;=
            (accounting.lastSignalSlot % gp.empireBase.ROUND_SIZE) + 1,
        }
      }
    })
  })
}
</code></pre>

<p>We classify the small number of the verification conditions that time out as <em>not falsified</em> rather than verified. Usually, we recommend verifying such conditions with a theorem prover such as Lean or Rocq. Another solution is to fix these non-constant values to known production configuration values to gain further confidence.</p>

<h2 id="9-findings">9. Findings</h2>

<p>Our verification of the Aztec Governance Protocol uncovered <strong>five Medium</strong>, <strong>three Low</strong>, and <strong>six Informational findings</strong>. Most arose from subtle cross-contract interactions that are difficult to identify through conventional testing, fuzzing, or simulation alone. We reported all issues to Aztec Labs, who acknowledged and/or fixed them in subsequent pull requests.</p>

<p>Below we explain one representative issue: a <strong>violation of the solvency invariant</strong>.</p>

<h3 id="91-governance-insolvency">9.1. Governance Insolvency</h3>

<p>Recall the solvency invariant from <a href="https://protocols-made-fun.com/quint/2025/12/09/aztec-governance.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#6-protocol-invariants">Protocol Invariants</a>. When we check it, Apalache produces a counterexample. Below is the root cause of this issue:</p>

<pre><code class="language-ts">function deposit(address _beneficiary, uint256 _amount) external
        override(IGovernance) isDepositAllowed(_beneficiary) {
  ASSET.safeTransferFrom(msg.sender, address(this), _amount);
    // &lt;--- if msg.sender == address(this), then the balances do not change
  users[_beneficiary].add(_amount);
    // &lt;--- ...but the liabilities always get increased
  total.add(_amount);
  emit Deposit(msg.sender, _beneficiary, _amount);
}
</code></pre>

<p>In short, <strong>executing an approved governance proposal</strong> can invoke <code>Governance.deposit(...)</code>. Inside <code>deposit</code>, this performs an ERC-20 self-transfer – leaving token balances unchanged – while <strong>crediting <code>_beneficiary</code> and increasing <code>total</code></strong>. <code>Governance</code>’s liabilities go up, but its assets don’t – the contract becomes <strong>insolvent</strong>. The diagram below illustrates the problematic scenario.</p>

<p><img src="https://protocols-made-fun.com/assets/images/gov-insolvency.svg?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" alt="aztec-gov-plot-all" /></p>

<p><strong>On ERC-20 approvals.</strong> Most ERC20 token implementations would require Governance to execute an explicit token approval for the self-transfer before the call to <code>deposit()</code> (while executing the governance proposal). Calling <code>ASSET</code> is forbidden by the current <code>Governance</code> implementation. However, certain tokens like <a href="https://etherscan.io/address/0xc02aaa39b223fe8d0a0e5c4f27ead9083c756cc2?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#code">WETH</a> do <strong>not</strong> require approvals for <code>transferFrom()</code> if <code>from == msg.sender</code>.</p>

<p><strong>Resolution.</strong> We raised this finding with Aztec Labs who addressed it in PR <a href="https://github.com/AztecProtocol/aztec-packages/pull/16917?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">#16917</a> by forbidding Governance from calling <code>deposit()</code> itself. In addition, the <a href="https://etherscan.io/address/0xA27EC0006e59f245217Ff08CD52A7E8b169E62D2?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#code">current AZTEC token implementation</a> uses OpenZeppelin’s ERC-20 implementation, which does require explicit approval of self-transfers.</p>

<h3 id="92-other-findings">9.2. Other Findings</h3>

<p>For details on all our findings, refer to our <a href="https://github.com/konnov/aztec-governance-formal-verification-2025q3/blob/e313681ade9f9e96d0e83a5120a670a1e1e07188/reports/Aztec-Governance-Protocol.pdf?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">formal report</a>.</p>

<h2 id="10-conclusion-scalable-formal-verification-in-practice">10. Conclusion: Scalable Formal Verification in Practice</h2>

<p>Our formal verification of the Aztec Governance Protocol went far beyond a traditional audit. It was a <strong>compositional, protocol-level analysis</strong> using state-of-the-art tools and techniques that we helped create. We formally proved <strong>125 high-level invariants</strong> across a multi-contract system – reasoning over a search space beyond the reach of traditional testing and most formal verification tools. These invariants were automatically decomposed into 992 verification conditions, which let us further parallelize the verification task.</p>

<p>By combining <strong>inductive invariants, symbolic reasoning, and massive parallelization</strong> (321 CPU-days of compute), we showed that formal verification can scale to the complexity of modern, mission-critical smart contract systems. Our methodology enables <strong>exhaustive, automated reasoning</strong> about real-world governance mechanisms and other smart contract protocols.</p>

<p>For systems like Aztec Governance, where bugs are subtle but potentially catastrophic, <strong>deep understanding of the tools and underlying logic</strong> is essential. This project demonstrates that scalable, unbounded formal verification is not just theoretically possible – it’s practical today for mature, production-grade protocols.</p>

<h3 id="differential-testing-spec--implementation-conformance">Differential Testing: Spec / Implementation Conformance</h3>

<p>A natural next step would be to <strong>connect the formal protocol specification</strong> with the actual Solidity implementation to <strong>close the verification loop</strong> (known as <em>differential</em> or <em>conformance testing</em>). With our methodology, it suffices to check that each external function call in Solidity conforms to its formal specification in Quint. Traditionally, this is done by writing and proving pre- and post-conditions in Hoare logic – e.g., using <a href="https://www.certora.com/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Certora Prover</a>. We suggest that a <strong>more pragmatic approach</strong> is to <strong>fuzz external Solidity functions directly against the formal specification</strong>.</p>

<p><strong>Enabling diff testing in Apalache:</strong> We have just implemented a new <a href="https://github.com/apalache-mc/apalache/tree/main/json-rpc?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Apalache JSON-RPC API</a>, which enables interactive differential testing between implementation and specification. This delivers <strong>fast, actionable, and reproducible results</strong> while still providing a <strong>high level of assurance</strong> grounded in rigorous formal modelling.</p>]]></content>
      

      
      
      
      
      

      <author>
          <name>Igor Konnov</name>
        
          <email>igor@konnov.phd</email>
        
        
          <uri>https://konnov.phd?utm_source=protocols_made_fun&amp;utm_medium=referral&amp;utm_campaign=pmf_site</uri>
        
      </author>

      
        
          <category term="quint" />
        
      

      

      
      
        <summary type="html"><![CDATA[Authors: Thomas Pani, Igor Konnov]]></summary>
      

      
      
    </entry>
  
    <entry>
      
      

      <title type="html">Small scope hypothesis revisited</title>
      <link href="https://protocols-made-fun.com/tlaplus/2025/12/02/small-scope.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" rel="alternate" type="text/html" title="Small scope hypothesis revisited" />
      <published>2025-12-02T00:00:00+00:00</published>
      <updated>2025-12-02T00:00:00+00:00</updated>
      <id>https://protocols-made-fun.com/tlaplus/2025/12/02/small-scope</id>
      
      
        <content type="html" xml:base="https://protocols-made-fun.com/tlaplus/2025/12/02/small-scope.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed"><![CDATA[<p><strong>Author:</strong> <a href="https://konnov.phd?utm_source=protocols_made_fun&amp;utm_medium=referral&amp;utm_campaign=pmf_site">Igor Konnov</a></p>

<p><strong>Tags:</strong> specification tlaplus tlc</p>

<p>A couple of weeks ago, I gave a talk at the internal Nvidia FM Week 2025. Many
thanks to <a href="https://github.com/lemmy?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Markus Kuppe</a> for the organization and invitation! I am going to
write a longer blog post about interactive spec conformance testing with
Apalache later. Today, I want to talk a bit about the question posed by Markus
(to find the question, continue reading).</p>

<p>Let’s talk about the small scope hypothesis. As formulated by Jackson in the
<a href="https://dspace.mit.edu/bitstream/handle/1721.1/149864/MIT-LCS-TR-735.pdf?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">technical report</a> (1997), this hypothesis reads as follows:</p>

<p class="highlight-question"><strong><em>
    "...most errors can be demonstrated by counterexamples within a small scope."
</em></strong></p>

<p>As you will see below, my example fits into this hypothesis quite well. However,
having spoken to many engineers over the years, I believe that there is a
mismatch between what engineers understand by “small scope” and what
verification engineers understand by “small scope”.</p>

<p>In this blog post, I’ve decided to try a <strong>new format</strong>. Since everyone is using
LLMs nowadays, I will follow the protocol. I will present the example and the
problem of finding a small scope. Then, it is your turn to decide how this blog
post should continue. If someone gives me an interesting example or insight in a
<a href="https://protocols-made-fun.com/tlaplus/2025/12/02/small-scope.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#end">comment</a>, I will update this blog post accordingly.</p>

<h2 id="1-example-1-buggy-circular-buffer">1. Example 1: Buggy circular buffer</h2>

<h3 id="11-the-specification">1.1. The specification</h3>

<p>I started the talk with a TLA<sup>+</sup> specification of a <strong>buggy</strong> circular
buffer. You can find the full specification, the model checking models, and the
TLC configuration files <a href="https://github.com/konnov/cyclic-buffer-challenge/tree/main/tla?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">here</a>. The specification looks as follows:</p>

<pre><code class="language-tlaplus">----------------------------- MODULE BuggyCircularBuffer -----------------------------
(**
 * A very simple specification of a circular buffer with a bug.
 * Generated with ChatGPT and beautified by Igor Konnov, 2025.
 * ChatGPT learned abstraction so well that it omitted the actual buffer storage!
 *)
EXTENDS Integers

CONSTANTS
    \* Size of the circular buffer.
    \* @type: Int;
    BUFFER_SIZE,
    \* The set of possible buffer elements.
    \* @type: Set(Int);
    BUFFER_ELEMS

ASSUME BUFFER_SIZE &gt; 0

VARIABLES
    \* The integer buffer of size BUFFER_SIZE.
    \* @type: Int -&gt; Int;
    buffer,
    \* Index of the next element to POP.
    \* @type: Int;
    head,
    \* Index of the next free slot for PUSH.
    \* @type: Int;
    tail,
    \* Number of elements currently stored.
    \* @type: Int;
    count

\* Initial state
Init ==
  /\ buffer = [ i \in 0..(BUFFER_SIZE - 1) |-&gt; 0 ]
  /\ head = 0
  /\ tail = 0
  /\ count = 0

\* Buggy PUT: Advance tail, increment count, but no fullness check!
Put(x) ==
  Put::
  LET nextTail == (tail + 1) % BUFFER_SIZE IN
  /\ buffer' = [buffer EXCEPT ![tail] = x]
  /\ head' = head
  /\ tail' = nextTail
  /\ count' = count + 1

\* GET: Only allowed when count &gt; 0.
Get ==
  Get::
  LET nextHead == (head + 1) % BUFFER_SIZE IN
  /\ count &gt; 0
  /\ UNCHANGED buffer
  /\ head' = nextHead
  /\ tail' = tail
  /\ count' = count - 1

\* Either Put or Get may happen in any step.
Next ==
    \/ \E x \in BUFFER_ELEMS:
        Put(x)
    \/ Get

vars == &lt;&lt;buffer, head, tail, count&gt;&gt;

\* Complete specification
Spec == Init /\ [][Next]_vars

\* Safety property we *intend* to hold, but it is violated:
\* count must never exceed the buffer capacity.
SafeInv == count &lt;= BUFFER_SIZE

</code></pre>

<p>Since I wanted to experiment with different buffer sizes and potential buffer
elements, I have introduced two parameters in the specification:</p>

<ul>
  <li><code>BUFFER_SIZE</code> is the size of the cyclic buffer, and</li>
  <li><code>BUFFER_ELEMS</code> is the set of possible buffer elements.</li>
</ul>

<p>Now, my previous experience with introducing TLA<sup>+</sup> to engineers
suggests that there are two ways to set these parameters:</p>

<ol>
  <li>
    <p><strong>The Engineer’s way:</strong> Set the parameters to relatively small yet
 reasonable values. For example, <code>BUFFER_SIZE = 10</code> and <code>BUFFER_ELEMS = 0..255</code>.
 These are not the minimal possible values, but they kind of make sense: The
 buffer should hold up to 10 bytes. Obviously, <code>BUFFER_ELEMS</code> are
 set to the minimal possible type in their programming language of choice, e.g.,
 <code>char</code> in C, or <code>u8</code> in Rust.</p>
  </li>
  <li>
    <p><strong>The Verification Engineer’s way:</strong> Start with the smallest possible values
 of the parameters, e.g., <code>BUFFER_SIZE = 2</code> and <code>BUFFER_ELEMS = {0, 1}</code>. The
 idea is to check the specification in the smallest possible scope first. If there
 are no bugs found, increase the parameters gradually until you reach the
 reasonable values.</p>
  </li>
</ol>

<h3 id="12-checking-the-specification-engineers-way">1.2. Checking the specification Engineer’s way</h3>

<p>To check the specification the Engineer’s way, I have created the
TLA<sup>+</sup> model <a href="https://github.com/konnov/cyclic-buffer-challenge/blob/main/tla/MC10u8_BuggyCircularBuffer.tla?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">MC10u8_BuggyCircularBuffer.tla</a> with <code>BUFFER_SIZE = 10</code>
and <code>BUFFER_ELEMS = 0..255</code>. For technical reasons, we also need the TLC config
file <a href="https://github.com/konnov/cyclic-buffer-challenge/blob/main/tla/MC.cfg?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">MC.cfg</a>. Follow the links to see the details.  Further, I’ve run TLC on
this model to check the invariant <code>SafeInv</code>:</p>

<pre><code class="language-shell">$ java -cp tla2tools.jar "-XX:+UseParallelGC" tlc2.TLC \
  -config MC.cfg MC10u8_BuggyCircularBuffer.tla
</code></pre>

<p>I wanted to see how far TLC could go, so I gave it a machine with 128 GB of RAM
and 32 cores. TLC has explored around 3 billion states in about 40 minutes.
After consuming 400 GB of disk space, it has run out of disk space and
terminated. No bug was found. Is this surprising? Not really. In this
configuration, TLC has to enumerate $(2^8)^{10} * 10 * 10 * 10 \approx 2^{90}$
states. (Thanks to <a href="https://blltprf.xyz/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Thomas Pani</a> for correcting the initially wrong estimate.)</p>

<p>Obviously, anyone who used TLC for some time would have asked the same question
as Markus did:</p>

<p class="highlight-question"><strong><em>
  What about the small scope hypothesis? Can we use smaller parameters?
</em></strong></p>

<p>The answer to this question is basically the second approach, which I called the
Verification Engineer’s way.</p>

<div class="tip-box">
    <div class="tip-header"><strong>💡 Tip:</strong></div>
    <div class="tip"><p>Apalache finds an invariant violation in 3
seconds, when running bounded model checking with the command <code>check</code>.
However, I do not want to distract us from the main point of this blog post.</p>
</div>
</div>

<h3 id="13-checking-the-specification-verification-engineers-way">1.3. Checking the specification Verification Engineer’s way</h3>

<p>This time, we use the instance <a href="https://github.com/konnov/cyclic-buffer-challenge/blob/main/tla/MC2u1_BuggyCircularBuffer.tla?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">MC2u1_BuggyCircularBuffer.tla</a>
that has <code>BUFFER_SIZE = 2</code> and <code>BUFFER_ELEMS = {0, 1}</code>.
Let’s run TLC on this instance to check the invariant <code>SafeInv</code>:</p>

<pre><code class="language-shell">$ java -cp tla2tools.jar "-XX:+UseParallelGC" tlc2.TLC \
  -config MC.cfg MC2u1_BuggyCircularBuffer.tla
...
Error: Invariant SafeInv is violated.
...
10 states generated, 10 distinct states found, 5 states left on queue.
</code></pre>

<p>Yay! Just after visiting 10 states, TLC has found a violation of the invariant!</p>

<p>So if we pick the right small scope, exhaustive model checking with TLC finds
the bug quite fast. In this example, it is hard to find a small scope that would
not reveal the bug. Of course, when we know that the bug exists, it is easy to
experiment with different values of the parameters and find the bug.</p>

<h3 id="14-checking-the-specification-randomly">1.4. Checking the specification randomly</h3>

<p>Surprisingly, if we forget about exhaustive enumeration, TLC finds an
invariant violation for <code>BUFFER_SIZE = 10</code> and <code>BUFFER_ELEMS = 0..255</code> in less
than a second. To do this, we run TLC with the option <code>-simulate</code>, which simply
picks successor states at random:</p>

<pre><code class="language-shell">$ java -cp tla2tools.jar "-XX:+UseParallelGC" tlc2.TLC \
  -simulate -config MC.cfg MC10u8_BuggyCircularBuffer.tla
...
Error: Invariant SafeInv is violated.
...
</code></pre>

<p>This effectiveness of randomized search is actually not a one-off thing.
The Quint simulator <a href="https://github.com/konnov/cyclic-buffer-challenge/tree/main/quint?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#randomized-simulation">finds the bug</a> in less than a second.
Similarly, the Rust property-based testing with <a href="https://github.com/konnov/cyclic-buffer-challenge/tree/main/rust/proptest?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">proptest</a> finds the bug
almost immediately.</p>

<p>Interestingly, <strong>we did not have to tune the scope to be as tiny as possible</strong>,
as we did for exhaustive model checking. Maybe this is why some engineers want
to use property-based testing for every problem?</p>

<h2 id="2-thinking-about-the-small-scope-hypothesis">2. Thinking about the small scope hypothesis</h2>

<p>In Example 1, we indeed found several assignments to <code>BUFFER_SIZE</code> and
<code>BUFFER_ELEMS</code> that revealed an invariant violation. Actually, this bug is so
simple that almost any assignment to the parameters would reveal it. We could
even set <code>BUFFER_SIZE = 1</code> and <code>BUFFER_ELEMS = {0}</code> to find the bug! If you want
to push it further, think, whether <code>BUFFER_ELEMS = {}</code> would allow us to find an
invariant violation.</p>

<p>In fact, if we go back to <a href="https://alloytools.org/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Alloy</a>, the way Alloy restricts the scope is quite
different from what we did in Example 1. Alloy limits the number of elements of
each type in the specification. For example, if we had specified the circular
buffer in Alloy, we could restrict the search scope as follows:</p>

<ul>
  <li>
    <p>All integers, including <code>BUFFER_SIZE</code> and buffer indices, have the bit width
 of 4.</p>
  </li>
  <li>
    <p>The number of unique buffer elements is $2^8$.</p>
  </li>
</ul>

<p>As a result, Alloy would consider all possible values of <code>BUFFER_SIZE</code> from 0 to
15, all possible values of buffer elements from 0 to 255, as well as all
possible combinations of buffers of size up to 15. This is a much more flexible
way to restrict the search space.  In case of TLC, we did not have this
flexibility: We had to give concrete values to <code>BUFFER_SIZE</code> and <code>BUFFER_ELEMS</code>.</p>

<div class="tip-box">
    <div class="tip-header"><strong>💡 Tip:</strong></div>
    <div class="tip"><p>Apalache has data generators, which are closer to
the Alloy scopes in spirit, though they work slightly different from Alloy.</p>
</div>
</div>

<p>Hence, we have to distinguish between small scopes and small parameter
assignments in TLC. After thinking about this question a bit more, I’ve asked
myself:</p>

<p class="highlight-question"><strong><em>
  Are there examples of specifications that have a small scope for a specific
  invariant violation, but it is hard to find concrete parameter assignments
  within this scope?
</em></strong></p>

<p>Even though my intuition says “yes”, there must be plenty of such examples, I
could not come up with with non-artificial examples immediately. On top of my
head, I can think of the following directions to look for such examples:</p>

<ul>
  <li>
    <p>Examples from <strong>abstract interpretation</strong>. If we have non-trivial math with
  overflows and underflows, it might be hard to find concrete assignments that
  would trigger these overflows and underflows.</p>
  </li>
  <li>
    <p>Examples from <strong>graph theory</strong>. For instance, <a href="https://en.wikipedia.org/wiki/Planar_graph?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">non-planar graphs</a>
  must contain subgraphs that are subdivisions of $K_5$ or $K_{3,3}$ (see
  Kuratowski’s theorem on <a href="https://en.wikipedia.org/wiki/Planar_graph?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">planar graphs</a>). So if a bug shows up only in
  non-planar graphs, there must be a small scope that reveals the bug.  However,
  our concrete graph would have to contain a subdivision of $K_5$ or $K_{3,3}$,
  which is far from an arbitrary graph. Unfortunately, I do not know any
  concurrent or distributed algorithm that would have something to do with
  planar or non-planar graphs.</p>
  </li>
</ul>

<h2 id="3-your-turn">3. Your turn</h2>

<p>It is your turn to decide how this blog post should continue. If someone gives
me an interesting example or insight in a <a href="https://protocols-made-fun.com/tlaplus/2025/12/02/small-scope.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed#end">comment</a>, I will
update this blog post accordingly.</p>

<!-- references -->]]></content>
      

      
      
      
      
      

      <author>
          <name>Igor Konnov</name>
        
          <email>igor@konnov.phd</email>
        
        
          <uri>https://konnov.phd?utm_source=protocols_made_fun&amp;utm_medium=referral&amp;utm_campaign=pmf_site</uri>
        
      </author>

      
        
          <category term="tlaplus" />
        
      

      

      
      
        <summary type="html"><![CDATA[Author: Igor Konnov]]></summary>
      

      
      
    </entry>
  
    <entry>
      
      

      <title type="html">Running TLC with non-standard modules</title>
      <link href="https://protocols-made-fun.com/tlaplus/2025/10/09/tlc-with-modules.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed" rel="alternate" type="text/html" title="Running TLC with non-standard modules" />
      <published>2025-10-09T00:00:00+00:00</published>
      <updated>2025-10-09T00:00:00+00:00</updated>
      <id>https://protocols-made-fun.com/tlaplus/2025/10/09/tlc-with-modules</id>
      
      
        <content type="html" xml:base="https://protocols-made-fun.com/tlaplus/2025/10/09/tlc-with-modules.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed"><![CDATA[<p><strong>Author:</strong> <a href="https://konnov.phd?utm_source=protocols_made_fun&amp;utm_medium=referral&amp;utm_campaign=pmf_site">Igor Konnov</a></p>

<p><strong>Tags:</strong> specification tlaplus apalache tlc</p>

<p>This must be my shortest blog post ever. I just wanted to run <a href="https://github.com/tlaplus/tlaplus?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">TLC</a> to check
a specification that uses <a href="https://apalache-mc.org/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Apalache</a> modules. For example, the typed version
of two-phase commit <a href="https://github.com/apalache-mc/apalache/blob/main/test/tla/MC3_TwoPhaseTyped.tla?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">MC3_TwoPhaseTyped.tla</a> uses the module <a href="https://apalache-mc.org/docs/lang/variants.html?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">Variants</a>.
Sure, TLC can do that, but it requires a small trick.</p>

<p>Let’s do it step by step for <a href="https://github.com/apalache-mc/apalache/blob/main/test/tla/MC3_TwoPhaseTyped.tla?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">MC3_TwoPhaseTyped.tla</a>. Say, we want to see
an example of all participants committing:</p>

<pre><code class="language-tlaplus">RMAllCommittedEx ==
    ~(\A rm \in RM: rmState[rm] = "committed")
</code></pre>

<p><strong>Step 1.</strong> Checkout the Apalache repository, if you don’t have it already:</p>

<pre><code class="language-shell">$ git clone git@github.com:apalache-mc/apalache.git
$ export APALACHE_HOME=$(pwd)/apalache
</code></pre>

<p><strong>Step 2.</strong> Download TLA<sup>+</sup> Tools:</p>

<pre><code class="language-shell">$ wget https://github.com/tlaplus/tlaplus/releases/download/v1.7.4/tla2tools.jar
$ export TLC_HOME=$(pwd)
</code></pre>

<p><strong>Step 3.</strong> Introduce a configuration file <code>MC3_TwoPhaseTyped.cfg</code> with the
following content:</p>

<pre><code>$ cd $APALACHE_HOME/test/tla
$ cat &gt;MC3_TwoPhaseTyped.cfg &lt;&lt;EOF
INIT Init
NEXT Next
INVARIANT RMAllCommittedEx
EOF
</code></pre>

<p><strong>Step 4.</strong> Run TLC with the option <code>-cp</code>, which extends the Java
classpath. TLC will look for non-standard modules in the specified directory,
that is, in the directory <code>${APALACHE_HOME}/src/tla</code>. <em>This is the trick!</em></p>

<pre><code class="language-shell">$ java -cp ${TLC_HOME}/tla2tools.jar:${APALACHE_HOME}/src/tla \
  "-XX:+UseParallelGC" tlc2.TLC \
  -config MC3_TwoPhaseTyped.cfg MC3_TwoPhaseTyped.tla
</code></pre>

<p>As expected, TLC finds an example of <code>RMAllCommittedEx</code>:</p>

<pre><code>Running breadth-first search Model-Checking...
...
State 11: &lt;Next line 16, col 1 to line 16, col 22 of module MC3_TwoPhaseTyped&gt;
/\ msgs = { [tag |-&gt; "Commit", value |-&gt; "0_OF_NIL"],
  [tag |-&gt; "Prepared", value |-&gt; "0_OF_RM"],
  [tag |-&gt; "Prepared", value |-&gt; "1_OF_RM"],
  [tag |-&gt; "Prepared", value |-&gt; "2_OF_RM"] }
/\ rmState = [0_OF_RM |-&gt; "committed", 1_OF_RM |-&gt; "committed", 2_OF_RM |-&gt; "committed"]
/\ tmState = "committed"
/\ tmPrepared = {"0_OF_RM", "1_OF_RM", "2_OF_RM"}
...
1119 states generated, 287 distinct states found, 7 states left on queue.
</code></pre>

<p><a name="end"></a></p>
<h2 id="bottom-line">Bottom line</h2>

<p>This is it! If you have any questions, please feel free to reach out. I’m
<a href="https://protocols-made-fun.com/contact/?utm_source=protocols_made_fun&amp;utm_medium=feed&amp;utm_campaign=pmf_feed">happy to help</a>.</p>]]></content>
      

      
      
      
      
      

      <author>
          <name>Igor Konnov</name>
        
          <email>igor@konnov.phd</email>
        
        
          <uri>https://konnov.phd?utm_source=protocols_made_fun&amp;utm_medium=referral&amp;utm_campaign=pmf_site</uri>
        
      </author>

      
        
          <category term="tlaplus" />
        
      

      

      
      
        <summary type="html"><![CDATA[Author: Igor Konnov]]></summary>
      

      
      
    </entry>
  
</feed>
