Author: Igor Konnov

Date: December 22, 2025

I present a simple example that illustrates how property-based testing (PBT) and model checking can help us catch unexpected behaviors of LLMs when they are used to generate code. The example is inspired by the talk on property-based testing by Scott Wlaschin. If you are looking for a light example that stresses the importance of writing good properties and having them checked, this post is for you.

1. Adversarial Developer

A few days ago, I watched the talk on property-based testing by Scott Wlaschin. He started the talk by introducing a persona that he called the Enterprise Developer from Hell. This is basically someone who implements a feature to satisfy the given requirements, but they do it in creatively evil (or just stupid) and unexpected ways. I will call such a persona an adversarial developer in the rest of this post.

Then, Scott1 showed how an adversarial developer could ruin as simple task as adding up two numbers. For example, if we give them two tests $2+2=4$ and $10+33 = 43$, they will implement exactly those cases by case distinction. I am not going to repeat Scott’s talk. Watch it! It’s instructive and entertaining.

Back in 2020, of course, Scott added that we are often aversarial developers ourselves, and our peers are rarely that evil. It could be an enthusiastic junior developer, who has just started and now wants to rewrite the whole code base. Now, we are a few weeks away from 2026, and we definitely have such a peer! It is called an LLM, or just AI, as the corporate marketers prefer. LLMs are not necessarily evil, but they are definitely less predictable than experienced human software engineers. I am not talking about prompt injection here, which is another real issue with LLMs.

To be clear, in the rest of this text, I am talking about how an LLM could behave like an adversarial developer when it generates code. It does not mean that I ran one of the commercial LLMs and got those results.

2. Property-Based Testing

The point of Scott’s talk was to show that a few data points (typical unit tests) are insufficient to demonstrate correctness of the implementation. A totally valid point!

In addition to the standard unit tests, we should also write the expected properties of our implementation. The PBT frameworks test the code by producing input values at random. For example, have a look at Hypothesis. While this may seem to be a silly idea at first, property-based tests may uncover tricky bugs. Moreover the input value distribution does not have to be uniform. Keep reading to see how this helps us catch the adversarial developer.

Here are the three properties of addition that Scott used to defeat the adversarial developer:

  • identity: for any number $x$, $x + 0 = x$,
  • commutativity: for any numbers $x$ and $y$, $x + y = y + x$, and
  • associativity: for any numbers $x$, $y$, and $z$, $(x + y) + z = x + (y + z)$.

At this point of the talk, I was like: Wait a minute! I could continue this game of the adversarial developer. Before continuing with the game, let’s look at where we are with respect to the code and the properties. Here is the obvious implementation of integer addition in Python, since the language has built-in support for unbounded integers:

def add(a: int, b: int) -> int:
    return a + b

Here are the property-based tests in Hypothesis, generated by Claude Sonnet 4.5:

@given(st.integers())
def test_identity(a):
    """Test identity property: a + 0 = a."""
    assert add(a, 0) == a
    assert add(0, a) == a


@given(st.integers(), st.integers())
def test_commutativity(a, b):
    """Test commutativity property: a + b = b + a."""
    assert add(a, b) == add(b, a)


@given(st.integers(), st.integers(), st.integers())
def test_associativity(a, b, c):
    """Test associativity property: (a + b) + c = a + (b + c)."""
    assert add(add(a, b), c) == add(a, add(b, c))

You can find these and further examples in the example repository.

We run these tests with pytest to make sure that they all pass:

$ git clone https://github.com/konnov/pbt-example-summation.git
$ cd pbt-example-summation/python
$ poetry run pytest tests/test_add.py \
  -k "test_identity or test_commutativity or test_associativity" --verbose
...
tests/test_add.py::test_identity PASSED                                        [ 33%]
tests/test_add.py::test_commutativity PASSED                                   [ 66%]
tests/test_add.py::test_associativity PASSED                                   [100%]

3. Symbolic model checking with Apalache

I decided to go even further and write a TLA+ specification, to check the three properties with Apalache and Z3 (only show the relevant parts):

----------------------------------- MODULE Add --------------------------------
(*
 * A simple TLA+ specification of different kinds of addition.
 *
 * Igor Konnov, 2025
 *)
EXTENDS Integers

VARIABLE
    \* @type: Int;
    x,
    \* @type: Int;
    y,
    \* @type: Int;
    z

AddMath(a, b) == a + b

InitMath ==
    /\ x \in Int
    /\ y \in Int
    /\ z \in Int

Next == UNCHANGED <<x, y, z>>

Identity(F(_, _)) ==
    F(x, 0) = x

Commutativity(F(_, _)) ==
    F(x, y) = F(y, x)

Associativity(F(_, _)) ==
    F(F(x, y), z) = F(x, F(y, z))

InvMath ==
    /\ Identity(AddMath)
    /\ Commutativity(AddMath)
    /\ Associativity(AddMath)

With the above specification, we define a very simple state machine that non-deterministically picks three integers x, y, and z with InitMath. These variables do not change in the state machine, as you can see from the definition of Next. We use x, y, and z to define three properties of addition: Identity, Commutativity, and Associativity. As you can see, these definitions are parameterized by the operator F, which is Add for now. Our invariant InvMath is simply the conjunction of the three properties.

This is how we run Apalache to check the invariant:

$ cd pbt-example-summation/tla-spec
$ docker pull ghcr.io/apalache-mc/apalache
$ docker run --rm -v `pwd`:/var/apalache ghcr.io/apalache-mc/apalache \
  check --length=0 --init=InitMath --inv=InvMath Add.tla

With the above command, we tell Apalache to check the invariant InvMath starting from the initial state InitMath. The --length=0 option tells Apalache to unroll Next zero times, which is sufficient in our case, since the state machine does not change the variables.

4. Playing adversarial

Ok, the code and the specification above seem to be correct. But what if our friendly AI produced something unexpected?

4.1. Hallucinating addition over 32-bit unsigned integers

Since we are dealing with an adversarial developer, they could simply use a different definition of addition. So far, we have been talking about unbounded mathematical integers, which Python conveniently implements for us.

Now, the adversarial developer gives us this implementation:

def add32(a: int, b: int) -> int:
    return (a + b) % (2**32)

This implementation is actually not wrong. An LLM could copy it from a code base that emulates a 32-bit CPU architecture in Python. This is a bit of a stretch, but possible.

Let’s add property-based tests for this implementation as well:

# Tests for add32 (32-bit natural numbers with wrapping)

@given(st.integers(min_value=0, max_value=2**32 - 1))
def test_add32_identity(a):
    """Test identity property for add32: a + 0 = a."""
    assert add32(a, 0) == a
    assert add32(0, a) == a


@given(st.integers(min_value=0, max_value=2**32 - 1), st.integers(min_value=0, max_value=2**32 - 1))
def test_add32_commutativity(a, b):
    """Test commutativity property for add32: a + b = b + a."""
    assert add32(a, b) == add32(b, a)


@given(
    st.integers(min_value=0, max_value=2**32 - 1),
    st.integers(min_value=0, max_value=2**32 - 1),
    st.integers(min_value=0, max_value=2**32 - 1)
)
def test_add32_associativity(a, b, c):
    """Test associativity property for add32: (a + b) + c = a + (b + c)."""
    assert add32(add32(a, b), c) == add32(a, add32(b, c))

These tests also pass:

$ poetry run pytest tests/test_add.py \
  -k "test_add32_identity or test_add32_commutativity or test_add32_associativity" \
  --verbose

What is going on? Well, identity, commutativity, and associativity also hold for 32-bit integers with overflow semantics. If we let AI generate not only the implementation but also the properties, we may end up with a correct implementation, but not the one we wanted! In this case, imagine an LLM has added the @given decorators for the inputs to be in the range in $[0, 2^{32})$, whereas we wanted unbounded integers! This is an example of the classical question in requirements engineering: do we get things right vs. do we get the right things.

Just to double check that it is not the random chance, I ran Apalache on the TLA+ specification above with Add32 instead of Add:

$ docker run --rm -v `pwd`:/var/apalache ghcr.io/apalache-mc/apalache \
  check --length=0 --init=Init32 --inv=Inv32 Add.tla
...
Checker reports no error up to computation length 0
Total time: 2.205 sec

The SMT solver Z3 confirms that identity, commutativity, and associativity also hold for 32-bit integers with overflow semantics. This is provided that we pick the integers from the range $[0, 2^{32})$, which we do with Init32.

However, this is not what we wanted initially. Let’s catch the adversarial developer with the PBT tests that pick unbounded non-negative integers:

@given(st.integers(0))
def test_add32_unbounded_inputs_identity(a):
    """Test identity property for add32: a + 0 = a."""
    assert add32(a, 0) == a
    assert add32(0, a) == a

@given(st.integers(0), st.integers(0))
def test_add32_unbounded_inputs_commutativity(a, b):
    """Test commutativity property for add32: a + b = b + a."""
    assert add32(a, b) == add32(b, a)


@given(
    st.integers(0),
    st.integers(0),
    st.integers(0)
)
def test_add32_unbounded_inputs_associativity(a, b, c):
    """Test associativity property for add32: (a + b) + c = a + (b + c)."""
    assert add32(add32(a, b), c) == add32(a, add32(b, c))

This time, Hypothesis catches the issue with identity:

$ poetry run pytest tests/test_add.py  --verbose \
  -k "test_add32_unbounded_inputs_identity or test_add32_unbounded_inputs_commutativity or test_add32_unbounded_inputs_associativity"
...
a = 4294967296

    @given(st.integers(0))
    def test_add32_unbounded_inputs_identity(a):
        """Test identity property for add32: a + 0 = a."""
>       assert add32(a, 0) == a
E       assert 0 == 4294967296
E        +  where 0 = add32(4294967296, 0)
E       Falsifying example: test_add32_unbounded_inputs_identity(
E           a=4_294_967_296,
E       )
💡 Tip:

We can also implement add64 that wraps integers modulo $2^{64}$ and the PBT tests will catch the issue with identity almost immediately.

4.2. Is property-based testing a magic tool?

Let’s stop and think about our example. How did Hypothesis catch the issue over an unbounded integer domain? Even it was picking the integers from the interval $[0, 2^{64})$, the chance of picking 4294967296 by uniform random sampling is pretty slim. Yet, Hypothesis keeps picking this number.

Well, the trick is that its input generator tries the well-known “magic numbers” such as 0, 1, -1, 2**32, 2**64, etc. In this sense, Hypothesis does not use uniform random sampling. See the discussion on domain and distribution in the Hypothesis documentation for more details.

What if our adversarial developer hallucinated an implementation that stays undetected by Hypothesis? This is what our next example is about.

4.3. Hallucinating addition over 256-bit unsigned integers

This time, the adversarial developer uses 256-bit unsigned integers with overflow semantics:

def add256(a: int, b: int) -> int:
    return (a + b) % (2**256)

If you think that using 256-bit integers is absurd, well, the Ethereum Virtual Machine (EVM) does exactly that. So an LLM could have adapted the above code from an EVM-related code base.

@given(st.integers(0))
@settings(max_examples=100000)
def test_add256_unbounded_inputs_identity(a):
    """Test identity property for add256: a + 0 = a."""
    assert add256(a, 0) == a
    assert add256(0, a) == a

@given(st.integers(0), st.integers(0))
@settings(max_examples=100000)
def test_add256_unbounded_inputs_commutativity(a, b):
    """Test commutativity property for add256: a + b = b + a."""
    assert add256(a, b) == add256(b, a)


@given(
    st.integers(0),
    st.integers(0),
    st.integers(0)
)
@settings(max_examples=100000)
def test_add256_unbounded_inputs_associativity(a, b, c):
    """Test associativity property for add256: (a + b) + c = a + (b + c)."""
    assert add256(add256(a, b), c) == add256(a, add256(b, c))

This time, the adversarial developer gets away, all tests pass:

$ poetry run pytest tests/test_add.py --verbose -k \
  "test_add256_unbounded_inputs_identity or test_add256_unbounded_inputs_commutativity or test_add256_unbounded_inputs_associativity" 
...
tests/test_add.py::test_add256_unbounded_inputs_identity PASSED                [ 33%]
tests/test_add.py::test_add256_unbounded_inputs_commutativity PASSED           [ 66%]
tests/test_add.py::test_add256_unbounded_inputs_associativity PASSED           [100%]

Why? Hypothesis does not try $2^{256}$ as a magic number. I gave it the budget of 100,000 examples, so it had a chance to try multiple powers of two, but it did not try anything above $2^{256} - 1$.

4.4. Catching the adversarial developer with Apalache

Here is how we modify the TLA+ specification to use Add256:

Add256(a, b) == (a + b) % (2^256)

InitNat ==
    /\ x \in Nat
    /\ y \in Nat
    /\ z \in Nat

Inv256 ==
    /\ Identity(Add256)
    /\ Commutativity(Add256)
    /\ Associativity(Add256)

Apalache immediately finds the issue with identity when we run it with Add256:

$ docker run --rm -v `pwd`:/var/apalache ghcr.io/apalache-mc/apalache \
  check --length=0 --init=InitNat --inv=Inv256 Add.tla
...
State 0: state invariant 0 violated.
Total time: 2.272 sec

If we check the counterexample, we see that the solver picks quite a large number for x:

$ head -n 14 _apalache-out/Add.tla/2025-12-22T15-20-31_8166638658721555415/violation.tla
---------------------------- MODULE counterexample ----------------------------
EXTENDS Add

(* Constant initialization state *)
ConstInit == TRUE

(* Initial state [_transition(0)] *)
State0 ==
  x
      = 115792089237316195423570985008687907853269984665640564039457584007913129639936
    /\ y = 0
    /\ z = 0

This is not just luck and not an heuristic! Apalache delegates solving to the SMT solver Z3, which solves integer constraints. If you want to make sure that it’s not using magic numbers, go and change the modulo operator in Add256 to a large prime number, e.g., $2^{256} + 297$. Rerun the model checker, and it will still find the issue with identity.

4.5. For the curious: how Apalache and Z3 work together

Our example is so simple that we can even go over the actual SMT constraints that Apalache generates. Let’s run Apalache with the option --debug:

$ docker run --rm -v `pwd`:/var/apalache ghcr.io/apalache-mc/apalache \
  check --debug \
  --length=0 --init=InitNat --inv=Inv256 Add.tla

Open the file _apalache-out/Add.tla/2025-12-23T09-31-46_16436938462355564409/log0.smt (the timestamp will be different on your machine). The log is pretty verbose. Here are the crucial parts, which I’ve accompanied with explanations:

-- the initial value of `x`
(declare-const $C$6 Int)
-- `x` is a natural number
(assert (>= $C$6 0))
-- introduce a boolean variable for the identity property
(declare-const $C$9 Bool)
-- Encode the identity property for `Add256` and `x`.
-- The huge number is 2^256.
(assert (= $C$9
   (= (mod $C$6
           115792089237316195423570985008687907853269984665640564039457584007913129639936)
      $C$6)))
-- assert that the identity property is violated
(declare-const $C$10 Bool)
(assert (= (not $C$10) $C$9))
(assert $C$10)
-- check, whether the above constraints have a solution
(check-sat)

If you want to understand what is going on, read the comments above. At first, I was actually surprised that the SMT constraints did not contain the addition at all. Then I recalled that Apalache has a bunch of rewriting rules that simplify the constraints. In this case, the symbolic model checker has applied the property a + 0 = a internally to get rid of the addition (yeah, it is the identity property!). It was an equivalent transformation, so we are still left with the modulo operator.

Essentially, we are asking Z3 to solve these inequalities over integers:

\[\left\{ \begin{aligned} x &\ge 0\\ x &\pmod{2^{256}} \neq 0 \end{aligned} \right.\]

What is crucial here is that the SMT solver Z3 has the following contract. When we give it a set of constraints and ask it to check their satisfiability, it will apply sound algorithms to arrive at one of the three answers:

  • sat: there is a solution to the above constraints, i.e., an assignment of values to the variables that makes all the constraints true,

  • unsat: there is no solution, and

  • unknown: the constraints are too hard, or it took the solver too long to solve them.

In contrast to PBT, it is not just like “I tried a few random inputs and did not find a bug”. If Z3 answers sat, there is indeed a solution to the constraints, and the solver gives it to us as a model. If it returns unsat, there is no solution. Whenever you see unknown, it’s a bad day. Sometimes, it also indicates a bug in the solver itself, as I’ve found a few times. However, Z3 is pretty reliable in my experience, and producing an unknown is an achievement, unless you set very tight timeouts.

If you are further interested in how Z3 actually solved the above constraints, a simple answer is that it used something like the Simplex algorithm for integer linear programming. The constraints are linear in our case, so Z3 could apply this algorithm to find a solution. Most likely, Z3 used a more recent algorithm, but the idea is similar.

In case you really want to know how SMT solvers work under the hood, I recommend starting with the book on Decision Procedures by Kroening and Strichman.

5. Conclusions

We all have to learn how to write high-quality properties and understand the boundaries of the “magic” tools. Even if you don’t use LLMs, your peers will.

How to learn writing good properties? We can play with property-based testing. However, PBT tools are not reliable teachers. By their random nature, a PBT tool may miss a bug on one run and find it on another run. Don’t get me wrong. Property-based testing has its value, as many other testing and verification techniques. However, PBT is not a silver bullet. It may miss bugs, especially if the input generator does not cover the right input space well enough.

Shall we use interactive provers like Lean and Rocq? Learning how to prove code correct definitely helps! However, these tools only tell us that the proof does not go through. It would not give us a counterexample. State-of-the-art provers also recommend using PBT for bug finding.

In my opinion, model checking is the best way to learn how to write good properties. You can write as many properties as you like, and the model checker will produce you counterexamples, or not. Importantly, model checkers come with a guarantee of not having a bug in their search scope, if they terminate. See my blog post on the value of model checking on that.

Usually, I recommend people to start with TLC. It works by state enumeration and easy to understand. If your search scope is small, TLC is a good learning tool. In our example, the search scope is astronomical. In this case, Apalache is there to help.

By the way, our example was so simple, that we could encode it in Z3 directly via its python bindings. We could use other model checkers. If you do that, let me know!

6. Bonus: Hypothesis + Crosshair

Hypothesis offers an integration with Crosshair, which is a symbolic execution engine for Python using Z3. I did not explore this integration in depth. Claude told me that it is sufficient to just add this import to the test:

import hypothesis_crosshair_provider

Well, this did not help me to find the violation of identity for add256. If you know how to make it work, please let me know!

In addition, we can make sure that the identity test indeed fails when we a large value as an example:

@given(st.integers(0))
@settings(max_examples=100000)
 @example(2**256)  # This should fail!
def test_add256_unbounded_inputs_identity(a):
    """Test identity property for add256: a + 0 = a."""
    assert add256(a, 0) == a
    assert add256(0, a) == a
  1. I’ve never met Scott Wlaschin in real life, online or offline. I hope he would not mind me referring to him by his first name.