Interchain security is a cornerstone protocol of the Cosmos Hub. We want to iterate on this protocol, to add features and make it more useful, cheaper, and fit changing requirements. However, it is critical that we do not compromise quality or robustness when we iterate, as the protocol is already in active use.
In the Hub team at Informal Systems, we are utilizing Model-Based Testing (MBT for short) to ensure this development is safe. MBT has successfully been used by other teams at Informal Systems before, for example in the context of Tendermint and IBC, as part of audits, and in the Atomkraft tool. Of course, the ideas themselves are even older.
The idea behind MBT is to make readable and verifiable reference implementations, called models, a first-class citizen of the development process. These models can augment plain-English specifications and make them useful artifacts instead of maintenance burdens. The advantage of models is that they are less detailed than the implementation, and thus easier to reason about and iterate on.
Our current understanding is that MBT is most useful for established and somewhat stable codebases, particularly when larger additions or adjustments become necessary.
Let’s outline a common situation for stable codebases, which we recently encountered while working on Interchain Security. The protocol was launched about a year ago and has been running in production without major issues, but we are now planning to make some major adjustments (like partial set security). This presents some challenges:
When the protocol was originally designed, the team wrote a large plain-English specification. The specification describes the protocol, outlines correctness properties (such as state invariants), and reasons about why the correctness properties are satisfied by the protocol.
This initial specification also includes quite a bit of pseudocode, though it isn’t detailed enough to directly translate to implementation code. With the specification in good shape, the team went on to implement the protocol. Over time the ultimate source of truth shifted from the specification to the implementation - what’s in the code is how it’s supposed to work. As the project matures, the specification often lags behind when changes are made, becoming less truthful and less useful over time.
Eventually, the plain-English specification had lost a lot of its initial utility, simply because it diverged too much from the implementation to be useful for most things. The pseudocode parts fall out of sync fastest, because they often try to be more detailed than the text around them, so even small changes to the implementation can make them obsolete. As opposed to the actual code, there is no CI that tests the pseudocode. At some point, when a developer wants to understand how something actually works, they have to dig through the code, because the specification is simply outdated or wrong.This is unfortunate, because the codebase is much harder to digest than the specification. For the code to work, there are many details that don’t matter from a high-level point-of-view, and are not helpful to understand the protocol itself. It is the difference between reading a hundred lines of English text and having to read thousands of lines of code. When we want to make major changes to the protocol now, we have to understand if the proposed changes affect any of the correctness properties that we initially assumed. This means we have to redo our reasoning for why these properties (still) hold. Usually, the only way to achieve this is to essentially recreate our specification to be up-to-date with the proposed changes. This is a lot of work, and especially so when the specification is already inconsistent with how things are actually implemented. And even when we have updated the specification, once we are actually implementing the changes, it quickly becomes outdated again.
Of course, a solution to this might be simply enforcing in our processes that the specification must be updated when changes to the code are made. In fact, we already have such a policy, but it’s not very effective: it is easy to forget to update the specification, and nothing enforces that the specification and implementation align, or even flags if they don’t. This makes it incredibly hard to make sure that the specification is up-to-date.
In light of this, we would like to find ways to make it easier to align the specification and the implementation, and in particular to detect discrepancies between the two early. This is where MBT comes into play.
The cornerstone of MBT is the use of models. Models can be seen as high-level reference implementations. They are supposed to be easy to understand, and to abstract away a lot of the details that an actual implementation would need. The choice of language for this model is important - we should strongly prefer readability over performance. An example of a language that might be suited for this is Python (for example, see the Ethereum Execution Layer Specifications). In the hub team, we instead chose Quint, a domain-specific language purpose-built for specifying distributed systems, which we prefer over a general purpose programming language like Python due to Quint's simpler semantics and higher level of abstraction. We believe Quint is readable, as well as reasonably concise. For a rough comparison: the implementation of Interchain Security in Golang has about 13k lines (excluding tests and automatically generated files); the plain-English specification has about 3.5k lines; the Quint model comes in at around 1.8k lines. This is obviously not an apples-to-apples comparison, but it shows that the reference model can be much more concise than the implementation or plain-English specification.
The model is supposed to partially replace, but also enhance, the English specification. For example, pseudocode in the plain-English docs should be linked to the code in the model. There are also approaches for interweaving the English and model code directly.
The huge advantage of the model over English text is that it is executable. This is very helpful while writing the model. It’s much easier to keep everything internally consistent when it’s possible to actually run and observe what happens. Despite this, the model remains relatively readable because we can omit unnecessary details by choosing the right level of abstraction. For example, we do not need to worry about how to interact with lower levels in the stack like the consensus layer when we are specifying our application logic.
The fact that the model is executable also helps ensure that the model and implementation do not diverge. We can generate possible executions of the model, and then check, for each model execution, whether the same execution visits the same states along the way in the actual implementation. Quint provides a simulator that randomly generates such executions and outputs them in an easy-to-parse json format. Then we only need to write a driver, which can read these execution traces and execute them against the implementation. If the driver detects a difference between the model state and the actual execution, it flags it. Running this as part of our test suite allows us to detect any differences between the model and the implementation early.
A sketch of our MBT implementation.
Even better, the execution traces serve as a test suite themselves, and can provide good coverage for very little code investment. In the case of Interchain Security, our driver has around 1.7k lines of code, and provides roughly the same line coverage as our integration test suite with around 6k lines of code. (Each provides around 40% coverage. For context, all tests in the repository combined provide around 80% coverage.) Again, this is not an apples-to-apples comparison, but it shows that model-based testing is not only a way to ensure alignment between the model and implementation, but also a way to augment other testing efforts. I would not recommend relying solely on MBT for testing, but instead to see the extra test coverage as the proverbial cherry on the cake. MBTs main advantage is to ensure compliance between the specification and implementation in a way that other approaches have a hard time replicating.
Further, by choosing Quint as our language for the model, we get access to powerful tools for automated reasoning. We can formalize our correctness properties in Quint, and use model-checkers to automatically prove that they hold, or find counterexamples if they do not. Quint is integrated with Apalache, a symbolic bounded model checker, for this purpose.
Quint is a language that is designed to be verifiable, so it is much more promising to run a model checker on the higher-level Quint model, rather than directly on the implementation. This is an advantage over languages that are meant for producing actual implementations, like Python or Typescript. On the other hand, one of the design principles behind Quint is to use a syntax that intentionally stays close to mainstream programming languages where possible, and in this way Quint aims to improve adoption by engineers without a background in formal methods.
This is an example of a correctness property of Interchain Security, formalized in Quint:
To conclude this section, model-checking and model-based testing are a very powerful combination. The process is as follows:
We have model-checked the model, and we are therefore sure that it satisfies the correctness properties.
We have tested that the model and the implementation align and behave equivalently on a large set of executions.
Therefore, we increase our confidence that the implementation also satisfies the correctness properties.
While working on Interchain Security, I have found the use of model-based testing highly promising, and I believe it will allow us to make adjustments to the protocol with greater confidence going forward. However, it is not without a cost in terms of time. It takes significant effort to create models. Still, writing specifications in plain English takes time, too, and I think the cost of creating a model can be amortized over time, as it is useful again and again when onboarding new developers and changing the protocol. I don’t think MBT is extremely useful for projects that are still heavily in flux. Having models, especially ones in a verifiable language, is itself very useful when specifying a project at any stage of progress. However, the effort of writing drivers and connecting the model and implementation might not be worth it if the implementation is still undergoing rapid changes. Instead, I think the sweet spot is for projects that are established and stable enough to have strong evidence that the relatively heavy initial investment of writing drivers will amortize over time.
Divergence between the specification and implementation of critical systems is a major problem that makes changes slow and painful, and contributes to the loss of knowledge over time. Model-based testing is a promising approach to tackle the challenge of keeping the specification and implementation aligned, which can simultaneously improve test coverage for a relatively modest investment in terms of lines of code. It further enables the use of powerful automated reasoning techniques, which ordinarily have difficulties in scaling to complex and large code bases.
If you want to learn more about the work we do on Model-Based Testing in the Hub team, or about the work we do on Quint at Informal Systems in general, please reach out - we’d love to connect!