For almost a year, Replicated Security has been successfully replicating the multi-billion dollar economic security of Cosmos Hub to the Neutron and Stride chains. A potential hindrance, known as unbonding pausing, of Replicated Security is that, in exceptional cases, the unbonding of tokens on Cosmos Hub could be delayed by more than the expected unbonding period of 21 days. Those potential unbonding delays concern delegators because they might not be able to retrieve their tokens on time.
At Informal Systems, we investigated whether we can improve Replicated Security by eliminating unbonding pausing to avoid potential unbonding delays. In this blog post, we show the surprising result that it is impossible to eliminate unbonding pausing without compromising the security of light clients and hence the security of the Interchain.
In what follows, we first provide a high-level overview of how light clients operate. We then present Replicated Security and its use of a specific type of packets, namely VSCMaturedPackets, that can cause unbonding pausing. Finally, we show that there is no way to remove VSCMaturedPackets from Replicated Security without compromising the security of light clients.
The Inter-Blockchain (IBC) protocol is a general message-passing protocol that allows different chains to communicate with each other. IBC consists of multiple components such as light clients, connections, and channels (for more information, see the official page of IBC). In this section, we provide only a high-level overview of light clients that constitute the core of the IBC protocol.
IBC light clients, which we refer to simply as light clients, allow one chain A to send and receive messages to and from another chain B. Chain A can use a light client to verify that indeed a message sent from chain B stems from chain B. Because of this, light clients form the basis on which IBC applications, such as token transfers, are built upon. Note that Replicated Security is in itself an application of IBC.
A light client on chain A consists of a bounded list of headers (from increasing but not necessarily contiguous heights) of chain B. We say that chain A tracks chain B because chain A keeps track of the headers of chain B. A header for a specific height H contains, among others, the appHash (i.e., hash of the root of all the stores of the chain) at height H, and a validator set of all the validators that signed the block with this appHash at height H. Note that in reality, the state of a light client is streamlined and is a list of consensus states (see ConsensusState) but for simplicity of presentation, we consider here that a light client consists of a list of headers. Additionally, the list of headers of a light client is bounded because older headers are pruned as is explained later on.
For example, Cosmos Hub has multiple light clients that track different chains. All those light clients reside in the state of Cosmos Hub. In the figure below we see an example where Cosmos Hub has a light client that tracks Stride, another light client that tracks Osmosis, etc. Creating a light client on any chain is as simple as sending a transaction to the Cosmos Hub to provide an initial header through a MsgCreateClient message.
If we were to look inside a light client, for example the light client that tracks Stride, we can see that the client contains a list of headers as shown below.
Light clients are useful because we can use them to prove that something has happened on the chain they track. In the figure above, we have the header of Stride at height 3129 and hence we know the appHash of Stride at height 3129. Therefore, if a user wants to prove to Cosmos Hub that an action took place on Stride at height 3129 (e.g., tokens were escrowed, a message was sent, etc.), the user just sends to Cosmos Hub the action and the Merkle proof. We can look at the Merkle proof in combination with the action to verify that we end up with the same appHash as the one stored in the header at height 3129. If this is the case, then we can be certain that indeed the action took place on Stride. For simplicity, we omit here the off-by-one discrepancy between the appHash and the application state, namely that the appHash at height H + 1 refers to the application state after executing the transactions from block at height H (see discussion).
It should be apparent that light clients can be used to form the basis of communication between chains. For example, for two chains A and B to communicate with each other, we can set up a light client on chain A that tracks chain B and a light client on chain B that tracks chain A. Now if chain A wants to send a message m to chain B, chain A just has to write m in its store and send a proof to chain B. Because chains do not have the ability to actually communicate directly with each other, communication is done through the use of specific messages, for example MsgUpdateClient and MsgReceivePacket. Those messages can be sent by any user to a chain but are usually sent by off-chain processes, called relayers, that connect to multiple chains and monitor their communication. A state of the art IBC relayer is Hermes. When chain A wants to send a message to another chain B the relayer would pick this up and send transactions with messages, such as MsgReceivePacket, over to chain B. Whenever we mention “user”, we mean a “relayer” in practical terms.
A careful reader might wonder how light clients get updated. For example, if a malicious user could update the Stride light client on Cosmos Hub with a bogus appHash, then that user could alter reality as seen by Cosmos Hub and “convince” the hub of erroneous things, such as, that an account on Stride has more tokens than they have in reality.
In practice, updating a client entails appending one header in the list of headers of the client. To append a header, we have to verify that the new header is indeed a valid and not a bogus header. To do this, we have to make two assumptions: i) all headers contained in the light client so far are valid, and ii) that at most ⅓ of the validators can be malicious; a classic Byzantine Fault Tolerance (BFT) assumption. Technically, the assumption is that at most ⅓ of the voting power is not malicious but for the sake of clarity we simply assume that at most ⅓ of the validators can be malicious.
Now, for a user to update a client with a new header \(H_n\), the user has to provide a trusted header \(H_t\) as well that already exists in the light client. The light client only gets updated with \(H_n\) if the intersection of \(validatorSet(H_n) \cap validatorSet(H_t) > ⅓\). If this is the case, then we know that at least one correct validator that signed \(H_n\) also signed \(H_t\) and therefore \(H_n\) is also a valid header. In practice, to update a light client someone has to send a MsgUpdateClient message to the chain that would contain the client to be updated, the new header to be added and the trusted header.
Note that in order to update a light client, a user can provide as a trusted header any header that is contained in the light client and not necessarily the latest one, something known as skipping verification. For example, in the figure below, we update the light client that tracks Stride with a new header from Stride at height 3378. When updating the light client we can say that we trust the header at height 89 and if the intersection \(validatorSet(H_{3378}) \cap validatorSet(H_{89}) > ⅓\) the header at height 3378 would be included.
After the light client gets updated we would be in the following state.
Skipping verification is efficient because we do not need to store a header at height X - 1 to verify a header at height X but can use any header stored in the light client. Note that whether we update a light client with ⅓ intersection or more depends on the trustLevel we set when we first create a light client (MsgCreateClient).
As mentioned earlier, we made two assumptions, that all headers contained in the light client so far are valid. We can guarantee that this assumption holds when we first create a client using the MsgCreateClient message by providing a valid header in this message. We need some initial proof of trust for this but afterwards by induction, if every header we add is valid then all headers in the light client would be valid (with a caveat – see later on trustingPeriod). For the second BFT assumption to hold we need some deterrents so that validators do not act maliciously. Those deterrents in proof-of-stake chains include: i) slashing validators that misbehave (e.g., double sign) by burning their staked tokens, and ii) the concept of token toxicity.
We can see an example where an IBC action takes place, where a user first sends a MsgUpdateClient message to update a light client with a new header, and then the user sends a MsgReceivePacket message that contains a Merkle proof to prove to the chain that it took an action.
A natural question that might arise is for how long do we keep the headers stored in a light client. The headers of a light client are pruned after a period of time, named trustingPeriod. The trustingPeriod of a light client should be less than the unbonding period of the chain the client tracks and usually, although not necessarily, trustingPeriod is set to 2/3 of the unbonding period. The reason for this is the following: Assume a scenario where a set \(V_m\) of more than >⅓ malicious validators sign a header on the third of January and update the light client, then \(V_m\) unbond all their tokens wait for the unbonding period to pass in order to receive their bonded tokens back. After the malicious validators \(V_m\) receive their tokens back, \(V_m\) updates the light client using as trusted header the one from the third of January. Suppose this header is still trusted, i.e., trustingPeriod > unbonding period. Naturally, there is a more than >⅓ intersection between the validator sets of the two headers because they were both signed by \(V_m\) and \(V_m\) consists of more than >⅓ of the malicious validators. Because of this we just added a bogus header on the light client.
For this reason, it is imperative that the trustingPeriod of a light client should be less than the unbondingPeriod of the chain that the client tracks. This way, if some validators attempt to introduce a bogus header to the light client, their tokens will still be staked or not yet unbonded and we would be able to punish such a validator by slashing their tokens.
Curious readers can dive deeper into the topic of light clients here.
Replicated Security allows one chain to provide its economic security to other chains. A specific instantiation of Replicated Security launched on Cosmos Hub and has successfully been securing the Neutron chain since May of 2023 and the Stride chain since June of 2023.
Through Replicated Security, Cosmos Hub (known as the provider chain) can provide its multi-billion dollar economic security to other chains (known as consumer chains) by replicating its validator set to the other chains. In practice, this means that the Cosmos Hub validator set is used to validate transactions on the consumer chains. If a validator misbehaves (e.g., double votes) on a consumer chain, then this validator gets slashed on the provider chain. Effectively, this means that the consumer chain receives the same economic security of billions of dollars of staked ATOM as the provider chain. Additionally, this way, consumer chains can get security without having to bootstrap their own validator sets. The Cosmos Hub validator set in return for its validation work, receives rewards sent from the consumer chain (e.g,. Stride tokens).
The idea behind Replicated Security is rather simple. On a high level, Replicated Security works as follows: Whenever a validator-set change takes place (e.g., undelegations, redelegations, etc.) on the provider chain, the provider chain sends a validator-set-change packet (VSCPacket) to all the consumer chains that it secures. This VSCPacket contains the updated validator set of the provider chain. When a consumer chain receives a VSCPacket it applies the changes to its validator set (through the EndBlock of CometBFT). This way, the provider replicates its validator set and hence its security to the consumer chain. We depict the VSCPackets in the figure below.
After a consumer chain receives a VSCPacket, the consumer chain waits for an unbonding period and then sends back a VSCMaturedPacket to the provider chain. The provider chain can only unbond tokens after it has received a VSCMaturedPacket from all of its consumer chains as seen in the figure below.
The VSCMaturedPacket was introduced in the protocol to ensure the security of the consumer chain, so that if a malicious validator misbehaves on the consumer chain, the tokens of the malicious validator would not have yet unbonded on the provider chain and hence the malicious validator can still be slashed.
Because of VSCMaturedPackets, the unbonding of tokens on Cosmos Hub could be delayed more than the usual 21 days. This delay is called unbonding pausing. This can happen if for example, the VSCMaturedPackets is delayed to be sent from a consumer chain, for various reasons, such as the consumer chain is down, relayers are down, etc. Although rare in practice, unbonding pausing has occurred in the past.
Note that Replicated Security contains many other components such as reward distribution, consumer-key assignments, etc. that we do not describe here because they are not necessary for understanding the inherent limitation of unbonding pausing.
There are two reasons that we want to remove VSCMaturedPackets from Replicated Security. One is to eliminate unbonding pausing, although it can only occur in exceptional circumstances, it would be better if it could never occur. Additionally, the use of VSCMaturedPackets substantially complicates the protocol and its implementation. As a result, if we were to remove VSCMaturedPackets we would be able to eliminate thousands-of-lines of code ending up with a simpler protocol and we would be better equipped when working on new protocols, such as Partial Set Security.
In our investigation we found that there is no way to remove VSCMaturedPackets without compromising security. Specifically, we argue that there is no way to remove VSCMaturedPackets without compromising the security of light clients connected to consumer chains.
We first describe a useful inequality when we think about the problem of removing VSCMaturedPackets and then present some assumptions we can make so that the inequality holds. At the end, we present a realistic example of a potential attack in a setting where we do not have VSCMaturedPackets.
A nice way to think about the VSCMaturedPackets-removal problem is through the following inequality (for simplicity in what follows we only consider one consumer chain):
DeliveryTimeOfVSCPacket + ConsumerTrustingPeriod + EvidenceSubmissionPeriod <= ProviderUnbondingPeriod
where:
DeliveryTimeOfVSCPacket is the time it takes for a VSCPacket to be delivered to the consumer chain from the point the corresponding validator-set updates were generated on the provider chain.
ConsumerTrustingPeriod usually corresponds to ⅔ of the ConsumerUnbondingPeriod. Naturally this is light-client specific but let us assume for simplicity that all consumer light clients have the same trusting period.
EvidenceSubmissionPeriod corresponds to the period of time we would like to have to be able to act on evidence. For example, the Cosmos Hub has an unbonding period of 21 days and Cosmos Hub light clients typically have a trusting period of 14 days, which means that we get 7 days to submit and act on light-client-attack evidence.
ProviderUnbondingPeriod is the unbonding period on the provider chain.
Below, we see a graphical depiction of those variables:
Note that we can slightly adjust EvidenceSubmissionPeriod and ConsumerTrustingPeriod but otherwise their values are rather predetermined. For instance, we would probably agree beforehand that we need at least X days of time to act on evidence and hence EvidenceSubmissionPeriod should be X. Also, reducing the ConsumerTrustingPeriod would make consumer light clients more prone to expirations because it would be harder to guarantee that they are getting updated during their trustingPeriod. To reactivate an expired light client, a governance proposal is needed, making the reactivation of clients a time-consuming ordeal.
On the contrary, we have ample wiggle room to adjust ProviderUnbondingPeriod and DeliveryTimeOfVSCPacket.
The inequality must hold, otherwise the security of 3rd-party-chain light clients connected to a consumer chain would get compromised. We can see this with a couple of examples in which we have fixed ProviderUnbondingPeriod = (21 days), ConsumerTrustingPeriod = (10 days), and EvidenceSubmissionPeriod = (5 days).
In the following figure we see that in case DeliveryTimeOfVSCPacket = 8 days, the effective time we actually have to slash a malicious validator on a light client attack (red-shaded area) is just 3 days: Day 19, 20, and 21) compared to the 5 days (= EvidenceSubmissionPeriod) we want. The green-shared area corresponds to the trustingPeriod of light client LC.
If the DeliveryTimeOfVSCPacket is even longer (e.g., 13 days), then we do not have any time to slash for a potential light-client attack and the light clients are not secure. This could happen if a malicious validator V with > ⅓ voting power unbonds on Day 1 on the provider, issues a light-client update for 3rd-party-chain light client LC on Day 12 and then after the unbonding has completed on Day 21, issues another light-client update on Day 22 when V has fully unbonded. Because the trustingPeriod of LC is 10 days, validator V can successfully issue a bogus client update on Day 22 without getting slashed.
We describe two approaches that guarantee that the inequality holds.
Replicated Security If ProviderUnbondingPeriod is fixed, then a long DeliveryTimeOfVSCPacket would violate the inequality. The current Replicated Security approach states that if DeliveryTimeOfVSCPacket takes too long, then we increase ProviderUnbondingPeriod and hence we have “unbonding pausing.” This way the inequality remains true. Naturally, in a world where we want to remove VSCMaturedPackets because we want to eliminate “unbonding pausing,” any solution that increases ProviderUnbondingPeriod is unsatisfactory. This is because such a “solution” enforces an additional delay on all the unbondings on the provider chain by default and hence is not better than the current approach where in some exceptional cases, some unbondings might get delayed.
Bound DeliveryTimeOfVSCPacket Another proposal made by AiB is that when DeliveryTimeOfVSCPacket is too long (i.e., 8 days) then the consumer chain would halt. By doing this we satisfy the above inequality, in the sense that we put an upper limit on what DeliveryTimeOfVSCPacket can be.
The problem with this approach is what happens if DeliveryTimeOfVSCPacket takes longer than expected. In such a scenario, 3rd-party-chain light clients connected to the consumer chain would be not secure as previously shown. Finally, note that halting a chain does not do much to protect light clients connected to this chain. A malicious validator can keep sending bogus updates to a light client tracking that chain even though the chain might have halted or might not be accepting user-generated transactions.
Naturally, whether the inequality holds or not depends on the protocol and what assumptions we make. In what follows, we describe assumptions starting from the stronger ones.
The stronger the assumptions are, the easier it is to design a protocol that works under such assumptions. However, such a protocol will not be as reliable as one that is designed under weaker assumptions. That is, because when those strong assumptions get violated the protocol does not behave as expected. For instance if we were to assume a setting, let us call it normal operation where the following assumptions hold:
time for a message to be delivered from a provider to a consumer chain is bounded (e.g., always less than 2 days);
there is no significant (e.g., < 60s) clock drift between the provider and the consumer chain when referring to their BFT times;
chains are live and keep producing blocks at regular intervals (e.g., every 10 seconds).
Effectively the above assumptions combined bound DeliveryTimeOfVSCPacket. For example, if there is a clock drift and the consumer chain is lagging behind (e.g., 10 days behind), then it could perceive this as being equivalent to a long delay in the delivery of DeliveryTimeOfVSCPacket. Similarly, if the consumer chain is delayed and produces one block every 5 days, then any DeliveryTimeOfVSCPacket is effectively delayed, etc.
In such a normal operation, we could simply remove VSCMaturedPackets from Replicated Security without any issue. Similarly, the proposal made by AiB would work fine as well. However such a protocol would break if one of the above assumptions gets violated.
We do not believe that those assumptions, at least the first one, are reasonable and we are not really interested in a solution that “fails” when those assumptions do not hold anymore.
As a matter of fact, Replicated Security works under a weaker set of assumptions where we assume that the time it takes for a message to be delivered from the provider to the consumer is unbounded and that chains can halt for long periods of time.
We present a potential real-world profitable attack in a setting where we do not have VSCMaturedPackets. In this attack, an attacker can convince the Kujira chain that it has an arbitrary number of Neutron tokens (NTRNs) that the attacker then swaps with other tokens on Kujira and therefore makes money out of thin air.
For this attack to take place we consider the Neutron consumer chain and the light client in Kujira that tracks Neutron. Specifically, the light client with clientId 07-tendermint-112, that from now on we would refer to as Kujira client. This Kujira client has a trustingPeriod of 17 days (i.e., 1468800s) and a trustLevel of ⅓ (i.e., to verify a light-client header, this client only needs an intersection with more than ⅓ voting power with a trusted header).
Additionally, for this attack to be possible, we consider a validator V with > ⅓ voting power on Cosmos Hub. The same attack could be performed by a set of validators that have >⅓ accumulated voting power but for simplicity we only consider a single malicious validator V with > ⅓ voting power.
We furthermore assume that Neutron halts for 5 days. It is not unrealistic for a consumer chain to halt for a prolonged period of time and Neutron has halted for ~9 hours in the past (see BFT time difference between blocks 1909053 and 1909054).
Now consider that when Neutron halts, validator V immediately unbonds all its stake on the Cosmos Hub. Additionally, assume that when Neutron is back on Day 6, it decides and executes on a block X that does not yet contain the VSCPacket that informs Neutron that V has unbonded. But later, for example at block X + 1, Neutron picks up this VSCPacket change.
Now, the light client on Kujira that tracks Neutron could be updated with a header from block X that contains validator V in its header. Later, on Day 22 when V has unbonded everything, validator V can create a bogus light-client update using as trustedHeader that of block height X and update the Kujira client. This client update would get accepted because the trustingPeriod since block X has not expired yet. This can be seen in the diagram above where the green-shaded area corresponds to the trusting period.
By being able to send bogus light-client updates to the Kujira client, V can now pretend it just sent an arbitrary amount of NTRNs to Kujira by first sending a bogus MsgUpdateClient that contains a bogus appHash and then a MsgRecvPacket that contains a “proof” that an arbitrary number of Neutron was sent over to Kujira. At this point, someone could swap the NTRNs with other tokens on Kujira.
For the above scenario to take place, a malicious validator would need to start unbonding when Neutron halts but at this stage such a validator cannot possibly know how long Neutron would be down for. For example, if Neutron is only down for 10 hours, then the aforementioned attack on the Kujira client cannot take place.
Note however that a malicious validator with > ⅓ voting power can perform a similar kind-of an attack by unbonding all its tokens on Cosmos Hub and at the same time censor all VSCPackets that are sent to Neutron for 5 days. After 5 days it stops censoring VSCPackets.
In this blog post we gave a high-level overview of how IBC light clients operate. We presented Replicated Security and described unbonding pausing. Finally, we showed why we cannot eliminate unbonding pausing without compromising the security of light clients that are connected to consumer chains.
Thanks to Jae Kwon for first proposing to remove VSCMaturedPackets. Additionally, many thanks to Marius Poke, Josef Widder, Philip Offtermatt, Jehan Tremback, Aditya Sripal, Albert Le Batteux, and Thomas Bruyelle for all the constructive discussions related to unbonding pausing.