Informal Systems

2023-08-23

Stride Relaying Issue Post Mortem

Hermes Dev Team • 2023-08-23

On July 19th, the Hermes team noticed some chatter on Twitter mentioning Hermes being outpaced by Rly when relaying on Stride from several relayers such as IcyCRO, LavenderFive and GoldenStaking

This intrigued us, because in the v1.5 release of Hermes, we made some major performance improvements that greatly improved relaying on virtually all channels. So we decided to take a closer look. Turns out things weren’t quite what they seemed. We booted up some tooling to monitor transactions and packets sent to Stride. We used our ChainPulse tool, which collects information on IBC packets, the results of their execution, and who relayed them, and exposes it all in a nice Prometheus time series database. Combining that with a Grafana dashboard, we’ve got a full on IBC network analysis station 😎 

Between July 19th and August 14th, we identified 82 Hermes instances and 15 Rly instances relaying on Stride. Over that same period of time, a total of 2,008,945 IBC packets were sent to Stride by the combined relayer instances of Hermes and Rly.

But not every sent packet is equal! Sometimes relayers send redundant packets, because relaying is an off-chain race in the mempool. Redundant packets don’t have any effect on the state, so they’re just kind of wasteful. Out of those 2M sent packets during this time, it turned out that around 1.8M were actually redundant (had no effect!). From what we could tell, ~99% of the packets being relayed by Rly, and ~60% of the packets being relayed by Hermes, were redundant. 

⚠️ This means that ~90% of the IBC packets sent to Stride over that period of time were redundant.

💡 For comparison, on Osmosis, over the same period of time, only 0.2% of the packets sent to Osmosis were redundant.

Clearly something was up.

The Underlying Issue

We began to investigate and quickly noticed this issue with an abnormally high amount of redundant packets being sent to Stride. We got in touch with the team at Stride, and with the Strangelove team working on Rly to share what we were seeing.

Ultimately we realized that txs containing only redundant packets were not rejected with an error 22: packet messages are redundant by Stride, as they should have been. We were able to confirm our hypothesis using an that Stride v12 does not filter redundant IBC packets.

Meanwhile, we noticed this issue on the Stride repository, opened by Jorge Hernandez on May 29th https://github.com/Stride-Labs/stride/issues/807

Stride seems to be using the standard ante handler from cosmos sdk, which does not include the ibc ante handler which helps with removing redundant packets.

Aha! Indeed, the Stride app did not use the IBC-go ante handler, which is in charge of rejected txs which only contains redundant packets with an error 22. Instead, it was using the stock Cosmos-SDK AnteHandler.

So this explained why relayers were submitting many more redundant packets than they would otherwise submit on Stride.

A patch for this issue was submitted here and the fix has been included in Stride v13.0.0+ We re-targeted our integration test towards Stride v13 and were able to confirm that the bug has been fixed in Stride v13.0.0 and Stride v13.0.1. No more redundant IBC packets!

Stride v13 Upgrade

On August 19th, the Stride network underwent an upgrade to Stride v13.1.0, which includes the fix for the missing IBC ante handler described above.

Since then, the efficiency of both relayers has improved massively, as can be seen on the recent screenshots below from our Chainpulse tool:

image6.png

channel-0 <> Cosmos Hub (ICS-20 transfers)

image3.png

channel-5 <> Osmosis (ICS-20 transfers)

image2.png

channel-146 <> Cosmos Hub (Interchain Security)

As can be seen above, the efficiency of the relayers increased massively on these three high-traffic channels. That said, there appears to be some lingering issues with Rly on channel-146. But overall, the numbers look much better already across all three channels for both relayers.

Conclusion

What started as a claim that Rly was performing better than Hermes on Stride turned out to be a bug that was causing 99% of packets sent by Rly (and 60% of packets from Hermes) to be redundant.

This investigation highlights the subtleties of IBC performance and the importance of precisely measuring what happens over the IBC network in order to explain high-level behaviour. We hope our ChainPulse tool will help the IBC community shed more light on interchain activity in order to simplify debugging, gain insights, and create a more transparent IBC ecosystem. See you out in the interchain 🫡