Informal Systems

2023-06-08

Finer-Grained Control Over Hermes' Performance with v1.5

Sean Chen • 2023-06-08

The newest major version of the Hermes relayer has been released! 🎉

Improving Hermes’ relaying performance was one of the main goals of this release. To that end, the Hermes team did a spectacular job triaging, profiling, ideating, implementing, and testing a number of different ways of improving Hermes’ performance.

These changes resulted in relayer operators seeing significant increases in the number of packets they were able to relay, especially on well-trafficked channels such as the Osmosis-Hub channel which many transactions are relayed through.

New Performance-tuning Parameters

One realization that cropped up from talking to multiple relayer operators is that every blockchain network presents different characteristics that have an effect on relayer performance. Thus, there is no one-size-fits-all configuration that holds true in every relaying use-case; the relaying environment is a major consideration to take into account when tuning Hermes for relaying on a particular network.

Based off of this fact, Hermes v1.5 introduces some additional configuration parameters that give relayer operators more fine-tuned control over their Hermes instances. The idea is that tuning these parameters will allow operators to further optimize Hermes for their particular relaying use-cases and performance needs.

Eliminating latency variability when broadcasting event batches

The first of these parameters is batch_delay. Prior to introducing this parameter, batches of events were sent to the supervisor in response to every NewBlock. This introduced a variable amount of latency in the relaying process, as NewBlocks were not always received consistently. Hermes v1.5 changes this behavior so that event batches are sent after a specified delay in addition to on NewBlock events. This eliminates the variability when sending event batches, as batches are now always broadcast no later than the batch_delay period.

One relayer operator who helped us test the effects of this parameter saw a pronounced increase in the number of packets they were able to relay on the Osmosis-Hub channel 141, one of the most well-trafficked channels in the ecosystem.

image(1).png

Decreasing the batch delay value, however, comes with the tradeoff of higher relaying costs in exchange for the benefit of faster event processing. This is because a lower delay value can cause events to be split across multiple batches, rather than including them all within a single batch. More batches being sent means more client updates will need to be performed, thus incurring additional costs that might not have been required if the events were batched together more efficiently.

Currently, the default batch_delay is set to 500ms, which provides a good balance between speed and reliability, while still minimizing the number of client updates that need to be sent. In the case of this operator, they found that 250ms ended up striking the right balance between relaying latency and fees.

If you prioritize processing speed and can tolerate the potentially higher relaying costs, then setting a lower batch_delay value could be beneficial for your use case. In situations where relaying latency is not as important, such as if you’re running a backup relayer instance, then a higher batch_delay value would be more suitable.

Again, it should be noted that there is no magic number when it comes to batch_delay, or indeed any Hermes configuration parameters. It is important to consider the environment in which you are operating within, and to play around with these parameters to see what combination of configuration values yields the desired performance you’re going for.

Configuring whether a node should be trusted

Hermes v1.5 also introduces the ability to configure whether the full node that a Hermes instance is connected to should be trusted or not via the trusted_node parameter; by default, Hermes does not trust any full node it connects to. If a node is untrusted, that means the light client will perform an extra step to verify the headers included in the ClientUpdate message in order to ensure that they are valid.

This validation step can be skipped by setting trusted_node = true. This leads to faster processing of ClientUpdate messages, though it also comes with some downsides. Most notably, client updates may fail without the validation step. This risk is most prominent after any significant changes in validator sets. More specifically, if more than a third of the validators that validated the last trusted header have been swapped out, then client updates will very likely no longer be valid.

It should be noted that sending invalid client updates does not present a security vulnerability. The invalid updates will be reverted with future client updates, but this will cost additional fees, as well as incur additional latency on account of more transactions having to be sent.

Ultimately, it is recommended that nodes be considered untrusted except in the case where a Hermes user has full control of the full node that Hermes is connected to such that verifying headers is most likely redundant.

Improving Hermes’ packet clearing latency

We also uncovered a bug that occurred when Hermes would query the chain for packet acknowledgements, which happens whenever Hermes clears pending packets (both on-start, and periodically afterwards). Specifically, this query would oftentimes return so much data that it caused an error by overstepping the message decoding limit set by one of Hermes’ dependencies (4MB in this case).

The packet acknowledgement query asks the chain for an acknowledgement for each packet commitment on the counterparty chain. It turned out that in the case when there were no outstanding packet commitments, the query would actually return all packet acknowledgements on the chain. This was what was causing the query to return too much data. Even if this query did not end up overstepping the 4MB message decoding limit, it also resulted in slow start-up times, as Hermes would be fetching a bunch of unnecessary data from the chains.

The query now works correctly, i.e., if there are no outstanding packet commitments on the counterparty chain, then no packet acknowledgements are fetched. This has the effect of speeding up the packet clearing process, and thus the on-start scanning process when Hermes first boots up.

In Summary

Hermes v1.5 also includes a slew of other improvements to how it handles misbehavior evidence, as well as to its profiling capabilities. We won’t be going into details on those. If you’d like to learn more, please take a look at the Hermes v1.5 changelog for a comprehensive list of everything that is included in this exciting release.

If you’d like to keep up with the Hermes team’s work, please follow the project on GitHub, as well as Informal System’s twitter, where we post updates for what every team, not just Hermes, is working on!