This post goes through the architecture of version 1 of the Hermes relayer. We'll cover each of the main components and discuss how they fit into the larger whole. The goal of this post is to give you, the reader, a clear sense of all of the different concerns that the Hermes relayer juggles.
Ultimately, we want you to walk away from this post with:
a) an understanding of the considerations that went into the design of the Hermes relayer, and b) how the implementation addresses these considerations.
In other words, we're looking to answer questions like "Why was Hermes designed this way?" and "What role(s) does component X play in the overarching relayer flow?"
You can follow along with this post by navigating to this diagram for a visual representation of the flow outlined in this post.
While it is not strictly necessary, readers with a general familiarity of the IBC protocol and how it functions will gain a richer understanding of the Hermes relayer and the role that it plays in the larger context of IBC. The Cosmos Developer Portal has a great synopsis of the protocol.
It includes a useful overview of relaying in general for those looking to understand why relaying is integral to the IBC protocol. At its most succinct, the job of Hermes is to relay packets between IBC-enabled chains. But there are a lot of considerations that go into how Hermes performs this job efficiently, from both a performance as well as a cost standpoint.
In addition, ICS 018 is the official IBC relayer specification, which Hermes implements. The relaying algorithm that it lays out is quite simplified, but it is another good primer that details the high-level concerns of relayer implementations.
Let's dive under the hood and take a look at the first major component of the relayer pipeline.
The supervisor singleton instance sits at the "opening" of the relayer pipeline. It subscribes to a node of each network that the relayer relays between and listens for events from all of these networks. The supervisor receives event batches emitted via a WebSocket connection from a source chain as a single event stream.
The supervisor also buckets events into one of three WorkerCmd
variants, depending on the nature of the event. These variants are IbcEvent
, ClearPendingPackets
, and NewBlock
.
This approach means we don't have to have each of the different components (i.e., individual workers) that handle different types of events all subscribing to the WebSocket interface; the supervisor handles de-multiplexing the event stream into a bunch of smaller streams that each feed into one or more workers.
Due to the centralized nature of the supervisor, it is also responsible for reconnecting to the WebSocket in the case that some sort of error occurs that disrupts the event stream.
The commands that the supervisor outputs represent the input for the part of the relayer called the packet command worker. Each channel that Hermes relays on is served by a packet command worker thread. The main job of the packet command worker is to schedule the execution of relaying jobs.
Note: The concept of a relaying job is referred to in the code as 'operational data', however, 'relaying job' reads a bit easier. We'll use both terms interchangeably as appropriate.
There are two main types of relaying jobs:
Packet clearing jobs are scheduled in response to the worker receiving a ClearPendingPackets
or NewBlock
command.
Live relaying jobs, what you might consider as the 'main' tasks of the relayer, are scheduled in response to the worker receiving IbcEvent
worker commands.
The rest of this section will cover the different variants of worker commands and how the packet command worker responds to each of them, the process of packet clearing and why it is important, as well operational data and how it is constructed.
The possible WorkerCmd
s that the supervisor can output are:
IbcEvents
: This is the stock command that wraps most incoming events. It signifies that the inner events are to be propagated further down the relayer pipeline to be processed.
ClearPendingPackets
: This supervisor emits this command usually in response to something going wrong with the WebSocket connection. In this situation, packets that the relayer was expecting may have been en route but were lost due to the broken connection. As a result, the relayer optimistically clears any packets that were outstanding or in the middle of being processed. This has the effect of soft resetting the state of the relayer pipeline.
NewBlock
: This command signifies that a new block was committed on the source chain. In response to this, if the clear_interval
config value has been set to some numeric value X (with X > 0), then packets are cleared every X new blocks that are committed. This process readies the internal state of the relayer to process events for the newly-committed blocks.
In the case of the first command, the IBC events are forwarded to be handled further down in the relayer pipeline. In the case that a packet clearing job was scheduled, IBC events are generated in response. These events now need to make their way further down the pipeline for further processing.
Let's now go into a bit more detail on the process of packet clearing.
The packet clearing process handles three types of packets:
recv
packets, which are submitted by a relayer on behalf of a chain submitting a SendPacket
event to a target chain.
ack
packets, which are submitted by a relayer on behalf of a chain receiving a recv
packet; ack
packets are bound for the chain that originated the message associated with the recv
packet.
timeout
packets, which are submitted by a relayer in order to notify a chain that a message that the chain had attempted to send was not received and instead had timed out.
The below diagram illustrates the two main packet flows: the first one showcases the packets that are sent in the case of a successful relay, the second one showcases the packets that are sent in the case of an unsuccessful relay, i.e., the message timed out.
The job of the packet clearing process is to drive the relaying process forward by relaying pending packets, either through explicit packet clearing jobs, or through live relaying jobs. The ClearPendingPackets
and NewBlock
worker commands each kick off the packet clearing process, resulting in pending packets making progress towards being sent to their destined target chains. In addition to these commands, packet clearing is also initiated in response to receiving IbcEvent
s.
As part of the packet clearing process, pending packets are fetched and transposed into operational data, which is the type that represents the concept of a relaying job. This is the last piece of the packet command worker that we need to cover.
In order to ensure that each IBC event makes its way to the correct relayer component, some metadata needs to be generated and attached to each event. These pieces of metadata-augmented IBC events are called operational data. This data type encapsulates both the events that originated the operational data, as well as the resulting message(s) that map to those events.
Operational data can be generated for either the source chain, in the form of ack
packets and timeout
packets, or the destination chain, in the form of recv
packets. Depending on which chain a piece of operational data targets, the data is placed into one of two separate queues: one that holds operational data bound for the source chain, the other that holds data bound for the destination chain.
Another way to think about operational data is that it is a self-sufficient (from the relayer's perspective) data type that captures the inputs to the relayer (events) along with their resulting outputs (transactions). This is the form that events take as they make their way through the rest of the relayer pipeline.
These pieces of data then make their way to the packet worker, another logical abstraction where the bulk of the work that the relayer does takes place.
The packet worker comprises the "meat" of the relayer; this is where most of the work happens. As such, the work that the packet worker does can be further broken down into smaller sub-tasks: monitoring the queues that hold pieces of scheduled operational data, executing pieces of operational data that are ready, and finally processing pending transactions (though this last step needs to be explicitly toggled on). In other words, the packet worker is where events and transactions are processed into the actual data that the destination chain is expecting to receive.
The entry point of the packet worker is the 'refresh' component, which is responsible for monitoring the operational data queues; these queues act as input data streams for the packet worker.
Pieces of operational data are peeked from the front of this queue. Those pieces of data that are ready to be executed stay in the queue, waiting to be popped by the 'execute' component. However, not every piece of operational data translates into an execution. Some of these pieces of operational data are stale, meaning that it perhaps arrived out of order, or that it arrived too late to be executed. Such pieces of operational data are removed from the queue and dropped, though not before a timeout event is generated and scheduled to be sent back to the source chain, notifying it of the fact that the message timed out.
Once the 'refresh' component has checked that a piece of operational data is consistent with regard to its target chain, that piece of data is popped from its queue by the 'execute' component. This component's main responsibility is sending pieces of operational data to their target destination chain.
The process of sending a piece of operational data first involves assembling the actual messages that the target chain understands from the transaction(s) bundled in the piece of operational data. This assembling process produces a vector of Any
s. The Any
data type is an opaque set of bytes that the relayer doesn't understand and doesn't care about, since it is meant to be understood by the destination chain. This vector of Any
data is then submitted to the runtime, the component that can talk to the target chain, which performs the final stretch of work necessary to get the messages to the destination chain. We will discuss the runtime component in more detail later on in the post.
If the sending of the messages was successful, the runtime returns a vector of transaction hashes, each one corresponding to one of the transactions that was successfully sent (but not yet necessarily committed) to the destination chain. These hashes serve as promises from the target chain that the transactions the hashes correspond with will be committed (though these promises are not always fulfilled). If transaction confirmation has been enabled, then the relayer will check that submitted transactions actually went through. More specifically, the transaction hashes, which represent pending transactions, are queued up into separate queues, one corresponding with the source chain, the other corresponding with the destination chain, in order for the 'process pending transactions' component to handle.
If an error occured when attempting to send a piece of operational data, one of two possible paths of recourse are taken. In some cases, the send operation is simply retried. In other cases, a timeout event is generated and scheduled to be delivered back to the source chain, notifying it that the original transaction timed out and failed to complete.
When the tx_confirmation
option is toggled on in the Hermes configuration, the 'process pending transactions' component records and exposes metrics (assuming metrics have been toggled on) about transactions to Hermes users so that they can better understand the outcomes of executions. These metrics can't be updated only based on the transaction hashes, as we don't know for sure that those transactions actually got committed. This component uses those hashes to query the target chain in order to determine whether the corresponding transactions were committed or not. Once Hermes receives a response containing the state of the transaction, then it updates the metrics accordingly.
The querying of transactions exhibits an error handling pattern similar to the one that the 'execute' component utilizes when it sends transactions. In the case that an error occurs when querying transactions, the transaction hash is pushed to the back of the queue of transaction hashes to be retried later. In the case that the query was successful and a confirmation was received, then the confirmed events are recorded in the telemetry state.
The last major component of the relayer pipeline is the chain endpoint, which is the part of the relayer that is not chain-agnostic; each chain endpoint is particular to the type of chain that packets are being relayed to.
The chain endpoint is responsible for interfacing directly with a chain, and as such, these endpoints are bespoke for each type of chain. For instance, the Hermes v1 chain endpoint implementation primarily supports Cosmos SDK chains. Chains that don't adhere to the Cosmos SDK interface will require dedicated support and development in order to address this, at least in the v1 design of Hermes. Future versions of Hermes will make it much easier to support different types of chain interfaces, without requiring custom development or support.
The chain endpoint receives a vector of opaque Any
data and creates new transactions from it. A transaction includes the actual message comprised of Any
data, as well as an associated packet: a recv
packet in the case when a message was successfully relayed from the originating source chain to the intended destination chain, or an ack
packet in the case when the original destination chain responds back to the source chain, acknowledging that it received the source chain's message.
In order for a transaction to be sent, it must be cryptographically signed, the amount of gas that will be required to send the transaction must be estimated, and then it must be broadcast to the target chain. This broadcast occurs in an asynchronous fashion such that the endpoint doesn't wait for transactions to be committed on the target chain; the relayer drops transactions off in the target chain's mempool and leaves it at that. At this point, unless transaction confirmation has been enabled, the relayer has no idea whether the transactions get committed on the target chain or not (the 'process pending transactions' component will circle back to check on these transactions if it is a requirement that the transactions were known to be committed).
That concludes our whirlwind tour of the Hermes v1 relayer lifecycle! We covered a lot of ground, and yet this post only scratched the surface of the inner workings of the Hermes relayer.
If you read all the way to the end and are interested in digging deeper, check out the the Hermes guide. You can reach the Hermes developers via the IBC Gang Discord server or the Hermes GitHub repo 🙂