Informal Systems

2023-06-01

Hub Engineering Update for April & May 2023

Jehan Tremback • 2023-06-01

Hub Engineering Update & First Consumer Chain Launch

It’s time for an engineering update from Informal Systems’ Cosmos Hub team!

Utilizing Replicated Security for the first time, Neutron officially launched as a consumer chain on May 11th. This update covers the engineering update for the period Apr-May’23 - which included the Neutron launch!

Table of Contents

The First ICS Chain! - Neutron Launch Retro

Launching consumer chains has been done a huge number of times in testnets and in fact, is done hundreds of times a week in our automated tests. Typically, the technical details of a consumer chain launch are too boring to write about. However, real life is inevitably different from testing. We had a few complications during the Neutron launch, but nothing that caused any permanent issues, and the issues shouldn’t reoccur for future launches. I’ll recap the launch and the complications here.

Note: when I say “we” in this post, I am not referring only to the Informal team. It was an effort of many teams led by Neutron, including us, the ICF, Hypha, Strangelove, CryptoCrew, and of course the entire validator set. 

Multisig issues reported and fixed - April 25th - May 4th

A couple of weeks before the launch, we began receiving reports that validators were having trouble using multisigs and Ledger devices to sign key assignment transactions. By default, validators use the same keys on consumer chains as they do on the Hub, but it is considered a good security practice to use a different key for consumer chains, and this is what key assignment transactions let you do.

It turned out that Ledger devices and multisigs were not able to sign key assignment transactions because of complications involving Amino, a data format that has been largely removed from Cosmos, but is still used in Ledger devices and multisigs. Fixing this involved a small change in format of key assignment transactions.

We finished this change around May 4, the week before launch, and sent out notice of an emergency upgrade to be performed on Monday, May 8th, the original scheduled date of the Neutron launch. This upgrade also contained a fix for a bug that could potentially halt the Hub.

Hub emergency upgrade - May 8th

The Neutron team wisely decided to postpone their launch for a few days, to Wednesday the 10th, to avoid complicating the upgrade. 

The upgrade on Monday went relatively smoothly, with minimal downtime and no major incidents. Immediately after the upgrade, validators were able to use the key assignment transaction from their multisigs and Ledger devices. Thanks to the team at Notional for consulting with Neutron on postponing the launch, and helping to coordinate the upgrade!

However, Neutron’s “spawn time” was on Monday, even though their actual launch was delayed. At the spawn time, the IBC client and genesis file for the consumer chain are created on the Hub, and after this point, they cannot be changed. All validators who had not been able to use the key assignment transaction before the fix were in the genesis file with their Hub keys.

This means that to run Neutron, they would need to put their Cosmos Hub keys on their Neutron validator. This is not a problem for the software, but for operational security, most validators would prefer not to be moving keys around on different computers.

Neutron launch - May 10th - May 11th

We did the Hub emergency upgrade so quickly because we didn’t want the validators who weren’t able to assign keys to be unfairly punished for downtime. However, we soon realized that we might have a bigger problem. A very large number of validators had not been able to assign keys before spawn time, and after spawn time, the genesis was locked in. Key assignments could still be done, but they would only take effect after Neutron had started.

You need at least 67% of the Hub validators to run a replicated security consumer chain in order for it to start. This 67% would need to come from the validators who weren’t using multisigs or Ledgers, or from validators who were willing to temporarily run the consumer chain with their Hub keys. With a lot of outreach over the 10th, we were able to put together enough power to get the chain started, although it took many hours. Thanks to all the validators who helped us out here.

We decided to wait till the next day before starting the relayer though. Once the relayer started, it would start catching up on all the replicated security packets that update the consumer chain’s validator set. These packets would update the keys for validators who had used the consumer key assignment transaction after Monday’s spawn time. This is a good thing, but our concern was that it could be possible for validators who had helped us out by using their Hub key instead of their consumer key could be knocked offline once the keys were updated. Since we were running on such a bare majority, this could cause a halt. We wanted to have as many people as possible online and well-rested to help deal with this.

On the 11th, we had some difficulties getting relayers to start which are too boring to get into here. But once the relayers were up and running, packets started going through very quickly, and the chain got synced up. Neutron is now the first Replicated Security consumer chain, secured by the Cosmos Hub.

A note on unbonding 

Some users unbonded their staked ATOMs during the Neutron Launch window (May 8th - May 11th). These tokens will be fully unbonded on May 31st. 

Neutron officially became a consumer chain on May 8th. This is known as its spawn time. However, it did not start running until May 11th. When it started, it received all the VSC packets for unbondings that happened after the spawn time, and started the unbonding period countdown of 20 days. 20 days after the 11th is the 31st, and so this is when the unbonding period of all VSC packets received on the 11th will complete.

How does unbonding work with Replicated Security?

A core part of the Replicated Security protocol is the preservation of the consumer chain’s unbonding period in all circumstances. This is done in a simple way: 

  1. When a user sends an unbonding transaction, a record of the pending unbonding operation is created.

  2. This unbonding operation can complete when two conditions are met:

    1. 21 days have passed on the Cosmos Hub.

    2. Every consumer chain reports that their unbonding periods have also passed.

  3. At the end of the block that the unbonding was begun, a validator set change packet (VSC packet) is sent to all consumer chains. This contains the power change caused by the unbonding, and any other power changes from that block.

  4. Once the consumer chain receives the validator set change packet, it begins its own unbonding period countdown. On Neutron, the unbonding period is 20 days.

  5. After this period has elapsed, the consumer chain sends a VSC matured packet to the Cosmos Hub.

  6. Now any corresponding unbonding operations can complete, once the Hub’s unbonding period has also passed.

This protocol guarantees that no matter what, the consumer chain’s unbonding period is respected.

How can this be avoided in the future?

While this mechanism does guarantee the security of consumer chains to the highest degree, delays in unbonding can be inconvenient and should probably be avoided. It should be noted that a consumer chain cannot delay unbonding forever. If a consumer chain has delayed unbonding for more than 5 weeks, it will be removed from the Cosmos Hub and all unbondings will be released.

To avoid the type of minor delay discussed here, the easiest way is to reduce the unbonding period on consumer chains. To illustrate, if consumer chain unbonding periods were one week shorter than the Cosmos Hub’s unbonding period, then a consumer chain would need to be stopped for one week before any unbondings were delayed on the Hub. This theoretically reduces the security somewhat, but most likely not in a meaningful way.

We’re also studying the protocol to see if it may be possible to eliminate the mechanism of unbonding pausing altogether. This will likely require making the assumptions that clocks are always synchronized, and making the security of consumer chains harder to quantify exactly, but it may be worth it.

Other Engineering Updates

Much of the work in April and May was focused on preparing for and launching the first consumer chain, Neutron. 

Soft opt-out

Neutron chose to launch with the soft opt out feature discussed in the last update, which allows the smallest 5% of validators on the Hub to opt out of running the consumer chain. We went ahead and implemented it in time for the Neutron launch: #833

Completed standalone to consumer migration code

The standalone to consumer migration will allow chains who are currently running standalone to become consumer chains. Stride will use this during their launch. We’ve been working on this with Stride over the past few months, but in April we merged it: #757, #794, #832

Oak Audit review

Oak Security was recently funded to do an audit of Interchain Security by the Cosmos Hub community pool. We received their initial report, reviewed and discussed it with the audit team. They found some issues where a malicious consumer chain could potentially halt the Cosmos Hub, but this is within the current security model of Replicated Security. We are fixing these issues and others to move towards an “untrusted consumer chain” paradigm, but this work is ongoing.

One of the issues that Oak Security identified was where an attacker could halt a consumer chain by creating millions of spam denoms (token types). This issue was very similar to one that we had already been investigating on the provider chain side of ICS. We can now confirm that these issues have been addressed and fixed on both the Cosmos Hub and Neutron.

Neutron testnet and fixes

In the Neutron testnet, we found a bug where multiple key assignment messages could cause a halt on the consumer (this was also found in the Oak audit a few weeks later). We fixed this issue on both the consumer and the provider: #850, #846

Multisig issue

At the end of the month, we began receiving reports that key assignment wasn’t working for validators using multisigs and Ledger devices. We began work on a fix for this issue by slightly modifying the format of the key assignment transaction.

Preparing for emergency upgrade

We started preparing for an emergency Hub upgrade containing the denom and multisig fixes discussed above. Most of the work on this upgrade happened in May though. We’ve posted about it elsewhere but I won’t cover it here since it will be in the May update. Spoiler alert: it went pretty smoothly.