On Nov 18, 8am PST we had the third Stake Wars call with around 10 people. We again opened registration for validators to the public through our customized form. However, due to insufficient testing of the form, some bugs went unnoticed until Sunday, which may caused some registrations to be lost. As a result, only five external validators actually signed up for stake wars this week at genesis.
The network started as expected during the call. However, very soon we realized that the new state sync code does not work properly — it panics on empty state. After fixing the bug, we restarted the network on Monday afternoon with the same set of validators. However, due to timezone difference none of the external validators was able to restart their nodes at the same time as us, which caused finality gadget to stall since there were fewer than 2/3 validators online.
Shortly afterwards, we noticed that it became difficult for nodes to sync to the stake wars network. After some debugging, we realized that the network forked and nodes had difficulty syncing because they incorrectly rejected known headers so that they cannot do state sync. We fixed the bug in header sync and changed adaptive waiting time for block production to reduce forkfulness. The network was restarted again on Tuesday night.
The new network ran normally for a couple days before Murphy’s law kicked in. This time, the network itself seemed fine but no transactions can go through, which means that no one can register account on our stake wars wallet or send a staking transaction. Looking into it we noticed that no chunks were included in the newly produced blocks.
At the same time, all our validators have been kicked out at that point, making it harder to debug. As a debug workaround, we figured out a way to impersonate the block producing node by customizing the node’s code to see why chunks were not included, which soon leads to the discovery of a simple but hard-to-notice bug in our block production logic.
Here is a summary of the issues we found from running stake wars this week:
- State sync crashes on empty state. The new state sync code incorrectly crashed on receiving an empty state, which is fixed in https://github.com/nearprotocol/nearcore/pull/1716.
- Incorrect rejection of old block headers during header sync, which is fixed in https://github.com/nearprotocol/nearcore/pull/1740.
- In block production, chunks might be incorrectly removed if not all chunks are ready when the block production function is invoked, which is fixed in https://github.com/nearprotocol/nearcore/pull/1765.
Overall, this week’s experience indicates that our stability has regressed, mostly due to the new code that was merged last week. Although it is hard to stabilize our code while still developing new features, we will focus on improving test coverage for more complex setups to make sure that new code doesn’t break the system. For example, we have now a nightly test infrastructure set up so that we can run expensive integration multi node tests nightly to test the stability of the system as a whole. Our goal is to see steady increase in the stability of the network over the next few weeks.