On Nov 11, 4pm PST we had the second Stake Wars call with around 20 people.
Due to the issues uncovered last week, we took more caution this time and chose to start the genesis in house and invite external people to join afterwards. The network was started several minutes before the call, and during the call, Peter demonstrated how to join the network and stake to become a validator.
The network went down at 14560 blocks at around 9:00pm PST on Monday. All validating nodes crashed almost simultaneously. From the logs we observed that the nodes had dramatic increase in memory and cpu usage right before the crash. Due to the lack of proper logging, we were not able to determine the cause of the crash and therefore decided to restart the network on Tuesday with more careful debugging and logging setup.
The network crashed again on Tuesday night, around 11pm PST. From the valgrind output we saw that a function that broadcasts message to the network used an unreasonable amount of memory. After some investigation and sifting through logs of more than 1GB we found that there is a subtle bug in our network code that would only be triggered under a specific circumstance. We quickly deployed a quick fix that prevents such bug in normal situations but are still working on fully fixing the issue in the byzantine setting. We restarted the network again on Wednesday and it has been running without any downtime since then.
Released v0.4.6 with updates (fully implemented finality gadget) and fixes described below.
- The aforementioned network bug. More specifically, when we receive account announcement we check whether it already exists in the routing table by doing an exact match. However, since account announcement has epoch id, and because a newly joined peer would rebroadcast the account announcement they receive from peers, if a node in the network announce their account for the next epoch at the same time, it causes the announcement to overwrite each other. Each overwrite would lead to a broadcasting to the entire network and therefore causes exponential growth of network messages, which causes the entire network to crash. The issue is temporarily fixed in https://github.com/nearprotocol/nearcore/pull/1688 and we are still working on fully fixing the issue.
- We also noticed that, even though we fixed some major memory leaks last week, nodes were still leaking memory slowly. Through inspection we found that some caches in ShardsManager were not properly implemented and caused the leak. This issue is fixed in https://github.com/nearprotocol/nearcore/pull/1706 (which is pending more testing).
- There is also some issue with rpc that makes wallet sometimes nonfunctional. Validators complained that they have to try multiple times to register one account name. This issue is fixed in https://github.com/nearprotocol/nearcore/pull/1699.
The experience this week shows that stability of our network is improving with each week. For next week, we will again open registration for stake wars to the public with a customized landing page that does input validation.
We also invite validator to start actually trying malicious or DDOS types of attacks to start testing non vanilla behaviors.