Post-Mortems

Post Mortem Report: Nolus Protocol Service Interruption on March 23th

Summary

By Nolus Team2 min read
Cover for Post Mortem Report: Nolus Protocol Service Interruption on March 23th

Summary

The Nolus Protocol experienced service interruption starting at 21:11 UTC on March 23th, due to an influx of requests to open or close lease positions by users and the protocol’s liquidation engine. This unexpected surge led to a bottleneck in the time alarms dispatcher, tasked with maintaining the liveness of ICA channels, as it was unable to process the excessive number of alarms required. This inefficiency quickly escalated, causing IBC relayers to generate a high volume of requests, resulting in consensus failures on both Nolus and Osmosis nodes operated by third parties as well as the Nolus core team. To mitigate the issue, access to the Nolus dApp was restricted, allowing the team to conduct an investigation and address the root cause of the disruption. By 16:00 UTC on March 26th, the system had been restored to full operational capacity, with all infrastructure revived and pending IBC packets processed.

Root Cause Analysis

The existing implementation of IBC’s ICS-27 standard presents a challenge in the operational dynamics of ordered ICS-27 channels, especially when they encounter timeouts. The IBC current design triggers a callback to notify the relevant contract that the channel is closed before the channel’s status officially transitions to closed. This premature notification results in a sequence where the channel’s state is updated to ‘state_closed’ only after this callback has been executed. This creates a timing issue that can adversely affect subsequent operations, notably channel registration.

To mitigate this issue, the Nolus dev team implemented additional mechanics involving the dispatch of time alarms which are designed to ensure that Interchain Accounts Channels (ICA) can be promptly reopened by the lease contract, thereby preserving their activity and functionality at all costs. This approach is intended to address the timing discrepancies caused by the current ICS-27 implementation, ensuring that channel operations remain smooth and uninterrupted despite the inherent challenges presented by the existing protocol design.

A significant increase in user requests, combined with the protocol’s liquidation engine operating at high capacity, created a bottleneck in the time alarms dispatcher. This caused the dispatcher to be unable to process the overwhelming number of required alarms.

Corrective and Preventive Measures

In response to this incident, several measures have been implemented:

  • Dynamic Scaling of Alarm Dispatcher: The alarms dispatcher has been enhanced to dynamically scale based on demand. While this is a temporary solution, it ensures the dispatcher’s operational capacity during periods of excessive alarm activity. It’s important to note that this solution can increase the transaction occupancy in the blockchain’s block space, potentially taking up to 30% of the available space in certain blocks.
  • IBC Implementation Update: The core issue The core issue has been acknowledged, and the proposed solution will not require the involvement of any time alarms. Nolus plans to adopt this update following its official release, which will directly address the underlying issue.

The Nolus team extends its gratitude to the IBC team for their swift response in acknowledging the issue. Moving forward, Nolus is committed to closely monitoring the system’s performance and continuously improving its infrastructure. This commitment is aimed at preventing similar incidents and ensuring a smooth cross-chain experience.