On July 27th, 2020, between 10:27 UTC and 11:14 UTC, as well as 22:51 UTC and 23:26 UTC, API calls on standard or enterprise environments may have failed. Affected APIs include clients joining a session and server REST API calls. This incident did not impact ongoing sessions for which no API calls were invoked.
A security agent running on the non-persistent database servers, which monitors and scans disk access, caused a resource conflict between the agent and the database. As part of its normal operation the non-persistent database writes files to disk for logging. The security agent performs anti-malware in-line analysis of accesses made to files. The non-persistent database is unique in that it is single threaded and it opens and closes the log file for each log entry.
The scan of the log files by the agent caused a resource conflict between the agent and the database, this resulted in increased latency in the requests thus increasing the number of open connections needed to be serviced.
This cascaded into requests timing out and triggered a failover to a new primary node. The new primary node experienced the same behavior and a cascaded effect. At this moment the cluster was unable to handle all traffic and the entire cluster became unresponsive.
The non-persistent database cluster management nodes were restarted at 10:46 UTC, and the platform resumed to normal operations.
When the second incident occurred, the team was investigating the reason for the first one, and the security agent was not yet identified as the root cause for the incident.
Restarting the non-persistent database cluster management node at 23:10 UTC did not solve the second incident, which still persisted. Further investigation found excessive resource utilization by the security agent. The agent was disabled in all the non-persistent database cluster nodes at 23:24 UTC. Normal operations resumed subsequent to that.
On July 27th, 2020, between 10:27 UTC and 11:14 UTC, as well as 22:51 UTC and 23:26 UTC, users on all client SDKs may have not been able to connect to a session running in the Standard or Enterprise environments. This incident did not impact users who had already joined the session.
For reasons that are currently under investigation by internal and external teams, a third party software component running on the nodes of the non-persistent database made them unresponsive.
It was discovered that this component was using almost all computing power of the cluster at the time; therefore, causing the cluster to fail.
The fix for the initial incident started to roll out at 10:46 UTC and the incident seemed to resolve by restarting the API Gateway and the platform resumed to normal operations.
When the second incident occurred, the team was investigating the reason for the first one and at that point, this software component was not yet identified as the root cause for the issue.
While rebooting the API Gateways resolved the first incident, the issue persisted after doing so when we were notified of the new outage. During the second outage, it was discovered that the software component was utilizing most of the resources of the API Gateway. This component was quickly disabled across the database cluster. Normal operations resumed subsequent to that.