Between 18:18 UTC on 2024-02-01 and 17:30 UTC on 2024-02-22, customers experienced intermittent errors when trying to use the RDP Web Session feature of Safeguard Remote Access (SRA).
The event was triggered by a deployment of updated versions of Safeguard Remote Access product components at 14:18 UTC on 2024-02-01 and caused scaling issues of backend services due to a memory leak issue.
The event was first reported by customers at 13:54 UTC on 2024-02-07. The team started working on the event by implementing interim fixes to mitigate the service outage and allow customers to use the feature, while working on a permanent solution.
This incident impacted multiple SRA customers across the US and EU regions.
The overall incident can be separated into 3 unique issues
The first customer service requests noted that the first 4-5 attempts to launch an RDP web session in SRA failed, with subsequent attempts then working as expected.
We then opened a major incident on Statuspage for the Safeguard Remote Access product at 16:27 UTC on 2024-02-08.
Our Product team found changes in recent application framework updates. Code updates were made and the repaired component was re-deployed to production at 14:58 UTC on 2024-02-09.
Our Support team confirmed at 18:45 UTC on 2024-02-09 that the RDP web session feature was working, our Operations team then changed the incident on Statuspage to 'Resolved' at 21:05 UTC on 2024-02-09.
This was believed to address the cause of the outage at that time.
Starting at 16:50 UTC on 2024-02-13, customers reported that the loss in RDP functionality had returned.
Our product team found two components that were aggressively scaling towards their maximum operating limit.
As an interim hotfix, we increased the maximum target that the service could scale to, which allowed the service to continue functioning. We also changed the service responsible for performing scaling operations to rule out the original scaling service as a root cause.
The Operations team opened the second major incident on Statuspage for the Safeguard Remote Access product at 12:52 UTC on 2024-02-14.
At 17:46 UTC on 2024-02-14, following our hotfix, customers began to report that the RDP web session feature was operational.
The Product team continued to analyze service logs and found broken pipe errors caused by a bug in the implementation of two components that was introduced as part of a product release on 2024-02-01.
At 14:02 UTC on 2024-02-15, the two suspect components were rolled back to their last known-working version. The Product team also amended one of the components responsible for scaling.
Further monitoring showed proper scaling of backend components and a large drop of broken pipe errors.
The Operation team updated the incident on Statuspage to 'Resolved' at 16:17 UTC on 2024-02-15.
At 07:06 UTC on 2024-02-16, our support team reported that after a few hours of successfully using the RDP web session feature, the same RDP session errors had returned.
Developers immediately started collecting monitoring data and began to reproduce the errors in our non-production environment.
The Operations team created the third major incident on Statuspage at 09:28 UTC on 2024-02-16.
After discussing with the Product team, the Operations team implemented a hotfix that restarted the Safeguard Remote Access service every 4 hours to eliminate the impact of memory errors. These scheduled restarts were communicated via the Statuspage at 10:58 UTC on 2024-02-16.
Customers then confirmed that Safeguard Remote Access RDP functionality had been restored.
The Operations team extended the scheduled restarts to once every 6 hours. This change was reflected on the Statuspage at 18:51 UTC on 2024-02-16.
After further monitoring of backend metrics, scaling parameters were updated to better mitigate the memory error and require less restarts.
A 3rd party component was identified as a root cause of the memory issues.
The Operations team deployed a known-working version of the 3rd party component at 18:00 UTC on 2024-02-19.
After further monitoring of backend metrics the scheduled restarts were removed and the Operations team updated the incident on Statuspage to the monitoring status at 19:00 UTC on 2024-02-19 to the “monitoring” status.
After further tests, internal validation, and confirmation from customers, the Statuspage incident was resolved on 2024-02-22 at 17:30 UTC.