Remote Access RDP Web Session Outage

Incident Report for One Identity Starling

Postmortem

What happened?

Between 18:18 UTC on 2024-02-01 and 17:30 UTC on 2024-02-22, customers experienced intermittent errors when trying to use the RDP Web Session feature of Safeguard Remote Access (SRA).

The event was triggered by a deployment of updated versions of Safeguard Remote Access product components at 14:18 UTC on 2024-02-01 and caused scaling issues of backend services due to a memory leak issue.

The event was first reported by customers at 13:54 UTC on 2024-02-07. The team started working on the event by implementing interim fixes to mitigate the service outage and allow customers to use the feature, while working on a permanent solution.

This incident impacted multiple SRA customers across the US and EU regions.

What went wrong and why?

The overall incident can be separated into 3 unique issues

Session URL issues leading to resolution errors in all non-chrome browsers.
Broken pipe errors caused by a 3rd party component.
Memory errors caused by a 3rd party component.

Session URL issues

The first customer service requests noted that the first 4-5 attempts to launch an RDP web session in SRA failed, with subsequent attempts then working as expected.

We then opened a major incident on Statuspage for the Safeguard Remote Access product at 16:27 UTC on 2024-02-08.

Our Product team found changes in recent application framework updates. Code updates were made and the repaired component was re-deployed to production at 14:58 UTC on 2024-02-09.

Our Support team confirmed at 18:45 UTC on 2024-02-09 that the RDP web session feature was working, our Operations team then changed the incident on Statuspage to 'Resolved' at 21:05 UTC on 2024-02-09.

This was believed to address the cause of the outage at that time.

Broken Pipe Errors (Scaling issues of backend components)

Starting at 16:50 UTC on 2024-02-13, customers reported that the loss in RDP functionality had returned.

Our product team found two components that were aggressively scaling towards their maximum operating limit.

As an interim hotfix, we increased the maximum target that the service could scale to, which allowed the service to continue functioning. We also changed the service responsible for performing scaling operations to rule out the original scaling service as a root cause.

The Operations team opened the second major incident on Statuspage for the Safeguard Remote Access product at 12:52 UTC on 2024-02-14.

At 17:46 UTC on 2024-02-14, following our hotfix, customers began to report that the RDP web session feature was operational.

The Product team continued to analyze service logs and found broken pipe errors caused by a bug in the implementation of two components that was introduced as part of a product release on 2024-02-01.

At 14:02 UTC on 2024-02-15, the two suspect components were rolled back to their last known-working version. The Product team also amended one of the components responsible for scaling.

Further monitoring showed proper scaling of backend components and a large drop of broken pipe errors.

The Operation team updated the incident on Statuspage to 'Resolved' at 16:17 UTC on 2024-02-15.

Memory Error

At 07:06 UTC on 2024-02-16, our support team reported that after a few hours of successfully using the RDP web session feature, the same RDP session errors had returned.

Developers immediately started collecting monitoring data and began to reproduce the errors in our non-production environment.

The Operations team created the third major incident on Statuspage at 09:28 UTC on 2024-02-16.

After discussing with the Product team, the Operations team implemented a hotfix that restarted the Safeguard Remote Access service every 4 hours to eliminate the impact of memory errors. These scheduled restarts were communicated via the Statuspage at 10:58 UTC on 2024-02-16.

Customers then confirmed that Safeguard Remote Access RDP functionality had been restored.

The Operations team extended the scheduled restarts to once every 6 hours. This change was reflected on the Statuspage at 18:51 UTC on 2024-02-16.

After further monitoring of backend metrics, scaling parameters were updated to better mitigate the memory error and require less restarts.

A 3rd party component was identified as a root cause of the memory issues.

The Operations team deployed a known-working version of the 3rd party component at 18:00 UTC on 2024-02-19.

After further monitoring of backend metrics the scheduled restarts were removed and the Operations team updated the incident on Statuspage to the monitoring status at 19:00 UTC on 2024-02-19 to the “monitoring” status.

After further tests, internal validation, and confirmation from customers, the Statuspage incident was resolved on 2024-02-22 at 17:30 UTC.

How are we making incidents like this less likely or less impactful?

We are further improving our automated stress testing of RDP and SSH functionality in Safeguard Remote Access.
We are implementing additional monitoring that will specifically alert our Operations Team to failures of this nature.
We are building on our existing validation procedures for 3rd party services that our product uses.

Posted Feb 23, 2024 - 09:03 PST

Resolved

This incident has been resolved.

Posted Feb 22, 2024 - 08:48 PST

Update

A fix for this issue has been deployed and we are monitoring.
We will provide further updates on or before Thursday, 22nd February, 1PM UTC.

Posted Feb 21, 2024 - 08:59 PST

Update

A fix for this issue has been deployed and we are monitoring.
We will provide further updates on or before Wednesday, 21st February, 5:00PM UTC.

Posted Feb 21, 2024 - 06:52 PST

Update

A fix for this issue has been deployed and we are monitoring.
We will provide further updates on or before Wednesday, 21st February, 3:00PM UTC.

Posted Feb 21, 2024 - 04:08 PST

Update

A fix for this issue has been deployed and we are monitoring.
We will provide further updates on or before Wednesday, 21st February, 12:00PM UTC.

Posted Feb 20, 2024 - 08:42 PST

Update

A fix for this issue has been deployed and we are monitoring.
We will provide further updates on or before Tuesday, 20th February, 5:00PM UTC.

Posted Feb 20, 2024 - 05:47 PST

Update

A fix for this issue has been deployed and we are monitoring.
We will provide further updates on or before Tuesday, 20th February, 2:00PM UTC.

Posted Feb 20, 2024 - 02:44 PST

Monitoring

A fix for this issue has been deployed and we are monitoring.
We will provide further updates on or before Tuesday, 20th February, 12:00PM UTC.

Posted Feb 19, 2024 - 11:00 PST

Update

We are continuing to work on a fix for this issue.
Connection resets will occur at 6:00 and 18:00 UTC.
We will provide further updates on or before 7PM UTC.

Posted Feb 19, 2024 - 09:04 PST

Update

We are continuing to work on a repair for this issue.
Connection resets will occur at 0:00, 6:00, 12:00, 18:00 UTC.
We will provide further updates on or before 5PM UTC.

Posted Feb 19, 2024 - 07:18 PST

Update

We are continuing to work on a repair for this issue.
Connection resets will occur at 0:00, 6:00, 12:00, 18:00 UTC.
We will provide further updates on or before 3PM UTC.

Posted Feb 19, 2024 - 04:27 PST

Update

We are continuing to work on a repair for this issue. In the meantime we will be resetting connections at set intervals to prevent failed connections for long periods of time. Existing connection resets will occur at 0:00, 6:00, 12:00, 18:00 UTC.
We will provide further updates on or before the 19th of February at 1PM UTC.

Posted Feb 16, 2024 - 10:51 PST

Update

We are continuing to develop a fix for Starling Remote Access,

In the meantime, our hotfix remains - This hotfix still causes sessions to disconnect every 3-4 hours

We will provide a further update at 7PM GMT.

Posted Feb 16, 2024 - 09:14 PST

Update

Investigation into a permanent fix for Starling Remote Access continues.

Our temporary hotfix is still in place and continues to ensure service while we investigate, however disconnects every 3-4 hours are still expected.

We will provide a further update at or before 5PM GMT.

Posted Feb 16, 2024 - 06:47 PST

Identified

Posted Feb 16, 2024 - 04:57 PST

Update

A hotfix is currently in place that will allow full functionality of RDP Web Sessions in SRA, however, sessions may disconnect every 3-4 hours.

We are continuing to work on a permanent fix.

We will provide a further update at 1PM GMT.

Posted Feb 16, 2024 - 02:58 PST

Investigating

Further testing has shown that the functionality of RDP Web Sessions in Starling Remote Access is still partially impacted.

We are investigating and will provide a further update at 11AM GMT.

Posted Feb 16, 2024 - 01:28 PST

This incident affected: One Identity Starling EMEA (Remote Access) and One Identity Starling NA (Remote Access).