Safeguard On Demand outage

Incident Report for One Identity Starling

Postmortem

What Occurred
Some Safeguard On Demand Starling Edition (SGODSE) Customers reported they were unable to access the WebUI of their SPP/SPS appliances. 

What went wrong and why?
A product update resulted in the amendment of traffic flow rules which blocked traffic between the app gateway in front of each existing customer instance and the servers hosting the SPP and SPS appliances. Impact was limited to the availability of the WebUI. 

 

How are we making incidents like this less likely or less impactful?
We are improving our QA processes to include tests for this upgrade scenario in pre-production environments.

 

Timeline
2024-04-18 03:59AM PDT

  • Update to the internal service responsible for managing instances of SGODSE deployed.

 

2024-04-18 04:23AM PDT 

  • Automated testing and monitoring began reporting 502 Bad Gateway errors.
  • Customers reported 502 Bad Gateway errors shortly thereafter.
  • We began investigation

 

2024-04-18 09:35AM PDT 

  • The issue was identified and we began fixing customer instances.

 

2024-04-18 12:45PM PDT 

  • All customers were fully operational.
Posted May 30, 2024 - 15:36 PDT

Resolved

This incident has now been resolved.

We will follow this incident with an RCA.
Posted Apr 18, 2024 - 13:54 PDT

Identified

The issue has been identified and we are working on a fix
Posted Apr 18, 2024 - 12:01 PDT

Update

We are continuing to investigate this issue.
Next update will occur in 2 hours.
Posted Apr 18, 2024 - 10:51 PDT

Update

We are continuing to investigate this issue.
Posted Apr 18, 2024 - 07:45 PDT

Investigating

We are currently investigating this issue.
Posted Apr 18, 2024 - 05:59 PDT
This incident affected: On Demand EMEA (Safeguard On Demand), On Demand APAC (Safeguard On Demand), and On Demand NA (Safeguard On Demand).