Major outage across all services except notifications

Postmortem

07 August, 2025 at 4:18 AM

Postmortem

07 August, 2025 at 4:18 AM

Description of the Incident

On 12th June 2025, a major outage of the Pulse Software system was reported. This impacted all services and all customers. The critical response process was followed, this including setting up a critical response team and sending communications internally. It was found that our microservice infrastructure had failed and did not automatically recover. The critical response team intervened adding additional infrastructure and removing the unhealthy infrastructure. This is the first time we have had a complete failure on our microservices infrastructure. The outage lasted for a total of 65 minutes.

Over the last six months the uptime for Pulse Software Platform has been exceptional at 99.98% with no maintenance windows. After this event this has dropped to 99.96% after the event. This is well above 99.5% SLA.

Root Cause Analysis

Technical Cause:

The incident was traced to a very high peak workload impacting the change log auditing service. This service was overwhelmed and crashed the service infrastructure preventing users from accessing the system. This is the first time the entire service has crashed at Pulse Software.

Timeline of Event

13 June 2025
14:50: Issue was reported internally within Pulse
14:50: Critical response team setup
14:53: A series of uptime alerts were triggered
14:53: Critical response team found the Microservice infrastructure was in an unhealthy state
14:54: Communications were sent internally of a major service disruption impacting all services and all customers
15:05: Redirected all sites to maintenance mode
15:10: Attempts to recover the infrastructure failed
15:11: Started to setup new infrastructure to support microservices
15:55 All services returned to normal operations

Post incident review

· Add additional limits to the services that will protect the infrastructure from failing completely

· Expedite the micro service scaling and resilience project. Include into 2025 Q3 plan

· Change Log Audit export requires further analysis and performance optimisation

Resolved

12 June, 2025 at 5:55 AM

Resolved

12 June, 2025 at 5:55 AM

A service node crashed and failed to automatically recover. Engineers had to manually recover the node.

Investigating

12 June, 2025 at 4:50 AM