Pulse Software - Major outage accross all services except notifications – Incident details

Major outage accross all services except notifications

Resolved
Major outage
Started 2 months agoLasted about 1 hour

Affected

Pulse Web Service

Major outage from 4:50 AM to 5:55 AM

Forms Service

Major outage from 4:50 AM to 5:55 AM

Workflow Service

Major outage from 4:50 AM to 5:55 AM

Identity Service

Major outage from 4:50 AM to 5:55 AM

Public Job Sites

Major outage from 4:50 AM to 5:55 AM

File Service

Major outage from 4:50 AM to 5:55 AM

Updates
  • Postmortem
    Postmortem

    Description of the Incident

    On 12th June 2025, a major outage of the Pulse Software system was reported. This impacted all services and all customers. The critical response process was followed, this including setting up a critical response team and sending communications internally. It was found that our microservice infrastructure had failed and did not automatically recover. The critical response team intervened adding additional infrastructure and removing the unhealthy infrastructure. This is the first time we have had a complete failure on our microservices infrastructure. The outage lasted for a total of 65 minutes.

    Over the last six months the uptime for Pulse Software Platform has been exceptional at 99.98% with no maintenance windows. After this event this has dropped to 99.96% after the event. This is well above 99.5% SLA.

     

     

    Root Cause Analysis

    Technical Cause:

    The incident was traced to a very high peak workload impacting the change log auditing service. This service was overwhelmed and crashed the service infrastructure preventing users from accessing the system. This is the first time the entire service has crashed at Pulse Software.

    Timeline of Event 

    • 13 June 2025

    • 14:50: Issue was reported internally within Pulse

    • 14:50: Critical response team setup

    • 14:53: A series of uptime alerts were triggered

    • 14:53: Critical response team found the Microservice infrastructure was in an unhealthy state

    • 14:54: Communications were sent internally of a major service disruption impacting all services and all customers

    • 15:05: Redirected all sites to maintenance mode

    • 15:10: Attempts to recover the infrastructure failed

    • 15:11: Started to setup new infrastructure to support microservices

    • 15:55 All services returned to normal operations

     

    Post incident review

    ·       Add additional limits to the services that will protect the infrastructure from failing completely

    ·       Expedite the micro service scaling and resilience project. Include into 2025 Q3 plan

    ·       Change Log Audit export requires further analysis and performance optimisation

  • Resolved
    Resolved

    A service node crashed and failed to automatically recover. Engineers had to manually recover the node.

  • Investigating
    Investigating
    We are currently investigating this incident.