Pulse Software - Major outage – Incident details

Major outage

Resolved
Major outage
Started 26 days agoLasted 14 minutes

Affected

Pulse Web Service

Major outage from 3:45 AM to 3:59 AM

Forms Service

Major outage from 3:45 AM to 3:59 AM

Workflow Service

Major outage from 3:45 AM to 3:59 AM

Identity Service

Major outage from 3:45 AM to 3:59 AM

Email Service

Major outage from 3:45 AM to 3:59 AM

Public Job Sites

Major outage from 3:45 AM to 3:59 AM

Updates
  • Postmortem
    Postmortem

     

    Date and Time of Incident

    2025-07-17 13:45

    Incident Type

    Major Outage

    Reported By

    Reported Internally

    Location/System Affected

    All Australian Sites/Customers

    Prepared By

    Site Reliability Engineer

    Acknowledged By

     

     

    Description of the Incident

    On 17 June 2025 at approximately 13:43 AEST, a sudden spike in resource consumption was detected across our infrastructure. This surge posed a potential performance risk to all customers. In response, our team initiated standard mitigation procedures to stabilize the environment and maintain service quality.

    During this process, an engineer made a configuration error while addressing an overloaded server. This error inadvertently caused the platform to go offline for a duration of 16 minutes.

    Root Cause Analysis

    The incident was caused by a manual configuration process that should have been automated. The lack of automation introduced the possibility of human error, which ultimately led to the misconfiguration and temporary service disruption.

    Timeline of Event

    ·       13:43 – Overloaded server identified

    ·       13:45 – Configuration error occurred

    ·       14:00 – Alternative Server Cluster configured

    ·       14:01 – Platform restored, and services resumed

    Post incident review

    Following the incident, a comprehensive review was conducted. Key findings and actions include:

    ·       Automation Improvements: Plans are underway to automate the configuration process to eliminate manual intervention and reduce the risk of human error

    ·       Monitoring Enhancements: Resource monitoring tools will be refined to provide earlier alerts and more granular diagnostics

  • Resolved
    Resolved
    This incident has been resolved.
  • Investigating
    Investigating

    A major outage impacting all services