Pulse Software - Notice history

Pulse Web Service - Operational

100% - uptime
Jun 2025
Jul 2025
Aug 2025

Forms Service - Operational

100% - uptime
Jun 2025
Jul 2025
Aug 2025

Workflow Service - Operational

100% - uptime
Jun 2025
Jul 2025
Aug 2025

Identity Service - Operational

100% - uptime
Jun 2025
Jul 2025
Aug 2025

Email Service - Operational

100% - uptime
Jun 2025
Jul 2025
Aug 2025

Public Job Sites - Operational

100% - uptime
Jun 2025
Jul 2025
Aug 2025

File Service - Operational

100% - uptime
Jun 2025
Jul 2025
Aug 2025

Notice history

Jul 2025

Major outage
  • Postmortem
    Postmortem

     

    Date and Time of Incident

    2025-07-17 13:45

    Incident Type

    Major Outage

    Reported By

    Reported Internally

    Location/System Affected

    All Australian Sites/Customers

    Prepared By

    Site Reliability Engineer

    Acknowledged By

     

     

    Description of the Incident

    On 17 June 2025 at approximately 13:43 AEST, a sudden spike in resource consumption was detected across our infrastructure. This surge posed a potential performance risk to all customers. In response, our team initiated standard mitigation procedures to stabilize the environment and maintain service quality.

    During this process, an engineer made a configuration error while addressing an overloaded server. This error inadvertently caused the platform to go offline for a duration of 16 minutes.

    Root Cause Analysis

    The incident was caused by a manual configuration process that should have been automated. The lack of automation introduced the possibility of human error, which ultimately led to the misconfiguration and temporary service disruption.

    Timeline of Event

    ·       13:43 – Overloaded server identified

    ·       13:45 – Configuration error occurred

    ·       14:00 – Alternative Server Cluster configured

    ·       14:01 – Platform restored, and services resumed

    Post incident review

    Following the incident, a comprehensive review was conducted. Key findings and actions include:

    ·       Automation Improvements: Plans are underway to automate the configuration process to eliminate manual intervention and reduce the risk of human error

    ·       Monitoring Enhancements: Resource monitoring tools will be refined to provide earlier alerts and more granular diagnostics

  • Resolved
    Resolved
    This incident has been resolved.
  • Investigating
    Investigating

    A major outage impacting all services

Jun 2025

Major outage accross all services except notifications
  • Postmortem
    Postmortem

    Description of the Incident

    On 12th June 2025, a major outage of the Pulse Software system was reported. This impacted all services and all customers. The critical response process was followed, this including setting up a critical response team and sending communications internally. It was found that our microservice infrastructure had failed and did not automatically recover. The critical response team intervened adding additional infrastructure and removing the unhealthy infrastructure. This is the first time we have had a complete failure on our microservices infrastructure. The outage lasted for a total of 65 minutes.

    Over the last six months the uptime for Pulse Software Platform has been exceptional at 99.98% with no maintenance windows. After this event this has dropped to 99.96% after the event. This is well above 99.5% SLA.

     

     

    Root Cause Analysis

    Technical Cause:

    The incident was traced to a very high peak workload impacting the change log auditing service. This service was overwhelmed and crashed the service infrastructure preventing users from accessing the system. This is the first time the entire service has crashed at Pulse Software.

    Timeline of Event 

    • 13 June 2025

    • 14:50: Issue was reported internally within Pulse

    • 14:50: Critical response team setup

    • 14:53: A series of uptime alerts were triggered

    • 14:53: Critical response team found the Microservice infrastructure was in an unhealthy state

    • 14:54: Communications were sent internally of a major service disruption impacting all services and all customers

    • 15:05: Redirected all sites to maintenance mode

    • 15:10: Attempts to recover the infrastructure failed

    • 15:11: Started to setup new infrastructure to support microservices

    • 15:55 All services returned to normal operations

     

    Post incident review

    ·       Add additional limits to the services that will protect the infrastructure from failing completely

    ·       Expedite the micro service scaling and resilience project. Include into 2025 Q3 plan

    ·       Change Log Audit export requires further analysis and performance optimisation

  • Resolved
    Resolved

    A service node crashed and failed to automatically recover. Engineers had to manually recover the node.

  • Investigating
    Investigating
    We are currently investigating this incident.

Jun 2025 to Aug 2025

Next