Date and Time of Incident | 2025-07-17 13:45 |
Incident Type | Major Outage |
Reported By | Reported Internally |
Location/System Affected | All Australian Sites/Customers |
Prepared By | Site Reliability Engineer |
Acknowledged By | |
Description of the Incident
On 17 June 2025 at approximately 13:43 AEST, a sudden spike in resource consumption was detected across our infrastructure. This surge posed a potential performance risk to all customers. In response, our team initiated standard mitigation procedures to stabilize the environment and maintain service quality.
During this process, an engineer made a configuration error while addressing an overloaded server. This error inadvertently caused the platform to go offline for a duration of 16 minutes.
Root Cause Analysis
The incident was caused by a manual configuration process that should have been automated. The lack of automation introduced the possibility of human error, which ultimately led to the misconfiguration and temporary service disruption.
Timeline of Event
· 13:43 – Overloaded server identified
· 13:45 – Configuration error occurred
· 14:00 – Alternative Server Cluster configured
· 14:01 – Platform restored, and services resumed
Post incident review
Following the incident, a comprehensive review was conducted. Key findings and actions include:
· Automation Improvements: Plans are underway to automate the configuration process to eliminate manual intervention and reduce the risk of human error
· Monitoring Enhancements: Resource monitoring tools will be refined to provide earlier alerts and more granular diagnostics