Pulse Software - Some intermittant timeouts being reported – Incident details

Some intermittant timeouts being reported

Resolved
Degraded performance
Started 21 days agoLasted about 5 hours

Affected

Pulse Web Service

Degraded performance from 6:00 AM to 11:11 AM

Forms Service

Operational from 6:00 AM to 10:51 AM, Degraded performance from 10:51 AM to 11:11 AM

Workflow Service

Operational from 6:00 AM to 10:51 AM, Degraded performance from 10:51 AM to 11:11 AM

Identity Service

Operational from 6:00 AM to 10:51 AM, Degraded performance from 10:51 AM to 11:11 AM

Email Service

Operational from 6:00 AM to 10:51 AM, Degraded performance from 10:51 AM to 11:11 AM

Public Job Sites

Operational from 6:00 AM to 10:51 AM, Degraded performance from 10:51 AM to 11:11 AM

Updates
  • Postmortem
    Postmortem

    Incident Summary

    A Microsoft component responsible for extracting data from Excel began causing application pool crashes. When an application pool crashes, it impacts all users connected to that pool. Each crash resulted in temporary request timeouts for those users until the pool automatically recovered, typically within seconds. As a result, the issue only persisted for less than a minute and only affected users tied to the impacted pool.

    Impact

    • Users connected to the affected app pool experienced brief timeouts (under one minute).

    • No lasting or widespread outages occurred, as app pools automatically restarted and recovered within seconds.

    Root Cause

    The failure was traced to a Microsoft Excel data extraction component, which triggered repeated crashes in the application pool.

    Resolution

    • All servers were patched and verified on the morning following the incident.

    • A longer-term remediation has been expedited: migrating the Excel data extraction functionality to a dedicated microservice. This change will reduce dependency on the main web server and improve overall system resilience.

    Next Steps

    • Continue monitoring the updated servers for stability.

    • Complete and deploy the microservice migration to fully eliminate reliance on the problematic component.

    • Review alerting and recovery procedures to ensure faster detection and mitigation in future pool crashes.

  • Resolved
    Resolved
    This incident has been resolved.
  • Monitoring
    Monitoring
    We implemented a fix and are currently monitoring the result.
  • Identified
    Identified

    We have identified the issue is related to imports. A fix will be deployed tonight.

  • Investigating
    Investigating
    We are currently investigating this incident.