At approximately 8:00 AM, degraded performance in the production system was detected by our alerting system and reported by several customers. A critical response team was immediately assembled to investigate the issue.
The initial investigation identified the in-memory cache service as the primary bottleneck, as it was operating beyond expected capacity. The underlying cause of the increased load was not immediately clear. To mitigate customer impact and allow additional time for analysis, the decision was made to scale up the cache service.
The scale-up alleviated the bottleneck almost immediately, and system performance returned to normal operating levels.
As a follow-up action, we will integrate the in-memory cache service into our APM observability platform to improve visibility, monitoring, and early detection of similar issues in the future.