Logswiz Case Study — Global E-Commerce Platform

The Organisation

Global E-Commerce Platform · E-Commerce

The Challenge

The platform operated 2,400 microservices across three cloud regions. Their observability stack was sampling approximately 12% of log volume to manage costs, with the remaining 88% dropped at the collection layer. When incidents occurred, SRE teams spent the majority of their time manually correlating partial signal across services — a process that averaged 40 minutes per incident and required senior engineers who understood the full service dependency graph.

The Approach

LLM reasoning was applied to 100% of infrastructure events — logs, metrics, and traces across all 2,400 services. The model was trained to identify causal chains across service boundaries and surface root cause hypotheses with confidence scores, delivered directly into the team's existing incident management workflow.

"We had tried an LLM pilot with GPT-4 at 12% coverage. The insights were impressive but the cost was absurd. This is the same quality of reasoning at a cost that actually works in production."

VP of Infrastructure Engineering

Key Finding

The most impactful early finding was a class of incidents the team had been categorizing as 'intermittent database timeouts' for 14 months. Full coverage LLM reasoning identified that all 23 incidents in that category shared a common upstream cause — a specific connection pool behavior in a shared library used by 340 services. The fix was deployed in a single patch.

Results at a Glance

Mean time to resolution 40 min → 3 min

Infrastructure event coverage 12% → 100%

Reduction in senior engineer escalations 87%

Annual on-call and incident response cost saving $8.4M

Services covered with full LLM reasoning 2,400

Inference cost reduction vs GPT-4 pilot 1000×

From 40-minute MTTR to 3 minutes across a 2,400-service architecture

Talk to us about your data.