Incident-Ready Observability: What to Set Up Before You Need It
A practical checklist for logs, metrics, traces, and alerting that actually helps during incidents.
When incidents happen, the difference between a 10-minute fix and a 2-hour outage is usually not “more engineers” — it’s whether your observability gives clear answers fast.
This post is a practical baseline you can implement without rebuilding your entire platform.
The goal of observability during an incident
You need to answer four questions quickly:
- What is broken?
- Where is it broken?
- What changed?
- How do we stop the impact?
If your monitoring can’t support these questions, it becomes noise.
Baseline you should have in every environment
Logs
- Every request should have a correlation ID.
- Log format should be structured (JSON recommended).
- Include: service name, environment, request path, status code, latency, user or tenant identifier (if applicable).
- Centralize logs in one place with consistent retention.
Metrics
Minimum set per service:
- Request rate (RPS)
- Error rate (4xx/5xx)
- Latency (p50/p95/p99)
- Saturation (CPU, memory, queue depth)
Traces
Distributed tracing is optional until it’s not. If you have microservices or async flows, you want:
- trace ID propagated across services
- spans for external calls (DB, cache, HTTP dependencies)
- sampling rules you can adjust during incidents
Alerting that doesn’t create burnout
Good alerts are:
- actionable
- tied to impact
- routed to the right owner
Bad alerts are:
- “CPU > 80%” with no context
- flapping thresholds
- anything that pages without a runbook
A simple approach:
- Page on symptoms (error rate, latency)
- Create tickets on causes (CPU, memory, disk, scaling)
Add “change awareness”
Most incidents correlate with change. Make sure you can see:
- deployments
- config changes
- feature flag changes
- infrastructure changes
At minimum, annotate dashboards with deploy events.
A fast implementation plan (1–2 weeks)
- Standardize structured logs + correlation IDs
- Add golden signals dashboard for each service
- Implement basic alerting for error rate + latency
- Add deployment annotations
- Add tracing where debugging is currently painful
Photo source
Cover image: Unsplash — https://unsplash.com/photos/laptop-computer-on-table-beside-turned-on-monitor-4hbJ-eymZ1o