AI Guides
What Is Observability? A Practical Guide to Logs, Metrics, Traces, OpenTelemetry and SLOs
A practical guide to observability for IT systems, covering monitoring vs observability, logs, metrics, traces, OpenTelemetry, Prometheus, Grafana, SLI/SLO, alerting, dashboards, incident response and a 90-day rollout roadmap.
💡Key Takeaways
- A practical guide to observability for IT systems, covering monitoring vs observability, logs, metrics, traces, OpenTelemetry, Prometheus, Grafana, SLI/SLO, alerting, dashboards, incident response and a 90-day rollout roadmap.
Quick summary
Observability is the ability to understand the internal state of a system from the data it emits, mainly logs, metrics, traces and increasingly profiles. In modern systems such as microservices, Kubernetes, serverless, API gateways, databases, queues, caches and CI/CD platforms, observability helps teams understand whether the system is healthy, where failures happen, who is affected and what should be fixed first.
OpenTelemetry describes observability as understanding a system’s internal state by examining its outputs, and it highlights telemetry signals such as traces, metrics and logs.1 OpenTelemetry is an open-source observability framework made of APIs, SDKs, tools and integrations for creating, collecting, managing and exporting telemetry data.2
Simple version: monitoring tells you something is wrong; observability helps you understand why.
Why observability matters
A modern request may travel through many layers:
Mobile app
↓
CDN / WAF
↓
API Gateway
↓
Auth service
↓
Business service
↓
Database + Redis + Queue
↓
Payment provider
↓
Notification service
When users report that checkout is slow, you need to answer:
- is the issue frontend or backend?
- which service adds latency?
- which database query is slow?
- is a queue backlogged?
- is the issue regional or global?
- did a deployment cause it?
- how many users are affected?
- should the team rollback, scale, fix a query or contact a provider?
Logs alone rarely answer these questions quickly. Observability connects data across layers.
Monitoring vs observability
| Criteria | Monitoring | Observability |
|---|---|---|
| Main question | Is the system broken? | Why is it broken? |
| Data | predefined metrics and alerts | logs, metrics, traces, events, profiles |
| Best for | known failures and known thresholds | unknown failures and distributed systems |
| Usage | dashboards and alerts | investigation, correlation, root cause analysis |
| Example | CPU > 90%, error rate > 5% | request is slow because service A calls DB query B after deploy C |
Monitoring is part of observability. You still need monitoring, but threshold-based monitoring alone is not enough for complex systems.
The three main signals: Logs, Metrics, Traces
OpenTelemetry calls logs, metrics and traces the main observability signals.3
Logs
Logs are event records, usually text or structured events.
Good log example:
{
"timestamp": "2026-06-04T10:15:30Z",
"level": "error",
"service": "checkout-api",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"user_id_hash": "u_12ab",
"order_id": "ord_9482",
"message": "payment authorization failed",
"provider": "stripe",
"error_code": "card_declined"
}
Logs are good for:
- detailed debugging;
- audit trails;
- exceptions;
- business events;
- security events;
- incident investigation.
Common mistakes:
- unstructured logs;
- inconsistent timestamps;
- logging secrets or passwords;
- too much noise;
- missing trace_id/request_id;
- vague messages like “something went wrong.”
Metrics
Metrics are numeric measurements over time. Prometheus describes itself as a monitoring and alerting system that collects and stores metrics as time series, with timestamps and key-value labels.4
Metric examples:
http_requests_total{service="checkout-api",status="500"} 42
http_request_duration_seconds_bucket{service="checkout-api",le="0.5"} 1832
queue_depth{queue="payment-events"} 1200
cpu_usage_percent{pod="checkout-api-7d9"} 82
Metrics are good for:
- dashboards;
- alerts;
- trends;
- capacity planning;
- SLI/SLO tracking;
- autoscaling;
- before/after deployment comparison.
Common mistakes:
- high-cardinality labels;
- CPU-based alerts instead of user-impact alerts;
- missing service/version/region labels;
- no latency histograms;
- only infrastructure metrics, no business metrics.
Traces
Traces follow a request across services. A trace contains spans, and each span represents one operation.
Example:
Trace: checkout request
├── frontend → POST /checkout 1200ms
├── checkout-api → validate cart 30ms
├── checkout-api → auth-service 80ms
├── checkout-api → inventory-service 150ms
├── checkout-api → payment-provider 850ms
└── checkout-api → database write 40ms
Traces are good for:
- microservices;
- distributed systems;
- latency investigation;
- dependency maps;
- finding the slow service;
- debugging request-specific failures.
Common mistakes:
- no trace context propagation;
- only instrumenting the gateway;
- bad sampling strategy;
- missing attributes such as route, region or version;
- no correlation with logs.
SLI, SLO and SLA
The Google SRE Book defines an SLO as a target value or range for a service level, measured by an SLI.5
| Term | Meaning | Example |
|---|---|---|
| SLI | Service Level Indicator, the measurement | 99th percentile latency, availability, error rate |
| SLO | Service Level Objective, the internal target | 99.9% successful requests over 30 days |
| SLA | Service Level Agreement, external commitment | service credits if availability drops below agreement |
Example:
SLI: ratio of 2xx/3xx HTTP responses to total requests
SLO: 99.9% successful requests over 30 days
SLA: customer receives credits if availability falls below 99.5%
SLOs help teams alert on user experience instead of every low-level signal.
What is an error budget?
If an availability SLO is 99.9% over 30 days, the service can fail 0.1% of requests or equivalent downtime. That allowed failure amount is the error budget.
Example:
SLO: 99.9% success rate
Error budget: 0.1% failed requests
If the error budget burns too fast, the team should:
- pause risky releases;
- prioritize reliability work;
- rollback instead of pushing forward;
- reduce change velocity;
- investigate root cause.
Error budgets help balance feature speed and reliability.
Good alerting
The Google SRE Book emphasizes that monitoring systems should identify urgent conditions, but alerts should be operationally meaningful. Bad alerts create fatigue and get ignored.6
Good alerts have:
- clear user impact;
- severity;
- runbook;
- owner;
- dashboard link;
- trace/log query link;
- SLO-based threshold when possible;
- deduplication for the same root cause.
Avoid alerts for:
- small CPU spikes with no user impact;
- warnings requiring no action;
- log noise;
- short-lived metric fluctuations;
- symptoms already covered by another alert.
Good alert example:
Checkout API 5xx rate exceeds 5% for 10 minutes
Impact: users cannot complete checkout
Action: check recent deployments, payment provider and database errors
Dashboard: ...
Runbook: ...
What is OpenTelemetry?
OpenTelemetry, or OTel, is an open-source, vendor-neutral observability framework. It provides APIs, SDKs, Collector and protocol support to generate and export telemetry data to backends such as Prometheus, Grafana, Jaeger, Tempo, Datadog, New Relic, Honeycomb, Elastic or internal systems.2
Strengths:
- vendor-neutral;
- supports many languages;
- auto-instrumentation;
- manual instrumentation;
- OTLP protocol;
- OpenTelemetry Collector;
- semantic conventions for standard attributes.
CNCF lists OpenTelemetry as a CNCF project.7
What is OpenTelemetry Collector?
OpenTelemetry Collector receives, processes and exports telemetry data. It prevents every application from needing direct integration with every backend.
Common flow:
App / Service
↓ OTLP
OpenTelemetry Collector
↓ processors: batch, filter, transform, sampling
Observability backends
├── Prometheus / Mimir for metrics
├── Loki / Elasticsearch for logs
└── Tempo / Jaeger for traces
Collector components:
| Component | Role |
|---|---|
| Receivers | receive telemetry from apps or agents |
| Processors | batch, filter, redact, transform, sample |
| Exporters | send telemetry to backends |
The OpenTelemetry docs describe the Collector as a way to receive, process and export telemetry through receivers, processors, exporters and connectors.8
Prometheus, Grafana, Loki and Tempo
| Tool | Role |
|---|---|
| Prometheus | time-series metrics collection and querying |
| Alertmanager | alert routing and deduplication |
| Grafana | dashboards and visualization |
| Loki | log aggregation |
| Tempo | distributed tracing backend |
| Mimir/Thanos | scalable long-term metrics storage |
| Jaeger | distributed tracing backend |
| OpenTelemetry Collector | telemetry collection, processing and export |
Prometheus is commonly used for metrics and alerting; Grafana for dashboards; Loki for logs; Tempo or Jaeger for traces.491011
Example observability architectures
Small-team architecture
App emits logs + metrics
↓
Prometheus scrapes metrics
↓
Grafana dashboard + Alertmanager
↓
Logs sent to Loki
Best for:
- monoliths;
- a few services;
- early-stage observability;
- smaller budgets.
Cloud-native architecture
Services instrumented with OpenTelemetry
↓ OTLP
OpenTelemetry Collector DaemonSet/Gateway
↓
Metrics → Prometheus/Mimir
Logs → Loki/Elastic
Traces → Tempo/Jaeger
↓
Grafana dashboards + alerts + incident runbooks
Best for:
- Kubernetes;
- microservices;
- multiple teams;
- log/metric/trace correlation.
What is instrumentation?
Instrumentation is adding code, libraries or agents so an application emits telemetry.
Types:
| Type | Description |
|---|---|
| Auto-instrumentation | agents/libraries automatically capture HTTP, DB, queue and framework data |
| Manual instrumentation | developers add spans, metrics and attributes for business logic |
Auto-instrumentation is good for starting quickly. Manual instrumentation adds business context.
Example manual span:
Span: checkout.calculate_discount
Attributes:
cart.items_count = 4
user.tier = "premium"
discount.rule = "campaign_2026_q2"
Do not put sensitive PII such as email, phone number, tokens, addresses or payment data into attributes.
Semantic conventions
Semantic conventions are standard names for OpenTelemetry attributes. They help services and tools use common fields such as service.name, http.request.method, url.path, db.system, server.address.12
Benefits:
- reusable dashboards;
- consistent queries;
- better correlation;
- less chaos across teams.
Example:
service.name = checkout-api
deployment.environment.name = production
http.request.method = POST
http.route = /checkout
http.response.status_code = 500
db.system.name = postgresql
Cardinality
Cardinality means the number of unique values in a label or attribute. High-cardinality metrics can make backends expensive, slow and memory-heavy.
Bad example:
http_requests_total{user_id="123456", email="[email protected]", request_id="abc"} 1
Better example:
http_requests_total{service="checkout-api", route="/checkout", status="500"} 1
Rules:
- do not use user_id, email or request_id as metric labels;
- use route templates like
/users/:id, not/users/123; - use traces/logs for request-specific investigation;
- use metrics for aggregate trends.
Dashboard design
A good dashboard answers operational questions.
A service dashboard should include:
- request rate;
- error rate;
- p50/p95/p99 latency;
- saturation: CPU, memory, queue depth, connection pool;
- dependency latency;
- deployment version;
- recent error logs;
- trace exemplars where available;
- SLO burn rate;
- runbook link.
Suggested layout:
Row 1: user impact
Row 2: service health
Row 3: dependencies
Row 4: infrastructure
Row 5: logs/traces links
RED and USE methods
RED method
For services and requests:
Rate: requests per second
Errors: error rate
Duration: latency
Best for APIs, microservices and gateways.
USE method
For resources:
Utilization: resource usage
Saturation: queued/waiting work
Errors: resource errors
Best for CPU, memory, disk, network and connection pools.
Using RED and USE together covers both user impact and infrastructure health.
Kubernetes observability
Kubernetes needs observability across layers:
| Layer | What to measure |
|---|---|
| Cluster | node health, API server, scheduler, etcd |
| Node | CPU, memory, disk, network |
| Pod | restarts, OOMKilled, CPU/memory, readiness |
| Deployment | replicas, rollout status, version |
| Service | request rate, error rate, latency |
| Ingress | 4xx/5xx, TLS, latency |
| Workload | business metrics, queue depth |
| Security | audit logs, policy violations |
Do not monitor only CPU and memory. Many user-facing incidents are caused by dependencies, databases, queues or external APIs.
Database observability
Track:
- query latency;
- slow queries;
- connection pool usage;
- locks/deadlocks;
- replication lag;
- cache hit ratio;
- disk usage;
- transaction rate;
- error rate;
- backup status;
- failover events.
If an API is slow, traces should show which database query or dependency caused the delay.
Frontend observability
Frontend also needs observability:
- page load time;
- Core Web Vitals;
- JavaScript errors;
- browser-side API latency;
- route-level error rate;
- device/browser breakdown;
- geographic latency;
- session replay where policy allows;
- feature flag impact.
A healthy backend does not always mean a good user experience.
Observability for AI/LLM applications
AI applications need extra telemetry:
- model name/version;
- provider;
- prompt tokens;
- completion tokens;
- estimated cost;
- provider latency;
- rate limits;
- retry/fallback count;
- tool calls;
- retrieval latency;
- vector search top-k;
- grounding/citation coverage;
- safety refusals/errors;
- user feedback.
Do not log raw prompts containing sensitive data unless redaction and policy are in place.
Telemetry data security
Telemetry often contains sensitive data. Control it carefully:
- never log passwords, tokens or API keys;
- hash or redact user identifiers;
- avoid PII in traces;
- sample sensitive data carefully;
- encrypt in transit and at rest;
- set retention policy;
- use RBAC for dashboards and logs;
- audit access to logs;
- separate production and development logs;
- redact at OpenTelemetry Collector when possible.
Observability should not become a data leak.
30/60/90-day rollout roadmap
Days 1–30: foundation
- Standardize structured logging.
- Add request_id/trace_id to logs.
- Collect core metrics: rate, error, duration and saturation.
- Deploy Prometheus + Grafana or a managed backend.
- Create dashboards for 3–5 critical services.
- Create first user-impact alert.
- Write runbooks for important alerts.
- Set retention policy.
Days 31–60: traces and SLOs
- Add OpenTelemetry SDK or auto-instrumentation.
- Add OpenTelemetry Collector.
- Enable distributed tracing for critical requests.
- Define SLI/SLO for critical services.
- Add SLO burn-rate alerts.
- Correlate logs with traces.
- Add dependency metrics for DB, queues, caches and external APIs.
- Review metric cardinality.
Days 61–90: optimization and governance
- Add trace sampling policy.
- Add Collector redaction/filtering.
- Move dashboards to code.
- Review alerts regularly.
- Create incident postmortem template.
- Optimize logs/traces cost.
- Build service catalog with owner/runbook/SLO.
- Add synthetic checks for critical user journeys.
- Add frontend and AI workload observability where relevant.
Quick checklist
Logs
- Structured JSON logs.
- Service name, environment and version.
- trace_id/request_id.
- No secrets or sensitive PII.
- Correct log levels.
- Retention policy.
Metrics
- RED metrics for services.
- USE metrics for resources.
- Latency histograms.
- Controlled label cardinality.
- Important business metrics.
- SLO-based alerts.
Traces
- Trace context propagation.
- Gateway, service, DB, queue and external call spans.
- Good sampling policy.
- Standard span attributes.
- Trace-log correlation.
- No PII in spans.
Dashboards
- Service dashboard.
- User journey dashboard.
- Links to logs/traces.
- Deployment markers.
- SLO/burn rate.
- Owner and runbook.
Alerts
- Clear action.
- User impact.
- Severity.
- Runbook.
- Dedup/silence.
- Regular alert-noise review.
Common mistakes
- Dashboards without useful alerts.
- Too many noisy alerts.
- Logs without request_id/trace_id.
- High-cardinality metric labels.
- Measuring only CPU/memory, not user impact.
- Trace sampling that drops important errors.
- No runbooks.
- No service owner.
- Retention too long and too expensive.
- Secrets or PII in logs.
- No deployment-to-incident correlation.
- No SLOs.
Reference tooling
| Need | Tool |
|---|---|
| Metrics | Prometheus, Mimir, VictoriaMetrics |
| Dashboards | Grafana |
| Logs | Loki, Elasticsearch/OpenSearch |
| Traces | Tempo, Jaeger |
| Telemetry framework | OpenTelemetry |
| Collector | OpenTelemetry Collector |
| Alert routing | Alertmanager, Grafana Alerting, PagerDuty/Opsgenie |
| Synthetic monitoring | Grafana Synthetic Monitoring, Checkly, k6 |
| Runtime profiling | Parca, Pyroscope |
| Incident management | PagerDuty, Opsgenie, incident.io |
Tools matter less than correct data, useful alerts and good incident process.
Practical SLO examples
Checkout API
SLI: ratio of successful POST /checkout requests
SLO: 99.9% over 30 days
Exclude: 4xx requests caused by invalid user input
Alert: burn 2% of budget in 1 hour or 10% in 6 hours
Search service
SLI: p95 latency under 500ms
SLO: 99% of search requests meet latency target over 7 days
Alert: p95 latency > 500ms for 15 minutes and traffic > minimum threshold
Background job
SLI: payment-events processed within 5 minutes
SLO: 99.5% of events processed within 5 minutes over 30 days
Alert: queue lag > 5 minutes for 20 minutes
FAQ
What is observability?
Observability is the ability to understand a system’s internal state from emitted data such as logs, metrics, traces and profiles.
How is monitoring different from observability?
Monitoring asks “is something wrong?”. Observability helps answer “why is it wrong, where is it wrong, who is affected and what should we do?”.
What is OpenTelemetry?
OpenTelemetry is an open-source, vendor-neutral observability framework with APIs, SDKs, Collector and protocol support for generating, collecting and exporting telemetry data.2
Do I need logs, metrics and traces?
Yes, eventually. Metrics are best for alerts and trends; logs are best for details; traces are best for distributed request flows. Start with metrics and structured logs, then add tracing for critical paths.
Does Prometheus replace OpenTelemetry?
No. Prometheus is mainly a metrics monitoring and alerting system. OpenTelemetry is a framework for producing and transporting telemetry. They can work together.
Does every service need an SLO?
Not at first. Start with the most important user journeys and services with clear business impact.
Conclusion
Observability is a core operating capability for modern IT systems. It is not just installing Grafana or storing many logs. It is designing the right operational data so teams can quickly answer: what is wrong, who is affected, where is the cause and what action should be taken?
A practical start is structured logs, RED/USE metrics, service dashboards and actionable alerts with runbooks. Then add OpenTelemetry tracing for critical paths. As the system grows, introduce SLOs, error budgets, OpenTelemetry Collector, sampling, redaction and dashboard-as-code to make observability sustainable.
References
Footnotes
-
OpenTelemetry Docs. “Observability primer.” https://opentelemetry.io/docs/concepts/observability-primer/ ↩
-
OpenTelemetry Docs. “What is OpenTelemetry?” https://opentelemetry.io/docs/what-is-opentelemetry/ ↩ ↩2 ↩3
-
OpenTelemetry Docs. “Signals.” https://opentelemetry.io/docs/concepts/signals/ ↩
-
Prometheus Docs. “Overview.” https://prometheus.io/docs/introduction/overview/ ↩ ↩2
-
Google SRE Book. “Service Level Objectives.” https://sre.google/sre-book/service-level-objectives/ ↩
-
Google SRE Book. “Monitoring Distributed Systems.” https://sre.google/sre-book/monitoring-distributed-systems/ ↩
-
CNCF. “OpenTelemetry.” https://www.cncf.io/projects/opentelemetry/ ↩
-
OpenTelemetry Docs. “Collector.” https://opentelemetry.io/docs/collector/ ↩
-
Grafana Docs. “Introduction.” https://grafana.com/docs/grafana/latest/introduction/ ↩
-
Grafana Loki Docs. https://grafana.com/docs/loki/latest/ ↩
-
Grafana Tempo Docs. https://grafana.com/docs/tempo/latest/ ↩
-
OpenTelemetry Docs. “Semantic conventions.” https://opentelemetry.io/docs/concepts/semantic-conventions/ ↩
Written by PixelRouter Editorial Team
We publish deep, authoritative guides on AI infrastructure, API gateway security, cloud financial management, and system optimizations for developers.
FAQ
What is observability?
Observability is the ability to understand a system’s internal state from emitted data such as logs, metrics, traces and profiles.
How is monitoring different from observability?
Monitoring asks whether something is wrong. Observability helps teams investigate why it is wrong, where the issue is, who is affected and what action should be taken.
What are the main observability signals?
The main observability signals discussed in the article are logs, metrics and traces. Logs provide detailed event records, metrics provide numeric measurements over time, and traces follow requests across services.
What is OpenTelemetry?
OpenTelemetry is an open-source, vendor-neutral observability framework with APIs, SDKs, Collector and protocol support for generating, collecting and exporting telemetry data.
Do I need logs, metrics and traces?
Yes, eventually. Metrics are useful for alerts and trends, logs are useful for detailed investigation, and traces are useful for understanding distributed request flows. A practical start is metrics and structured logs, then tracing for critical paths.
Does Prometheus replace OpenTelemetry?
No. Prometheus is mainly a metrics monitoring and alerting system, while OpenTelemetry is a framework for producing and transporting telemetry. They can work together.
📂Related posts
AI Guides
YouTube Copyright Policy 2026: Content ID, Strikes, Fair Use, and How to Respond
A practical guide to YouTube copyright policy, explaining Content ID claims, copyright strikes, fair use, Creative Commons, public domain, disputes, counter notifications, and creator checklists.
AI Guides
YouTube Policies Creators Should Know Beyond Deceptive Content
A creator-focused guide to YouTube policy areas beyond deceptive content, including harmful content, child safety, harassment, violent or graphic content, regulated goods, copyright, and monetization rules.
AI Guides
YouTube Deceptive Content Policy Part 3: Pre-Publish Compliance Workflow
A practical workflow for creators to review YouTube titles, thumbnails, descriptions, external links, AI-generated content, impersonation risks, warnings, and strikes before publishing.