AI Guides

What Is Observability? A Practical Guide to Logs, Metrics, Traces, OpenTelemetry and SLOs

A practical guide to observability for IT systems, covering monitoring vs observability, logs, metrics, traces, OpenTelemetry, Prometheus, Grafana, SLI/SLO, alerting, dashboards, incident response and a 90-day rollout roadmap.

Published: Jun 5, 2026Updated: Jun 5, 2026Reading time: 11 minViews: 1
ObservabilityOpenTelemetryLogs Metrics TracesPrometheusGrafanaSLOIncident ResponseIT Monitoring

💡Key Takeaways

  • A practical guide to observability for IT systems, covering monitoring vs observability, logs, metrics, traces, OpenTelemetry, Prometheus, Grafana, SLI/SLO, alerting, dashboards, incident response and a 90-day rollout roadmap.

Quick summary

Observability is the ability to understand the internal state of a system from the data it emits, mainly logs, metrics, traces and increasingly profiles. In modern systems such as microservices, Kubernetes, serverless, API gateways, databases, queues, caches and CI/CD platforms, observability helps teams understand whether the system is healthy, where failures happen, who is affected and what should be fixed first.

OpenTelemetry describes observability as understanding a system’s internal state by examining its outputs, and it highlights telemetry signals such as traces, metrics and logs.1 OpenTelemetry is an open-source observability framework made of APIs, SDKs, tools and integrations for creating, collecting, managing and exporting telemetry data.2

Simple version: monitoring tells you something is wrong; observability helps you understand why.

Why observability matters

A modern request may travel through many layers:

Mobile app
  ↓
CDN / WAF
  ↓
API Gateway
  ↓
Auth service
  ↓
Business service
  ↓
Database + Redis + Queue
  ↓
Payment provider
  ↓
Notification service

When users report that checkout is slow, you need to answer:

  • is the issue frontend or backend?
  • which service adds latency?
  • which database query is slow?
  • is a queue backlogged?
  • is the issue regional or global?
  • did a deployment cause it?
  • how many users are affected?
  • should the team rollback, scale, fix a query or contact a provider?

Logs alone rarely answer these questions quickly. Observability connects data across layers.

Monitoring vs observability

CriteriaMonitoringObservability
Main questionIs the system broken?Why is it broken?
Datapredefined metrics and alertslogs, metrics, traces, events, profiles
Best forknown failures and known thresholdsunknown failures and distributed systems
Usagedashboards and alertsinvestigation, correlation, root cause analysis
ExampleCPU > 90%, error rate > 5%request is slow because service A calls DB query B after deploy C

Monitoring is part of observability. You still need monitoring, but threshold-based monitoring alone is not enough for complex systems.

The three main signals: Logs, Metrics, Traces

OpenTelemetry calls logs, metrics and traces the main observability signals.3

Logs

Logs are event records, usually text or structured events.

Good log example:

{
  "timestamp": "2026-06-04T10:15:30Z",
  "level": "error",
  "service": "checkout-api",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "user_id_hash": "u_12ab",
  "order_id": "ord_9482",
  "message": "payment authorization failed",
  "provider": "stripe",
  "error_code": "card_declined"
}

Logs are good for:

  • detailed debugging;
  • audit trails;
  • exceptions;
  • business events;
  • security events;
  • incident investigation.

Common mistakes:

  • unstructured logs;
  • inconsistent timestamps;
  • logging secrets or passwords;
  • too much noise;
  • missing trace_id/request_id;
  • vague messages like “something went wrong.”

Metrics

Metrics are numeric measurements over time. Prometheus describes itself as a monitoring and alerting system that collects and stores metrics as time series, with timestamps and key-value labels.4

Metric examples:

http_requests_total{service="checkout-api",status="500"} 42
http_request_duration_seconds_bucket{service="checkout-api",le="0.5"} 1832
queue_depth{queue="payment-events"} 1200
cpu_usage_percent{pod="checkout-api-7d9"} 82

Metrics are good for:

  • dashboards;
  • alerts;
  • trends;
  • capacity planning;
  • SLI/SLO tracking;
  • autoscaling;
  • before/after deployment comparison.

Common mistakes:

  • high-cardinality labels;
  • CPU-based alerts instead of user-impact alerts;
  • missing service/version/region labels;
  • no latency histograms;
  • only infrastructure metrics, no business metrics.

Traces

Traces follow a request across services. A trace contains spans, and each span represents one operation.

Example:

Trace: checkout request
  ├── frontend → POST /checkout             1200ms
  ├── checkout-api → validate cart            30ms
  ├── checkout-api → auth-service             80ms
  ├── checkout-api → inventory-service       150ms
  ├── checkout-api → payment-provider        850ms
  └── checkout-api → database write           40ms

Traces are good for:

  • microservices;
  • distributed systems;
  • latency investigation;
  • dependency maps;
  • finding the slow service;
  • debugging request-specific failures.

Common mistakes:

  • no trace context propagation;
  • only instrumenting the gateway;
  • bad sampling strategy;
  • missing attributes such as route, region or version;
  • no correlation with logs.

SLI, SLO and SLA

The Google SRE Book defines an SLO as a target value or range for a service level, measured by an SLI.5

TermMeaningExample
SLIService Level Indicator, the measurement99th percentile latency, availability, error rate
SLOService Level Objective, the internal target99.9% successful requests over 30 days
SLAService Level Agreement, external commitmentservice credits if availability drops below agreement

Example:

SLI: ratio of 2xx/3xx HTTP responses to total requests
SLO: 99.9% successful requests over 30 days
SLA: customer receives credits if availability falls below 99.5%

SLOs help teams alert on user experience instead of every low-level signal.

What is an error budget?

If an availability SLO is 99.9% over 30 days, the service can fail 0.1% of requests or equivalent downtime. That allowed failure amount is the error budget.

Example:

SLO: 99.9% success rate
Error budget: 0.1% failed requests

If the error budget burns too fast, the team should:

  • pause risky releases;
  • prioritize reliability work;
  • rollback instead of pushing forward;
  • reduce change velocity;
  • investigate root cause.

Error budgets help balance feature speed and reliability.

Good alerting

The Google SRE Book emphasizes that monitoring systems should identify urgent conditions, but alerts should be operationally meaningful. Bad alerts create fatigue and get ignored.6

Good alerts have:

  • clear user impact;
  • severity;
  • runbook;
  • owner;
  • dashboard link;
  • trace/log query link;
  • SLO-based threshold when possible;
  • deduplication for the same root cause.

Avoid alerts for:

  • small CPU spikes with no user impact;
  • warnings requiring no action;
  • log noise;
  • short-lived metric fluctuations;
  • symptoms already covered by another alert.

Good alert example:

Checkout API 5xx rate exceeds 5% for 10 minutes
Impact: users cannot complete checkout
Action: check recent deployments, payment provider and database errors
Dashboard: ...
Runbook: ...

What is OpenTelemetry?

OpenTelemetry, or OTel, is an open-source, vendor-neutral observability framework. It provides APIs, SDKs, Collector and protocol support to generate and export telemetry data to backends such as Prometheus, Grafana, Jaeger, Tempo, Datadog, New Relic, Honeycomb, Elastic or internal systems.2

Strengths:

  • vendor-neutral;
  • supports many languages;
  • auto-instrumentation;
  • manual instrumentation;
  • OTLP protocol;
  • OpenTelemetry Collector;
  • semantic conventions for standard attributes.

CNCF lists OpenTelemetry as a CNCF project.7

What is OpenTelemetry Collector?

OpenTelemetry Collector receives, processes and exports telemetry data. It prevents every application from needing direct integration with every backend.

Common flow:

App / Service
  ↓ OTLP
OpenTelemetry Collector
  ↓ processors: batch, filter, transform, sampling
Observability backends
  ├── Prometheus / Mimir for metrics
  ├── Loki / Elasticsearch for logs
  └── Tempo / Jaeger for traces

Collector components:

ComponentRole
Receiversreceive telemetry from apps or agents
Processorsbatch, filter, redact, transform, sample
Exporterssend telemetry to backends

The OpenTelemetry docs describe the Collector as a way to receive, process and export telemetry through receivers, processors, exporters and connectors.8

Prometheus, Grafana, Loki and Tempo

ToolRole
Prometheustime-series metrics collection and querying
Alertmanageralert routing and deduplication
Grafanadashboards and visualization
Lokilog aggregation
Tempodistributed tracing backend
Mimir/Thanosscalable long-term metrics storage
Jaegerdistributed tracing backend
OpenTelemetry Collectortelemetry collection, processing and export

Prometheus is commonly used for metrics and alerting; Grafana for dashboards; Loki for logs; Tempo or Jaeger for traces.491011

Example observability architectures

Small-team architecture

App emits logs + metrics
  ↓
Prometheus scrapes metrics
  ↓
Grafana dashboard + Alertmanager
  ↓
Logs sent to Loki

Best for:

  • monoliths;
  • a few services;
  • early-stage observability;
  • smaller budgets.

Cloud-native architecture

Services instrumented with OpenTelemetry
  ↓ OTLP
OpenTelemetry Collector DaemonSet/Gateway
  ↓
Metrics → Prometheus/Mimir
Logs    → Loki/Elastic
Traces  → Tempo/Jaeger
  ↓
Grafana dashboards + alerts + incident runbooks

Best for:

  • Kubernetes;
  • microservices;
  • multiple teams;
  • log/metric/trace correlation.

What is instrumentation?

Instrumentation is adding code, libraries or agents so an application emits telemetry.

Types:

TypeDescription
Auto-instrumentationagents/libraries automatically capture HTTP, DB, queue and framework data
Manual instrumentationdevelopers add spans, metrics and attributes for business logic

Auto-instrumentation is good for starting quickly. Manual instrumentation adds business context.

Example manual span:

Span: checkout.calculate_discount
Attributes:
  cart.items_count = 4
  user.tier = "premium"
  discount.rule = "campaign_2026_q2"

Do not put sensitive PII such as email, phone number, tokens, addresses or payment data into attributes.

Semantic conventions

Semantic conventions are standard names for OpenTelemetry attributes. They help services and tools use common fields such as service.name, http.request.method, url.path, db.system, server.address.12

Benefits:

  • reusable dashboards;
  • consistent queries;
  • better correlation;
  • less chaos across teams.

Example:

service.name = checkout-api
deployment.environment.name = production
http.request.method = POST
http.route = /checkout
http.response.status_code = 500
db.system.name = postgresql

Cardinality

Cardinality means the number of unique values in a label or attribute. High-cardinality metrics can make backends expensive, slow and memory-heavy.

Bad example:

http_requests_total{user_id="123456", email="[email protected]", request_id="abc"} 1

Better example:

http_requests_total{service="checkout-api", route="/checkout", status="500"} 1

Rules:

  • do not use user_id, email or request_id as metric labels;
  • use route templates like /users/:id, not /users/123;
  • use traces/logs for request-specific investigation;
  • use metrics for aggregate trends.

Dashboard design

A good dashboard answers operational questions.

A service dashboard should include:

  • request rate;
  • error rate;
  • p50/p95/p99 latency;
  • saturation: CPU, memory, queue depth, connection pool;
  • dependency latency;
  • deployment version;
  • recent error logs;
  • trace exemplars where available;
  • SLO burn rate;
  • runbook link.

Suggested layout:

Row 1: user impact
Row 2: service health
Row 3: dependencies
Row 4: infrastructure
Row 5: logs/traces links

RED and USE methods

RED method

For services and requests:

Rate: requests per second
Errors: error rate
Duration: latency

Best for APIs, microservices and gateways.

USE method

For resources:

Utilization: resource usage
Saturation: queued/waiting work
Errors: resource errors

Best for CPU, memory, disk, network and connection pools.

Using RED and USE together covers both user impact and infrastructure health.

Kubernetes observability

Kubernetes needs observability across layers:

LayerWhat to measure
Clusternode health, API server, scheduler, etcd
NodeCPU, memory, disk, network
Podrestarts, OOMKilled, CPU/memory, readiness
Deploymentreplicas, rollout status, version
Servicerequest rate, error rate, latency
Ingress4xx/5xx, TLS, latency
Workloadbusiness metrics, queue depth
Securityaudit logs, policy violations

Do not monitor only CPU and memory. Many user-facing incidents are caused by dependencies, databases, queues or external APIs.

Database observability

Track:

  • query latency;
  • slow queries;
  • connection pool usage;
  • locks/deadlocks;
  • replication lag;
  • cache hit ratio;
  • disk usage;
  • transaction rate;
  • error rate;
  • backup status;
  • failover events.

If an API is slow, traces should show which database query or dependency caused the delay.

Frontend observability

Frontend also needs observability:

  • page load time;
  • Core Web Vitals;
  • JavaScript errors;
  • browser-side API latency;
  • route-level error rate;
  • device/browser breakdown;
  • geographic latency;
  • session replay where policy allows;
  • feature flag impact.

A healthy backend does not always mean a good user experience.

Observability for AI/LLM applications

AI applications need extra telemetry:

  • model name/version;
  • provider;
  • prompt tokens;
  • completion tokens;
  • estimated cost;
  • provider latency;
  • rate limits;
  • retry/fallback count;
  • tool calls;
  • retrieval latency;
  • vector search top-k;
  • grounding/citation coverage;
  • safety refusals/errors;
  • user feedback.

Do not log raw prompts containing sensitive data unless redaction and policy are in place.

Telemetry data security

Telemetry often contains sensitive data. Control it carefully:

  • never log passwords, tokens or API keys;
  • hash or redact user identifiers;
  • avoid PII in traces;
  • sample sensitive data carefully;
  • encrypt in transit and at rest;
  • set retention policy;
  • use RBAC for dashboards and logs;
  • audit access to logs;
  • separate production and development logs;
  • redact at OpenTelemetry Collector when possible.

Observability should not become a data leak.

30/60/90-day rollout roadmap

Days 1–30: foundation

  • Standardize structured logging.
  • Add request_id/trace_id to logs.
  • Collect core metrics: rate, error, duration and saturation.
  • Deploy Prometheus + Grafana or a managed backend.
  • Create dashboards for 3–5 critical services.
  • Create first user-impact alert.
  • Write runbooks for important alerts.
  • Set retention policy.

Days 31–60: traces and SLOs

  • Add OpenTelemetry SDK or auto-instrumentation.
  • Add OpenTelemetry Collector.
  • Enable distributed tracing for critical requests.
  • Define SLI/SLO for critical services.
  • Add SLO burn-rate alerts.
  • Correlate logs with traces.
  • Add dependency metrics for DB, queues, caches and external APIs.
  • Review metric cardinality.

Days 61–90: optimization and governance

  • Add trace sampling policy.
  • Add Collector redaction/filtering.
  • Move dashboards to code.
  • Review alerts regularly.
  • Create incident postmortem template.
  • Optimize logs/traces cost.
  • Build service catalog with owner/runbook/SLO.
  • Add synthetic checks for critical user journeys.
  • Add frontend and AI workload observability where relevant.

Quick checklist

Logs

  • Structured JSON logs.
  • Service name, environment and version.
  • trace_id/request_id.
  • No secrets or sensitive PII.
  • Correct log levels.
  • Retention policy.

Metrics

  • RED metrics for services.
  • USE metrics for resources.
  • Latency histograms.
  • Controlled label cardinality.
  • Important business metrics.
  • SLO-based alerts.

Traces

  • Trace context propagation.
  • Gateway, service, DB, queue and external call spans.
  • Good sampling policy.
  • Standard span attributes.
  • Trace-log correlation.
  • No PII in spans.

Dashboards

  • Service dashboard.
  • User journey dashboard.
  • Links to logs/traces.
  • Deployment markers.
  • SLO/burn rate.
  • Owner and runbook.

Alerts

  • Clear action.
  • User impact.
  • Severity.
  • Runbook.
  • Dedup/silence.
  • Regular alert-noise review.

Common mistakes

  • Dashboards without useful alerts.
  • Too many noisy alerts.
  • Logs without request_id/trace_id.
  • High-cardinality metric labels.
  • Measuring only CPU/memory, not user impact.
  • Trace sampling that drops important errors.
  • No runbooks.
  • No service owner.
  • Retention too long and too expensive.
  • Secrets or PII in logs.
  • No deployment-to-incident correlation.
  • No SLOs.

Reference tooling

NeedTool
MetricsPrometheus, Mimir, VictoriaMetrics
DashboardsGrafana
LogsLoki, Elasticsearch/OpenSearch
TracesTempo, Jaeger
Telemetry frameworkOpenTelemetry
CollectorOpenTelemetry Collector
Alert routingAlertmanager, Grafana Alerting, PagerDuty/Opsgenie
Synthetic monitoringGrafana Synthetic Monitoring, Checkly, k6
Runtime profilingParca, Pyroscope
Incident managementPagerDuty, Opsgenie, incident.io

Tools matter less than correct data, useful alerts and good incident process.

Practical SLO examples

Checkout API

SLI: ratio of successful POST /checkout requests
SLO: 99.9% over 30 days
Exclude: 4xx requests caused by invalid user input
Alert: burn 2% of budget in 1 hour or 10% in 6 hours

Search service

SLI: p95 latency under 500ms
SLO: 99% of search requests meet latency target over 7 days
Alert: p95 latency > 500ms for 15 minutes and traffic > minimum threshold

Background job

SLI: payment-events processed within 5 minutes
SLO: 99.5% of events processed within 5 minutes over 30 days
Alert: queue lag > 5 minutes for 20 minutes

FAQ

What is observability?

Observability is the ability to understand a system’s internal state from emitted data such as logs, metrics, traces and profiles.

How is monitoring different from observability?

Monitoring asks “is something wrong?”. Observability helps answer “why is it wrong, where is it wrong, who is affected and what should we do?”.

What is OpenTelemetry?

OpenTelemetry is an open-source, vendor-neutral observability framework with APIs, SDKs, Collector and protocol support for generating, collecting and exporting telemetry data.2

Do I need logs, metrics and traces?

Yes, eventually. Metrics are best for alerts and trends; logs are best for details; traces are best for distributed request flows. Start with metrics and structured logs, then add tracing for critical paths.

Does Prometheus replace OpenTelemetry?

No. Prometheus is mainly a metrics monitoring and alerting system. OpenTelemetry is a framework for producing and transporting telemetry. They can work together.

Does every service need an SLO?

Not at first. Start with the most important user journeys and services with clear business impact.

Conclusion

Observability is a core operating capability for modern IT systems. It is not just installing Grafana or storing many logs. It is designing the right operational data so teams can quickly answer: what is wrong, who is affected, where is the cause and what action should be taken?

A practical start is structured logs, RED/USE metrics, service dashboards and actionable alerts with runbooks. Then add OpenTelemetry tracing for critical paths. As the system grows, introduce SLOs, error budgets, OpenTelemetry Collector, sampling, redaction and dashboard-as-code to make observability sustainable.

References

Footnotes

  1. OpenTelemetry Docs. “Observability primer.” https://opentelemetry.io/docs/concepts/observability-primer/

  2. OpenTelemetry Docs. “What is OpenTelemetry?” https://opentelemetry.io/docs/what-is-opentelemetry/ 2 3

  3. OpenTelemetry Docs. “Signals.” https://opentelemetry.io/docs/concepts/signals/

  4. Prometheus Docs. “Overview.” https://prometheus.io/docs/introduction/overview/ 2

  5. Google SRE Book. “Service Level Objectives.” https://sre.google/sre-book/service-level-objectives/

  6. Google SRE Book. “Monitoring Distributed Systems.” https://sre.google/sre-book/monitoring-distributed-systems/

  7. CNCF. “OpenTelemetry.” https://www.cncf.io/projects/opentelemetry/

  8. OpenTelemetry Docs. “Collector.” https://opentelemetry.io/docs/collector/

  9. Grafana Docs. “Introduction.” https://grafana.com/docs/grafana/latest/introduction/

  10. Grafana Loki Docs. https://grafana.com/docs/loki/latest/

  11. Grafana Tempo Docs. https://grafana.com/docs/tempo/latest/

  12. OpenTelemetry Docs. “Semantic conventions.” https://opentelemetry.io/docs/concepts/semantic-conventions/

PR

Written by PixelRouter Editorial Team

We publish deep, authoritative guides on AI infrastructure, API gateway security, cloud financial management, and system optimizations for developers.

FAQ

What is observability?

Observability is the ability to understand a system’s internal state from emitted data such as logs, metrics, traces and profiles.

How is monitoring different from observability?

Monitoring asks whether something is wrong. Observability helps teams investigate why it is wrong, where the issue is, who is affected and what action should be taken.

What are the main observability signals?

The main observability signals discussed in the article are logs, metrics and traces. Logs provide detailed event records, metrics provide numeric measurements over time, and traces follow requests across services.

What is OpenTelemetry?

OpenTelemetry is an open-source, vendor-neutral observability framework with APIs, SDKs, Collector and protocol support for generating, collecting and exporting telemetry data.

Do I need logs, metrics and traces?

Yes, eventually. Metrics are useful for alerts and trends, logs are useful for detailed investigation, and traces are useful for understanding distributed request flows. A practical start is metrics and structured logs, then tracing for critical paths.

Does Prometheus replace OpenTelemetry?

No. Prometheus is mainly a metrics monitoring and alerting system, while OpenTelemetry is a framework for producing and transporting telemetry. They can work together.