Hướng dẫn DevOps

Observability là gì? Hướng dẫn logs, metrics, traces, OpenTelemetry và SLO cho hệ thống IT

Hướng dẫn dễ hiểu về observability trong hệ thống IT: logs, metrics, traces, OpenTelemetry, Prometheus, Grafana, SLI/SLO, alerting, dashboard, incident response và roadmap triển khai 90 ngày.

Xuất bản: 5 thg 6, 2026Cập nhật: 5 thg 6, 2026Thời gian đọc: 13 minLượt xem: 0

ObservabilityOpenTelemetryLogs Metrics TracesPrometheusGrafanaSLOIncident ResponseDevOps

💡Điểm chính của bài viết

Hướng dẫn dễ hiểu về observability trong hệ thống IT: logs, metrics, traces, OpenTelemetry, Prometheus, Grafana, SLI/SLO, alerting, dashboard, incident response và roadmap triển khai 90 ngày.

Tóm tắt nhanh

Observability là khả năng hiểu được trạng thái bên trong của hệ thống dựa trên dữ liệu bên ngoài mà hệ thống phát ra, chủ yếu gồm logs, metrics, traces và ngày càng thêm profiles. Trong hệ thống hiện đại như microservices, Kubernetes, serverless, API gateway, database, queue, cache và CI/CD, observability giúp đội IT biết hệ thống có đang khỏe không, lỗi nằm ở đâu, người dùng bị ảnh hưởng ra sao và nên sửa gì trước.

OpenTelemetry mô tả observability là khả năng hiểu trạng thái nội bộ của hệ thống bằng cách kiểm tra output của nó, đồng thời nhấn mạnh ba nhóm tín hiệu telemetry chính: traces, metrics và logs.¹ OpenTelemetry là framework observability mã nguồn mở gồm API, SDK, tool và integration để tạo, thu thập, quản lý và xuất telemetry data.²

Nói đơn giản: monitoring cho biết có vấn đề; observability giúp hiểu vì sao có vấn đề.

Vì sao observability quan trọng?

Hệ thống IT ngày nay không còn là một server đơn giản. Một request có thể đi qua:

Mobile app
  ↓
CDN / WAF
  ↓
API Gateway
  ↓
Auth service
  ↓
Business service
  ↓
Database + Redis + Queue
  ↓
Payment provider
  ↓
Notification service

Nếu người dùng nói “checkout bị chậm”, bạn cần trả lời:

chậm ở frontend hay backend?
latency tăng từ service nào?
database query nào bị chậm?
queue có bị backlog không?
lỗi chỉ xảy ra với một region hay toàn bộ?
deployment mới có liên quan không?
lỗi ảnh hưởng bao nhiêu phần trăm user?
nên rollback, scale, fix query hay gọi provider?

Nếu chỉ có log rời rạc, rất khó trả lời nhanh. Observability giúp kết nối dữ liệu từ nhiều lớp để điều tra sự cố theo ngữ cảnh.

Monitoring và Observability khác nhau thế nào?

Tiêu chí	Monitoring	Observability
Câu hỏi chính	Hệ thống có đang lỗi không?	Vì sao hệ thống lỗi?
Dữ liệu	metrics/alerts đã định nghĩa trước	logs, metrics, traces, events, profiles
Phù hợp	lỗi đã biết, ngưỡng đã biết	lỗi mới, hệ thống phân tán, nguyên nhân phức tạp
Cách dùng	dashboard và alert	điều tra, correlation, root cause analysis
Ví dụ	CPU > 90%, error rate > 5%	request chậm vì service A gọi DB query B sau deploy C

Monitoring là một phần của observability. Bạn vẫn cần monitoring, nhưng trong hệ thống phức tạp, chỉ monitoring theo ngưỡng là chưa đủ.

Ba tín hiệu chính: Logs, Metrics, Traces

OpenTelemetry gọi logs, metrics và traces là các signals chính trong observability.³

Logs

Logs là bản ghi sự kiện dạng text hoặc structured event.

Ví dụ log tốt:

{
  "timestamp": "2026-06-04T10:15:30Z",
  "level": "error",
  "service": "checkout-api",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "user_id_hash": "u_12ab",
  "order_id": "ord_9482",
  "message": "payment authorization failed",
  "provider": "stripe",
  "error_code": "card_declined"
}

Log dùng tốt cho:

debug chi tiết;
audit;
exception;
business event;
security event;
replay điều tra sự cố.

Sai lầm thường gặp:

log không structured;
log thiếu timestamp chuẩn;
log chứa secret/token/password;
log quá nhiều gây tốn chi phí;
log không có trace_id/request_id;
log message mơ hồ như “something went wrong”.

Metrics

Metrics là số đo theo thời gian. Prometheus mô tả mình là hệ thống monitoring và alerting, thu thập và lưu metrics dạng time series, tức thông tin metrics đi kèm timestamp và label key-value.⁴

Ví dụ metrics:

http_requests_total{service="checkout-api",status="500"} 42
http_request_duration_seconds_bucket{service="checkout-api",le="0.5"} 1832
queue_depth{queue="payment-events"} 1200
cpu_usage_percent{pod="checkout-api-7d9"} 82

Metrics dùng tốt cho:

dashboard;
alert;
trend;
capacity planning;
SLI/SLO;
autoscaling;
so sánh trước/sau deploy.

Sai lầm thường gặp:

quá nhiều label gây high cardinality;
alert theo CPU thay vì user impact;
metrics không gắn service/version/region;
thiếu histogram latency;
chỉ đo infrastructure, không đo business metrics.

Traces

Traces theo dõi đường đi của một request qua nhiều service. Mỗi trace gồm nhiều span, mỗi span là một bước xử lý.

Ví dụ:

Trace: checkout request
  ├── frontend → POST /checkout             1200ms
  ├── checkout-api → validate cart            30ms
  ├── checkout-api → auth-service             80ms
  ├── checkout-api → inventory-service       150ms
  ├── checkout-api → payment-provider        850ms
  └── checkout-api → database write           40ms

Traces dùng tốt cho:

microservices;
distributed systems;
latency investigation;
dependency map;
tìm service gây chậm;
phân tích lỗi theo request cụ thể.

Sai lầm thường gặp:

không propagate trace context;
chỉ instrument gateway, không instrument service nội bộ;
sampling sai làm mất trace quan trọng;
trace thiếu attributes như user tier, region, route, version;
trace không correlate với log.

SLI, SLO, SLA là gì?

Google SRE Book định nghĩa SLO là mục tiêu cụ thể cho service level, còn SLI là chỉ số được đo để đánh giá service level đó.⁵

Thuật ngữ	Ý nghĩa	Ví dụ
SLI	Service Level Indicator, chỉ số đo	99th percentile latency, availability, error rate
SLO	Service Level Objective, mục tiêu nội bộ	99.9% request thành công trong 30 ngày
SLA	Service Level Agreement, cam kết với khách hàng	nếu availability dưới mức cam kết thì hoàn tiền/credit

Ví dụ:

SLI: tỷ lệ HTTP 2xx/3xx trên tổng request
SLO: 99.9% request thành công trong 30 ngày
SLA: nếu dưới 99.5%, khách hàng được service credit

SLO giúp alert dựa trên trải nghiệm người dùng thay vì alert mọi thứ nhỏ lẻ.

Error budget là gì?

Nếu SLO availability là 99.9% trong 30 ngày, bạn được phép lỗi 0.1% request hoặc downtime tương đương. Phần được phép lỗi đó gọi là error budget.

Ví dụ:

SLO: 99.9% success rate
Error budget: 0.1% failed requests

Nếu error budget bị đốt quá nhanh, team nên:

dừng release rủi ro;
ưu tiên reliability work;
rollback thay vì cố deploy tiếp;
giảm change velocity;
điều tra root cause.

Error budget giúp cân bằng giữa tốc độ phát triển và độ ổn định.

Alerting: cảnh báo thế nào cho đúng?

Google SRE cảnh báo rằng monitoring system cần phát hiện các tình trạng cần can thiệp khẩn cấp, nhưng alert phải có ý nghĩa vận hành; alert tồi gây mệt mỏi và bị bỏ qua.⁶

Alert tốt nên có:

user impact rõ;
severity rõ;
runbook;
owner;
link dashboard;
link trace/log query;
ngưỡng dựa trên SLO nếu có;
không bắn liên tục vì cùng một nguyên nhân.

Không nên alert cho:

CPU tăng nhẹ nhưng user không ảnh hưởng;
warning không cần hành động;
log noise;
metric dao động ngắn;
symptom đã có alert khác bao phủ.

Ví dụ alert tốt:

Checkout API 5xx rate exceeds 5% for 10 minutes
Impact: users cannot complete checkout
Action: check recent deployments, payment provider, database errors
Dashboard: ...
Runbook: ...

OpenTelemetry là gì?

OpenTelemetry hay OTel là framework observability mã nguồn mở, vendor-neutral. OpenTelemetry cung cấp API, SDK, Collector và protocol để tạo và gửi telemetry data tới backend như Prometheus, Grafana, Jaeger, Tempo, Datadog, New Relic, Honeycomb, Elastic hoặc hệ thống nội bộ.²

Điểm mạnh:

không khóa vào một vendor;
hỗ trợ nhiều ngôn ngữ;
có auto-instrumentation;
có manual instrumentation;
dùng OTLP để gửi telemetry;
có OpenTelemetry Collector làm lớp trung gian;
có semantic conventions giúp chuẩn hóa tên attributes.

CNCF ghi OpenTelemetry là project thuộc hệ sinh thái CNCF.⁷

OpenTelemetry Collector là gì?

OpenTelemetry Collector là thành phần nhận, xử lý và xuất telemetry. Nó giúp app không phải gửi trực tiếp tới từng backend.

Luồng phổ biến:

App / Service
  ↓ OTLP
OpenTelemetry Collector
  ↓ processors: batch, filter, transform, sampling
Observability backends
  ├── Prometheus / Mimir for metrics
  ├── Loki / Elasticsearch for logs
  └── Tempo / Jaeger for traces

Collector có ba nhóm chức năng chính:

Thành phần	Vai trò
Receivers	nhận telemetry từ app hoặc agent
Processors	batch, filter, redact, transform, sampling
Exporters	gửi telemetry sang backend

OpenTelemetry docs mô tả Collector là cách nhận, xử lý và xuất telemetry data qua các receiver, processor, exporter và connector.⁸

Prometheus, Grafana, Loki, Tempo dùng để làm gì?

Công cụ	Vai trò
Prometheus	thu thập và query metrics time series
Alertmanager	quản lý alert từ Prometheus
Grafana	dashboard và visualization
Loki	log aggregation, thường query bằng LogQL
Tempo	distributed tracing backend
Mimir/Thanos	metrics long-term storage và scale lớn
Jaeger	tracing backend phổ biến
OpenTelemetry Collector	thu thập/xử lý/export telemetry

Prometheus phù hợp cho metrics và alerting; Grafana thường dùng để hiển thị dashboard; Loki lưu logs; Tempo hoặc Jaeger lưu traces.⁴⁹¹⁰¹¹

Kiến trúc observability mẫu

Kiến trúc nhỏ cho team ít người

App emits logs + metrics
  ↓
Prometheus scrapes metrics
  ↓
Grafana dashboard + Alertmanager
  ↓
Logs gửi tới Loki

Phù hợp:

monolith;
vài service;
chưa cần tracing sâu;
ngân sách nhỏ.

Kiến trúc cloud-native

Services instrumented with OpenTelemetry
  ↓ OTLP
OpenTelemetry Collector DaemonSet/Gateway
  ↓
Metrics → Prometheus/Mimir
Logs    → Loki/Elastic
Traces  → Tempo/Jaeger
  ↓
Grafana dashboards + alerts + incident runbooks

Phù hợp:

Kubernetes;
microservices;
nhiều team;
cần correlation logs/metrics/traces.

Instrumentation là gì?

Instrumentation là việc thêm code hoặc agent để app phát ra telemetry.

Có hai kiểu:

Kiểu	Mô tả
Auto-instrumentation	dùng agent/library tự bắt HTTP, DB, queue, framework
Manual instrumentation	developer tự tạo span, metric, log attribute cho business logic

Auto-instrumentation giúp bắt đầu nhanh. Manual instrumentation giúp telemetry có ngữ cảnh business.

Ví dụ manual span:

Span: checkout.calculate_discount
Attributes:
  cart.items_count = 4
  user.tier = "premium"
  discount.rule = "campaign_2026_q2"

Không nên ghi PII nhạy cảm như email, số điện thoại, token, địa chỉ, số thẻ vào attributes.

Semantic conventions là gì?

Semantic conventions là quy ước đặt tên attributes trong OpenTelemetry. Mục tiêu là các service, ngôn ngữ và backend dùng chung tên field như service.name, http.request.method, url.path, db.system, server.address.¹²

Lợi ích:

dashboard dễ tái sử dụng;
query nhất quán;
correlation tốt hơn;
giảm hỗn loạn khi nhiều team instrument khác nhau.

Ví dụ:

service.name = checkout-api
deployment.environment.name = production
http.request.method = POST
http.route = /checkout
http.response.status_code = 500
db.system.name = postgresql

Cardinality là gì và vì sao nguy hiểm?

Cardinality là số lượng giá trị khác nhau của label/attribute. Metrics có cardinality quá cao có thể làm backend nặng, tốn RAM, tốn tiền và query chậm.

Ví dụ xấu:

http_requests_total{user_id="123456", email="[email protected]", request_id="abc"} 1

Vì user_id, email, request_id có quá nhiều giá trị khác nhau.

Ví dụ tốt hơn:

http_requests_total{service="checkout-api", route="/checkout", status="500"} 1

Quy tắc:

không dùng user_id/email/request_id làm metric label;
route nên là template /users/:id, không phải /users/123;
dùng trace/log để điều tra request cụ thể;
dùng metrics để xem xu hướng tổng quát.

Dashboard nên thiết kế thế nào?

Dashboard tốt trả lời câu hỏi vận hành, không chỉ trưng bày biểu đồ.

Dashboard service nên có:

request rate;
error rate;
latency p50/p95/p99;
saturation: CPU, memory, queue depth, connection pool;
dependency latency;
deployment version;
recent error logs;
trace exemplars nếu có;
SLO burn rate;
link runbook.

Cấu trúc gợi ý:

Hàng 1: user impact
Hàng 2: service health
Hàng 3: dependencies
Hàng 4: infrastructure
Hàng 5: logs/traces links

RED và USE method

RED method

Dùng cho service/request:

Rate: số request/giây
Errors: tỷ lệ lỗi
Duration: latency

Phù hợp API, microservice, gateway.

USE method

Dùng cho tài nguyên:

Utilization: mức dùng tài nguyên
Saturation: mức xếp hàng/chờ
Errors: lỗi tài nguyên

Phù hợp CPU, memory, disk, network, connection pool.

Kết hợp RED + USE giúp nhìn cả user impact và infrastructure health.

Observability cho Kubernetes

Kubernetes cần theo dõi nhiều lớp:

Lớp	Cần đo
Cluster	node health, API server, scheduler, etcd
Node	CPU, memory, disk, network
Pod	restart count, OOMKilled, CPU/memory, readiness
Deployment	replicas, rollout status, version
Service	request rate, error rate, latency
Ingress	4xx/5xx, TLS, latency
Workload	business metrics, queue depth
Security	audit logs, policy violations

Không nên chỉ theo dõi CPU/memory. Nhiều sự cố user-facing xảy ra ở dependency, database, queue hoặc external API.

Observability cho database

Database observability cần:

query latency;
slow queries;
connection pool usage;
locks/deadlocks;
replication lag;
cache hit ratio;
disk usage;
transaction rate;
error rate;
backup status;
failover events.

Nếu API chậm, traces nên cho biết latency nằm ở DB query nào, không chỉ nói “request slow”.

Observability cho frontend

Frontend cũng cần observability:

page load time;
Core Web Vitals;
JavaScript errors;
API latency từ browser;
route-level error rate;
device/browser breakdown;
geographic latency;
rage click/session replay nếu phù hợp policy;
feature flag impact.

Backend “healthy” không có nghĩa là user experience tốt.

Observability cho AI/LLM apps

Ứng dụng AI cần thêm telemetry đặc thù:

model name/version;
provider;
prompt token;
completion token;
total cost ước tính;
latency theo provider;
rate limit;
retry/fallback count;
tool calls;
retrieval latency;
vector search top-k;
grounding/citation coverage;
safety refusal/error;
user feedback.

Không nên log raw prompt chứa dữ liệu nhạy cảm nếu chưa có chính sách redaction.

Bảo mật dữ liệu observability

Telemetry thường chứa dữ liệu nhạy cảm. Cần kiểm soát:

không log password, token, API key;
hash hoặc redact user identifiers;
hạn chế PII trong traces;
sampling với dữ liệu nhạy cảm;
encryption in transit và at rest;
retention policy;
RBAC cho dashboard/logs;
audit truy cập log;
tách production logs khỏi dev;
filter/redact tại OpenTelemetry Collector nếu có thể.

Observability không được trở thành nơi rò rỉ dữ liệu.

Roadmap triển khai 30/60/90 ngày

Ngày 1–30: nền tảng

Chuẩn hóa structured logging.
Thêm request_id/trace_id vào logs.
Thu metrics cơ bản: rate, error, duration, saturation.
Dựng Prometheus + Grafana hoặc dùng managed backend.
Tạo dashboard cho 3–5 service quan trọng.
Tạo alert user-impact đầu tiên.
Viết runbook cho alert quan trọng.
Bật log retention hợp lý.

Ngày 31–60: traces và SLO

Tích hợp OpenTelemetry SDK/auto-instrumentation.
Thêm OpenTelemetry Collector.
Bật distributed tracing cho request chính.
Định nghĩa SLI/SLO cho service quan trọng.
Thêm SLO burn-rate alerts.
Correlate logs với traces.
Thêm dependency metrics: DB, queue, cache, external API.
Review cardinality.

Ngày 61–90: tối ưu và governance

Sampling policy cho traces.
Redaction/filtering tại Collector.
Dashboard-as-code.
Alert review định kỳ.
Incident postmortem template.
Cost optimization cho logs/traces.
Service catalog có owner/runbook/SLO.
Synthetic checks cho critical user journeys.
Observability cho frontend và AI workloads nếu có.

Checklist triển khai nhanh

Logs

Structured JSON logs.
Có service name, environment, version.
Có trace_id/request_id.
Không log secret/PII nhạy cảm.
Có log level đúng.
Có retention policy.

Metrics

RED metrics cho service.
USE metrics cho resource.
Histogram latency.
Label cardinality kiểm soát.
Business metrics quan trọng.
Alert dựa trên SLO.

Traces

Propagate trace context.
Instrument gateway, service, DB, queue, external calls.
Sampling hợp lý.
Span attributes chuẩn.
Trace-log correlation.
Không ghi PII vào span.

Dashboard

Dashboard theo service.
Dashboard theo user journey.
Link logs/traces.
Có deployment markers.
Có SLO/burn rate.
Có owner và runbook.

Alerting

Alert có action rõ.
Alert dựa trên user impact.
Có severity.
Có runbook.
Có dedup/silence.
Review alert noise định kỳ.

Sai lầm phổ biến

Chỉ có dashboard nhưng không có alert tốt.
Alert quá nhiều gây alert fatigue.
Log không có request_id/trace_id.
Metrics label cardinality quá cao.
Chỉ đo CPU/memory, không đo user impact.
Trace sampling làm mất trace lỗi.
Không có runbook.
Không có owner cho service.
Retention quá dài gây tốn chi phí.
Ghi PII/secrets vào logs.
Không liên kết deployment với incident.
Không định nghĩa SLO.

Tooling tham khảo

Nhu cầu	Tool
Metrics	Prometheus, Mimir, VictoriaMetrics
Dashboard	Grafana
Logs	Loki, Elasticsearch/OpenSearch
Traces	Tempo, Jaeger
Telemetry framework	OpenTelemetry
Collector	OpenTelemetry Collector
Alert routing	Alertmanager, Grafana Alerting, PagerDuty/Opsgenie
Synthetic monitoring	Grafana Synthetic Monitoring, Checkly, k6
Runtime profiling	Parca, Pyroscope
Incident management	PagerDuty, Opsgenie, incident.io

Tool không quan trọng bằng dữ liệu đúng, alert đúng và quy trình xử lý đúng.

Mẫu SLO thực tế

API checkout

SLI: tỷ lệ request POST /checkout thành công
SLO: 99.9% trong 30 ngày
Loại trừ: request 4xx do user input sai
Alert: burn rate 2% budget trong 1 giờ hoặc 10% budget trong 6 giờ

Search service

SLI: p95 latency dưới 500ms
SLO: 99% request search đạt p95 mục tiêu trong 7 ngày
Alert: latency p95 > 500ms trong 15 phút và traffic > ngưỡng tối thiểu

Background job

SLI: payment-events processed within 5 minutes
SLO: 99.5% events xử lý trong 5 phút trong 30 ngày
Alert: queue lag > 5 phút trong 20 phút

FAQ

Observability là gì?

Observability là khả năng hiểu trạng thái bên trong của hệ thống dựa trên dữ liệu hệ thống phát ra như logs, metrics, traces và profiles. Nó giúp điều tra nguyên nhân sự cố thay vì chỉ biết hệ thống đang lỗi.

Monitoring khác observability thế nào?

Monitoring thường trả lời “có lỗi không?”. Observability trả lời sâu hơn: “vì sao lỗi, lỗi ở đâu, ảnh hưởng ai, sửa hướng nào?”.

OpenTelemetry là gì?

OpenTelemetry là framework observability mã nguồn mở, vendor-neutral, cung cấp API, SDK, Collector và protocol để tạo, thu thập và xuất telemetry data.²

Có cần cả logs, metrics và traces không?

Có, nhưng không cần triển khai tất cả cùng lúc. Metrics tốt cho alert và trend; logs tốt cho chi tiết; traces tốt cho request qua nhiều service. Hãy bắt đầu từ metrics + structured logs, sau đó thêm tracing cho luồng quan trọng.

Prometheus có thay OpenTelemetry không?

Không. Prometheus chủ yếu là metrics monitoring và alerting. OpenTelemetry là framework tạo và vận chuyển telemetry. Hai công cụ có thể dùng cùng nhau.

SLO có cần cho mọi service không?

Không ngay lập tức. Bắt đầu với service quan trọng nhất, user journey quan trọng nhất và business impact rõ nhất.

Kết luận

Observability là nền tảng vận hành hệ thống IT hiện đại. Nó không chỉ là cài Grafana hay lưu log thật nhiều, mà là thiết kế dữ liệu vận hành đủ tốt để trả lời nhanh: hệ thống có vấn đề gì, ảnh hưởng ai, nguyên nhân nằm ở đâu và cần hành động gì.

Cách bắt đầu thực tế là chuẩn hóa structured logs, thu metrics RED/USE, tạo dashboard service, viết alert có runbook, rồi thêm OpenTelemetry tracing cho các luồng quan trọng. Khi hệ thống lớn hơn, hãy dùng SLO, error budget, OpenTelemetry Collector, sampling, redaction và dashboard-as-code để biến observability thành năng lực vận hành bền vững.

Nguồn tham khảo

Footnotes

OpenTelemetry Docs. “Observability primer.” https://opentelemetry.io/docs/concepts/observability-primer/ ↩
OpenTelemetry Docs. “What is OpenTelemetry?” https://opentelemetry.io/docs/what-is-opentelemetry/ ↩ ↩² ↩³
OpenTelemetry Docs. “Signals.” https://opentelemetry.io/docs/concepts/signals/ ↩
Prometheus Docs. “Overview.” https://prometheus.io/docs/introduction/overview/ ↩ ↩²
Google SRE Book. “Service Level Objectives.” https://sre.google/sre-book/service-level-objectives/ ↩
Google SRE Book. “Monitoring Distributed Systems.” https://sre.google/sre-book/monitoring-distributed-systems/ ↩
CNCF. “OpenTelemetry.” https://www.cncf.io/projects/opentelemetry/ ↩
OpenTelemetry Docs. “Collector.” https://opentelemetry.io/docs/collector/ ↩
Grafana Docs. “Introduction.” https://grafana.com/docs/grafana/latest/introduction/ ↩
Grafana Loki Docs. https://grafana.com/docs/loki/latest/ ↩
Grafana Tempo Docs. https://grafana.com/docs/tempo/latest/ ↩
OpenTelemetry Docs. “Semantic conventions.” https://opentelemetry.io/docs/concepts/semantic-conventions/ ↩

Được biên soạn bởi PixelRouter Editorial Team

Chúng tôi cung cấp các bài viết chuyên sâu và chính xác về hạ tầng AI, bảo mật API, quản lý tài chính đám mây và tối ưu hóa hệ thống cho nhà phát triển.

Câu hỏi thường gặp

Observability là gì?

Monitoring khác observability thế nào?

Monitoring thường trả lời “hệ thống có đang lỗi không?”, còn observability giúp trả lời “vì sao lỗi, lỗi ở đâu, ảnh hưởng ai và nên sửa hướng nào?”.

OpenTelemetry là gì?

OpenTelemetry là framework observability mã nguồn mở, vendor-neutral, cung cấp API, SDK, Collector và protocol để tạo, thu thập, quản lý và xuất telemetry data.

Có cần cả logs, metrics và traces không?

Logs, metrics và traces phục vụ các mục đích khác nhau: metrics phù hợp cho alert và trend, logs phù hợp để xem chi tiết sự kiện, còn traces hữu ích khi điều tra request đi qua nhiều service. Có thể bắt đầu từ metrics và structured logs, sau đó thêm tracing cho các luồng quan trọng.

Prometheus có thay OpenTelemetry không?

Không. Prometheus chủ yếu dùng cho metrics monitoring và alerting, còn OpenTelemetry là framework để tạo và vận chuyển telemetry. Hai công cụ có thể được dùng cùng nhau.

SLO có cần cho mọi service không?

Không cần áp dụng ngay cho mọi service. Nên bắt đầu với service hoặc user journey quan trọng nhất, nơi có business impact rõ ràng.

📂Bài liên quan

Hướng dẫn DevOps

Containerization và Docker là gì? Hướng dẫn container, image, Dockerfile, Compose và bảo mật

Hướng dẫn dễ hiểu về containerization và Docker: container, image, layer, registry, Dockerfile, Docker Compose, multi-stage build, volume, network, bảo mật image, secrets, scanning, SBOM và roadmap triển khai.

👁 212 min

Hướng dẫn DevOps

Infrastructure as Code và GitOps là gì? Hướng dẫn Terraform, OpenTofu, Argo CD và Flux cho đội IT

Hướng dẫn dễ hiểu về Infrastructure as Code và GitOps: khái niệm IaC, Terraform, OpenTofu, state, plan/apply, module, drift, Argo CD, Flux, bảo mật secrets, policy-as-code, CI/CD và roadmap triển khai 90 ngày.

👁 213 min

← PixelRouter Blog