7.11 Distributed System Tracing
You've mastered tracing within a single application. Your Django app is fully observable with Debug Toolbar. Your React components are profiled with DevTools. But now you're facing a different challenge: your system isn't a monolith anymore. It's five microservices communicating over HTTP, publishing to message queues, writing to shared databases, and caching aggressively. When something breaks, logs from individual services don't tell the complete story.
A user reports: "My checkout failed." You look at logs:
[api-gateway] POST /api/checkout 200 OK
[inventory] Reserved items for order_12345
[payment] Processing payment for $124.99
[payment] Payment successful: charge_xyz
[fulfillment] No logs found
[email] No logs found
The payment succeeded but fulfillment never happened. Which service failed? Was the message lost? Did fulfillment crash? Without correlation between these logs, you're guessing.
This is where distributed tracing becomes essential. Notice what's different from single-application tracing: you need to follow a single user request across multiple processes, potentially across multiple machines, with no shared memory or debugging session.
7.11.1 OpenTelemetry Basics
OpenTelemetry (OTel) is the industry-standard solution for distributed tracing. It's what replaced earlier tools like Zipkin, Jaeger's native instrumentation, and proprietary APM solutions. The key insight: instrument once with OTel, export to any backend.
Before we dive into configuration, let me show you what problem this solves with a concrete example.
The Problem Without Distributed Tracing
You have three services handling checkout:
# Service A: API Gateway
@app.post("/checkout")
def checkout(request):
user_id = request.user_id
items = request.json["items"]
# Call inventory service
inventory_response = requests.post(
"http://inventory:8001/reserve",
json={"user_id": user_id, "items": items}
)
# Call payment service
payment_response = requests.post(
"http://payment:8002/charge",
json={"user_id": user_id, "amount": inventory_response.json()["total"]}
)
return {"status": "success", "charge_id": payment_response.json()["charge_id"]}
# Service B: Inventory Service
@app.post("/reserve")
def reserve_items(request):
user_id = request.json["user_id"]
items = request.json["items"]
# Check inventory
for item in items:
stock = db.query("SELECT quantity FROM inventory WHERE id = ?", item["id"])
if stock < item["quantity"]:
raise ValueError(f"Insufficient stock for {item['id']}")
# Reserve items
total = calculate_total(items)
return {"status": "reserved", "total": total}
# Service C: Payment Service
@app.post("/charge")
def charge_payment(request):
user_id = request.json["user_id"]
amount = request.json["amount"]
# Process payment
charge = stripe.Charge.create(amount=amount, customer=user_id)
return {"status": "success", "charge_id": charge.id}
When checkout is slow, each service logs independently:
[api-gateway] Checkout request took 2.3s
[inventory] Reserve took 0.1s
[payment] Charge took 0.05s
Where did the extra 2.15 seconds go? Network latency? Queueing delays? You can't tell from these logs. They're not correlated.
The Solution With OpenTelemetry
With OTel, every request gets a unique trace ID that flows across all services:
[api-gateway] trace_id=abc123 span=checkout duration=2.3s
[inventory] trace_id=abc123 span=reserve parent=checkout duration=0.1s
[payment] trace_id=abc123 span=charge parent=checkout duration=0.05s
Now you can see the parent-child relationships and identify that 2.15 seconds is unaccounted for—likely network delays or queueing between services.
Installing Auto-Instrumentation
The fastest way to start with OTel is auto-instrumentation. This requires zero code changes to your application:
# Install OpenTelemetry packages
pip install opentelemetry-distro opentelemetry-exporter-otlp
# Auto-discover and install instrumentation for your frameworks
opentelemetry-bootstrap -a install
# Run your application with instrumentation
opentelemetry-instrument \
--traces_exporter console \
--metrics_exporter console \
python app.py
That's it. Your application now emits traces. Let's see what this does to our checkout code:
# Your original code - unchanged
@app.post("/checkout")
def checkout(request):
user_id = request.user_id
items = request.json["items"]
# This HTTP request is automatically traced
inventory_response = requests.post(
"http://inventory:8001/reserve",
json={"user_id": user_id, "items": items}
)
# So is this one
payment_response = requests.post(
"http://payment:8002/charge",
json={"user_id": user_id, "amount": inventory_response.json()["total"]}
)
return {"status": "success", "charge_id": payment_response.json()["charge_id"]}
Auto-instrumentation intercepts:
-
HTTP requests (requests, httpx, aiohttp libraries)
-
Database queries (psycopg2, pymongo, sqlalchemy)
-
Framework handlers (Flask, Django, FastAPI)
-
Async operations (asyncio tasks)
When you run this, console output shows traces:
{
"name": "POST /checkout",
"context": {
"trace_id": "0x5e3e2e3c6e4d5f6a7b8c9d0e1f2a3b4c",
"span_id": "0x1a2b3c4d5e6f7a8b",
"trace_state": "[]"
},
"kind": "SpanKind.SERVER",
"parent_id": null,
"start_time": "2025-10-11T10:30:00.123456Z",
"end_time": "2025-10-11T10:30:02.456789Z",
"status": {
"status_code": "OK"
},
"attributes": {
"http.method": "POST",
"http.url": "/checkout",
"http.status_code": 200
},
"events": [],
"links": [],
"resource": {
"service.name": "api-gateway"
}
}
This is crucial: Notice the trace_id and span_id. These are automatically propagated to downstream services via HTTP headers. When the inventory service receives the request, it continues the same trace.
Tracing Across Service Boundaries
Here's how trace context propagates between services. Auto-instrumentation handles this automatically, but understanding the mechanism helps with debugging.
When API Gateway calls Inventory Service:
# API Gateway (auto-instrumented)
response = requests.post(
"http://inventory:8001/reserve",
json={"user_id": user_id, "items": items}
)
Auto-instrumentation adds headers to the request:
POST /reserve HTTP/1.1
Host: inventory:8001
traceparent: 00-5e3e2e3c6e4d5f6a7b8c9d0e1f2a3b4c-1a2b3c4d5e6f7a8b-01
tracestate:
{
"user_id": 12345,
"items": [...]
}
The traceparent header contains:
-
00: Version -
5e3e2e3c...: Trace ID (same across all services) -
1a2b3c4d...: Parent span ID (the checkout span) -
01: Flags (sampled)
The Inventory Service (also auto-instrumented) reads this header and creates a child span:
# Inventory Service (auto-instrumented)
@app.post("/reserve")
def reserve_items(request):
# OTel automatically extracts trace context from headers
# and creates a child span with the same trace_id
user_id = request.json["user_id"]
items = request.json["items"]
# ... your code unchanged
The resulting span:
{
"name": "POST /reserve",
"context": {
"trace_id": "0x5e3e2e3c6e4d5f6a7b8c9d0e1f2a3b4c", // Same trace_id!
"span_id": "0x9f8e7d6c5b4a3929",
"trace_state": "[]"
},
"parent_id": "0x1a2b3c4d5e6f7a8b", // Points to checkout span
"kind": "SpanKind.SERVER"
// ...
}
Notice the key relationships:
-
Same
trace_idacross all services -
Child spans have
parent_idpointing to the parent span -
This builds a tree of execution across services
Manual Instrumentation for Custom Operations
Auto-instrumentation covers common libraries, but sometimes you need custom spans. Here's how to add them:
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
@app.post("/checkout")
def checkout(request):
user_id = request.user_id
items = request.json["items"]
# Auto-instrumented HTTP call
inventory_response = requests.post(
"http://inventory:8001/reserve",
json={"user_id": user_id, "items": items}
)
# Custom span for business logic
with tracer.start_as_current_span("calculate_discount") as span:
span.set_attribute("user_id", user_id)
span.set_attribute("item_count", len(items))
discount = calculate_user_discount(user_id, items)
span.set_attribute("discount_amount", discount)
# Another custom span
with tracer.start_as_current_span("apply_loyalty_points") as span:
points = apply_loyalty_points(user_id, inventory_response.json()["total"])
span.set_attribute("points_earned", points)
payment_response = requests.post(
"http://payment:8002/charge",
json={"user_id": user_id, "amount": inventory_response.json()["total"] - discount}
)
return {"status": "success", "charge_id": payment_response.json()["charge_id"]}
Now your trace shows custom business logic spans alongside HTTP calls:
checkout (2.3s)
├── POST http://inventory:8001/reserve (0.1s)
│ └── SELECT FROM inventory (0.05s)
├── calculate_discount (0.8s)
│ └── SELECT FROM user_discounts (0.7s)
├── apply_loyalty_points (0.02s)
└── POST http://payment:8002/charge (0.05s)
You've now identified the slow operation: calculate_discount takes 0.8 seconds, mostly in database queries.
Jaeger and Zipkin Visualization
Console output is useful for development, but production needs a tracing backend. The two most popular open-source options are Jaeger and Zipkin.
Running Jaeger locally:
# Start Jaeger all-in-one (includes UI, collector, and storage)
docker run -d --name jaeger \
-p 16686:16686 \
-p 4317:4317 \
-p 4318:4318 \
jaegertracing/all-in-one:latest
Configure your app to export to Jaeger:
# app.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Set up tracing
trace.set_tracer_provider(TracerProvider())
otlp_exporter = OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)
trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(otlp_exporter))
# Your application code
# ...
Or using environment variables (simpler):
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
export OTEL_SERVICE_NAME=api-gateway
opentelemetry-instrument python app.py
Now visit http://localhost:16686 to see the Jaeger UI. Here's what you can do:
1. Search for traces:
-
By service name: "Show me all traces from api-gateway"
-
By operation: "Show me all /checkout requests"
-
By duration: "Show me requests that took > 1 second"
-
By tags: "Show me traces where user_id=12345"
2. Visualize trace timelines:
Trace ID: 5e3e2e3c...
Total Duration: 2.3s
api-gateway: POST /checkout (2.3s)
├─ [100ms network latency]
├─ inventory: POST /reserve (0.1s)
│ └─ db: SELECT FROM inventory (0.05s)
├─ [1.5s network latency] ← Problem identified!
├─ payment: POST /charge (0.05s)
│ └─ stripe: API call (0.04s)
└─ [50ms network latency]
The visualization immediately shows you that 1.5 seconds is spent between inventory and payment calls—likely a network issue or service queueing.
3. Analyze service dependencies:
Jaeger generates service dependency graphs showing which services call which:
api-gateway → inventory
api-gateway → payment
payment → stripe-api
inventory → postgres
This helps you understand architecture and identify critical paths.
Zipkin is similar but has a different UI:
# Run Zipkin
docker run -d -p 9411:9411 openzipkin/zipkin
# Export to Zipkin
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:9411
export OTEL_TRACES_EXPORTER=zipkin
Choose Jaeger or Zipkin based on preference—they're functionally equivalent for most use cases. Jaeger has better query capabilities; Zipkin has a simpler UI.
Common pitfalls with OpenTelemetry:
-
Forgetting to propagate context: If you manually create threads or async tasks, you must propagate trace context. Use
contextvarsor OTel's context API. -
Over-instrumenting: Don't create spans for every function call. Focus on service boundaries, database calls, and business-critical operations.
-
Not sampling in production: Tracing every request in high-traffic production adds overhead. Use sampling:
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
# Sample 1% of traces
sampler = TraceIdRatioBased(0.01)
- Ignoring attributes: Add meaningful attributes to spans—they're searchable in Jaeger/Zipkin:
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order_id", order_id)
span.set_attribute("user_tier", user.tier) # Searchable!
span.set_attribute("item_count", len(items))
7.11.2 Structured Logging
OpenTelemetry provides tracing—the execution flow across services. But you also need logging—the detailed events within each span. The key is connecting traces and logs so you can jump from "this trace was slow" to "here are the log messages from that trace."
Correlation IDs and Request Tracing
The fundamental pattern: every log message should include the trace ID and span ID. This lets you filter logs by trace.
Manual correlation ID:
import logging
import uuid
from contextvars import ContextVar
# Store request_id in context-local storage
request_id_var = ContextVar('request_id', default=None)
logger = logging.getLogger(__name__)
@app.middleware("http")
async def add_request_id(request, call_next):
# Generate or extract request ID
request_id = request.headers.get('X-Request-ID', str(uuid.uuid4()))
request_id_var.set(request_id)
response = await call_next(request)
response.headers['X-Request-ID'] = request_id
return response
# Custom log filter that adds request_id to every log
class RequestIDFilter(logging.Filter):
def filter(self, record):
record.request_id = request_id_var.get()
return True
# Configure logging
handler = logging.StreamHandler()
handler.addFilter(RequestIDFilter())
handler.setFormatter(logging.Formatter(
'%(asctime)s [%(request_id)s] %(levelname)s %(message)s'
))
logger.addHandler(handler)
# Now all logs include request_id
@app.post("/checkout")
def checkout(request):
logger.info("Processing checkout") # Includes request_id automatically
# ...
Integration with OpenTelemetry:
OpenTelemetry provides trace context automatically. Connect it to logs:
import logging
from opentelemetry import trace
class TraceContextFilter(logging.Filter):
"""Add trace context to log records."""
def filter(self, record):
span = trace.get_current_span()
span_context = span.get_span_context()
record.trace_id = format(span_context.trace_id, '032x') if span_context.is_valid else '0'
record.span_id = format(span_context.span_id, '016x') if span_context.is_valid else '0'
return True
# Configure logging with trace context
handler = logging.StreamHandler()
handler.addFilter(TraceContextFilter())
handler.setFormatter(logging.Formatter(
'%(asctime)s [trace_id=%(trace_id)s span_id=%(span_id)s] %(levelname)s %(message)s'
))
logging.basicConfig(handlers=[handler], level=logging.INFO)
logger = logging.getLogger(__name__)
# Now logs are correlated with traces
@app.post("/checkout")
def checkout(request):
logger.info("Processing checkout")
# Output: 2025-10-11 10:30:00 [trace_id=5e3e2e3c... span_id=1a2b3c4d...] INFO Processing checkout
Log Aggregation for Execution Flow
With correlated logs, you can reconstruct execution flow from log aggregation systems. Popular options:
ELK Stack (Elasticsearch, Logstash, Kibana):
# Docker Compose for ELK
version: '3'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.10.0
environment:
- discovery.type=single-node
ports:
- "9200:9200"
kibana:
image: docker.elastic.co/kibana/kibana:8.10.0
ports:
- "5601:5601"
depends_on:
- elasticsearch
Ship logs to Elasticsearch:
from pythonjsonlogger import jsonlogger
# Use JSON logging for easy parsing
handler = logging.StreamHandler()
formatter = jsonlogger.JsonFormatter(
'%(asctime)s %(name)s %(levelname)s %(trace_id)s %(span_id)s %(message)s'
)
handler.setFormatter(formatter)
logger.addHandler(handler)
# Logs are now JSON
logger.info("Processing checkout", extra={"user_id": 12345, "items": 3})
# Output: {"asctime": "2025-10-11T10:30:00", "trace_id": "5e3e2e3c...", "user_id": 12345, ...}
Query logs by trace ID in Kibana:
GET /logs/_search
{
"query": {
"match": {
"trace_id": "5e3e2e3c6e4d5f6a7b8c9d0e1f2a3b4c"
}
},
"sort": [{ "timestamp": "asc" }]
}
This returns all log messages from that trace, across all services, in chronological order. You can reconstruct the complete execution flow:
[api-gateway] Processing checkout user_id=12345
[api-gateway] Calling inventory service
[inventory] Reserving items for user_id=12345
[inventory] Query: SELECT * FROM inventory WHERE id IN (...)
[inventory] Reserved 3 items total=$124.99
[api-gateway] Inventory reserved, calling payment
[payment] Processing charge user_id=12345 amount=$124.99
[payment] Stripe API call initiated
[payment] Charge successful charge_id=ch_xyz
[api-gateway] Checkout complete order_id=12345
Structured log queries for debugging:
# Find all slow checkouts
GET /logs/_search
{
"query": {
"bool": {
"must": [
{ "match": { "operation": "checkout" }},
{ "range": { "duration": { "gte": 2000 }}}
]
}
}
}
# Find all failed payments
GET /logs/_search
{
"query": {
"bool": {
"must": [
{ "match": { "service": "payment" }},
{ "match": { "levelname": "ERROR" }}
]
}
}
}
This is the key insight: Structured logging + trace correlation gives you queryable execution history. You can ask questions like "show me all requests from user X that failed in the payment service" and get precise answers.
Production logging best practices:
-
Always log in JSON: Makes parsing and aggregation trivial
-
Include trace context: trace_id and span_id in every log
-
Use log levels correctly: DEBUG for verbose detail, INFO for key events, ERROR for failures
-
Add structured fields: Don't log
f"User {user_id} checked out", loglogger.info("Checkout", extra={"user_id": user_id}) -
Don't log sensitive data: PII, passwords, tokens should never appear in logs
-
Sample high-volume logs: Don't log every cache hit in production