11-711-distributed-system-tracing

7.11 Distributed System Tracing

You've mastered tracing within a single application. Your Django app is fully observable with Debug Toolbar. Your React components are profiled with DevTools. But now you're facing a different challenge: your system isn't a monolith anymore. It's five microservices communicating over HTTP, publishing to message queues, writing to shared databases, and caching aggressively. When something breaks, logs from individual services don't tell the complete story.

A user reports: "My checkout failed." You look at logs:

[api-gateway]     POST /api/checkout 200 OK

[inventory]       Reserved items for order_12345

[payment]         Processing payment for $124.99

[payment]         Payment successful: charge_xyz

[fulfillment]     No logs found

[email]           No logs found

The payment succeeded but fulfillment never happened. Which service failed? Was the message lost? Did fulfillment crash? Without correlation between these logs, you're guessing.

This is where distributed tracing becomes essential. Notice what's different from single-application tracing: you need to follow a single user request across multiple processes, potentially across multiple machines, with no shared memory or debugging session.

7.11.1 OpenTelemetry Basics

OpenTelemetry (OTel) is the industry-standard solution for distributed tracing. It's what replaced earlier tools like Zipkin, Jaeger's native instrumentation, and proprietary APM solutions. The key insight: instrument once with OTel, export to any backend.

Before we dive into configuration, let me show you what problem this solves with a concrete example.

The Problem Without Distributed Tracing

You have three services handling checkout:

# Service A: API Gateway

@app.post("/checkout")

def checkout(request):

    user_id = request.user_id

    items = request.json["items"]



    # Call inventory service

    inventory_response = requests.post(

        "http://inventory:8001/reserve",

        json={"user_id": user_id, "items": items}

    )



    # Call payment service

    payment_response = requests.post(

        "http://payment:8002/charge",

        json={"user_id": user_id, "amount": inventory_response.json()["total"]}

    )



    return {"status": "success", "charge_id": payment_response.json()["charge_id"]}



# Service B: Inventory Service

@app.post("/reserve")

def reserve_items(request):

    user_id = request.json["user_id"]

    items = request.json["items"]



    # Check inventory

    for item in items:

        stock = db.query("SELECT quantity FROM inventory WHERE id = ?", item["id"])

        if stock < item["quantity"]:

            raise ValueError(f"Insufficient stock for {item['id']}")



    # Reserve items

    total = calculate_total(items)

    return {"status": "reserved", "total": total}



# Service C: Payment Service

@app.post("/charge")

def charge_payment(request):

    user_id = request.json["user_id"]

    amount = request.json["amount"]



    # Process payment

    charge = stripe.Charge.create(amount=amount, customer=user_id)

    return {"status": "success", "charge_id": charge.id}

When checkout is slow, each service logs independently:

[api-gateway] Checkout request took 2.3s

[inventory] Reserve took 0.1s

[payment] Charge took 0.05s

Where did the extra 2.15 seconds go? Network latency? Queueing delays? You can't tell from these logs. They're not correlated.

The Solution With OpenTelemetry

With OTel, every request gets a unique trace ID that flows across all services:

[api-gateway] trace_id=abc123 span=checkout duration=2.3s

[inventory]   trace_id=abc123 span=reserve parent=checkout duration=0.1s

[payment]     trace_id=abc123 span=charge parent=checkout duration=0.05s

Now you can see the parent-child relationships and identify that 2.15 seconds is unaccounted for—likely network delays or queueing between services.

Installing Auto-Instrumentation

The fastest way to start with OTel is auto-instrumentation. This requires zero code changes to your application:

# Install OpenTelemetry packages

pip install opentelemetry-distro opentelemetry-exporter-otlp



# Auto-discover and install instrumentation for your frameworks

opentelemetry-bootstrap -a install



# Run your application with instrumentation

opentelemetry-instrument \

    --traces_exporter console \

    --metrics_exporter console \

    python app.py

That's it. Your application now emits traces. Let's see what this does to our checkout code:

# Your original code - unchanged

@app.post("/checkout")

def checkout(request):

    user_id = request.user_id

    items = request.json["items"]



    # This HTTP request is automatically traced

    inventory_response = requests.post(

        "http://inventory:8001/reserve",

        json={"user_id": user_id, "items": items}

    )



    # So is this one

    payment_response = requests.post(

        "http://payment:8002/charge",

        json={"user_id": user_id, "amount": inventory_response.json()["total"]}

    )



    return {"status": "success", "charge_id": payment_response.json()["charge_id"]}

Auto-instrumentation intercepts:

HTTP requests (requests, httpx, aiohttp libraries)
Database queries (psycopg2, pymongo, sqlalchemy)
Framework handlers (Flask, Django, FastAPI)
Async operations (asyncio tasks)

When you run this, console output shows traces:

{
  "name": "POST /checkout",

  "context": {
    "trace_id": "0x5e3e2e3c6e4d5f6a7b8c9d0e1f2a3b4c",

    "span_id": "0x1a2b3c4d5e6f7a8b",

    "trace_state": "[]"
  },

  "kind": "SpanKind.SERVER",

  "parent_id": null,

  "start_time": "2025-10-11T10:30:00.123456Z",

  "end_time": "2025-10-11T10:30:02.456789Z",

  "status": {
    "status_code": "OK"
  },

  "attributes": {
    "http.method": "POST",

    "http.url": "/checkout",

    "http.status_code": 200
  },

  "events": [],

  "links": [],

  "resource": {
    "service.name": "api-gateway"
  }
}

This is crucial: Notice the trace_id and span_id. These are automatically propagated to downstream services via HTTP headers. When the inventory service receives the request, it continues the same trace.

Tracing Across Service Boundaries

Here's how trace context propagates between services. Auto-instrumentation handles this automatically, but understanding the mechanism helps with debugging.

When API Gateway calls Inventory Service:

# API Gateway (auto-instrumented)

response = requests.post(

    "http://inventory:8001/reserve",

    json={"user_id": user_id, "items": items}

)

Auto-instrumentation adds headers to the request:

POST /reserve HTTP/1.1

Host: inventory:8001

traceparent: 00-5e3e2e3c6e4d5f6a7b8c9d0e1f2a3b4c-1a2b3c4d5e6f7a8b-01

tracestate:



{

  "user_id": 12345,

  "items": [...]

}

The traceparent header contains:

00: Version
5e3e2e3c...: Trace ID (same across all services)
1a2b3c4d...: Parent span ID (the checkout span)
01: Flags (sampled)

The Inventory Service (also auto-instrumented) reads this header and creates a child span:

# Inventory Service (auto-instrumented)

@app.post("/reserve")

def reserve_items(request):

    # OTel automatically extracts trace context from headers

    # and creates a child span with the same trace_id

    user_id = request.json["user_id"]

    items = request.json["items"]

    # ... your code unchanged

The resulting span:

{
  "name": "POST /reserve",

  "context": {
    "trace_id": "0x5e3e2e3c6e4d5f6a7b8c9d0e1f2a3b4c", // Same trace_id!

    "span_id": "0x9f8e7d6c5b4a3929",

    "trace_state": "[]"
  },

  "parent_id": "0x1a2b3c4d5e6f7a8b", // Points to checkout span

  "kind": "SpanKind.SERVER"

  // ...
}

Notice the key relationships:

Same trace_id across all services
Child spans have parent_id pointing to the parent span
This builds a tree of execution across services

Manual Instrumentation for Custom Operations

Auto-instrumentation covers common libraries, but sometimes you need custom spans. Here's how to add them:

from opentelemetry import trace



tracer = trace.get_tracer(__name__)



@app.post("/checkout")

def checkout(request):

    user_id = request.user_id

    items = request.json["items"]



    # Auto-instrumented HTTP call

    inventory_response = requests.post(

        "http://inventory:8001/reserve",

        json={"user_id": user_id, "items": items}

    )



    # Custom span for business logic

    with tracer.start_as_current_span("calculate_discount") as span:

        span.set_attribute("user_id", user_id)

        span.set_attribute("item_count", len(items))



        discount = calculate_user_discount(user_id, items)

        span.set_attribute("discount_amount", discount)



    # Another custom span

    with tracer.start_as_current_span("apply_loyalty_points") as span:

        points = apply_loyalty_points(user_id, inventory_response.json()["total"])

        span.set_attribute("points_earned", points)



    payment_response = requests.post(

        "http://payment:8002/charge",

        json={"user_id": user_id, "amount": inventory_response.json()["total"] - discount}

    )



    return {"status": "success", "charge_id": payment_response.json()["charge_id"]}

Now your trace shows custom business logic spans alongside HTTP calls:

checkout (2.3s)

├── POST http://inventory:8001/reserve (0.1s)

│   └── SELECT FROM inventory (0.05s)

├── calculate_discount (0.8s)

│   └── SELECT FROM user_discounts (0.7s)

├── apply_loyalty_points (0.02s)

└── POST http://payment:8002/charge (0.05s)

You've now identified the slow operation: calculate_discount takes 0.8 seconds, mostly in database queries.

Jaeger and Zipkin Visualization

Console output is useful for development, but production needs a tracing backend. The two most popular open-source options are Jaeger and Zipkin.

Running Jaeger locally:

# Start Jaeger all-in-one (includes UI, collector, and storage)

docker run -d --name jaeger \

  -p 16686:16686 \

  -p 4317:4317 \

  -p 4318:4318 \

  jaegertracing/all-in-one:latest

Configure your app to export to Jaeger:

# app.py

from opentelemetry import trace

from opentelemetry.sdk.trace import TracerProvider

from opentelemetry.sdk.trace.export import BatchSpanProcessor

from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter



# Set up tracing

trace.set_tracer_provider(TracerProvider())

otlp_exporter = OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)

trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(otlp_exporter))



# Your application code

# ...

Or using environment variables (simpler):

export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

export OTEL_SERVICE_NAME=api-gateway



opentelemetry-instrument python app.py

Now visit http://localhost:16686 to see the Jaeger UI. Here's what you can do:

1. Search for traces:

By service name: "Show me all traces from api-gateway"
By operation: "Show me all /checkout requests"
By duration: "Show me requests that took > 1 second"
By tags: "Show me traces where user_id=12345"

2. Visualize trace timelines:

Trace ID: 5e3e2e3c...

Total Duration: 2.3s



api-gateway: POST /checkout (2.3s)

├─ [100ms network latency]

├─ inventory: POST /reserve (0.1s)

│  └─ db: SELECT FROM inventory (0.05s)

├─ [1.5s network latency] ← Problem identified!

├─ payment: POST /charge (0.05s)

│  └─ stripe: API call (0.04s)

└─ [50ms network latency]

The visualization immediately shows you that 1.5 seconds is spent between inventory and payment calls—likely a network issue or service queueing.

3. Analyze service dependencies:

Jaeger generates service dependency graphs showing which services call which:

api-gateway → inventory

api-gateway → payment

payment → stripe-api

inventory → postgres

This helps you understand architecture and identify critical paths.

Zipkin is similar but has a different UI:

# Run Zipkin

docker run -d -p 9411:9411 openzipkin/zipkin



# Export to Zipkin

export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:9411

export OTEL_TRACES_EXPORTER=zipkin

Choose Jaeger or Zipkin based on preference—they're functionally equivalent for most use cases. Jaeger has better query capabilities; Zipkin has a simpler UI.

Common pitfalls with OpenTelemetry:

Forgetting to propagate context: If you manually create threads or async tasks, you must propagate trace context. Use contextvars or OTel's context API.
Over-instrumenting: Don't create spans for every function call. Focus on service boundaries, database calls, and business-critical operations.
Not sampling in production: Tracing every request in high-traffic production adds overhead. Use sampling:

from opentelemetry.sdk.trace.sampling import TraceIdRatioBased



# Sample 1% of traces

sampler = TraceIdRatioBased(0.01)

Ignoring attributes: Add meaningful attributes to spans—they're searchable in Jaeger/Zipkin:

with tracer.start_as_current_span("process_order") as span:

    span.set_attribute("order_id", order_id)

    span.set_attribute("user_tier", user.tier)  # Searchable!

    span.set_attribute("item_count", len(items))

7.11.2 Structured Logging

OpenTelemetry provides tracing—the execution flow across services. But you also need logging—the detailed events within each span. The key is connecting traces and logs so you can jump from "this trace was slow" to "here are the log messages from that trace."

Correlation IDs and Request Tracing

The fundamental pattern: every log message should include the trace ID and span ID. This lets you filter logs by trace.

Manual correlation ID:

import logging

import uuid

from contextvars import ContextVar



# Store request_id in context-local storage

request_id_var = ContextVar('request_id', default=None)



logger = logging.getLogger(__name__)



@app.middleware("http")

async def add_request_id(request, call_next):

    # Generate or extract request ID

    request_id = request.headers.get('X-Request-ID', str(uuid.uuid4()))

    request_id_var.set(request_id)



    response = await call_next(request)

    response.headers['X-Request-ID'] = request_id

    return response



# Custom log filter that adds request_id to every log

class RequestIDFilter(logging.Filter):

    def filter(self, record):

        record.request_id = request_id_var.get()

        return True



# Configure logging

handler = logging.StreamHandler()

handler.addFilter(RequestIDFilter())

handler.setFormatter(logging.Formatter(

    '%(asctime)s [%(request_id)s] %(levelname)s %(message)s'

))

logger.addHandler(handler)



# Now all logs include request_id

@app.post("/checkout")

def checkout(request):

    logger.info("Processing checkout")  # Includes request_id automatically

    # ...

Integration with OpenTelemetry:

OpenTelemetry provides trace context automatically. Connect it to logs:

import logging

from opentelemetry import trace



class TraceContextFilter(logging.Filter):

    """Add trace context to log records."""

    def filter(self, record):

        span = trace.get_current_span()

        span_context = span.get_span_context()



        record.trace_id = format(span_context.trace_id, '032x') if span_context.is_valid else '0'

        record.span_id = format(span_context.span_id, '016x') if span_context.is_valid else '0'

        return True



# Configure logging with trace context

handler = logging.StreamHandler()

handler.addFilter(TraceContextFilter())

handler.setFormatter(logging.Formatter(

    '%(asctime)s [trace_id=%(trace_id)s span_id=%(span_id)s] %(levelname)s %(message)s'

))



logging.basicConfig(handlers=[handler], level=logging.INFO)

logger = logging.getLogger(__name__)



# Now logs are correlated with traces

@app.post("/checkout")

def checkout(request):

    logger.info("Processing checkout")

    # Output: 2025-10-11 10:30:00 [trace_id=5e3e2e3c... span_id=1a2b3c4d...] INFO Processing checkout

Log Aggregation for Execution Flow

With correlated logs, you can reconstruct execution flow from log aggregation systems. Popular options:

ELK Stack (Elasticsearch, Logstash, Kibana):

# Docker Compose for ELK

version: '3'

services:

  elasticsearch:

    image: docker.elastic.co/elasticsearch/elasticsearch:8.10.0

    environment:

      - discovery.type=single-node

    ports:

      - "9200:9200"



  kibana:

    image: docker.elastic.co/kibana/kibana:8.10.0

    ports:

      - "5601:5601"

    depends_on:

      - elasticsearch

Ship logs to Elasticsearch:

from pythonjsonlogger import jsonlogger



# Use JSON logging for easy parsing

handler = logging.StreamHandler()

formatter = jsonlogger.JsonFormatter(

    '%(asctime)s %(name)s %(levelname)s %(trace_id)s %(span_id)s %(message)s'

)

handler.setFormatter(formatter)

logger.addHandler(handler)



# Logs are now JSON

logger.info("Processing checkout", extra={"user_id": 12345, "items": 3})

# Output: {"asctime": "2025-10-11T10:30:00", "trace_id": "5e3e2e3c...", "user_id": 12345, ...}

Query logs by trace ID in Kibana:

GET /logs/_search

{

  "query": {

    "match": {

      "trace_id": "5e3e2e3c6e4d5f6a7b8c9d0e1f2a3b4c"

    }

  },

  "sort": [{ "timestamp": "asc" }]

}

This returns all log messages from that trace, across all services, in chronological order. You can reconstruct the complete execution flow:

[api-gateway]  Processing checkout user_id=12345

[api-gateway]  Calling inventory service

[inventory]    Reserving items for user_id=12345

[inventory]    Query: SELECT * FROM inventory WHERE id IN (...)

[inventory]    Reserved 3 items total=$124.99

[api-gateway]  Inventory reserved, calling payment

[payment]      Processing charge user_id=12345 amount=$124.99

[payment]      Stripe API call initiated

[payment]      Charge successful charge_id=ch_xyz

[api-gateway]  Checkout complete order_id=12345

Structured log queries for debugging:

# Find all slow checkouts

GET /logs/_search

{

  "query": {

    "bool": {

      "must": [

        { "match": { "operation": "checkout" }},

        { "range": { "duration": { "gte": 2000 }}}

      ]

    }

  }

}



# Find all failed payments

GET /logs/_search

{

  "query": {

    "bool": {

      "must": [

        { "match": { "service": "payment" }},

        { "match": { "levelname": "ERROR" }}

      ]

    }

  }

}

This is the key insight: Structured logging + trace correlation gives you queryable execution history. You can ask questions like "show me all requests from user X that failed in the payment service" and get precise answers.

Production logging best practices:

Always log in JSON: Makes parsing and aggregation trivial
Include trace context: trace_id and span_id in every log
Use log levels correctly: DEBUG for verbose detail, INFO for key events, ERROR for failures
Add structured fields: Don't log f"User {user_id} checked out", log logger.info("Checkout", extra={"user_id": user_id})
Don't log sensitive data: PII, passwords, tokens should never appear in logs
Sample high-volume logs: Don't log every cache hit in production