🏠

7.12 Production-Safe Tracing

You've learned to trace code in development with debuggers, profile it with py-spy, and understand distributed flows with OpenTelemetry. But production is different. You can't attach a debugger to a running production service—it pauses execution and ruins user experience. You can't enable verbose logging for every request—it overwhelms storage and degrades performance. You need production-safe tracing techniques.

7.12.1 Feature Flags for Instrumentation

The core pattern: instrumentation that can be enabled selectively for specific users, requests, or conditions without code deployment.

Enabling Tracing for Specific Users/Requests

Imagine you have a bug that only affects Premium tier users. You want verbose tracing for Premium users without impacting Basic users. Here's how:

from feature_flags import is_enabled  # Using a library like LaunchDarkly or Unleash

import logging



logger = logging.getLogger(__name__)



@app.post("/checkout")

def checkout(request):

    user_id = request.user.id

    user_tier = request.user.tier



    # Enable detailed tracing based on feature flag

    verbose_tracing = is_enabled(

        'detailed-checkout-tracing',

        context={'user_id': user_id, 'tier': user_tier}

    )



    if verbose_tracing:

        logger.setLevel(logging.DEBUG)

        logger.debug(f"Checkout initiated: user={user_id}, tier={user_tier}, items={request.json['items']}")



    # Your checkout logic

    inventory_response = reserve_inventory(request.json['items'])



    if verbose_tracing:

        logger.debug(f"Inventory reserved: {inventory_response}")



    payment_response = process_payment(user_id, inventory_response['total'])



    if verbose_tracing:

        logger.debug(f"Payment processed: {payment_response}")



    return {"status": "success"}

Now you can enable tracing from your feature flag dashboard:

Feature: detailed-checkout-tracing

Targeting: tier = "premium"

Status: Enabled for 100% of premium users

No code deployment needed. Premium users get verbose logs; Basic users don't see any performance impact.

More sophisticated targeting:

# Enable tracing for specific user

is_enabled('tracing', context={'user_id': 12345})  # True only for user 12345



# Enable tracing for percentage of requests

is_enabled('tracing', context={'rollout_percentage': 5})  # 5% of requests



# Enable tracing for specific conditions

is_enabled('tracing', context={

    'user_tier': 'premium',

    'request_path': '/checkout',

    'time_hour': datetime.now().hour  # Only during business hours

})

Performance Impact Mitigation

Feature flags add some overhead. Here's how to minimize it:

Pattern 1: Check once per request

@app.middleware("http")

async def tracing_middleware(request, call_next):

    # Check feature flag once at request start

    request.state.verbose_tracing = is_enabled(

        'detailed-tracing',

        context={'user_id': request.user.id if request.user else None}

    )

    response = await call_next(request)

    return response



@app.post("/checkout")

def checkout(request):

    if request.state.verbose_tracing:

        logger.debug("Checkout initiated")

    # ... rest of code

Pattern 2: Lazy evaluation

from contextlib import contextmanager



@contextmanager

def traced_operation(request, operation_name):

    """Only compute tracing data if tracing is enabled."""

    if not request.state.verbose_tracing:

        yield  # No-op if tracing disabled

        return



    start = time.time()

    logger.debug(f"Starting {operation_name}")

    try:

        yield

    finally:

        duration = time.time() - start

        logger.debug(f"{operation_name} completed in {duration:.3f}s")



@app.post("/checkout")

def checkout(request):

    with traced_operation(request, "inventory_check"):

        # This code always runs

        inventory = check_inventory(request.json['items'])

    # But tracing only happens if enabled

Pattern 3: Sampling with feature flags

import random



def should_trace(request):

    """Probabilistic tracing based on feature flag."""

    sampling_rate = feature_flags.get_float('tracing-sample-rate', default=0.0)



    # Always trace if user specifically requested it

    if request.headers.get('X-Enable-Tracing') == 'true':

        return True



    # Otherwise sample based on rate

    return random.random() < sampling_rate



@app.post("/checkout")

def checkout(request):

    if should_trace(request):

        with tracer.start_as_current_span("checkout"):

            # ... checkout logic

            pass

Security Considerations

Feature-flagged tracing exposes potential security risks:

Risk 1: Logging sensitive data

# DANGER: Might log credit card numbers if tracing is enabled

if verbose_tracing:

    logger.debug(f"Payment info: {request.json}")  # Contains card number!



# SAFE: Redact sensitive fields

def safe_log(data, sensitive_keys=['card_number', 'cvv', 'password']):

    return {k: '***REDACTED***' if k in sensitive_keys else v for k, v in data.items()}



if verbose_tracing:

    logger.debug(f"Payment info: {safe_log(request.json)}")

Risk 2: Unauthorized trace access

# DANGER: Anyone can enable tracing via header

if request.headers.get('X-Enable-Tracing'):

    enable_tracing()



# SAFE: Verify authorization

def should_enable_tracing(request):

    trace_header = request.headers.get('X-Enable-Tracing')

    if not trace_header:

        return False



    # Verify signed token or API key

    if not verify_trace_token(trace_header, request.user):

        logger.warning(f"Unauthorized tracing attempt from user {request.user.id}")

        return False



    return True

Risk 3: DoS via verbose logging

# DANGER: Attacker enables tracing for high-traffic endpoint

if is_enabled('tracing', context={'endpoint': request.path}):

    # Logs every request detail

    # If endpoint gets 1000 req/sec, this writes gigabytes of logs



# SAFE: Rate limit traced requests

from ratelimit import rate_limit



@rate_limit(max_traced_requests=100, window_seconds=60)

def trace_if_enabled(request):

    return is_enabled('tracing', context={'endpoint': request.path})

Tracing authorization checklist:

7.12.2 Sampling and Profiling in Production

Full tracing of every request isn't feasible in production. At 10,000 requests per second, you'd generate terabytes of trace data daily. The solution: sampling and on-demand profiling.

py-spy Attach Mode for Live Systems

py-spy is a sampling profiler that can attach to running Python processes without modifying code or restarting. This is the safest way to profile production.

# Find your process ID

ps aux | grep python

# Output: user  12345  ...  python app.py



# Attach py-spy and profile for 30 seconds

sudo py-spy record -p 12345 -o profile.svg --duration 30



# View the flamegraph

open profile.svg

The flamegraph shows what your application is doing:

[====app.checkout====][==db.query==][==serialize==]

  25% of time        30% of time    15% of time

You immediately see that 30% of time is spent in database queries—a potential optimization target.

When to use py-spy in production:

py-spy attach best practices:

# Sample at lower rate for less overhead (default is 100Hz)

sudo py-spy record -p 12345 -o profile.svg --rate 10 --duration 60



# Profile only specific threads

sudo py-spy record -p 12345 -o profile.svg --subprocesses



# Output top functions instead of flamegraph (for quick checks)

sudo py-spy top -p 12345

APM Tools Overview (New Relic, DataDog, Sentry)

When py-spy isn't enough—when you need continuous monitoring, alerting, or distributed tracing across many services—you need Application Performance Monitoring (APM) tools.

New Relic:

# Install

pip install newrelic



# Configure

newrelic-admin generate-config YOUR_LICENSE_KEY newrelic.ini



# Run your app with New Relic agent

NEW_RELIC_CONFIG_FILE=newrelic.ini newrelic-admin run-program python app.py

New Relic provides:

DataDog:

# Install

pip install ddtrace



# Run with DataDog tracing

ddtrace-run python app.py

DataDog provides similar capabilities to New Relic, plus:

Sentry:

import sentry_sdk



sentry_sdk.init(

    dsn="https://your-key@sentry.io/project-id",

    traces_sample_rate=0.1,  # Sample 10% of transactions

    profiles_sample_rate=0.1,  # Profile 10% of traced transactions

)



# Sentry automatically captures errors and traces

@app.post("/checkout")

def checkout(request):

    # If this raises an exception, Sentry captures it with full context

    process_order(request)

Sentry focuses on error tracking but also provides:

Key differences:

| Feature | New Relic | DataDog | Sentry |

| ------------------- | ------------------------- | --------------------------------------------- | ---------------------- |

| Focus | APM first | Infrastructure first | Error tracking first |

| Pricing | $99+/month | $15+/host/month | $26+/month |

| Best for | Deep transaction analysis | DevOps teams needing full stack observability | Error-driven debugging |

| Learning curve | Moderate | Steep | Easy |

| Trace retention | 8 days (standard) | 15 days | 90 days |

When to Pay for Commercial Solutions

You're a startup with 3 developers and 100 users. Do you need DataDog at $500/month? Probably not. Here's the decision matrix:

Use free/OSS tools when:

Pay for APM when:

The cost calculation:

Scenario: Production issue investigation



With OSS tools:

- 2 hours digging through separate logs/traces/metrics

- $200 developer time (2 hours Ă— $100/hr)

- $1000 business impact (downtime/lost sales)

Total: $1200



With APM tool:

- 15 minutes to identify issue (integrated view)

- $25 developer time (0.25 hours Ă— $100/hr)

- $250 business impact (faster resolution)

- $50 tool cost (monthly subscription / 30 days)

Total: $325



Savings per incident: $875

Break-even: 1 incident per month

This is crucial: If you have more than one production incident per month that takes > 1 hour to debug, APM tools pay for themselves. If incidents are rare and debugging is quick, stick with free tools.

Sampling strategies for production:

All APM tools support sampling to reduce costs and overhead. Here's how to configure it:

Head-based sampling (decision at trace start):

# Sample based on trace ID (consistent across services)

def should_sample(trace_id):

    # Sample 10% of traces

    return int(trace_id, 16) % 10 == 0



# OpenTelemetry configuration

from opentelemetry.sdk.trace.sampling import TraceIdRatioBased



sampler = TraceIdRatioBased(0.1)  # 10% sampling

Tail-based sampling (decision after trace completes):

# Sample all errors + 1% of successful requests

class SmartSampler:

    def should_sample(self, trace):

        # Always sample errors

        if trace.has_error:

            return True



        # Always sample slow requests

        if trace.duration > 1.0:  # > 1 second

            return True



        # Sample 1% of normal requests

        return random.random() < 0.01

Tail-based sampling requires buffering traces until completion, which adds memory overhead but gives better signal-to-noise ratio.

Adaptive sampling:

class AdaptiveSampler:

    def __init__(self, target_traces_per_second=100):

        self.target = target_traces_per_second

        self.current_rate = 0

        self.sample_ratio = 1.0



    def should_sample(self):

        # Adjust sampling to maintain target rate

        if self.current_rate > self.target:

            self.sample_ratio *= 0.9  # Sample less

        elif self.current_rate < self.target:

            self.sample_ratio = min(1.0, self.sample_ratio * 1.1)  # Sample more



        return random.random() < self.sample_ratio

This maintains constant trace volume regardless of traffic changes.

Production profiling safety checklist:

Real-world production tracing setup:

Here's a complete configuration for a mid-sized production system:

import os

from opentelemetry import trace

from opentelemetry.sdk.trace import TracerProvider

from opentelemetry.sdk.trace.sampling import ParentBasedTraceIdRatio

from opentelemetry.sdk.trace.export import BatchSpanProcessor

from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

import sentry_sdk



# Environment-based configuration

ENVIRONMENT = os.getenv('ENVIRONMENT', 'development')

SERVICE_NAME = os.getenv('SERVICE_NAME', 'api-gateway')



# Configure sampling based on environment

if ENVIRONMENT == 'production':

    SAMPLE_RATE = 0.01  # 1% sampling in production

    SENTRY_SAMPLE_RATE = 0.1  # 10% for Sentry (focuses on errors)

elif ENVIRONMENT == 'staging':

    SAMPLE_RATE = 0.1  # 10% in staging

    SENTRY_SAMPLE_RATE = 0.5

else:  # development

    SAMPLE_RATE = 1.0  # 100% in development

    SENTRY_SAMPLE_RATE = 1.0



# OpenTelemetry setup

sampler = ParentBasedTraceIdRatio(SAMPLE_RATE)

provider = TracerProvider(sampler=sampler)

otlp_exporter = OTLPSpanExporter(

    endpoint=os.getenv('OTEL_EXPORTER_ENDPOINT', 'http://localhost:4317'),

    insecure=ENVIRONMENT != 'production'

)

provider.add_span_processor(BatchSpanProcessor(otlp_exporter))

trace.set_tracer_provider(provider)



# Sentry setup (for error tracking + sampling)

if os.getenv('SENTRY_DSN'):

    sentry_sdk.init(

        dsn=os.getenv('SENTRY_DSN'),

        environment=ENVIRONMENT,

        traces_sample_rate=SENTRY_SAMPLE_RATE,

        profiles_sample_rate=SENTRY_SAMPLE_RATE * 0.5,  # Profile 50% of traced transactions

        before_send=redact_sensitive_data,  # Custom function to redact PII

    )



# Feature flag for emergency verbose tracing

from feature_flags import is_enabled



def should_enable_verbose_tracing(request):

    """Enable detailed tracing for specific requests."""

    # Check if emergency tracing is enabled

    if is_enabled('emergency-tracing'):

        return True



    # Check if specific user requested it

    if request.headers.get('X-Trace-This-Request'):

        # Verify authorization

        if verify_trace_authorization(request):

            return True



    return False



# Middleware for conditional tracing

@app.middleware("http")

async def tracing_middleware(request, call_next):

    # Store tracing preference for this request

    request.state.verbose_tracing = should_enable_verbose_tracing(request)



    # Force-sample this request if verbose tracing enabled

    if request.state.verbose_tracing:

        span = trace.get_current_span()

        span.set_attribute("sampled.forced", True)

        span.set_attribute("sampled.reason", "verbose_tracing_enabled")



    response = await call_next(request)

    return response

This configuration gives you:

Common production tracing mistakes:

  1. Over-instrumenting: Creating spans for every function call
  2. Fix: Focus on service boundaries and slow operations

  3. No sampling: Tracing 100% of production traffic

  4. Fix: Start with 1-5% sampling, increase only if needed

  5. Blocking exports: Sending traces synchronously

  6. Fix: Always use async/batch span processors

  7. No PII redaction: Logging sensitive user data

  8. Fix: Redact credit cards, passwords, emails in traces

  9. Ignoring overhead: Not monitoring tracing performance impact

  10. Fix: Alert if tracing adds > 5ms to request latency

  11. No off switch: Can't disable tracing without redeployment

  12. Fix: Use feature flags for quick disable

  13. Alert fatigue: Creating alerts for every slow request

  14. Fix: Alert on p95/p99 latency, not individual requests

The production tracing philosophy:

  1. Sample smartly: Trace enough to catch issues, not so much you drown in data

  2. Focus on signals: Errors, slow requests, and unusual patterns matter most

  3. Automate response: Tracing should enable quick resolution, not just observation

  4. Trust but verify: APM tools are helpful but verify with logs and metrics

  5. Cost-conscious: Balance observability value against tool costs

  6. Privacy-first: Redact PII automatically, not manually

  7. Team-accessible: Everyone should be able to query traces, not just ops

Quick decision guide:

Question: How should I trace in production?



Is this an active incident?

├─ Yes → Enable verbose tracing via feature flag (for affected users only)

└─ No

   ├─ Do you have < 100 req/sec?

   │  ├─ Yes → 10% sampling with free tools (Jaeger/Zipkin)

   │  └─ No

   │     ├─ Do you have budget ($100+/month)?

   │     │  ├─ Yes → Use APM tool (Sentry/DataDog/New Relic) with 1-5% sampling

   │     │  └─ No → Use OSS with 1% sampling + py-spy for ad-hoc profiling

   │     └─ Is this a one-time investigation?

   │        ├─ Yes → Attach py-spy for 60 seconds

   │        └─ No → Implement gradual sampling increase (1% → 5% → 10%)

Remember: The goal of production tracing isn't to capture everything—it's to capture enough to understand and resolve issues quickly. Start minimal, increase only when specific needs justify the overhead.