12-712-production-safe-tracing

7.12 Production-Safe Tracing

You've learned to trace code in development with debuggers, profile it with py-spy, and understand distributed flows with OpenTelemetry. But production is different. You can't attach a debugger to a running production service—it pauses execution and ruins user experience. You can't enable verbose logging for every request—it overwhelms storage and degrades performance. You need production-safe tracing techniques.

7.12.1 Feature Flags for Instrumentation

The core pattern: instrumentation that can be enabled selectively for specific users, requests, or conditions without code deployment.

Enabling Tracing for Specific Users/Requests

Imagine you have a bug that only affects Premium tier users. You want verbose tracing for Premium users without impacting Basic users. Here's how:

from feature_flags import is_enabled  # Using a library like LaunchDarkly or Unleash

import logging



logger = logging.getLogger(__name__)



@app.post("/checkout")

def checkout(request):

    user_id = request.user.id

    user_tier = request.user.tier



    # Enable detailed tracing based on feature flag

    verbose_tracing = is_enabled(

        'detailed-checkout-tracing',

        context={'user_id': user_id, 'tier': user_tier}

    )



    if verbose_tracing:

        logger.setLevel(logging.DEBUG)

        logger.debug(f"Checkout initiated: user={user_id}, tier={user_tier}, items={request.json['items']}")



    # Your checkout logic

    inventory_response = reserve_inventory(request.json['items'])



    if verbose_tracing:

        logger.debug(f"Inventory reserved: {inventory_response}")



    payment_response = process_payment(user_id, inventory_response['total'])



    if verbose_tracing:

        logger.debug(f"Payment processed: {payment_response}")



    return {"status": "success"}

Now you can enable tracing from your feature flag dashboard:

Feature: detailed-checkout-tracing

Targeting: tier = "premium"

Status: Enabled for 100% of premium users

No code deployment needed. Premium users get verbose logs; Basic users don't see any performance impact.

More sophisticated targeting:

# Enable tracing for specific user

is_enabled('tracing', context={'user_id': 12345})  # True only for user 12345



# Enable tracing for percentage of requests

is_enabled('tracing', context={'rollout_percentage': 5})  # 5% of requests



# Enable tracing for specific conditions

is_enabled('tracing', context={

    'user_tier': 'premium',

    'request_path': '/checkout',

    'time_hour': datetime.now().hour  # Only during business hours

})

Performance Impact Mitigation

Feature flags add some overhead. Here's how to minimize it:

Pattern 1: Check once per request

@app.middleware("http")

async def tracing_middleware(request, call_next):

    # Check feature flag once at request start

    request.state.verbose_tracing = is_enabled(

        'detailed-tracing',

        context={'user_id': request.user.id if request.user else None}

    )

    response = await call_next(request)

    return response



@app.post("/checkout")

def checkout(request):

    if request.state.verbose_tracing:

        logger.debug("Checkout initiated")

    # ... rest of code

Pattern 2: Lazy evaluation

from contextlib import contextmanager



@contextmanager

def traced_operation(request, operation_name):

    """Only compute tracing data if tracing is enabled."""

    if not request.state.verbose_tracing:

        yield  # No-op if tracing disabled

        return



    start = time.time()

    logger.debug(f"Starting {operation_name}")

    try:

        yield

    finally:

        duration = time.time() - start

        logger.debug(f"{operation_name} completed in {duration:.3f}s")



@app.post("/checkout")

def checkout(request):

    with traced_operation(request, "inventory_check"):

        # This code always runs

        inventory = check_inventory(request.json['items'])

    # But tracing only happens if enabled

Pattern 3: Sampling with feature flags

import random



def should_trace(request):

    """Probabilistic tracing based on feature flag."""

    sampling_rate = feature_flags.get_float('tracing-sample-rate', default=0.0)



    # Always trace if user specifically requested it

    if request.headers.get('X-Enable-Tracing') == 'true':

        return True



    # Otherwise sample based on rate

    return random.random() < sampling_rate



@app.post("/checkout")

def checkout(request):

    if should_trace(request):

        with tracer.start_as_current_span("checkout"):

            # ... checkout logic

            pass

Security Considerations

Feature-flagged tracing exposes potential security risks:

Risk 1: Logging sensitive data

# DANGER: Might log credit card numbers if tracing is enabled

if verbose_tracing:

    logger.debug(f"Payment info: {request.json}")  # Contains card number!



# SAFE: Redact sensitive fields

def safe_log(data, sensitive_keys=['card_number', 'cvv', 'password']):

    return {k: '***REDACTED***' if k in sensitive_keys else v for k, v in data.items()}



if verbose_tracing:

    logger.debug(f"Payment info: {safe_log(request.json)}")

Risk 2: Unauthorized trace access

# DANGER: Anyone can enable tracing via header

if request.headers.get('X-Enable-Tracing'):

    enable_tracing()



# SAFE: Verify authorization

def should_enable_tracing(request):

    trace_header = request.headers.get('X-Enable-Tracing')

    if not trace_header:

        return False



    # Verify signed token or API key

    if not verify_trace_token(trace_header, request.user):

        logger.warning(f"Unauthorized tracing attempt from user {request.user.id}")

        return False



    return True

Risk 3: DoS via verbose logging

# DANGER: Attacker enables tracing for high-traffic endpoint

if is_enabled('tracing', context={'endpoint': request.path}):

    # Logs every request detail

    # If endpoint gets 1000 req/sec, this writes gigabytes of logs



# SAFE: Rate limit traced requests

from ratelimit import rate_limit



@rate_limit(max_traced_requests=100, window_seconds=60)

def trace_if_enabled(request):

    return is_enabled('tracing', context={'endpoint': request.path})

Tracing authorization checklist:

[ ] Traced data is redacted (no PII, credentials, or secrets)
[ ] Tracing can only be enabled by authorized users/systems
[ ] Tracing has rate limits to prevent DoS
[ ] Trace data is stored securely and encrypted
[ ] Trace retention policies automatically delete old data
[ ] Audit logs track who enabled tracing when

7.12.2 Sampling and Profiling in Production

Full tracing of every request isn't feasible in production. At 10,000 requests per second, you'd generate terabytes of trace data daily. The solution: sampling and on-demand profiling.

py-spy Attach Mode for Live Systems

py-spy is a sampling profiler that can attach to running Python processes without modifying code or restarting. This is the safest way to profile production.

# Find your process ID

ps aux | grep python

# Output: user  12345  ...  python app.py



# Attach py-spy and profile for 30 seconds

sudo py-spy record -p 12345 -o profile.svg --duration 30



# View the flamegraph

open profile.svg

The flamegraph shows what your application is doing:

[====app.checkout====][==db.query==][==serialize==]

  25% of time        30% of time    15% of time

You immediately see that 30% of time is spent in database queries—a potential optimization target.

When to use py-spy in production:

✅ Investigating performance regressions ("Why is this suddenly slow?")
✅ Understanding baseline performance ("What does normal look like?")
✅ Identifying hot paths ("What code runs most?")
❌ Continuous monitoring (too much overhead—use APM tools instead)
❌ Debugging logical errors (use logging or feature-flagged tracing)

py-spy attach best practices:

# Sample at lower rate for less overhead (default is 100Hz)

sudo py-spy record -p 12345 -o profile.svg --rate 10 --duration 60



# Profile only specific threads

sudo py-spy record -p 12345 -o profile.svg --subprocesses



# Output top functions instead of flamegraph (for quick checks)

sudo py-spy top -p 12345

APM Tools Overview (New Relic, DataDog, Sentry)

When py-spy isn't enough—when you need continuous monitoring, alerting, or distributed tracing across many services—you need Application Performance Monitoring (APM) tools.

New Relic:

# Install

pip install newrelic



# Configure

newrelic-admin generate-config YOUR_LICENSE_KEY newrelic.ini



# Run your app with New Relic agent

NEW_RELIC_CONFIG_FILE=newrelic.ini newrelic-admin run-program python app.py

New Relic provides:

Automatic transaction tracing (every web request, background job, database query)
Distributed tracing across services
Error tracking with stack traces
Custom metrics and dashboards
Alerting based on performance thresholds

DataDog:

# Install

pip install ddtrace



# Run with DataDog tracing

ddtrace-run python app.py

DataDog provides similar capabilities to New Relic, plus:

Infrastructure monitoring (CPU, memory, disk)
Log aggregation integrated with traces
Real User Monitoring (RUM) for frontend performance
Network performance monitoring

Sentry:

import sentry_sdk



sentry_sdk.init(

    dsn="https://your-key@sentry.io/project-id",

    traces_sample_rate=0.1,  # Sample 10% of transactions

    profiles_sample_rate=0.1,  # Profile 10% of traced transactions

)



# Sentry automatically captures errors and traces

@app.post("/checkout")

def checkout(request):

    # If this raises an exception, Sentry captures it with full context

    process_order(request)

Sentry focuses on error tracking but also provides:

Performance monitoring (transaction tracing)
Profiling (flamegraphs integrated with transactions)
Release tracking (correlate errors with deployments)
User feedback collection

Key differences:

| ------------------- | ------------------------- | --------------------------------------------- | ---------------------- |

When to Pay for Commercial Solutions

You're a startup with 3 developers and 100 users. Do you need DataDog at $500/month? Probably not. Here's the decision matrix:

Use free/OSS tools when:

Traffic < 1000 requests/hour
Team < 10 developers
Services < 5
You have time to maintain Jaeger/ELK/Prometheus
Performance issues are rare
You can tolerate some downtime for debugging

Pay for APM when:

Traffic > 10,000 requests/hour
Team > 10 developers
Services > 10
Time to resolution matters ($100/hour downtime cost)
You need proactive alerting
Multiple teams share the system
Compliance requires detailed audit trails

The cost calculation:

Scenario: Production issue investigation



With OSS tools:

- 2 hours digging through separate logs/traces/metrics

- $200 developer time (2 hours × $100/hr)

- $1000 business impact (downtime/lost sales)

Total: $1200



With APM tool:

- 15 minutes to identify issue (integrated view)

- $25 developer time (0.25 hours × $100/hr)

- $250 business impact (faster resolution)

- $50 tool cost (monthly subscription / 30 days)

Total: $325



Savings per incident: $875

Break-even: 1 incident per month

This is crucial: If you have more than one production incident per month that takes > 1 hour to debug, APM tools pay for themselves. If incidents are rare and debugging is quick, stick with free tools.

Sampling strategies for production:

All APM tools support sampling to reduce costs and overhead. Here's how to configure it:

Head-based sampling (decision at trace start):

# Sample based on trace ID (consistent across services)

def should_sample(trace_id):

    # Sample 10% of traces

    return int(trace_id, 16) % 10 == 0



# OpenTelemetry configuration

from opentelemetry.sdk.trace.sampling import TraceIdRatioBased



sampler = TraceIdRatioBased(0.1)  # 10% sampling

Tail-based sampling (decision after trace completes):

# Sample all errors + 1% of successful requests

class SmartSampler:

    def should_sample(self, trace):

        # Always sample errors

        if trace.has_error:

            return True



        # Always sample slow requests

        if trace.duration > 1.0:  # > 1 second

            return True



        # Sample 1% of normal requests

        return random.random() < 0.01

Tail-based sampling requires buffering traces until completion, which adds memory overhead but gives better signal-to-noise ratio.

Adaptive sampling:

class AdaptiveSampler:

    def __init__(self, target_traces_per_second=100):

        self.target = target_traces_per_second

        self.current_rate = 0

        self.sample_ratio = 1.0



    def should_sample(self):

        # Adjust sampling to maintain target rate

        if self.current_rate > self.target:

            self.sample_ratio *= 0.9  # Sample less

        elif self.current_rate < self.target:

            self.sample_ratio = min(1.0, self.sample_ratio * 1.1)  # Sample more



        return random.random() < self.sample_ratio

This maintains constant trace volume regardless of traffic changes.

Production profiling safety checklist:

[ ] Sampling rate < 10% (1-5% is common)
[ ] Rate limiting on profiler attachment (max 1 profile/hour)
[ ] Alerts if profiling overhead exceeds threshold
[ ] Auto-disable profiling if CPU/memory spikes
[ ] Trace data retention policy configured
[ ] PII redaction enabled on all traces
[ ] Team trained on interpreting APM data

Real-world production tracing setup:

Here's a complete configuration for a mid-sized production system:

import os

from opentelemetry import trace

from opentelemetry.sdk.trace import TracerProvider

from opentelemetry.sdk.trace.sampling import ParentBasedTraceIdRatio

from opentelemetry.sdk.trace.export import BatchSpanProcessor

from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

import sentry_sdk



# Environment-based configuration

ENVIRONMENT = os.getenv('ENVIRONMENT', 'development')

SERVICE_NAME = os.getenv('SERVICE_NAME', 'api-gateway')



# Configure sampling based on environment

if ENVIRONMENT == 'production':

    SAMPLE_RATE = 0.01  # 1% sampling in production

    SENTRY_SAMPLE_RATE = 0.1  # 10% for Sentry (focuses on errors)

elif ENVIRONMENT == 'staging':

    SAMPLE_RATE = 0.1  # 10% in staging

    SENTRY_SAMPLE_RATE = 0.5

else:  # development

    SAMPLE_RATE = 1.0  # 100% in development

    SENTRY_SAMPLE_RATE = 1.0



# OpenTelemetry setup

sampler = ParentBasedTraceIdRatio(SAMPLE_RATE)

provider = TracerProvider(sampler=sampler)

otlp_exporter = OTLPSpanExporter(

    endpoint=os.getenv('OTEL_EXPORTER_ENDPOINT', 'http://localhost:4317'),

    insecure=ENVIRONMENT != 'production'

)

provider.add_span_processor(BatchSpanProcessor(otlp_exporter))

trace.set_tracer_provider(provider)



# Sentry setup (for error tracking + sampling)

if os.getenv('SENTRY_DSN'):

    sentry_sdk.init(

        dsn=os.getenv('SENTRY_DSN'),

        environment=ENVIRONMENT,

        traces_sample_rate=SENTRY_SAMPLE_RATE,

        profiles_sample_rate=SENTRY_SAMPLE_RATE * 0.5,  # Profile 50% of traced transactions

        before_send=redact_sensitive_data,  # Custom function to redact PII

    )



# Feature flag for emergency verbose tracing

from feature_flags import is_enabled



def should_enable_verbose_tracing(request):

    """Enable detailed tracing for specific requests."""

    # Check if emergency tracing is enabled

    if is_enabled('emergency-tracing'):

        return True



    # Check if specific user requested it

    if request.headers.get('X-Trace-This-Request'):

        # Verify authorization

        if verify_trace_authorization(request):

            return True



    return False



# Middleware for conditional tracing

@app.middleware("http")

async def tracing_middleware(request, call_next):

    # Store tracing preference for this request

    request.state.verbose_tracing = should_enable_verbose_tracing(request)



    # Force-sample this request if verbose tracing enabled

    if request.state.verbose_tracing:

        span = trace.get_current_span()

        span.set_attribute("sampled.forced", True)

        span.set_attribute("sampled.reason", "verbose_tracing_enabled")



    response = await call_next(request)

    return response

This configuration gives you:

Normal operation: 1% sampling, minimal overhead
Error tracking: All errors captured by Sentry
Emergency debugging: Feature flag enables 100% sampling for troubleshooting
Authorized tracing: Specific requests can be force-traced
Environment-aware: Different sampling in dev/staging/production

Common production tracing mistakes:

Over-instrumenting: Creating spans for every function call
Fix: Focus on service boundaries and slow operations
No sampling: Tracing 100% of production traffic
Fix: Start with 1-5% sampling, increase only if needed
Blocking exports: Sending traces synchronously
Fix: Always use async/batch span processors
No PII redaction: Logging sensitive user data
Fix: Redact credit cards, passwords, emails in traces
Ignoring overhead: Not monitoring tracing performance impact
Fix: Alert if tracing adds > 5ms to request latency
No off switch: Can't disable tracing without redeployment
Fix: Use feature flags for quick disable
Alert fatigue: Creating alerts for every slow request
Fix: Alert on p95/p99 latency, not individual requests

The production tracing philosophy:

Sample smartly: Trace enough to catch issues, not so much you drown in data
Focus on signals: Errors, slow requests, and unusual patterns matter most
Automate response: Tracing should enable quick resolution, not just observation
Trust but verify: APM tools are helpful but verify with logs and metrics
Cost-conscious: Balance observability value against tool costs
Privacy-first: Redact PII automatically, not manually
Team-accessible: Everyone should be able to query traces, not just ops

Quick decision guide:

Question: How should I trace in production?



Is this an active incident?

├─ Yes → Enable verbose tracing via feature flag (for affected users only)

└─ No

   ├─ Do you have < 100 req/sec?

   │  ├─ Yes → 10% sampling with free tools (Jaeger/Zipkin)

   │  └─ No

   │     ├─ Do you have budget ($100+/month)?

   │     │  ├─ Yes → Use APM tool (Sentry/DataDog/New Relic) with 1-5% sampling

   │     │  └─ No → Use OSS with 1% sampling + py-spy for ad-hoc profiling

   │     └─ Is this a one-time investigation?

   │        ├─ Yes → Attach py-spy for 60 seconds

   │        └─ No → Implement gradual sampling increase (1% → 5% → 10%)

Remember: The goal of production tracing isn't to capture everything—it's to capture enough to understand and resolve issues quickly. Start minimal, increase only when specific needs justify the overhead.