7.12 Production-Safe Tracing
You've learned to trace code in development with debuggers, profile it with py-spy, and understand distributed flows with OpenTelemetry. But production is different. You can't attach a debugger to a running production service—it pauses execution and ruins user experience. You can't enable verbose logging for every request—it overwhelms storage and degrades performance. You need production-safe tracing techniques.
7.12.1 Feature Flags for Instrumentation
The core pattern: instrumentation that can be enabled selectively for specific users, requests, or conditions without code deployment.
Enabling Tracing for Specific Users/Requests
Imagine you have a bug that only affects Premium tier users. You want verbose tracing for Premium users without impacting Basic users. Here's how:
from feature_flags import is_enabled # Using a library like LaunchDarkly or Unleash
import logging
logger = logging.getLogger(__name__)
@app.post("/checkout")
def checkout(request):
user_id = request.user.id
user_tier = request.user.tier
# Enable detailed tracing based on feature flag
verbose_tracing = is_enabled(
'detailed-checkout-tracing',
context={'user_id': user_id, 'tier': user_tier}
)
if verbose_tracing:
logger.setLevel(logging.DEBUG)
logger.debug(f"Checkout initiated: user={user_id}, tier={user_tier}, items={request.json['items']}")
# Your checkout logic
inventory_response = reserve_inventory(request.json['items'])
if verbose_tracing:
logger.debug(f"Inventory reserved: {inventory_response}")
payment_response = process_payment(user_id, inventory_response['total'])
if verbose_tracing:
logger.debug(f"Payment processed: {payment_response}")
return {"status": "success"}
Now you can enable tracing from your feature flag dashboard:
Feature: detailed-checkout-tracing
Targeting: tier = "premium"
Status: Enabled for 100% of premium users
No code deployment needed. Premium users get verbose logs; Basic users don't see any performance impact.
More sophisticated targeting:
# Enable tracing for specific user
is_enabled('tracing', context={'user_id': 12345}) # True only for user 12345
# Enable tracing for percentage of requests
is_enabled('tracing', context={'rollout_percentage': 5}) # 5% of requests
# Enable tracing for specific conditions
is_enabled('tracing', context={
'user_tier': 'premium',
'request_path': '/checkout',
'time_hour': datetime.now().hour # Only during business hours
})
Performance Impact Mitigation
Feature flags add some overhead. Here's how to minimize it:
Pattern 1: Check once per request
@app.middleware("http")
async def tracing_middleware(request, call_next):
# Check feature flag once at request start
request.state.verbose_tracing = is_enabled(
'detailed-tracing',
context={'user_id': request.user.id if request.user else None}
)
response = await call_next(request)
return response
@app.post("/checkout")
def checkout(request):
if request.state.verbose_tracing:
logger.debug("Checkout initiated")
# ... rest of code
Pattern 2: Lazy evaluation
from contextlib import contextmanager
@contextmanager
def traced_operation(request, operation_name):
"""Only compute tracing data if tracing is enabled."""
if not request.state.verbose_tracing:
yield # No-op if tracing disabled
return
start = time.time()
logger.debug(f"Starting {operation_name}")
try:
yield
finally:
duration = time.time() - start
logger.debug(f"{operation_name} completed in {duration:.3f}s")
@app.post("/checkout")
def checkout(request):
with traced_operation(request, "inventory_check"):
# This code always runs
inventory = check_inventory(request.json['items'])
# But tracing only happens if enabled
Pattern 3: Sampling with feature flags
import random
def should_trace(request):
"""Probabilistic tracing based on feature flag."""
sampling_rate = feature_flags.get_float('tracing-sample-rate', default=0.0)
# Always trace if user specifically requested it
if request.headers.get('X-Enable-Tracing') == 'true':
return True
# Otherwise sample based on rate
return random.random() < sampling_rate
@app.post("/checkout")
def checkout(request):
if should_trace(request):
with tracer.start_as_current_span("checkout"):
# ... checkout logic
pass
Security Considerations
Feature-flagged tracing exposes potential security risks:
Risk 1: Logging sensitive data
# DANGER: Might log credit card numbers if tracing is enabled
if verbose_tracing:
logger.debug(f"Payment info: {request.json}") # Contains card number!
# SAFE: Redact sensitive fields
def safe_log(data, sensitive_keys=['card_number', 'cvv', 'password']):
return {k: '***REDACTED***' if k in sensitive_keys else v for k, v in data.items()}
if verbose_tracing:
logger.debug(f"Payment info: {safe_log(request.json)}")
Risk 2: Unauthorized trace access
# DANGER: Anyone can enable tracing via header
if request.headers.get('X-Enable-Tracing'):
enable_tracing()
# SAFE: Verify authorization
def should_enable_tracing(request):
trace_header = request.headers.get('X-Enable-Tracing')
if not trace_header:
return False
# Verify signed token or API key
if not verify_trace_token(trace_header, request.user):
logger.warning(f"Unauthorized tracing attempt from user {request.user.id}")
return False
return True
Risk 3: DoS via verbose logging
# DANGER: Attacker enables tracing for high-traffic endpoint
if is_enabled('tracing', context={'endpoint': request.path}):
# Logs every request detail
# If endpoint gets 1000 req/sec, this writes gigabytes of logs
# SAFE: Rate limit traced requests
from ratelimit import rate_limit
@rate_limit(max_traced_requests=100, window_seconds=60)
def trace_if_enabled(request):
return is_enabled('tracing', context={'endpoint': request.path})
Tracing authorization checklist:
-
[ ] Traced data is redacted (no PII, credentials, or secrets)
-
[ ] Tracing can only be enabled by authorized users/systems
-
[ ] Tracing has rate limits to prevent DoS
-
[ ] Trace data is stored securely and encrypted
-
[ ] Trace retention policies automatically delete old data
-
[ ] Audit logs track who enabled tracing when
7.12.2 Sampling and Profiling in Production
Full tracing of every request isn't feasible in production. At 10,000 requests per second, you'd generate terabytes of trace data daily. The solution: sampling and on-demand profiling.
py-spy Attach Mode for Live Systems
py-spy is a sampling profiler that can attach to running Python processes without modifying code or restarting. This is the safest way to profile production.
# Find your process ID
ps aux | grep python
# Output: user 12345 ... python app.py
# Attach py-spy and profile for 30 seconds
sudo py-spy record -p 12345 -o profile.svg --duration 30
# View the flamegraph
open profile.svg
The flamegraph shows what your application is doing:
[====app.checkout====][==db.query==][==serialize==]
25% of time 30% of time 15% of time
You immediately see that 30% of time is spent in database queries—a potential optimization target.
When to use py-spy in production:
-
âś… Investigating performance regressions ("Why is this suddenly slow?")
-
âś… Understanding baseline performance ("What does normal look like?")
-
âś… Identifying hot paths ("What code runs most?")
-
❌ Continuous monitoring (too much overhead—use APM tools instead)
-
❌ Debugging logical errors (use logging or feature-flagged tracing)
py-spy attach best practices:
# Sample at lower rate for less overhead (default is 100Hz)
sudo py-spy record -p 12345 -o profile.svg --rate 10 --duration 60
# Profile only specific threads
sudo py-spy record -p 12345 -o profile.svg --subprocesses
# Output top functions instead of flamegraph (for quick checks)
sudo py-spy top -p 12345
APM Tools Overview (New Relic, DataDog, Sentry)
When py-spy isn't enough—when you need continuous monitoring, alerting, or distributed tracing across many services—you need Application Performance Monitoring (APM) tools.
New Relic:
# Install
pip install newrelic
# Configure
newrelic-admin generate-config YOUR_LICENSE_KEY newrelic.ini
# Run your app with New Relic agent
NEW_RELIC_CONFIG_FILE=newrelic.ini newrelic-admin run-program python app.py
New Relic provides:
-
Automatic transaction tracing (every web request, background job, database query)
-
Distributed tracing across services
-
Error tracking with stack traces
-
Custom metrics and dashboards
-
Alerting based on performance thresholds
DataDog:
# Install
pip install ddtrace
# Run with DataDog tracing
ddtrace-run python app.py
DataDog provides similar capabilities to New Relic, plus:
-
Infrastructure monitoring (CPU, memory, disk)
-
Log aggregation integrated with traces
-
Real User Monitoring (RUM) for frontend performance
-
Network performance monitoring
Sentry:
import sentry_sdk
sentry_sdk.init(
dsn="https://your-key@sentry.io/project-id",
traces_sample_rate=0.1, # Sample 10% of transactions
profiles_sample_rate=0.1, # Profile 10% of traced transactions
)
# Sentry automatically captures errors and traces
@app.post("/checkout")
def checkout(request):
# If this raises an exception, Sentry captures it with full context
process_order(request)
Sentry focuses on error tracking but also provides:
-
Performance monitoring (transaction tracing)
-
Profiling (flamegraphs integrated with transactions)
-
Release tracking (correlate errors with deployments)
-
User feedback collection
Key differences:
| Feature | New Relic | DataDog | Sentry |
| ------------------- | ------------------------- | --------------------------------------------- | ---------------------- |
| Focus | APM first | Infrastructure first | Error tracking first |
| Pricing | $99+/month | $15+/host/month | $26+/month |
| Best for | Deep transaction analysis | DevOps teams needing full stack observability | Error-driven debugging |
| Learning curve | Moderate | Steep | Easy |
| Trace retention | 8 days (standard) | 15 days | 90 days |
When to Pay for Commercial Solutions
You're a startup with 3 developers and 100 users. Do you need DataDog at $500/month? Probably not. Here's the decision matrix:
Use free/OSS tools when:
-
Traffic < 1000 requests/hour
-
Team < 10 developers
-
Services < 5
-
You have time to maintain Jaeger/ELK/Prometheus
-
Performance issues are rare
-
You can tolerate some downtime for debugging
Pay for APM when:
-
Traffic > 10,000 requests/hour
-
Team > 10 developers
-
Services > 10
-
Time to resolution matters ($100/hour downtime cost)
-
You need proactive alerting
-
Multiple teams share the system
-
Compliance requires detailed audit trails
The cost calculation:
Scenario: Production issue investigation
With OSS tools:
- 2 hours digging through separate logs/traces/metrics
- $200 developer time (2 hours Ă— $100/hr)
- $1000 business impact (downtime/lost sales)
Total: $1200
With APM tool:
- 15 minutes to identify issue (integrated view)
- $25 developer time (0.25 hours Ă— $100/hr)
- $250 business impact (faster resolution)
- $50 tool cost (monthly subscription / 30 days)
Total: $325
Savings per incident: $875
Break-even: 1 incident per month
This is crucial: If you have more than one production incident per month that takes > 1 hour to debug, APM tools pay for themselves. If incidents are rare and debugging is quick, stick with free tools.
Sampling strategies for production:
All APM tools support sampling to reduce costs and overhead. Here's how to configure it:
Head-based sampling (decision at trace start):
# Sample based on trace ID (consistent across services)
def should_sample(trace_id):
# Sample 10% of traces
return int(trace_id, 16) % 10 == 0
# OpenTelemetry configuration
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
sampler = TraceIdRatioBased(0.1) # 10% sampling
Tail-based sampling (decision after trace completes):
# Sample all errors + 1% of successful requests
class SmartSampler:
def should_sample(self, trace):
# Always sample errors
if trace.has_error:
return True
# Always sample slow requests
if trace.duration > 1.0: # > 1 second
return True
# Sample 1% of normal requests
return random.random() < 0.01
Tail-based sampling requires buffering traces until completion, which adds memory overhead but gives better signal-to-noise ratio.
Adaptive sampling:
class AdaptiveSampler:
def __init__(self, target_traces_per_second=100):
self.target = target_traces_per_second
self.current_rate = 0
self.sample_ratio = 1.0
def should_sample(self):
# Adjust sampling to maintain target rate
if self.current_rate > self.target:
self.sample_ratio *= 0.9 # Sample less
elif self.current_rate < self.target:
self.sample_ratio = min(1.0, self.sample_ratio * 1.1) # Sample more
return random.random() < self.sample_ratio
This maintains constant trace volume regardless of traffic changes.
Production profiling safety checklist:
-
[ ] Sampling rate < 10% (1-5% is common)
-
[ ] Rate limiting on profiler attachment (max 1 profile/hour)
-
[ ] Alerts if profiling overhead exceeds threshold
-
[ ] Auto-disable profiling if CPU/memory spikes
-
[ ] Trace data retention policy configured
-
[ ] PII redaction enabled on all traces
-
[ ] Team trained on interpreting APM data
Real-world production tracing setup:
Here's a complete configuration for a mid-sized production system:
import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.sampling import ParentBasedTraceIdRatio
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
import sentry_sdk
# Environment-based configuration
ENVIRONMENT = os.getenv('ENVIRONMENT', 'development')
SERVICE_NAME = os.getenv('SERVICE_NAME', 'api-gateway')
# Configure sampling based on environment
if ENVIRONMENT == 'production':
SAMPLE_RATE = 0.01 # 1% sampling in production
SENTRY_SAMPLE_RATE = 0.1 # 10% for Sentry (focuses on errors)
elif ENVIRONMENT == 'staging':
SAMPLE_RATE = 0.1 # 10% in staging
SENTRY_SAMPLE_RATE = 0.5
else: # development
SAMPLE_RATE = 1.0 # 100% in development
SENTRY_SAMPLE_RATE = 1.0
# OpenTelemetry setup
sampler = ParentBasedTraceIdRatio(SAMPLE_RATE)
provider = TracerProvider(sampler=sampler)
otlp_exporter = OTLPSpanExporter(
endpoint=os.getenv('OTEL_EXPORTER_ENDPOINT', 'http://localhost:4317'),
insecure=ENVIRONMENT != 'production'
)
provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
trace.set_tracer_provider(provider)
# Sentry setup (for error tracking + sampling)
if os.getenv('SENTRY_DSN'):
sentry_sdk.init(
dsn=os.getenv('SENTRY_DSN'),
environment=ENVIRONMENT,
traces_sample_rate=SENTRY_SAMPLE_RATE,
profiles_sample_rate=SENTRY_SAMPLE_RATE * 0.5, # Profile 50% of traced transactions
before_send=redact_sensitive_data, # Custom function to redact PII
)
# Feature flag for emergency verbose tracing
from feature_flags import is_enabled
def should_enable_verbose_tracing(request):
"""Enable detailed tracing for specific requests."""
# Check if emergency tracing is enabled
if is_enabled('emergency-tracing'):
return True
# Check if specific user requested it
if request.headers.get('X-Trace-This-Request'):
# Verify authorization
if verify_trace_authorization(request):
return True
return False
# Middleware for conditional tracing
@app.middleware("http")
async def tracing_middleware(request, call_next):
# Store tracing preference for this request
request.state.verbose_tracing = should_enable_verbose_tracing(request)
# Force-sample this request if verbose tracing enabled
if request.state.verbose_tracing:
span = trace.get_current_span()
span.set_attribute("sampled.forced", True)
span.set_attribute("sampled.reason", "verbose_tracing_enabled")
response = await call_next(request)
return response
This configuration gives you:
-
Normal operation: 1% sampling, minimal overhead
-
Error tracking: All errors captured by Sentry
-
Emergency debugging: Feature flag enables 100% sampling for troubleshooting
-
Authorized tracing: Specific requests can be force-traced
-
Environment-aware: Different sampling in dev/staging/production
Common production tracing mistakes:
- Over-instrumenting: Creating spans for every function call
-
Fix: Focus on service boundaries and slow operations
-
No sampling: Tracing 100% of production traffic
-
Fix: Start with 1-5% sampling, increase only if needed
-
Blocking exports: Sending traces synchronously
-
Fix: Always use async/batch span processors
-
No PII redaction: Logging sensitive user data
-
Fix: Redact credit cards, passwords, emails in traces
-
Ignoring overhead: Not monitoring tracing performance impact
-
Fix: Alert if tracing adds > 5ms to request latency
-
No off switch: Can't disable tracing without redeployment
-
Fix: Use feature flags for quick disable
-
Alert fatigue: Creating alerts for every slow request
- Fix: Alert on p95/p99 latency, not individual requests
The production tracing philosophy:
-
Sample smartly: Trace enough to catch issues, not so much you drown in data
-
Focus on signals: Errors, slow requests, and unusual patterns matter most
-
Automate response: Tracing should enable quick resolution, not just observation
-
Trust but verify: APM tools are helpful but verify with logs and metrics
-
Cost-conscious: Balance observability value against tool costs
-
Privacy-first: Redact PII automatically, not manually
-
Team-accessible: Everyone should be able to query traces, not just ops
Quick decision guide:
Question: How should I trace in production?
Is this an active incident?
├─ Yes → Enable verbose tracing via feature flag (for affected users only)
└─ No
├─ Do you have < 100 req/sec?
│ ├─ Yes → 10% sampling with free tools (Jaeger/Zipkin)
│ └─ No
│ ├─ Do you have budget ($100+/month)?
│ │ ├─ Yes → Use APM tool (Sentry/DataDog/New Relic) with 1-5% sampling
│ │ └─ No → Use OSS with 1% sampling + py-spy for ad-hoc profiling
│ └─ Is this a one-time investigation?
│ ├─ Yes → Attach py-spy for 60 seconds
│ └─ No → Implement gradual sampling increase (1% → 5% → 10%)
Remember: The goal of production tracing isn't to capture everything—it's to capture enough to understand and resolve issues quickly. Start minimal, increase only when specific needs justify the overhead.