Part VII: Avoiding Common Pitfalls
7.19 The Seven Deadly Sins of Code Tracing
You've spent three hours adding print() statements throughout a codebase, only to restart the server and realize you forgot to add one in the critical function. Or you've built an elegant AST instrumentation system over two weeks, only to discover Django Debug Toolbar would have answered your question in five minutes. These aren't isolated mistakes—they're patterns that trap developers repeatedly when tracing unfamiliar code.
This section catalogs the seven most common anti-patterns in execution tracing. Each represents a seductive path that feels productive in the moment but leads to wasted time, technical debt, or incomplete understanding. More importantly, we'll show you how to recognize when you're falling into these traps and what to do instead.
Sin 1: Print Statement Archaeology
The Scenario: You're trying to understand a Django form submission flow. You add print("In view") at the top of your view function. The terminal output is buried among Django's startup messages. You add print("=" * 50) to make it visible. You add print(f"Form data: {request.POST}") to see the data. You add prints in three more functions. Now you need to understand the order of execution, so you add timestamps. Then you realize you're not seeing output from one function—is it not executing, or is stdout buffering? You add sys.stdout.flush() calls. An hour has passed, and you're debugging your debugging instrumentation.
This is print statement archaeology—the practice of excavating program behavior by layering print statements throughout code like sedimentary deposits. Here's why it fails:
The Scalability Problem
Print debugging works beautifully for small, linear code paths. When you're testing a single function with clear inputs and outputs, a few strategic print statements give you exactly what you need:
def calculate_discount(price, coupon_code):
print(f"Calculating discount for price={price}, coupon={coupon_code}")
if coupon_code == "SAVE20":
discount = price * 0.20
print(f"Applied 20% discount: {discount}")
return discount
print("No discount applied")
return 0.0
This is fine. The problem emerges when you're tracing through unfamiliar codebases with complex execution flows:
# You're trying to understand how user permissions are checked
# File: views.py
def create_post(request):
print("=== CREATE POST VIEW ===")
print(f"User: {request.user}")
# ... 50 lines later, in a different file
# File: middleware.py
class PermissionMiddleware:
def process_view(self, request, view_func, view_args, view_kwargs):
print(f"[MIDDLEWARE] Checking permissions for {view_func.__name__}")
print(f"[MIDDLEWARE] User groups: {request.user.groups.all()}")
# Wait, why isn't this printing? Buffering? Wrong execution path?
# File: models.py
class Post(models.Model):
def save(self, *args, **kwargs):
print(f"[MODEL] Saving post, user={self.author}")
# This prints AFTER the view returns? What?
# File: signals.py
@receiver(post_save, sender=Post)
def notify_followers(sender, instance, created, **kwargs):
print("[SIGNAL] In notify_followers")
# This isn't printing at all...
After thirty minutes, your terminal looks like this:
System check identified no issues (0 silenced).
December 03, 2024 - 15:23:45
Django version 4.2, using settings 'myproject.settings'
Starting development server at http://127.0.0.1:8000/
Quit the server with CONTROL-C.
=== CREATE POST VIEW ===
User: alice@example.com
[MIDDLEWARE] Checking permissions for create_post
[MIDDLEWARE] User groups: <QuerySet []>
[MODEL] Saving post, user=alice@example.com
[HTTP] "POST /posts/create/ HTTP/1.1" 302 0
=== CREATE POST VIEW ===
User: alice@example.com
[MIDDLEWARE] Checking permissions for create_post
Notice the problems:
-
The signal handler never printed (or did it? Maybe it errored silently?)
-
The middleware printed twice (why?)
-
You have no idea about the 20 other functions that might have executed
-
The output is already hard to parse, and you've only instrumented 4 locations
-
Each server restart means reconfiguring your mental model from scratch
This is the key insight: Print debugging doesn't scale to understanding execution flow through complex systems. It's archaeology because you're excavating one layer at a time, and each new print statement you add obscures the previous context. You're constantly fighting against:
-
Output interleaving: Multiple threads/processes mix their output
-
Lost context: You can't see local variables in functions you didn't instrument
-
Execution order confusion: Async operations, signals, and callbacks make timing unclear
-
The modification burden: Every hypothesis requires editing multiple files and restarting
When Print Debugging Actually Works
Let me be clear: print debugging isn't evil. It's perfectly appropriate for:
-
Debugging isolated functions where you control the inputs and the execution is synchronous
-
Quick sanity checks like "Does this code path execute at all?"
-
Production debugging where you can't attach a debugger (though structured logging is better)
-
Scripts and data processing where you're transforming data through clear stages
Here's an example where print debugging is actually the right choice:
# You're debugging a data transformation script
def process_customer_data(csv_path):
customers = load_csv(csv_path)
print(f"Loaded {len(customers)} customers")
valid_customers = [c for c in customers if validate_email(c['email'])]
print(f"Filtered to {len(valid_customers)} valid customers")
print(f"Rejected emails: {[c['email'] for c in customers if c not in valid_customers][:5]}")
enriched = enrich_with_demographics(valid_customers)
print(f"Enriched {len(enriched)} customers with demographics")
return enriched
This works because:
-
Execution is linear and synchronous
-
You're primarily tracking data transformations, not control flow
-
The printed information directly answers your question
-
You can run this repeatedly with different inputs easily
When Debuggers Win
Now contrast that with trying to understand a web framework's request handling:
# You're trying to understand Django's authentication flow
# Using print statements:
def login_view(request):
print("1. In login_view")
print(f"2. Request method: {request.method}")
if request.method == "POST":
print("3. POST request received")
form = AuthenticationForm(request, data=request.POST)
print(f"4. Form created: {form}")
if form.is_valid():
print("5. Form is valid")
user = form.get_user()
print(f"6. Got user: {user}")
# ... but what happens inside django.contrib.auth.login()?
# And what middleware runs?
# And how does the session get created?
You'd need dozens of print statements across multiple files you don't even own (Django's source code). Instead, with a debugger:
-
Set a breakpoint at
login_view -
Step into
form.is_valid()to see the validation logic -
Step into
login(request, user)to watch session creation -
Examine the call stack at any point to see exactly how you got there
-
Inspect all local variables without adding any print statements
-
See middleware execution automatically in the call stack
The debugger gives you a complete execution theater where you can pause time, rewind, inspect, and explore—all without modifying a single line of code.
The Console.log Hell Phenomenon
JavaScript developers face an even more pernicious version of this trap. Because browser DevTools make console.log() so frictionless, it's easy to fall into what the community calls "console.log hell":
function handleCheckout(cartItems) {
console.log("Starting checkout", cartItems);
const total = calculateTotal(cartItems);
console.log("Total calculated:", total);
const discount = applyDiscount(total, user.couponCode);
console.log("Discount applied:", discount);
validatePaymentMethod(user.paymentMethod)
.then((result) => {
console.log("Payment validated:", result);
return processPayment(total - discount);
})
.then((payment) => {
console.log("Payment processed:", payment);
return createOrder(cartItems, payment);
})
.then((order) => {
console.log("Order created:", order);
// Wait, why didn't this run?
})
.catch((error) => {
console.log("ERROR:", error);
// Which step failed? What was the state?
});
}
After adding these logs, you refresh the browser and see:
Starting checkout [{...}, {...}]
Total calculated: 149.99
Discount applied: 14.99
Payment validated: {status: 'valid', ...}
ERROR: PaymentGatewayError: Card declined
You still don't know:
-
What happened inside
processPayment()? -
What was the cart state when the error occurred?
-
What network requests fired?
-
What other code might have modified the cart?
Chrome DevTools with breakpoints would show you:
-
The exact line where the error occurred
-
The call stack showing how you got there
-
All local variables at the moment of failure
-
Network requests correlated with code execution
-
The ability to step backwards (with Performance recordings)
Notice this carefully: The difference isn't that debuggers are "better"—it's that print statements force you to predict what information you'll need before running the code. Debuggers let you explore interactively once you're already inside the failing execution. This is the crucial distinction.
The Transition Rule
Here's a practical rule for when to abandon print debugging:
If you've added more than 5 print statements, or you've restarted your program more than 3 times to add new prints, stop immediately and switch to a debugger.
This is the warning sign that you're trying to explore, not just verify. Exploration requires interactive tools.
A Better Alternative: Strategic Logging
If you can't or won't use a debugger, structured logging is vastly superior to print statements:
import logging
import structlog
logger = structlog.get_logger(__name__)
def process_payment(order):
logger.info("payment.started",
order_id=order.id,
amount=order.total,
user_id=order.user.id)
try:
result = payment_gateway.charge(order.total, order.user.payment_method)
logger.info("payment.succeeded",
order_id=order.id,
transaction_id=result.transaction_id)
return result
except PaymentError as e:
logger.error("payment.failed",
order_id=order.id,
error_type=type(e).__name__,
error_message=str(e))
raise
This is better because:
-
Logs persist across runs—you can analyze patterns over time
-
Structured data is searchable and filterable
-
Log levels separate debug exploration from production monitoring
-
You can enable/disable logging without code changes
-
Correlation IDs link related operations across services
But even this is still inferior to debuggers for understanding execution flow in development. Logging is for production observability and historical analysis. Debuggers are for interactive exploration and learning.
Sin 2: Modification Without Version Control
The Scenario: You're exploring an unfamiliar React component to understand why it re-renders constantly. You add console.log statements. Then you add a temporary useEffect hook to log props changes. You comment out a few lines that you think might be causing issues. You add a test button to trigger the component directly. Two hours later, you've figured out the problem, but now you have a different problem: you can't remember all the changes you made. You use Ctrl+Z repeatedly, hoping to undo back to the original state. You accidentally undo too far and lose your actual fix. You spend another 30 minutes reconstructing what you changed.
This is modification without version control—treating your codebase as a scratchpad for exploration without creating restore points. It feels faster in the moment to "just add a quick console.log" without committing, but it creates cascading problems:
The Danger of "Temporary" Instrumentation
Let's be honest: nothing is more permanent than a temporary solution. Here's what actually happens with "temporary" debug code:
# Day 1: "I'll just add this temporarily to understand the flow"
def process_order(order_id):
print(f"DEBUG: Processing order {order_id}") # TODO: Remove this
order = Order.objects.get(id=order_id)
print(f"DEBUG: Order status = {order.status}") # TODO: Remove
if order.status == "pending":
# Temporarily disabled to test the flow
# send_confirmation_email(order)
process_payment(order)
# Day 3: You've forgotten about the debug prints
# Day 7: Another developer sees your code in a PR
# "Why are there debug prints in production code?"
# Day 14: The confirmation email bug is reported
# "Orders aren't sending confirmation emails"
# You spend an hour debugging before realizing you commented out the email call
# Day 30: The debug prints are still there
# They've now been copied into two other functions by developers
# who thought they were intentional logging
This is the key insight: "Temporary" instrumentation becomes permanent because:
-
You forget to remove it
-
You're afraid to remove it (what if that breaks something?)
-
Other developers don't know it's temporary
-
The commented-out code creates ambiguity about intended behavior
Even worse, this pattern trains you to be careless with code modifications, which eventually leads to:
-
Accidentally committing debug code to main branches
-
Breaking production because you forgot to re-enable commented code
-
Confusing teammates who don't know what's intentional
-
Losing actual fixes mixed in with debug scaffolding
Git Stash Workflows for Exploration
Here's the professional approach to exploratory code changes. When you need to modify code to understand it, create a clear boundary between exploration and production work:
Pattern 1: The Stash-Explore-Clean Cycle
# You're about to add debug instrumentation to explore a bug
# First, make sure your working directory is clean
git status
# If you have uncommitted work, commit it or stash it separately
# Create a clear boundary
git stash push -m "Clean state before exploration"
# Now add all your debug prints, test hooks, etc.
# Edit freely—you have a restore point
# After you understand the issue, note your findings
# then restore clean state
git stash drop # or git reset --hard HEAD
# Now implement the actual fix cleanly
This works well for quick exploration, but it has a weakness: if you discover something valuable during exploration, you've stashed it away. A better approach:
Pattern 2: Exploration Branches
# Before modifying code to trace execution
git checkout -b trace/understanding-order-flow
# Now add all your instrumentation
# Edit debug/test_order_flow.py (create new files for test harnesses)
git add -A
git commit -m "Add debug instrumentation for order flow
- Added logging to process_order(), validate_payment(), send_confirmation()
- Created test script to trigger edge case with expired coupons
- Temporarily disabled email sending to isolate payment flow
"
# Continue exploring and committing your debug changes
# Each commit documents what you learned:
git commit -m "Discovered payment validation happens in middleware, not view"
# When you understand the problem, switch back
git checkout main
# Create a new branch for the actual fix
git checkout -b fix/order-payment-validation
# Implement the fix cleanly, without debug code
# You can reference the trace branch to remember what you learned
# After the fix is merged, delete the exploration branch
git branch -D trace/understanding-order-flow
Notice carefully: This pattern:
-
Preserves all your exploration work as documentation
-
Separates "understanding the system" from "changing the system"
-
Lets you share exploration commits with teammates ("Here's how I figured this out")
-
Gives you perfect rollback at any point
-
Makes it impossible to accidentally commit debug code to production
Using Feature Branches for Tracing Experiments
Sometimes your exploration is more extensive—you want to try multiple instrumentation approaches or test different hypotheses. Use feature branches with descriptive names:
# You're trying to understand why a Celery task is slow
git checkout -b trace/celery-task-performance
# Try approach 1: Add timing logs
# Edit tasks.py, add import time, time.time() calls
git commit -m "Approach 1: Manual timing logs"
# Test it, discover it's not granular enough
# Try approach 2: Use line_profiler
pip install line_profiler
# Add @profile decorators
git commit -m "Approach 2: line_profiler on process_batch_task"
# This reveals the bottleneck: database queries in a loop
# Try approach 3: Add Django Debug Toolbar to Celery worker
git commit -m "Approach 3: DDT in worker (requires custom config)"
# Now you have a complete exploration history
git log --oneline
# 3a7f9c1 Approach 3: DDT in worker (requires custom config)
# 8d2e4b3 Approach 2: line_profiler on process_batch_task
# 1f9a8c7 Approach 1: Manual timing logs
# Document your findings in the final commit
git commit -m "FINDINGS: N+1 query in process_batch_task
The task calls select_related() but not prefetch_related(),
causing 500+ individual queries for related objects.
Solution: Add prefetch_related('attachments', 'comments')
to the initial queryset.
"
# Now create a clean fix branch
git checkout main
git checkout -b fix/celery-n-plus-one-query
# Implement just the fix, with no debug code
The Benefits of This Approach:
-
Your exploration becomes documentation: Future developers can see how you diagnosed the problem
-
Experiments don't pollute main: Your git log stays clean
-
You can resume exploration later: If the fix doesn't work, you have your instrumentation ready
-
Teammates can reproduce your investigation: "Check out the trace/celery-task-performance branch to see how I debugged this"
-
You never lose important work: Everything is committed and recoverable
A Practical Example: Comparing Approaches
Let's see this in action with a real scenario—understanding Django's form validation flow:
Bad Approach (No Version Control):
# You edit views.py directly
def submit_feedback(request):
print("=== SUBMIT FEEDBACK ===") # Added line 1
print(f"Method: {request.method}") # Added line 2
if request.method == "POST":
form = FeedbackForm(request.POST)
print(f"Form errors: {form.errors}") # Added line 3
# print(f"Cleaned data: {form.cleaned_data}") # This errored, left commented
if form.is_valid():
# Temporarily commented to test validation
# feedback = form.save(commit=False)
# feedback.user = request.user
# feedback.save()
print("Form valid!") # Added line 4
return redirect("feedback_list")
After 20 minutes of exploration, your file is a mess. You've made fixes mixed with debug code. You're afraid to undo because you might lose the fix. When you do git diff, it's chaos.
Good Approach (Exploration Branch):
git checkout -b trace/feedback-form-validation
# First commit: Add basic logging
# Edit views.py
git commit -m "Add logging to feedback submission flow"
# Second commit: Test form validation
# Modify form, add test data
git commit -m "Test form validation with invalid data"
# Third commit: Temporarily disable save to isolate validation
git commit -m "Disable save() to test validation in isolation"
# Fourth commit: Document findings
git commit -m "FINDINGS: EmailField validation rejects + in addresses
The form uses Django's EmailField, which validates using EmailValidator.
EmailValidator rejects addresses like 'user+tag@example.com' because
the default regex doesn't allow + in the local part.
Solution: Use custom validator or update regex in forms.py
"
# Now create clean fix
git checkout main
git checkout -b fix/email-validation-plus-sign
# Implement clean solution with no debug code
Your git log now shows a clear trail of investigation separate from the fix. Anyone reviewing your PR sees only the clean solution, but they can reference the trace branch to understand your reasoning.
The Version Control Rule for Exploration
Here's the rule you should tattoo on your hand (metaphorically):
Before adding ANY debug code, console.logs, commented lines, or test harnesses, create an exploration branch or stash your clean state. No exceptions.
This takes 5 seconds and saves hours of cleanup and confusion. Make it automatic:
# Add this alias to your ~/.bashrc or ~/.zshrc
alias trace='git checkout -b trace/$(date +%Y%m%d-%H%M%S)-'
# Usage:
$ trace celery-debug
# Creates: trace/20241203-153045-celery-debug
Now your muscle memory can be: "Need to explore? Type trace <description>, then hack freely."
Sin 3: Premature Optimization Profiling
The Scenario: You've just joined a project and need to add a feature to the user authentication flow. Before understanding how the code works, you run a profiler because you heard "performance matters." You spend two hours analyzing a cProfile report, discovering that bcrypt.hashpw() takes 95% of execution time during login. You research faster hashing algorithms. You propose switching to Argon2. Your tech lead says: "We hash passwords once per login. This is intentional security. Did you understand the auth flow yet?" You haven't—you profiled before understanding.
This is premature optimization profiling—using performance tools when your actual goal is understanding what the code does, not how fast it runs. The confusion arises because profilers seem like they're showing execution flow, but they're actually answering a completely different question.
Tracing for Understanding vs. Tracing for Performance
Let's be crystal clear about the distinction:
Understanding questions (what debuggers and execution tracers answer):
-
What code actually executes when I submit this form?
-
In what order do these functions run?
-
How does data flow from the view to the model?
-
Why does this user get redirected to the admin page?
-
Which middleware processes this request?
Performance questions (what profilers answer):
-
Which functions consume the most CPU time?
-
How many times is this function called?
-
What's the memory footprint of this operation?
-
Where are the bottlenecks in my hot path?
These are fundamentally different questions that require different tools. Using a profiler to understand execution flow is like using a telescope to read a book—technically possible, but wildly inefficient and frustrating.
The Wrong Question: "How Fast Is This?" vs. "What Does This Do?"
Here's a concrete example. You're trying to understand a Django view that creates user accounts:
def create_account(request):
if request.method == "POST":
form = UserCreationForm(request.POST)
if form.is_valid():
user = form.save()
profile = Profile.objects.create(user=user)
send_welcome_email(user)
return redirect("dashboard")
else:
form = UserCreationForm()
return render(request, "accounts/create.html", {"form": form})
If you run a profiler, you'll get output like:
127 function calls in 0.234 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.234 0.234 views.py:45(create_account)
1 0.012 0.012 0.187 0.187 hashers.py:67(make_password)
1 0.175 0.175 0.175 0.175 {method 'hashpw' of 'bcrypt'}
1 0.003 0.003 0.024 0.024 smtp.py:112(send_mail)
1 0.001 0.001 0.018 0.018 db/models.py:234(save)
15 0.002 0.000 0.015 0.001 {method 'execute' of 'sqlite3.Cursor'}
You learn that password hashing is slow, but you don't learn:
-
That Django signals fire after
user.save()to create a default user profile -
That the
Profile.objects.create()call is actually redundant because a signal already created one -
That
form.save()actually calls three different model save methods internally -
What data validation happens in
form.is_valid() -
Why the welcome email sometimes isn't sent (an exception you haven't seen yet)
If you use a debugger instead, you:
-
Set a breakpoint at
if form.is_valid(): -
Step into
is_valid()to see validation logic -
Step over
form.save()and inspectuserto see the created object -
Notice the signal receiver in the call stack
-
Step into
Profile.objects.create()and realize it fails with IntegrityError because the profile already exists -
Discover this has been silently caught and ignored
The debugger shows you what happens, which is what you need to understand the code. The profiler shows you how long things take, which is irrelevant until you know what's happening.
When to Profile: After Understanding, Not Before
Here's the proper workflow:
Phase 1: Understanding (Day 1-2)
-
Use debuggers to trace execution
-
Use framework tools (Django Debug Toolbar) to see queries and middleware
-
Build a mental model: "This view does X, then Y, then Z"
-
Document the flow: "Request → middleware → view → form validation → save → signals → response"
Phase 2: Implementation (Day 3-5)
-
Make your feature changes
-
Ensure correctness
-
Write tests
Phase 3: Performance Analysis (Day 6+, if needed)
-
NOW profile if something feels slow
-
Use profiling to identify bottlenecks in code you understand
-
Optimize based on data, not assumptions
This is crucial: You can't optimize code you don't understand. If you discover a function takes 2 seconds, you need to know:
-
What does this function do?
-
Is this expected behavior or a bug?
-
Is this even in the hot path for my use case?
-
What dependencies does it have that might be causing the slowness?
All of these questions require understanding first, performance analysis second.
A Real Example: The N+1 Query Trap
Here's where premature profiling particularly fails. You're exploring a blog listing page:
def post_list(request):
posts = Post.objects.all()[:20]
return render(request, "posts/list.html", {"posts": posts})
# Template: list.html
{% for post in posts %}
<h2>{{ post.title }}</h2>
<p>By {{ post.author.name }}</p> <!-- N+1 query here -->
<p>{{ post.comments.count }} comments</p> <!-- And here -->
{% endfor %}
If you profile this, you see:
500 function calls in 1.245 seconds
ncalls tottime percall cumtime percall filename:lineno(function)
420 0.156 0.000 0.892 0.002 db/backends/sqlite3/base.py:234(execute)
20 0.089 0.004 0.234 0.012 models.py:456(__getattribute__)
You learn that many database queries are slow, but you don't see:
-
That each
post.author.nametriggers a separate query (20 queries) -
That each
post.comments.counttriggers another query (20 more) -
That these could be eliminated with
select_related('author')andprefetch_related('comments')
If you use Django Debug Toolbar instead, you see:
42 queries in 1.2 seconds
DUPLICATE QUERIES (40):
SELECT * FROM auth_user WHERE id = 1 (20 times)
SELECT COUNT(*) FROM comments WHERE post_id = 1 (20 times)
RECOMMENDATIONS:
Consider using select_related('author')
Consider using prefetch_related('comments')
The Debug Toolbar shows you what's happening (N+1 queries) and how to fix it (use select_related). This is understanding-focused tooling that happens to show performance implications.
The "Just Checking Performance" Trap
Developers often fall into profiling prematurely because they think: "I'll just run a quick profile to see where the slowness is." This seems reasonable, but it leads to:
# You run: python -m cProfile -s cumtime manage.py runserver
# You see output... lots of output... 5,000 lines of function calls
# You pipe to a file: python -m cProfile -s cumtime manage.py runserver > profile.txt
# You open the file and try to make sense of it
# 30 minutes later, you've learned:
# - Django's startup does a lot of imports (not relevant)
# - Jinja templates compile slowly (not your code)
# - Some function in the ORM takes 0.003 seconds (so what?)
# What you haven't learned:
# - How the request flows through your application
# - What your code actually does
# - Where to make changes
This is the key insight: Profilers show you trees when you need to understand the forest. They're detail-oriented tools that require you to already know what you're looking for.
When Profiling IS Appropriate
Let me be clear: profiling is essential—at the right time. Use profilers when:
Scenario 1: You understand the code and have a performance problem
# You've built a data export feature
# Users report it takes 5 minutes for large exports
# You understand the code flow: query DB → transform → write CSV
# NOW profile to find the bottleneck:
from cProfile import Profile
profiler = Profile()
profiler.enable()
export_user_data(user_id=12345) # Known slow case
profiler.disable()
profiler.print_stats(sort='cumtime')
# Results show: 95% of time in pandas.DataFrame.apply()
# You know exactly what that does and can optimize it
Scenario 2: You've optimized and want to validate improvements
# Before optimization:
# 1000 rows exported in 45 seconds
# After adding vectorized operations:
# 1000 rows exported in 3 seconds
# Profile again to confirm and find next bottleneck
Scenario 3: You're analyzing production performance issues
# Production APM shows 95th percentile latency spike
# You know the code; now you need to find what changed
# Use production profiling (py-spy) to sample live traffic
In all these cases, you already understand what the code does. Profiling answers "where's the slowness?" not "what does this do?"
The Decision Tree
Here's a simple test:
Ask yourself: "Can I explain what this code does in plain English?"
-
No: Use a debugger or execution tracer. Don't touch profilers yet.
-
Yes, but it's slow: NOW profile.
If you find yourself staring at profiler output and thinking "I don't know what this function even does," stop immediately and switch to a debugger.
Profiling Antipattern Example
Finally, here's a perfect illustration of premature optimization profiling in action. A developer was exploring a Celery task:
@task
def process_uploaded_csv(file_path):
# What does this actually do? Let's profile it!
import cProfile
profiler = cProfile.Profile()
profiler.enable()
# ... 200 lines of complex CSV processing ...
profiler.disable()
profiler.print_stats()
They spent three hours analyzing the profile output, discovering that pandas.read_csv() was slow. They researched faster CSV parsers. They proposed rewriting the task to use Dask for parallel processing.
Then someone asked: "What does this task actually do? What's the business logic?"
They couldn't answer. They had profiled the code without understanding it. When they finally stepped through with a debugger, they discovered:
-
The task was supposed to process customer order data
-
It was calling an external API for each row (the actual bottleneck, hidden in the profile as "socket.recv")
-
The pandas operations were fast; the network I/O was slow
-
The "solution" wasn't a faster CSV parser—it was batching API calls
This is crucial: The profiler showed them numbers, but not meaning. The debugger showed them the business logic, which revealed the real problem.
The lesson: Never profile code you don't understand. Profilers are for optimization, not exploration.
Sin 4: Tool Overengineering
The Scenario: You're trying to trace execution through a Flask application to understand how authentication works. You search for "python trace execution" and find ast.NodeTransformer. You think: "I could write an AST transformer that automatically instruments every function entry and exit!" You spend a day building it. It breaks on async functions. You fix that. It doesn't preserve line numbers for debugging. You fix that. After three days, you have a 500-line instrumentation system. Then a colleague shows you Flask-DebugToolbar, which does exactly what you needed in 5 minutes of setup.
This is tool overengineering—building custom solutions to problems that have been solved by the community dozens of times. It's seductive because writing tools feels like productive engineering work, but it's actually procrastination disguised as productivity.
Building Custom Solutions to Solved Problems
Let's examine the psychology of why this happens. You need to understand execution flow in an unfamiliar codebase. Your brain presents you with two paths:
Path A: Find and learn existing tools
-
Search for "[framework] debugging tools"
-
Read documentation
-
Install a tool
-
Learn how to use it
-
Feel like a beginner who doesn't know things
Path B: Build your own tool
-
Apply your existing programming skills
-
Create something elegant and customized
-
Learn about ASTs, decorators, or metaprogramming
-
Feel clever and productive
-
"This will only take a few hours"
Path B feels better in the moment. It lets you stay in your comfort zone (writing code) instead of entering the uncomfortable zone (learning new tools). But Path B is almost always the wrong choice for execution tracing.
The AST/Tree-sitter Trap Revisited
This pattern is so common in execution tracing that it deserves special attention. Developers discover Python's ast module or Tree-sitter for other languages and think: "I can automatically instrument any codebase!"
Here's how it typically starts:
# "I'll just write a quick script to add print statements to every function"
import ast
import inspect
class FunctionTracer(ast.NodeTransformer):
def visit_FunctionDef(self, node):
# Add print at function entry
print_call = ast.Expr(
value=ast.Call(
func=ast.Name(id='print', ctx=ast.Load()),
args=[ast.Constant(value=f"Entering {node.name}")],
keywords=[]
)
)
node.body.insert(0, print_call)
return node
# "This is so elegant! I'll just transform the source and..."
Three hours later, you're debugging why your transformer breaks on:
-
Type hints
-
Decorators
-
Async functions
-
Context managers
-
Nested function definitions
-
Lambda expressions
-
Generator expressions
Six hours later, you've solved most of those, but now:
-
Your line numbers don't match the original source
-
The debugger can't set breakpoints correctly
-
Stack traces are confusing
-
You can't handle dynamic imports
-
Third-party code isn't instrumented
Two days later, you have a fragile system that:
-
Requires running a preprocessing step before every execution
-
Breaks when dependencies update
-
Confuses new team members
-
Still doesn't handle edge cases
-
Would take weeks to make production-ready
Meanwhile, if you had used sys.settrace(), you'd have a working solution in 20 minutes:
import sys
def trace_calls(frame, event, arg):
if event == 'call':
code = frame.f_code
print(f"Calling {code.co_name} in {code.co_filename}:{frame.f_lineno}")
return trace_calls
# Enable tracing
sys.settrace(trace_calls)
# Your code here
process_user_request(user_id=123)
# Disable tracing
sys.settrace(None)
This works immediately with:
-
All Python constructs
-
Third-party libraries
-
Async code
-
Correct line numbers
-
No preprocessing required
-
No maintenance burden
Or better yet, if you had installed Django Debug Toolbar, you'd see the entire request flow in a web UI with zero custom code.
The "Not Invented Here" Syndrome in Tooling
There's a specific cognitive bias at play here called "Not Invented Here" (NIH) syndrome. When you're facing a problem, there's psychological satisfaction in solving it yourself rather than using someone else's solution. In tooling, this manifests as:
Red flags that you're experiencing NIH:
-
"I could build that in a weekend" — Maybe, but why? The existing tool was built over months and handles edge cases you haven't thought of.
-
"But it doesn't do exactly what I need" — Are you sure? Have you read the documentation thoroughly? Have you asked the community if there's a way to extend it?
-
"I want to learn how it works" — Noble goal, but do you need to learn right now, or do you need to solve your actual problem? Learning by reading source code is fine; learning by reimplementing from scratch is usually procrastination.
-
"This will be more lightweight/faster/elegant" — Performance and elegance don't matter if you spend 10x longer building and maintaining your tool than using an existing one.
-
"I don't want to add a dependency" — One well-maintained dependency is better than 500 lines of custom code that you have to maintain forever.
A Real-World Example: The Custom Django Tracer
I've seen this pattern play out repeatedly. Here's a composite of several real incidents:
Week 1: Developer needs to understand Django's middleware execution order for a security audit.
# "I'll write a decorator to trace middleware calls"
def trace_middleware(func):
@wraps(func)
def wrapper(*args, **kwargs):
print(f"→ Middleware: {func.__name__}")
result = func(*args, **kwargs)
print(f"← Middleware: {func.__name__}")
return result
return wrapper
# Then manually add @trace_middleware to each middleware class
Problem: Django has 15+ built-in middleware classes plus third-party ones. Decorating them all is tedious and fragile.
Week 2: "I'll use metaclasses to auto-decorate all middleware!"
class TracedMiddlewareMeta(type):
def __new__(mcs, name, bases, namespace):
for attr_name, attr_value in namespace.items():
if callable(attr_value) and not attr_name.startswith('_'):
namespace[attr_name] = trace_middleware(attr_value)
return super().__new__(mcs, name, bases, namespace)
# Now inject this metaclass into middleware classes...
Problem: This requires modifying Django's source or monkey-patching, which breaks on updates.
Week 3: "I'll use import hooks to intercept middleware imports and modify them at runtime!"
import sys
from importlib.abc import MetaPathFinder, Loader
from importlib.util import spec_from_loader
class MiddlewareTracer(MetaPathFinder, Loader):
def find_spec(self, fullname, path, target=None):
if 'middleware' in fullname:
return spec_from_loader(fullname, self)
return None
def exec_module(self, module):
# Modify the module's classes...
pass
sys.meta_path.insert(0, MiddlewareTracer())
Problem: This is extremely fragile, hard to debug, and breaks in non-obvious ways.
Week 4: Team lead asks: "Why not just use Django Debug Toolbar?"
# settings.py
INSTALLED_APPS = [
...
'debug_toolbar',
]
MIDDLEWARE = [
'debug_toolbar.middleware.DebugToolbarMiddleware',
...
]
Result: Complete middleware execution visualization in the web UI, including:
-
Execution order
-
Time taken for each
-
SQL queries per middleware
-
Cache hits/misses
-
Template rendering
Time to working solution: 5 minutes.
Time spent on custom solution: 4 weeks.
Maintenance burden of custom solution: Ongoing.
Maintenance burden of Debug Toolbar: Zero (community-maintained).
Notice carefully: The custom solution wasn't technically impossible—it was just a massive waste of time solving a problem that's already solved.
How to Recognize Tool Overengineering
Here's a checklist. If you answer "yes" to more than two of these, stop and find an existing tool:
-
[ ] You've spent more than 2 hours building your instrumentation
-
[ ] Your solution requires modifying the runtime environment (import hooks, metaclasses, AST transformation)
-
[ ] You're handling edge cases specific to the language or framework
-
[ ] You're thinking about "What if we need to extend this later?"
-
[ ] You haven't searched "[framework name] execution tracing" or "[language] profiling tools"
-
[ ] Your code is more than 50 lines
-
[ ] You're excited about the elegance of your solution rather than solving the actual problem
-
[ ] Someone asks "Does X tool do this?" and you haven't checked
The Search-First Protocol
Here's the process you should follow before building any tracing tool:
Step 1: Define the actual problem (30 seconds)
-
Write down: "I need to [specific goal] in [specific context]"
-
Example: "I need to see which functions execute when I submit this Django form"
Step 2: Search for existing solutions (5 minutes)
Google: "django trace function execution"
Google: "django debugging tools"
Google: "python execution flow visualization"
Check: Awesome lists (awesome-django, awesome-python)
Check: Framework documentation
Step 3: Evaluate the top 3 results (15 minutes)
-
Install the tool
-
Try the basic example
-
Check if it solves your problem
-
If not, try the next one
Step 4: Only if all else fails, consider building (but ask first!)
-
Post on Stack Overflow: "How do I trace X in Y?"
-
Ask on the framework's Discord/Slack
-
Check framework issues for similar feature requests
Estimate: 80% of the time, Step 2 finds a solution. 15% of the time, Step 3 does. 5% of the time, you actually need custom tooling.
When Custom Tools Are Justified
Let me be clear: custom instrumentation isn't always wrong. It's appropriate when:
1. You have unique production requirements
# You need distributed tracing across custom services
# that don't speak HTTP (e.g., custom protocol over ZeroMQ)
# No existing tool handles this
2. You're building a product/framework
# You're creating a web framework and need to provide
# debugging tools for your users
# This is your core product, not a side quest
3. You need very specific metrics
# You need to track domain-specific metrics
# Example: "How many times does our pricing algorithm
# switch between calculation methods for each user?"
# This is business logic, not generic tracing
4. You've exhausted existing tools and documented why
# You've tried: debuggers, Debug Toolbar, py-spy, sys.settrace
# None of them work because [specific technical reason]
# You've asked the community and confirmed no solution exists
# You've documented this decision and the alternatives considered
In these cases, build the minimum tool that solves your problem. We'll cover this in Section 7.10 "Building Custom Instrumentation (When Justified)."
The Ego Trap
Finally, let's talk about the emotional component. Building tools feels good because:
-
It demonstrates technical sophistication
-
It creates something that's "yours"
-
It lets you procrastinate on the harder work of understanding unfamiliar code
-
It gives you something to show in code review
Using existing tools feels less impressive:
-
"I just installed Django Debug Toolbar" sounds simple
-
There's no clever code to show off
-
You might feel like you're not really solving the problem
This is crucial: Your job is to solve problems efficiently, not to demonstrate cleverness. The developer who installs Debug Toolbar and understands the codebase in 30 minutes is more valuable than the developer who spends 3 days building a custom tracer.
Ego-driven tooling is one of the most expensive forms of technical debt. It creates:
-
Code that only you understand
-
Dependencies that break when you're not around
-
Maintenance burden for the entire team
-
Onboarding friction for new developers
Problem-driven tooling (using existing solutions) creates:
-
Shared understanding across the team
-
Community support and updates
-
Onboarding that's as simple as "Read the Debug Toolbar docs"
-
Time to focus on actual business problems
Choose problem-driven tooling. Your future self (and your teammates) will thank you.
Sin 5: Ignoring Framework Tools
The Scenario: You're debugging a React application and trying to understand why a component re-renders 47 times on page load. You add console.log statements in useEffect hooks. You add counters. You create a custom hook to track render count. You spend two hours building a re-render detector. Then someone shows you the React DevTools Profiler, which shows you the entire render timeline with flame graphs, component trees, and the exact props that changed. It's been built into your browser the entire time.
This is ignoring framework tools—solving problems from scratch when the framework maintainers have built purpose-specific debugging tools that encode years of community knowledge. It's particularly frustrating because these tools are often invisible until someone points them out.
Every Mature Framework Has Debugging Tools
This is not an exaggeration. Every mature framework has dedicated debugging tools. Here's a non-exhaustive list:
Python Web Frameworks:
-
Django → Django Debug Toolbar
-
Flask → Flask-DebugToolbar, Werkzeug debugger
-
FastAPI → Uvicorn reload, API docs UI
-
Pyramid → pyramid_debugtoolbar
JavaScript Frameworks:
-
React → React DevTools (Components + Profiler)
-
Vue → Vue DevTools
-
Angular → Angular DevTools
-
Svelte → Svelte DevTools
-
Next.js → Built-in error overlay, Fast Refresh
Backend Frameworks (Other Languages):
-
Ruby on Rails → rails/web-console, rack-mini-profiler
-
Laravel (PHP) → Laravel Debugbar, Telescope
-
Spring Boot (Java) → Spring Boot Actuator, DevTools
-
ASP.NET Core (C#) → Developer Exception Page, debugging middleware
Mobile Frameworks:
-
React Native → React Native Debugger, Flipper
-
Flutter → Flutter DevTools
This is the key insight: Framework authors know the common debugging challenges better than you do. They've built tools that solve the exact problems you're encountering. These tools are specifically designed for the framework's architecture and conventions.
Community Knowledge Embedded in Framework-Specific Tools
Framework tools aren't just convenient—they encode collective knowledge from thousands of developers. When you ignore them, you're ignoring years of accumulated wisdom about what problems actually matter.
Let's examine Django Debug Toolbar as an example. When you install it, you get:
SQL Panel:
-
Shows every query with exact SQL
-
Highlights duplicate queries (N+1 detection)
-
Shows query execution time
-
Provides EXPLAIN analysis
-
Links to the code that triggered each query
Could you build this yourself? Sure. How long would it take?
-
Query interception: 2-4 hours
-
Duplicate detection: 2 hours
-
Stack trace association: 4-8 hours (tricky!)
-
EXPLAIN analysis: 2 hours
-
UI for displaying it: 4-8 hours
-
Total: ~20 hours minimum
And you'd still miss:
-
Database-specific quirks (PostgreSQL vs MySQL vs SQLite)
-
Edge cases with subqueries and joins
-
Integration with Django's connection pooling
-
Handling transactions correctly
-
Proper cleanup to avoid memory leaks
Debug Toolbar handles all of this because hundreds of contributors have encountered and fixed these issues over a decade.
A Concrete Example: React DevTools vs Custom Logging
Let's see this in practice. You're debugging why a React component re-renders excessively:
Approach 1: Custom logging (the hard way)
function UserProfile({ userId, settings, onUpdate }) {
const [renderCount, setRenderCount] = useState(0);
useEffect(() => {
setRenderCount((prev) => prev + 1);
console.log(`UserProfile rendered ${renderCount} times`);
console.log("Props:", { userId, settings, onUpdate });
});
useEffect(() => {
console.log("userId changed:", userId);
}, [userId]);
useEffect(() => {
console.log("settings changed:", settings);
}, [settings]);
useEffect(() => {
console.log("onUpdate changed:", onUpdate);
}, [onUpdate]);
// ... actual component logic
}
After running this, your console shows:
UserProfile rendered 1 times
Props: {userId: 123, settings: {...}, onUpdate: ƒ}
userId changed: 123
settings changed: {theme: 'dark', ...}
onUpdate changed: ƒ onUpdate()
UserProfile rendered 2 times
Props: {userId: 123, settings: {...}, onUpdate: ƒ}
onUpdate changed: ƒ onUpdate()
UserProfile rendered 3 times
Props: {userId: 123, settings: {...}, onUpdate: ƒ}
onUpdate changed: ƒ onUpdate()
You learn that onUpdate changes, but you still don't know:
-
Why onUpdate is changing (parent component creating new function each render?)
-
Which parent component is causing the re-render
-
What the performance impact actually is
-
What other components are affected
Approach 2: React DevTools Profiler (the right way)
-
Open React DevTools
-
Go to Profiler tab
-
Click "Record"
-
Interact with your app
-
Click "Stop"
You see:
Render #1: UserProfile (0.8ms)
Props changed: none (initial render)
Render #2: UserProfile (0.3ms)
Parent: Dashboard
Props changed: onUpdate
Reason: Parent rendered with new inline function
Render #3: UserProfile (0.3ms)
Parent: Dashboard
Props changed: onUpdate
Reason: Parent rendered with new inline function
Render #4: UserProfile (0.3ms)
Parent: Dashboard
Props changed: onUpdate
Reason: Parent rendered with new inline function
Notice carefully: DevTools immediately shows you:
-
The parent component causing the issue (
Dashboard) -
The specific prop changing (
onUpdate) -
The reason (new inline function)
-
The performance impact (0.3ms per render—not actually a problem!)
-
A flame graph showing the render hierarchy
The solution becomes obvious: Move the function definition outside the parent component or wrap it in useCallback:
// In Dashboard component
const onUpdate = useCallback((data) => {
// handle update
}, []); // Dependencies array prevents recreation
return <UserProfile userId={userId} settings={settings} onUpdate={onUpdate} />;
Time with custom logging: 1-2 hours of adding logs, analyzing output, and guessing at solutions.
Time with React DevTools: 5 minutes to identify and fix the issue.
The Cost of Reinventing Framework-Aware Instrumentation
Framework tools have deep integration that's impossible to replicate quickly. Consider what Django Debug Toolbar does:
Integration points you'd have to reimplement:
-
Middleware integration: Intercepts request/response cycle
-
SQL interception: Hooks into Django's database cursor
-
Template rendering: Monitors template system
-
Cache instrumentation: Tracks cache hits/misses
-
Signal monitoring: Tracks Django signals
-
Static file tracking: Shows which static files loaded
-
Request history: Maintains history of recent requests
-
Settings panel: Shows active Django settings
-
Headers panel: Displays HTTP headers
-
Logging panel: Aggregates log output
Each of these requires deep knowledge of Django's internals. Replicating even half this functionality would take weeks and would break whenever Django updates.
The Framework Evolution Problem
Another critical point: frameworks evolve. When React introduced Concurrent Mode, it changed how rendering works fundamentally. React DevTools was updated to understand Concurrent Mode. Your custom render logger? It would be completely wrong because it assumes the old rendering model.
Similarly, Django 3.2 introduced async views. Debug Toolbar was updated to handle async properly. Your custom SQL logger would miss queries in async views or crash trying to access thread-local storage.
Framework tools evolve with the framework. Your custom tools don't, unless you're willing to dedicate ongoing maintenance time.
How to Discover Framework Tools
If you're not aware of debugging tools for your framework, here's how to find them:
Step 1: Check official documentation
Search: "[framework] debugging" in official docs
Example: "Django debugging" → leads to Debug Toolbar mention
Example: "React debugging" → leads to React DevTools
Step 2: Check awesome lists
GitHub: awesome-[framework]
Example: awesome-django has a "Debugging" section
Example: awesome-react lists all DevTools extensions
Step 3: Ask the community
"What debugging tools do you use for [framework]?"
Post on Reddit: r/django, r/reactjs
Post on Discord/Slack for the framework
Step 4: Look for browser extensions
Chrome Web Store: search "[framework] devtools"
Firefox Add-ons: search "[framework] dev"
The Installation Barrier
Sometimes developers avoid framework tools because installation seems complicated. Let's address this:
Django Debug Toolbar: Seems scary (middleware configuration, URL routing) but it's actually:
pip install django-debug-toolbar
# settings.py
INSTALLED_APPS = [
...
'debug_toolbar',
]
MIDDLEWARE = [
'debug_toolbar.middleware.DebugToolbarMiddleware',
...
]
# urls.py
if settings.DEBUG:
import debug_toolbar
urlpatterns = [
path('__debug__/', include(debug_toolbar.urls)),
] + urlpatterns
Total time: 5 minutes. Total benefit: immeasurable.
React DevTools: Even simpler:
-
Go to Chrome Web Store
-
Search "React Developer Tools"
-
Click "Add to Chrome"
-
Done
Total time: 30 seconds.
The fear of configuration is almost always worse than the actual configuration.
When Framework Tools Aren't Enough
There are legitimate cases where framework tools don't solve your problem:
Case 1: Production debugging
-
Framework tools are usually dev-only
-
Solution: Use APM tools (Sentry, New Relic, DataDog)
Case 2: Cross-framework tracing
-
You have Django backend + React frontend
-
Solution: Distributed tracing (OpenTelemetry)
Case 3: Non-standard deployments
-
Running in embedded systems, IoT devices
-
Solution: Lightweight custom instrumentation
But even in these cases, start with framework tools in development to understand the code, then add production instrumentation as needed.
The Rule of Framework Tools
Here's the rule you should follow:
Before writing ANY custom tracing code, search for and try the framework's official debugging tools. Only after you've exhausted those tools should you consider alternatives.
This takes 10 minutes and could save you days of work.
Sin 6: Production Debugging Without Safety
The Scenario: Your API is timing out in production. Users are complaining. You think: "I'll just attach a debugger to see what's happening." You set breakpoints in the production process. The breakpoint hits. The entire web server freezes while you examine variables. All requests time out. The site goes down. You panic and restart the server, but now you have an outage to explain and still don't know what caused the original issue.
This is production debugging without safety—using development tools in production environments without understanding the consequences. It's one of the most dangerous pitfalls because the stakes are so high.
Never Run Debuggers in Production
Let me state this unequivocally: Do not attach debuggers to production processes. Not pdb, not the VS Code debugger, not Chrome DevTools remote debugging. Here's why:
What happens when a debugger hits a breakpoint:
-
The process completely stops
-
All threads freeze
-
No new requests are processed
-
Existing requests time out
-
Load balancers may mark the instance as unhealthy
-
Health checks fail
-
Auto-scaling may kill and restart the instance
Even worse, if you're debugging:
-
A database connection remains open, potentially holding locks
-
Message queue consumers stop processing
-
Scheduled tasks don't run
-
Websocket connections drop
A Real Incident: A developer attached pdb to a production Django process to debug a mysterious authentication failure. The breakpoint hit during a background task. The task held a database lock. All subsequent requests that needed that table started queuing. The database connection pool exhausted. The entire application became unresponsive. The incident lasted 15 minutes and affected thousands of users—all because of one breakpoint.
The Illusion of "Quick Look"
You might think: "I'll just attach for a second to see one variable." This never works because:
-
You can't predict when the breakpoint hits: You think you're debugging a rare code path, but it turns out to trigger on every request
-
Production traffic is unpredictable: While you're attached, a traffic spike hits
-
One question leads to another: "Wait, what's the value of this other variable?" Now you're stepping through code in production
-
Panic paralysis: When things start breaking, you might freeze instead of cleanly detaching
Safe Alternatives for Production Tracing
So how do you debug production issues? Use tools specifically designed for production safety:
Alternative 1: Structured Logging with Feature Flags
Instead of breakpoints, add conditional logging:
import logging
import structlog
logger = structlog.get_logger(__name__)
def process_payment(order_id):
# Feature flag check (LaunchDarkly, or similar)
debug_enabled = feature_flags.is_enabled('debug-payment-flow', user_id=order.user_id)
if debug_enabled:
logger.info("payment.started",
order_id=order_id,
user_id=order.user_id,
amount=order.total)
try:
result = payment_gateway.charge(order)
if debug_enabled:
logger.info("payment.succeeded",
order_id=order_id,
transaction_id=result.transaction_id,
gateway_response=result.raw_response)
return result
except PaymentError as e:
logger.error("payment.failed",
order_id=order_id,
error=str(e),
user_id=order.user_id)
raise
This is safe because:
-
Logging doesn't stop the process
-
You can enable it for specific users/requests
-
Log aggregation (ELK, Datadog) shows patterns
-
No performance impact when disabled
Alternative 2: py-spy (Sampling Profiler)
py-spy is specifically designed for production use:
# Attach to running process without pausing it
py-spy top --pid 12345
# Sample execution for 60 seconds
py-spy record --pid 12345 --duration 60 --output profile.svg
# Detach automatically
Why py-spy is safe:
-
Uses sampling (only checks stack every 10ms)
-
Doesn't stop the process
-
Minimal performance overhead (~1-2%)
-
No code changes required
-
Can attach and detach without restarting
Compare this to a debugger, which stops the process completely.
Alternative 3: APM Tools (Application Performance Monitoring)
Tools like Sentry, New Relic, or Datadog provide production-safe observability:
import sentry_sdk
sentry_sdk.init(
dsn="your-dsn-here",
traces_sample_rate=0.1, # Sample 10% of transactions
)
def checkout_flow(request):
# Sentry automatically tracks:
# - Exceptions with full stack traces
# - Performance of each function
# - Database queries
# - External API calls
with sentry_sdk.start_transaction(op="checkout", name="process_order"):
order = create_order(request.user, request.cart)
with sentry_sdk.start_span(op="payment", description="Process payment"):
payment = process_payment(order)
with sentry_sdk.start_span(op="email", description="Send confirmation"):
send_confirmation_email(order)
return order
APM tools give you:
-
Distributed tracing across services
-
Performance breakdowns
-
Error tracking with context
-
Real user monitoring
-
All without stopping production processes
Alternative 4: Feature Flags for Temporary Instrumentation
You can safely add temporary instrumentation if you gate it behind feature flags:
def handle_request(request):
# Only instrument for internal testers
if feature_flags.is_enabled('trace-request-flow', user=request.user):
with detailed_instrumentation():
return _process_request(request)
else:
return _process_request(request)
@contextmanager
def detailed_instrumentation():
# This code only runs for flagged users
start = time.time()
metrics = {}
yield metrics
duration = time.time() - start
logger.info("request.detailed_trace",
duration=duration,
**metrics)
This is safe because:
-
Instrumentation only affects flagged users (often just you)
-
You can disable it instantly if problems arise
-
Other users get normal, uninstrumented code
-
You can gradually roll out to more users
Feature Flags and Sampling Strategies
Let's dive deeper into safe production instrumentation patterns:
Pattern 1: User-based sampling
def should_trace(user_id):
# Trace 1% of requests, deterministically
return hash(user_id) % 100 < 1
def api_endpoint(request):
trace_enabled = should_trace(request.user.id)
if trace_enabled:
with distributed_trace():
return handle_request(request)
else:
return handle_request(request)
Pattern 2: Request-ID-based sampling
def api_endpoint(request):
# Sample based on request ID
request_id = request.headers.get('X-Request-ID')
trace_enabled = int(request_id.split('-')[0], 16) % 100 < 5 # 5% sample
if trace_enabled:
request.trace_id = request_id
with detailed_logging():
return handle_request(request)
else:
return handle_request(request)
Pattern 3: Canary deployment tracing
# In your load balancer configuration
# Route 5% of traffic to instrumented instances
# Route 95% to normal instances
# On instrumented instances only:
ENABLE_DETAILED_TRACING = os.environ.get('CANARY_INSTANCE') == 'true'
def handle_request(request):
if ENABLE_DETAILED_TRACING:
# Full instrumentation
with comprehensive_tracing():
return _handle_request(request)
else:
return _handle_request(request)
When Commercial Solutions Are Worth It
Production debugging is one area where paid tools often justify their cost:
Sentry ($26+/month):
-
Automatic error tracking
-
Performance monitoring
-
Release tracking
-
User impact analysis
DataDog ($15+/host/month):
-
APM with distributed tracing
-
Log aggregation
-
Infrastructure monitoring
-
Real-time alerting
New Relic ($25+/month):
-
Full-stack observability
-
Custom instrumentation
-
Anomaly detection
-
Performance baselines
These tools provide capabilities that would take months to build yourself:
-
Automatic instrumentation for common frameworks
-
Low overhead (< 5% performance impact)
-
Production-safe operation
-
No risk of accidentally stopping processes
-
Built-in dashboards and alerting
The cost of one production outage (lost revenue, customer trust, developer time) almost always exceeds the annual cost of an APM tool.
The Production Debugging Decision Tree
Here's a flowchart for making safe production debugging decisions:
Issue reported in production
↓
Can you reproduce it in staging/dev?
├─ YES → Use normal debugging tools there
│ (debuggers, print statements, etc.)
│
└─ NO → Is it a performance issue?
├─ YES → Use py-spy or APM sampling
│ (no code changes needed)
│
└─ NO → Is it an error/exception?
├─ YES → Check error tracking (Sentry)
│ Add structured logging if needed
│
└─ NO → Is it user-specific?
├─ YES → Enable feature flag for that user
│ Add conditional instrumentation
│
└─ NO → Roll out sampling-based tracing
(1-5% of traffic)
Notice the pattern: At no point do you attach a debugger to production. There's always a safer alternative.
The "Emergency" Exception That Isn't
Developers sometimes say: "But this is an emergency! Users are affected! I need to debug right now!"
This is crucial: The worse the emergency, the MORE important it is to use safe tools. Here's why:
Scenario: Payment processing is failing in production.
Bad response: Attach debugger → Process freezes → ALL payments fail → Outage escalates
Good response:
-
Check error logs (2 minutes)
-
Check APM traces (2 minutes)
-
If needed, enable detailed logging for new payments (5 minutes)
-
Analyze logs (10 minutes)
-
Deploy fix (20 minutes)
Total: 39 minutes, with continued (partial) service.
Debugger approach might seem faster ("just look at the variables!"), but:
-
5 minutes to attach and investigate
-
15 minute outage while attached
-
30 minutes to recover from outage
-
Still don't have full context
Total: 50 minutes, with complete service disruption.
Safe tools are faster in emergencies because they don't create additional incidents.
Teaching Teams About Production Safety
If you're working on a team, make production debugging safety part of your culture:
1. Document safe practices
Create a runbook:
# Production Debugging Runbook
## ❌ NEVER DO THESE:
- Attach pdb/debugger to production processes
- Add print statements to production code
- Modify production code without review
- Set breakpoints in production
## ✅ ALWAYS DO THESE:
- Check Sentry/error tracking first
- Use py-spy for performance issues
- Enable feature-flagged logging
- Test in staging before production
- Document your investigation
## Emergency Contacts:
- On-call engineer: [phone]
- Database team: [slack]
- Platform team: [slack]
2. Make safe tools easily accessible
# Add aliases to jump hosts
alias prod-profile='py-spy record --pid $(pgrep -f "web") --duration 30 --output /tmp/profile.svg'
alias prod-logs='tail -f /var/log/app/production.log | jq .'
alias prod-errors='curl https://sentry.io/api/latest-errors'
3. Conduct incident post-mortems that include debugging methods
After incidents, document:
-
What debugging tools were used
-
Whether they caused additional problems
-
What tools should have been used
-
Changes to make debugging safer
4. Practice production debugging in staging
Set up a staging environment that mirrors production constraints:
-
Same instance types
-
Similar traffic patterns
-
Feature flags enabled
Practice using safe debugging tools there so they're familiar when you need them.
Code Review Checkpoints for Production Changes
When reviewing any code that touches production, check:
-
[ ] No debugger imports (
pdb,ipdb,breakpoint()) -
[ ] No print statements (use logging instead)
-
[ ] New logging is behind feature flags
-
[ ] Instrumentation has negligible performance impact
-
[ ] Error handling won't crash the process
-
[ ] Monitoring/alerting is in place
The Ultimate Production Safety Rule
Here's the rule that trumps all others:
If a debugging tool would stop, pause, or significantly slow production traffic, it is not safe for production. No exceptions.
This means:
-
❌ Debuggers (they stop the process)
-
❌ Heavy instrumentation (slows requests)
-
❌ Synchronous external calls for logging (blocks requests)
-
❌ Memory dumps during traffic (causes pauses)
-
✅ Sampling profilers (minimal overhead)
-
✅ Async logging (non-blocking)
-
✅ Feature-flagged instrumentation (isolated impact)
-
✅ APM tools (designed for production)
When in doubt, ask: "What happens if this runs on every request during peak traffic?" If the answer is bad, don't do it.
Sin 7: Tracing Without Documentation
The Scenario: You spend three days tracing through a complex OAuth authentication flow. You discover that django-allauth triggers seven middleware components, makes three database queries, fires five signal handlers, and does two external HTTP requests—all for a single login. You understand it perfectly. You fix your bug. Two months later, a teammate asks: "How does our OAuth login work?" You try to remember... it's fuzzy. You have to trace through it again. Six months later, YOU need to modify it and have to retrace the entire flow because you've forgotten the details.
This is tracing without documentation—treating execution flow investigation as a one-time activity rather than creating artifacts that serve the team long-term. It's the most insidious sin because it feels like you've accomplished something, but the value evaporates the moment you move to the next task.
Execution Flow Diagrams Save Future Work
The primary deliverable of execution tracing shouldn't just be understanding in your head—it should be a diagram, document, or code comment that captures what you learned. Here's why:
Time investment without documentation:
-
First trace: 3 hours
-
Second trace (you, 6 months later): 2 hours
-
Third trace (teammate): 3 hours
-
Fourth trace (new team member): 4 hours
-
Total team time: 12 hours
Time investment with documentation:
-
First trace + documentation: 4 hours
-
Future reference (anyone): 10 minutes
-
Updates when code changes: 30 minutes per change
-
Total team time over a year: ~6 hours
The documentation pays for itself after just two uses.
What to Document After Tracing
When you finish tracing an execution flow, create one or more of these artifacts:
1. Sequence diagrams for complex interactions
# OAuth Login Flow
## Sequence Diagram
User → Browser → Django View → django-allauth → OAuth Provider → Database
1. User clicks "Login with Google"
2. Browser: GET /accounts/google/login/
3. Django Middleware Chain:
- SecurityMiddleware (check HTTPS)
- SessionMiddleware (load session)
- AuthenticationMiddleware (set request.user)
- django-allauth: SocialAccountMiddleware
4. View: google_login (from django-allauth)
- Generates OAuth state token
- Saves state to session (DB write #1)
- Redirects to Google
5. User authenticates at Google
6. Google: Redirects to /accounts/google/login/callback/
7. Django Middleware Chain (same as above)
8. View: google_callback (from django-allauth)
- Validates state token
- Exchanges code for access token (HTTP call #1)
- Fetches user profile from Google (HTTP call #2)
- Creates/updates SocialAccount (DB write #2)
- Creates/updates User (DB write #3)
- Fires signals:
- pre_social_login
- user_logged_in
- social_account_updated
- Creates session (DB write #4)
- Redirects to dashboard
This documentation:
-
Shows the complete flow in one place
-
Identifies all database writes (useful for optimization)
-
Identifies external HTTP calls (useful for reliability planning)
-
Identifies signal handlers (useful for debugging side effects)
-
Serves as a reference for anyone modifying the code
2. Architectural discovery comments
When you discover non-obvious architectural patterns during tracing, document them in the code:
def process_order(order_id):
"""
Process a customer order through payment and fulfillment.
EXECUTION FLOW (discovered 2024-12-03):
----------------------------------------
This function triggers a complex chain:
1. Creates Order instance (DB write)
2. Fires post_save signal → inventory.reserve_items()
- This is async via Celery
- May fail silently if Redis is down (see ISSUE-1234)
3. Calls payment_gateway.charge()
- Synchronous HTTP call (timeout: 30s)
- Retries 3x with exponential backoff
4. If payment succeeds, fires order_paid signal
- email.send_confirmation() - async
- analytics.track_conversion() - async
- shipping.create_label() - sync (blocking!)
5. Returns Order instance
GOTCHAS:
- If shipping.create_label() fails, the payment is NOT rolled back
- Signal handlers run even if this function raises an exception
- The order exists in DB even if payment fails (status='pending')
See docs/architecture/order-processing-flow.md for diagram.
"""
order = Order.objects.create(user_id=user_id, ...)
# ... implementation ...
Notice carefully: This comment:
-
Documents the discovered flow, not just what the code obviously does
-
Flags non-obvious behavior (async operations, failure modes)
-
Points to architectural gotchas
-
References more detailed documentation
-
Includes a date so future readers know it might be outdated
3. README sections for complex subsystems
# Payment Processing System
## Architecture Overview
The payment system consists of three main components:
1. **Synchronous payment flow** (`payments/views.py`)
- User-facing API for initiating payments
- Handles Stripe checkout session creation
- Returns immediately with session ID
2. **Webhook handling** (`payments/webhooks.py`)
- Stripe sends webhooks for payment events
- Processed asynchronously via Celery
- Updates order status in database
3. **Reconciliation worker** (`payments/tasks.py`)
- Runs hourly via cron
- Compares our records with Stripe
- Flags discrepancies for manual review
## Execution Flow for Successful Payment
User clicks "Pay"
→ POST /api/payments/checkout
→ Create Stripe checkout session (HTTP call)
→ Return session_id to frontend
→ Frontend redirects to Stripe
User completes payment at Stripe
→ Stripe sends webhook to /api/payments/webhook
→ Celery task: process_payment_webhook
→ Verify webhook signature
→ Update Order status → 'paid'
→ Fire order_completed signal
→ Send confirmation email (async)
→ Update inventory (async)
→ Create shipping label (sync)
→ Return 200 OK to Stripe
## Database Queries
Traced on 2024-12-03 with Django Debug Toolbar:
- Checkout creation: 3 queries (0.012s)
- SELECT User
- SELECT Cart items
- INSERT Order
- Webhook processing: 5 queries (0.031s)
- SELECT Order (with lock)
- UPDATE Order status
- INSERT PaymentTransaction
- SELECT Inventory items
- UPDATE Inventory (bulk)
## Known Issues
- **ISSUE-567**: Webhook processing can be slow during high traffic
- Impact: Stripe may retry webhooks, causing duplicate processing
- Mitigation: Idempotency key checking
- **ISSUE-892**: Inventory updates are not transactional with payments
- Impact: Payment can succeed but inventory update fails
- Workaround: Reconciliation worker catches these
## Debugging This System
1. **For payment failures**: Check Stripe Dashboard → Logs
2. **For webhook issues**: Check Celery logs: `tail -f logs/celery.log`
3. **For database issues**: Enable Django Debug Toolbar and check SQL panel
4. **For full trace**: Use `TRACE_PAYMENTS=true` env var (see feature flags)
This documentation:
-
Provides multiple entry points (overview, execution flow, debugging)
-
Includes specific metrics (query counts, timing)
-
Links to known issues
-
Tells future developers how to investigate further
README Updates for Complex Flows
When you trace a complex flow, update the project README or create a dedicated documentation file:
# Project Documentation
## Understanding Key Flows
New to the codebase? Start by understanding these critical execution paths:
### 1. User Registration Flow
- **Entry point**: `POST /api/auth/register`
- **Documentation**: See `docs/flows/user-registration.md`
- **Key files**: `users/views.py`, `users/models.py`, `emails/tasks.py`
- **Tracing tip**: Set breakpoint at `UserRegistrationView.post()`
### 2. Order Processing Flow
- **Entry point**: `POST /api/orders/create`
- **Documentation**: See `docs/flows/order-processing.md`
- **Key files**: `orders/views.py`, `payments/gateway.py`, `inventory/signals.py`
- **Tracing tip**: Enable Django Debug Toolbar, watch the SQL panel
### 3. Webhook Processing Flow
- **Entry point**: External webhooks from Stripe, SendGrid, etc.
- **Documentation**: See `docs/flows/webhooks.md`
- **Key files**: `webhooks/receivers.py`, `webhooks/tasks.py`
- **Tracing tip**: Use Celery logs with `--loglevel=debug`
## Debugging Guides
- [How to trace a request through the middleware stack](docs/debugging/middleware-tracing.md)
- [Understanding our Celery task architecture](docs/debugging/celery-tasks.md)
- [Database query optimization guide](docs/debugging/query-optimization.md)
Comment Conventions for Architectural Discoveries
Establish team conventions for documenting architectural insights:
Convention 1: Flow comments at entry points
def api_endpoint(request):
"""
FLOW: This triggers a complex chain:
1. Validates request → auth middleware
2. Checks permissions → permission_classes
3. Calls service layer → OrderService.create()
- This creates DB records AND queues Celery tasks
4. Returns response
DISCOVERED: 2024-12-03
The Celery tasks fire even if this function returns an error.
This is intentional for analytics tracking.
See ARCH-DECISION-003 for rationale.
"""
Convention 2: Gotcha comments at surprising behavior
def save_user_preferences(user, preferences):
user.preferences = preferences
user.save()
# GOTCHA: The save() triggers a post_save signal that sends an email.
# This means this function does I/O (SMTP call) even though it looks
# like just a database save. If you need to save without emailing,
# use save_without_signals() instead.
#
# Discovered during debugging session 2024-12-03.
# See trace/user-preferences-flow branch for full investigation.
Convention 3: Performance notes from profiling
def generate_report(user_id, date_range):
# PERFORMANCE: This function makes N+1 queries for user events.
# Traced 2024-12-03: For 100 events, makes 101 queries (0.5s total).
# TODO: Use prefetch_related('events') to optimize.
# See profile-report-generation.svg for detailed profile.
user = User.objects.get(id=user_id)
events = user.events.filter(date__range=date_range)
# ...
Visual Documentation Tools
Sometimes a diagram is worth a thousand words. Use tools to create visual documentation:
Mermaid diagrams in Markdown (GitHub renders these):
## Authentication Flow
```mermaid
sequenceDiagram
participant User
participant Browser
participant Django
participant OAuth
participant Database
User->>Browser: Click "Login with Google"
Browser->>Django: GET /accounts/google/login/
Django->>Database: Create OAuth state token
Django->>Browser: Redirect to Google
Browser->>OAuth: User authenticates
OAuth->>Browser: Redirect with code
Browser->>Django: GET /callback?code=xyz
Django->>OAuth: Exchange code for token
OAuth->>Django: Access token
Django->>OAuth: Fetch user profile
OAuth->>Django: Profile data
Django->>Database: Create/update user
Django->>Browser: Redirect to dashboard
```
```
Draw.io diagrams for complex architectures:
-
Store as
.drawio.png(PNG with embedded XML) -
Commit to repository
-
Anyone with draw.io can edit
Excalidraw for hand-drawn style diagrams:
-
Great for quick architectural sketches
-
Export as SVG for vector quality
-
Commit to
/docs/diagrams/
The Documentation Decision Tree
When you finish tracing, ask:
Did I discover anything non-obvious?
├─ NO → Maybe just a brief comment is enough
│
└─ YES → Will someone else need to understand this?
├─ NO → Brief inline comment
│
└─ YES → Will this need updating as code changes?
├─ NO → Inline comment + README mention
│
└─ YES → Separate documentation file + README link
Maintenance: Keeping Documentation Current
Documentation rots. Here's how to keep it fresh:
1. Link documentation to code in PRs
## Pull Request: Add OAuth2 support
### Changes
- Implemented OAuth2 login flow
- Added Google and GitHub providers
### Documentation Updates
- Updated `docs/flows/authentication.md` with OAuth flow
- Added comments in `auth/views.py` explaining middleware chain
- Updated README with OAuth setup instructions
### Testing
- Manually traced flow with Django Debug Toolbar
- Created `docs/flows/oauth-sequence-diagram.png`
2. Mark documentation with dates and authors
# Order Processing Flow
**Last Updated**: 2024-12-03 by @username
**Reviewed**: 2024-12-15 by @reviewer
**Next Review**: 2025-03-01
## Overview
...
3. Add "documentation debt" to technical debt
When you change code that has documentation:
def process_payment(order):
# TODO(docs): This flow changed significantly. Update docs/flows/payment.md
# Old flow: Direct Stripe charge
# New flow: Stripe checkout session (async)
...
Then track these TODOs in your issue tracker.
The Documentation ROI
Let's be concrete about the return on investment:
Scenario: Complex microservice communication pattern
Without documentation:
-
Developer 1 traces it: 4 hours
-
Developer 2 traces it (6 months later): 3 hours
-
Developer 3 traces it (new hire): 5 hours
-
Developer 1 traces it again (forgot details): 2 hours
-
Total: 14 hours
With documentation:
-
Developer 1 traces + documents: 5 hours
-
Developer 2 reads docs: 30 minutes
-
Developer 3 reads docs: 45 minutes (less familiar with codebase)
-
Developer 1 references docs: 10 minutes
-
Documentation updates (2x): 1 hour total
-
Total: 7.25 hours
Savings: 6.75 hours (48% reduction)
And this assumes only 4 people need to understand it. On larger teams or longer-lived projects, the savings multiply.
The Golden Rule of Tracing
Here's the rule to internalize:
If it took you more than 30 minutes to understand an execution flow, document it. If it surprised you, definitely document it. If you think "I should write this down," do it immediately—you'll forget within an hour.
Your future self, your teammates, and your future teammates will thank you.