21-721-from-tracing-to-understanding

Part VIII: Mastery & Philosophy

7.21 From Tracing to Understanding

You've just spent three hours tracing execution flow through a Django application. Your notebook is filled with function names, call sequences, and database queries. You understand exactly what happens when a user clicks "Submit Order"—the request hits the view, passes through six middleware components, triggers two signal handlers, executes twelve database queries, sends three emails via Celery, and returns a redirect response.

But here's the question that separates novice from master: Do you understand the system?

Execution traces show you the "what" and the "when." They don't automatically reveal the "why" or the "how well." A complete execution trace is like having a transcript of every word spoken in a meeting—you have data, but not necessarily insight. The mastery lies in transforming traces into understanding.

Execution flow is not architecture

Let's examine what you actually learn from tracing versus what you need to understand:

What tracing shows you:

1. LoginView.post() called

2. authenticate() called with username='john@example.com'

3. UserModel.objects.get(email='john@example.com') → SQL query

4. check_password() called

5. login() called → creates session

6. Session.objects.create() → SQL query

7. user_logged_in signal dispatched

8. update_last_login() signal handler called → SQL query

9. track_login_analytics() signal handler called → Redis write

10. redirect to /dashboard

This trace tells you what executes. It doesn't tell you:

Why authentication uses signals for side effects (design decision: extensibility without modifying core auth)
Why there are three separate database writes (design flaw: should be one transaction)
Why analytics tracking happens synchronously (possible performance problem)
What would happen if Redis is down (resilience question)
How this compares to industry patterns (architecture evaluation)

This is the key insight: Execution traces are raw material, not finished understanding. Your job is to synthesize architectural insights from trace data.

Here's how expert developers transform traces into understanding:

Step 1: Identify patterns across traces

Don't trace one execution path—trace several variations:

Successful login vs. failed login
First-time user vs. returning user
Login with 2FA enabled vs. disabled
High-load scenario vs. single request

Compare the traces. What changes? What stays constant? The invariants reveal core architecture; the variations reveal conditional logic and feature flags.

Step 2: Map traces to architectural concepts

As you trace, categorize what you're seeing:

MIDDLEWARE CHAIN (architectural layer: cross-cutting concerns)

├── SecurityMiddleware → security boundaries

├── SessionMiddleware → state management strategy

├── AuthenticationMiddleware → identity layer

└── CsrfViewMiddleware → attack surface protection



VIEW LAYER (architectural layer: business logic)

├── LoginView → entry point for authentication flow

└── Signal handlers → event-driven side effects



DATA LAYER (architectural layer: persistence)

├── User.objects.get() → identity lookup strategy

└── Session.objects.create() → session storage mechanism

Notice how this reorganizes trace data into architectural layers. You're no longer thinking "line 47 calls line 89." You're thinking "the authentication layer coordinates three subsystems: identity verification, session management, and audit logging."

Step 3: Extract design decisions and their tradeoffs

For each significant pattern in your trace, ask: "Why was it designed this way? What does this optimize for?"

Example from the login trace above:

Design Decision: Using Django signals (user_logged_in) to trigger analytics and last-login updates.

What this optimizes for:

✅ Extensibility: Third-party apps can hook into login without modifying auth code
✅ Separation of concerns: Core authentication doesn't know about analytics
✅ Testability: Can test authentication without mocking analytics

What this trades away:

❌ Performance: Synchronous signal handlers block the response
❌ Transparency: Signal handlers are "invisible" when reading the view code
❌ Debugging complexity: Execution jumps across files non-linearly

Now you're not just describing what happens—you're understanding why the system is shaped this way and what problems it might have.

Step 4: Build a mental model

The ultimate goal is a mental model that lets you predict behavior without tracing. After thoroughly tracing Django's authentication, you should be able to answer questions like:

"If I add a new signal handler, where in the execution sequence will it run?"
"If the database is slow, which operations will block the user's response?"
"If I need to add email verification, which layer should handle it?"

You know you've achieved understanding when you can answer architectural questions without running the debugger.

Building mental models from traces

Let's work through a concrete example of building a mental model from execution traces. You're exploring a FastAPI application that processes uploaded CSV files.

Initial trace (one execution):

[Trace 1: Upload success case]

POST /api/upload

→ upload_csv() endpoint handler

  → validate_file_extension()

  → parse_csv_with_pandas()

    → pd.read_csv()

  → validate_schema()

  → process_rows()

    → for each row: insert_record()

      → database INSERT

  → return {"status": "success", "rows": 1000}

From this single trace, you might conclude: "The app validates the file, parses it, validates the schema, and inserts rows." That's accurate but incomplete.

Building the model requires multiple traces:

[Trace 2: Invalid file extension]

POST /api/upload (file: data.txt)

→ upload_csv() endpoint handler

  → validate_file_extension() → raises ValidationError

→ Exception handler returns 400 response

[Process stops here]



[Trace 3: Malformed CSV]

POST /api/upload (file: malformed.csv)

→ upload_csv() endpoint handler

  → validate_file_extension() → OK

  → parse_csv_with_pandas() → raises pd.errors.ParserError

→ Exception handler returns 400 response

[Process stops here]



[Trace 4: Schema mismatch]

POST /api/upload (file: wrong_columns.csv)

→ upload_csv() endpoint handler

  → validate_file_extension() → OK

  → parse_csv_with_pandas() → OK, DataFrame created

  → validate_schema() → raises SchemaError

→ Exception handler returns 422 response

[Process stops here]



[Trace 5: Partial success scenario]

POST /api/upload (file: some_invalid_rows.csv)

→ upload_csv() endpoint handler

  → validate_file_extension() → OK

  → parse_csv_with_pandas() → OK

  → validate_schema() → OK

  → process_rows()

    → row 1: insert_record() → OK

    → row 2: insert_record() → IntegrityError (duplicate)

    → row 3: insert_record() → OK

    → ...

[All rows attempted, some failed]

→ return {"status": "partial", "rows": 800, "errors": 200}

Now you can build a more complete mental model:

Mental Model: CSV Upload Pipeline

                    ┌──────────────────────────────────────┐

                    │  Request Entry Point                 │

                    │  POST /api/upload                    │

                    └──────────────┬───────────────────────┘

                                   │

                    ┌──────────────▼───────────────────────┐

                    │  Validation Layer (fail-fast)        │

                    │  • Extension check (.csv only)       │

                    │  • CSV parsability (pandas)          │

                    │  • Schema match (column names/types) │

                    │  [Exit early on any failure]         │

                    └──────────────┬───────────────────────┘

                                   │

                    ┌──────────────▼───────────────────────┐

                    │  Processing Layer (fail-tolerant)    │

                    │  • Row-by-row insertion              │

                    │  • Continues despite individual      │

                    │    row failures                      │

                    │  • Collects errors for reporting     │

                    └──────────────┬───────────────────────┘

                                   │

                    ┌──────────────▼───────────────────────┐

                    │  Response Layer                      │

                    │  • Success: all rows inserted        │

                    │  • Partial: some rows failed         │

                    │  • Failure: validation failed        │

                    └──────────────────────────────────────┘

Key architectural insights from multiple traces:

Two-phase error handling: Validation errors fail fast (nothing persisted), processing errors are fault-tolerant (partial success allowed)
No transactions: Each row insertion is separate—if row 50 fails, rows 1-49 are already committed. This is a design choice, not a bug.
Synchronous processing: Large files will block the response. No background job queue.
Client responsibility: The client receives error details and must decide whether to retry, edit the file, or accept partial success.

This mental model now lets you answer questions like:

"What happens if the database goes down during processing?" (Partial success; already-inserted rows remain)
"Can we safely retry failed uploads?" (No—duplicate inserts will occur for successful rows)
"Where would we add authentication?" (Before the validation layer; check before doing any work)
"How would we make this handle 1M row files?" (Need to refactor to async background processing)

This is crucial: You extracted architectural understanding that's not visible in any single trace. The mental model reveals design choices (two-phase error handling), constraints (no transactions), and limitations (synchronous processing).

When to stop tracing and start designing

There's a point of diminishing returns where additional tracing doesn't improve understanding—it just consumes time. Recognizing this inflection point is an essential skill.

Stop tracing when:

1. You can predict behavior without running the trace

Test yourself: Can you answer "what happens if..." questions without the debugger?

"What happens if a user submits the form without authentication?"
"What happens if the database query returns no results?"
"What happens if the email service is down?"

If you can accurately predict the execution flow and outcomes, you understand the system. More tracing is reconnaissance; what you need now is design work.

2. You're seeing the same patterns repeatedly

You've traced five different endpoints and noticed they all:

Use the same middleware chain
Follow the same error-handling pattern
Query the database in similar ways
Return responses in the same format

You've identified the architectural pattern. Additional tracing of similar endpoints won't teach you anything new—you're just confirming what you already know.

3. Your questions shift from "what" to "why" and "should"

Early exploration questions:

"What middleware runs on this request?" (descriptive—needs tracing)
"What order do signal handlers fire?" (descriptive—needs tracing)
"What database queries execute?" (descriptive—needs tracing)

Later architectural questions:

"Why are there three separate database transactions?" (analytical—needs thinking)
"Should we move this to a background job?" (evaluative—needs design thinking)
"Is this error handling strategy appropriate?" (evaluative—needs judgment)

When your questions shift from what the system does to whether the system should do it that way, you've moved beyond tracing. You need architecture review, not more execution traces.

4. You can draw the architecture diagram

Try sketching the system architecture from memory:

What are the major components/layers?
How do they communicate?
What are the data flows?
What are the external dependencies?

If you can draw an accurate architecture diagram without referring to your traces, you've internalized the structure. Additional tracing provides detail but not clarity.

A practical example of knowing when to stop:

You're tracing a Node.js Express application's authentication middleware. Here's your progression:

Hour 1: Run the debugger, step through authenticate() function. Discover it checks JWT tokens, queries the database for user details, and attaches req.user. This is productive tracing.

Hour 2: Trace the same authentication flow for five different routes. Notice it's identical each time—the middleware is truly generic. Starting to see diminishing returns.

Hour 3: Step through the JWT library's internal verification logic, trace the bcrypt password hashing algorithm's implementation. You've gone too deep.

At hour 2, you should have stopped tracing and started asking design questions:

"Should all routes require authentication, or only some?"
"Is the database query on every request a performance bottleneck?"
"How do we handle token expiration and refresh?"
"What's the disaster recovery plan if the database is unavailable?"

These questions require design thinking, not more execution traces. Tracing the internals of the JWT library (hour 3) is actively harmful—you're learning irrelevant implementation details instead of addressing architectural concerns.

Notice this carefully: The skill of knowing when to stop tracing is as important as the skill of tracing itself. Expert developers trace just enough to build a mental model, then shift to higher-level architectural thinking. Novices often trace too much, getting lost in implementation details without building architectural understanding.

Documentation that captures execution insights

Execution traces are ephemeral—debugger sessions vanish when you close your laptop. The insights you gain are valuable only if you externalize them for your future self and your team.

But here's the trap: Most developers don't document traces at all, or they document them poorly. They'll paste a 500-line stack trace into a comment, or write cryptic notes like "Auth flow: middleware → view → db → signals." These don't help.

Effective documentation captures insights, not just traces. Here's how to do it right:

Format 1: Execution Flow Diagrams (for complex sequences)

When you've traced a multi-step process with branches, conditionals, and async operations, create a visual diagram that shows the decision points and data flows.

Example: Django Form Submission Documentation

## Order Submission Flow

When a user submits the order form (`POST /checkout/submit`), the execution follows this path:

User clicks submit

↓

CheckoutView.post()

↓

┌───▼────────────────────────┐

│ Form Validation │

│ • Payment info present? │

│ • Shipping address valid? │

│ • Inventory available? │

└───┬────────────────────────┘

│ ✓ Valid

↓

┌───▼────────────────────────┐

│ Payment Processing │

│ Stripe API call (blocking) │

│ Timeout: 30 seconds │

└───┬────────────────────────┘

│ ✓ Charged

↓

┌───▼────────────────────────┐

│ Order Creation │

│ Database transaction: │

│ • Create Order record │

│ • Create OrderItem records │

│ • Update inventory counts │

│ • Create shipping label │

└───┬────────────────────────┘

│ ✓ Committed

↓

┌───▼────────────────────────┐

│ Post-Order Signals │

│ (async signal handlers) │

│ • Send confirmation email │

│ • Update analytics │

│ • Notify warehouse system │

└───┬────────────────────────┘

│

↓

Redirect to /orders//confirmation

**Critical timing:** The Stripe API call blocks the response. Users see a loading spinner for 2-5 seconds.



**Error handling:**

- Form validation failures: Re-render form with errors (no state change)

- Payment failures: Roll back entire transaction, show error

- Post-order signal failures: Order still created, logged for retry



**Why this architecture:**

- Payment processed *before* order creation to avoid inventory reservation without payment

- All database changes in single transaction to ensure consistency

- Email/analytics after commit so they don't block order creation



**Known issues:**

- Blocking Stripe call causes poor UX on slow connections

- No automatic retry for failed warehouse notifications

- Large orders (>50 items) slow due to N+1 query pattern in OrderItem creation

Notice what this documentation captures:



1. **The happy path** (what executes when everything works)

2. **The timing characteristics** (what blocks the response)

3. **The error handling** (what happens when things fail)

4. **The architectural rationale** (why it's designed this way)

5. **The known limitations** (what should be improved)



This is not a code comment. It's architectural documentation derived from execution traces but enriched with understanding.



**Format 2: Narrative Walkthroughs (for subtle behaviors)**



Some execution insights are best captured as narrative explanations, especially when the behavior is surprising or non-obvious.



Example: Django ORM Query Behavior Documentation



```markdown

## User Profile Loading: Query Behavior



### The Naive Implementation



The `UserProfileView` appears to execute one database query:



```python

def get(self, request):

    user = request.user  # Already loaded by AuthenticationMiddleware

    profile = user.profile  # Related object access

    return render(request, 'profile.html', {'profile': profile})

```



You might expect this to be efficient since `request.user` is already loaded.



### What Actually Happens (discovered via Django Debug Toolbar)



**Query 1:** Middleware loads the user

```sql

SELECT * FROM auth_user WHERE id = ?

```



**Query 2:** Accessing `user.profile` triggers a join query

```sql

SELECT * FROM user_profile WHERE user_id = ?

```



**Query 3-12:** The template accesses `profile.recent_orders`, triggering:

```sql

SELECT * FROM orders WHERE user_id = ? ORDER BY created_at DESC LIMIT 10

```



**Query 13-22:** For each order, the template shows `order.items.count()`:

```sql

SELECT COUNT(*) FROM order_items WHERE order_id = ?

-- This runs once per order! (N+1 query)

```



**Total: 22 queries** for a page that conceptually needs 3.



### The Fix



Use `select_related()` and `prefetch_related()`:



```python

def get(self, request):

    user = User.objects.select_related('profile').prefetch_related(

        Prefetch('orders',

                 queryset=Order.objects.prefetch_related('items')[:10])

    ).get(pk=request.user.pk)

    return render(request, 'profile.html', {'user': user})

```



Now: **3 queries total** (user+profile, orders, order_items in bulk).



### Key Lesson



Django's ORM executes queries **lazily**. Just because data seems "already loaded" doesn't mean it is. Template access patterns create hidden queries. Always trace the actual SQL, not the Python code.



Tools used: Django Debug Toolbar SQL panel + VS Code debugger to confirm query timing.

```



This narrative format works well when you need to contrast expectations versus reality, or when you're documenting a subtle behavior that surprised you.



**Format 3: Decision Records (for architectural choices)**



When tracing reveals an architectural decision, document it as an Architecture Decision Record (ADR) that explains *why* the system is structured this way.



Example:



```markdown

## ADR-007: Synchronous vs. Asynchronous Email Sending



### Context



User registration requires sending a confirmation email. We traced two possible implementation approaches:



**Option A: Synchronous (current)**

```

register_view()

→ create_user()

→ send_confirmation_email() ← blocks here (SMTP: ~500ms)

→ return response

```



**Option B: Asynchronous (with Celery)**

```

register_view()

→ create_user()

→ send_confirmation_email.delay() ← returns immediately

→ return response



[Separately, Celery worker executes email send]

```



### Decision



We chose **synchronous** email sending despite the 500ms latency.



### Rationale



**Traced execution revealed:**

- Registration form already takes 800ms (database writes, password hashing)

- Adding 500ms for email makes total response time 1.3s

- User expects to wait during registration (not perceived as slow)



**Why not async:**

- Would add Celery as infrastructure dependency (Redis/RabbitMQ broker)

- Would add operational complexity (monitoring worker health)

- Would add failure mode complexity (email fails silently, retry logic needed)

- Would add testing complexity (mocking async tasks)



**Trade-off analysis:**

- 500ms latency is acceptable for infrequent operation (registration)

- Simplicity beats performance when performance is acceptable

- Synchronous failures are easier to handle (show error immediately)



### Consequences



**Positive:**

- Simple architecture (no message broker, no workers)

- Failures surface immediately to users

- Easy to test and debug



**Negative:**

- Registration response blocked by email send

- If SMTP server is slow/down, registration is slow/broken

- Can't easily add more email notifications without blocking more



### Revisit Criteria



Reconsider async approach if:

- Registration volume exceeds 1000/hour

- We add multiple emails per registration (email + SMS + webhook)

- User feedback indicates registration feels slow



### Tracing Methodology



Tools used: Django Debug Toolbar (timing), VS Code debugger (execution flow), `py-spy` (profile of registration under load)



Execution traced: 2024-03-15

Decision made: 2024-03-18

Status: Accepted

```



This format captures the entire decision-making process—not just *what* the code does, but *why* it does it that way and *when* we might change it.



**The documentation principle:** Document insights, not traces. Your future self doesn't need to see every function call. They need to understand:



1. **How the system works** (execution flow diagrams)

2. **Why it works that way** (architectural rationale)

3. **What could go wrong** (error cases and limitations)

4. **When to revisit** (conditions for reevaluation)



If your documentation allows someone to understand the system without running the debugger themselves, you've succeeded.



---



### 7.22 The Tracing Mindset



Mastery of execution tracing isn't just about knowing which debugger flag to pass or which tool to install. It's about cultivating a particular mindset—a way of approaching unfamiliar code that consistently leads to understanding rather than confusion.



Here are the four core principles that separate expert tracers from those who struggle:



#### Curiosity over assumptions



You open a Flask application for the first time. The README says: "Standard Flask app with Blueprint-based routing and SQLAlchemy for database access." You might assume you know how it works—you've built Flask apps before.



This assumption is your enemy.



Expert tracers approach every codebase with aggressive curiosity: *"I think I know how this works, but let me verify."* They don't trust their intuitions until they've confirmed them with evidence.



**The assumption trap looks like this:**



You're asked to add a feature to a Django REST Framework API. You assume:

- Views are class-based (because that's DRF's standard pattern)

- Authentication uses JWT tokens (because that's what your last project used)

- Serializers handle validation (because that's DRF's design)



You start writing code based on these assumptions. Three hours later, you're confused—your code doesn't work, and you can't figure out why.



Then you actually *trace* the execution. You discover:

- This project uses function-based views wrapped in DRF decorators

- Authentication is session-based, not JWT

- Validation happens in custom middleware, not serializers



**You wasted three hours because you assumed instead of verified.**



**Curiosity over assumptions means:**



**1. Question the obvious**



```python

# You see this in a Flask route:

@app.route('/api/users/<user_id>')

def get_user(user_id):

    user = User.query.get(user_id)

    return jsonify(user.to_dict())

```



Assumptions you might make:

- "user_id is an integer" (might be UUID string)

- "User.query.get() hits the database" (might be cached)

- "to_dict() returns all fields" (might filter sensitive data)

- "This requires authentication" (might be public endpoint)



Expert tracers don't assume—they verify. Set a breakpoint, examine `user_id`'s type, step into `User.query.get()`, inspect what `to_dict()` returns.



**2. Test edge cases during exploration**



Don't just trace the happy path. Actively try to break things during your exploration:



- What if `user_id` is invalid? (Trace the error path)

- What if the user doesn't exist? (Trace the None-handling)

- What if the database is slow? (Trace with artificial latency)



Curiosity means deliberately invoking edge cases to see how the system handles them. You learn more from failures than successes.



**3. Follow unexpected discoveries**



You're tracing a Django view that saves a form. You notice a signal handler fires after save:



```python

@receiver(post_save, sender=OrderModel)

def notify_warehouse(sender, instance, created, **kwargs):

    if created:

        warehouse_api.notify_new_order(instance.id)

```



The incurious developer thinks: "Okay, it sends a warehouse notification" and moves on.



The curious developer asks:

- "Does this API call block the response? Let me step into it."

- "What happens if the warehouse API is down? Let me check error handling."

- "How long does this take? Let me profile it."

- "Is this called for every order, or only some? Let me check conditional logic."



One level of curiosity gives you a fact: "It notifies the warehouse."



Deep curiosity gives you understanding: "This synchronous API call adds 200ms latency to every order submission and will fail silently if the warehouse service is unavailable, causing orders to be created but never fulfilled."



**The curiosity habit:**



Develop the mental habit of following your confusion. When something surprises you during tracing—a function takes an unexpected path, a variable contains a weird value, execution jumps somewhere you didn't predict—*stop and investigate*.



Confusion is a gift. It marks the gap between your mental model and reality. Curious developers treat confusion as a treasure map: "I'm confused because I don't understand something. Let me trace this until I'm not confused anymore."



Assumptions are time bombs. They make you feel like you understand when you don't. Curiosity defuses them.



#### Tools over cleverness



You need to understand how authentication works in an inherited codebase. You're a clever developer. You might think:



"I'll write a Python script that uses AST parsing to trace all functions decorated with `@login_required`. Then I'll follow the decorator's implementation to see what it does. I'll output a call graph and analyze it."



Stop. You're being clever. Cleverness is seductive—it feels like you're demonstrating mastery by building sophisticated solutions.



**But here's what the expert does instead:**



1. Open VS Code

2. Set a breakpoint in the login view

3. Submit the login form

4. Step through execution with F10/F11

5. Observe what actually runs



**Time investment:**

- Clever approach: 3 hours writing AST parser, 1 hour debugging it, 30 minutes analyzing output = 4.5 hours

- Tool-based approach: 5 minutes setting up debugger, 15 minutes stepping through = 20 minutes



The expert gets the answer **13x faster** using existing tools instead of building custom ones.



**This is crucial:** The "tools over cleverness" principle doesn't mean you're not smart. It means you're smart enough to recognize that other smart people already solved this problem. Your cleverness should be directed at your actual work (building features, architecting systems), not at reinventing debugging tools.



**The cleverness trap manifests in several forms:**



**Form 1: Custom instrumentation when loggers exist**



```python

# Clever but unnecessary:

import ast

import inspect



class FunctionTracer(ast.NodeTransformer):

    def visit_FunctionDef(self, node):

        # Insert print statements at function entry/exit

        # ... 50 lines of AST manipulation ...

        return node



# Trace all functions in a module

trace_module('myapp.views')

```



Meanwhile, Python's built-in tools already do this:



```python

# Simple and reliable:

import sys



def trace_calls(frame, event, arg):

    if event == 'call':

        code = frame.f_code

        print(f"Calling {code.co_filename}:{code.co_name}")

    return trace_calls



sys.settrace(trace_calls)

```



Or just use the debugger and set breakpoints. The built-in approach is 5 lines instead of 50, and it actually works reliably.



**Form 2: Grep-based code analysis when debuggers show actual execution**



```bash

# Trying to understand what executes:

$ grep -r "def process_payment" .

$ grep -r "process_payment(" .

$ grep -r "from.*import.*process_payment" .

# ... 20 more grep commands trying to find usage patterns ...

```



This tells you where `process_payment` *could* be called. It doesn't tell you where it *is* called for your specific use case.



Instead: Set a breakpoint in `process_payment`, trigger your use case, see the call stack. Now you know *exactly* what calls it, in what order, with what data.



**Form 3: Writing test cases to understand behavior**



```python

# Trying to understand what User.authenticate() does:

def test_authenticate_with_valid_password():

    result = User.authenticate('john', 'password123')

    assert result is not None



def test_authenticate_with_invalid_password():

    result = User.authenticate('john', 'wrong')

    assert result is None



def test_authenticate_with_nonexistent_user():

    result = User.authenticate('nobody', 'password')

    assert result is None



# ... 10 more test cases to explore all branches ...

```



This is work. You're writing code to understand code.



Instead: Set a breakpoint at the top of `User.authenticate()`, call it once, step through the entire implementation. You see every branch, every query, every decision point in 5 minutes.



**The tool-first principle:**



Before writing any custom code for execution tracing, ask yourself:



1. **Can the debugger answer this?** (Answer: Yes, 90% of the time)

2. **Can framework-specific tools answer this?** (Django Debug Toolbar, React DevTools, etc.)

3. **Can built-in profilers answer this?** (`cProfile`, Chrome DevTools, etc.)

4. **Can system tools answer this?** (`strace`, `ltrace`, etc.)



Only after exhausting all existing tools should you consider building custom instrumentation. And even then, start with the absolute minimum—a context manager with logging, a simple decorator—not an AST parser.



**Why tools beat cleverness:**



- **Tools are maintained.** Your clever solution becomes your maintenance burden.

- **Tools are documented.** Your clever solution requires explanation.

- **Tools are familiar to your team.** Your clever solution requires training.

- **Tools solve the general problem.** Your clever solution solves only your specific case.

- **Tools are debugged by communities.** Your clever solution has bugs you'll discover later.



Your cleverness is valuable. Direct it toward problems that don't have existing solutions. Execution tracing is a solved problem—don't resolve it.



#### Simplicity over elegance



You've traced a complex authentication flow and discovered it involves seven different components: middleware, decorators, signal handlers, custom validators, database queries, cache lookups, and API calls.



You could document this as an elegant object-oriented architecture diagram with abstract base classes, dependency injection patterns, and design pattern names. It would be beautiful. It would be useless.



**Or you could document it like this:**



```

Authentication Flow (Simple Truth)



1. Middleware checks if request has session cookie

2. If yes: Load user from cache (or database if cache miss)

3. If no: Set request.user = AnonymousUser

4. View decorator checks if request.user is authenticated

5. If not: Redirect to login page

6. If yes: Proceed to view

```



The simple version tells you what actually happens. The elegant version tells you what the architect wishes you'd appreciate about their design.



**Simplicity over elegance means:**



**1. Describe what happens, not what it represents**



Elegant documentation:

```

"The AuthenticationMiddleware implements the Strategy pattern, where different

authentication backends can be plugged in through the provider interface.

This demonstrates the Open/Closed Principle and separation of concerns."

```



Simple documentation:

```

"When a request comes in, AuthenticationMiddleware checks the session cookie.

If valid, it loads the user. If invalid, request.user is AnonymousUser.

To change auth methods, modify AUTHENTICATION_BACKENDS in settings.py."

```



The simple version tells someone how to work with the code. The elegant version tells them how clever the architecture is.



**2. Use the most obvious tool, not the most sophisticated one**



You need to see what SQL queries a Django view executes.



Elegant approach: Configure logging with JSON formatters, send to ELK stack, write Kibana query, visualize in dashboard.



Simple approach: Install Django Debug Toolbar, refresh the page, click SQL panel.



The elegant approach might be necessary in production. For local development exploration, it's overkill. Simplicity means choosing the tool that solves today's problem with today's effort, not building infrastructure for imagined future needs.



**3. Accept imperfect understanding over comprehensive analysis**



You're tracing how a React component fetches data. You could:



- Trace through the entire Redux middleware chain

- Understand every Redux action and reducer

- Map the complete state tree transformations

- Document every side effect and async flow



Or you could observe: "When the component mounts, it calls `fetchUserData()`, which dispatches a Redux action that triggers an API call. The response updates the store, which re-renders the component with data."



The comprehensive approach gives you elegant, complete understanding. The simple approach gives you enough understanding to add the feature you need to add today.



**This is the key insight:** Perfect understanding is expensive. Good-enough understanding is cheap and usually sufficient. Simple tracers ask: "What's the minimum I need to understand to accomplish my goal?" Elegant tracers ask: "What's the complete mental model of this system?"



Most of the time, you need the first answer, not the second.



**4. Write straightforward code when instrumenting**



If you must add instrumentation, resist the urge to make it clever:



```python

# Elegant but confusing:

from functools import wraps

from typing import TypeVar, Callable

import inspect



F = TypeVar('F', bound=Callable)



def trace(level: int = 1):

    def decorator(func: F) -> F:

        sig = inspect.signature(func)

        @wraps(func)

        def wrapper(*args, **kwargs):

            bound = sig.bind(*args, **kwargs)

            print(f"{'  ' * level}{func.__name__}{bound.arguments}")

            return func(*args, **kwargs)

        return wrapper

    return decorator

```



```python

# Simple and clear:

def trace(func):

    def wrapper(*args, **kwargs):

        print(f"Calling {func.__name__}")

        result = func(*args, **kwargs)

        print(f"Finished {func.__name__}")

        return result

    return wrapper

```



The elegant version handles signatures, indentation levels, and type hints. The simple version prints when functions start and stop. Both tell you execution order. The simple version is 10 lines instead of 15, and anyone can understand it in 10 seconds.



**The simplicity test:**



If you can't explain your tracing approach in one sentence to a junior developer, it's too complex.



- Simple: "I set a breakpoint and stepped through the code."

- Simple: "I used Django Debug Toolbar to see the SQL queries."

- Too complex: "I wrote an AST transformer that injects logging decorators at parse time, preserving source maps for debugging."



**Why simplicity beats elegance:**



- **Simple solutions are maintainable.** You can hand them off or revisit them in six months.

- **Simple solutions are debuggable.** When they break, you can fix them quickly.

- **Simple solutions are transferable.** Your team can replicate them on other projects.

- **Simple solutions are disposable.** You can throw them away when they're no longer needed without guilt.



Elegant solutions feel good to create. Simple solutions feel good to use. Choose use over creation.



#### Understanding over instrumentation



You're exploring a codebase and you keep wanting to add logging, insert print statements, modify code to track execution. This impulse is natural but often counterproductive.



**The instrumentation impulse looks like this:**



```python

# Original code:

def process_order(order_id):

    order = Order.objects.get(id=order_id)

    process_payment(order)

    send_confirmation(order)

    return order



# After your "exploration":

def process_order(order_id):

    print(f">>> ENTERING process_order with {order_id}")

    order = Order.objects.get(id=order_id)

    print(f">>> LOADED ORDER: {order}")

    print(f">>> CALLING process_payment")

    process_payment(order)

    print(f">>> PAYMENT PROCESSED")

    print(f">>> CALLING send_confirmation")

    send_confirmation(order)

    print(f">>> CONFIRMATION SENT")

    return order

    print(f">>> EXITING process_order")  # Never executes!

```



You've modified the code to understand it. Now the code has two problems: the original logic you were trying to understand, and your instrumentation that you need to remove.



**Understanding over instrumentation means:**



**1. Observe before modifying**



Use tools that don't require code changes:

- Debuggers (set breakpoints, no code change needed)

- Framework tools (Django Debug Toolbar, React DevTools—no code change needed)

- Profilers (`py-spy`, Chrome DevTools—attach to running process, no code change needed)



If you can answer your question by observing rather than instrumenting, always observe.



**2. Temporary exploration stays in separate branches**



If you must modify code to explore, do it on a git branch:



```bash

# Don't do this in your main branch:

git add -u

git commit -m "added logging to understand order processing"



# Do this instead:

git checkout -b exploration/order-flow

# Add all your print statements, logging, temporary code

# Explore until you understand

git checkout main  # Return to clean code

# Delete the branch when done

```



The goal is understanding the code, not changing it. Your exploration artifacts should be ephemeral.



**3. Document insights, remove instrumentation**



After tracing and exploring, you understand the system. Now you have two choices:



**Bad:** Leave your instrumentation in place "just in case it's useful later."



```python

def process_order(order_id):

    # TODO: Remove this debug logging (added 2024-03-15)

    logger.debug(f"Processing order {order_id}")

    order = Order.objects.get(id=order_id)

    logger.debug(f"Order loaded: {order.status}")

    process_payment(order)

    logger.debug("Payment processed successfully")

    # ... more temporary logging ...

```



**Good:** Remove all instrumentation. Document your understanding instead.



```python

def process_order(order_id):

    """Process an order through payment and confirmation.



    Execution flow:

    1. Loads order from database

    2. Processes payment (blocking, ~500ms)

    3. Sends confirmation email (async via Celery)



    Note: Payment must succeed before confirmation sends.

    If payment fails, the order state remains unchanged.

    """

    order = Order.objects.get(id=order_id)

    process_payment(order)

    send_confirmation(order)

    return order

```



Your future self doesn't need to see print statements. They need to understand what the function does and why. Documentation captures understanding; instrumentation captures exploration process.



**4. Recognize when you're instrumenting because you're stuck**



Sometimes you add instrumentation not because you need it, but because you're frustrated:



"I can't figure out why this doesn't work. Let me add logging everywhere."



This is a warning sign. When you find yourself adding print statements to more than three places, stop. You're not exploring anymore—you're thrashing.



The reset protocol:

1. **Remove all your instrumentation** (git reset --hard or delete the branch)

2. **State your actual question clearly:** "I need to understand why payment processing fails for orders over $1000"

3. **Choose the right tool:** Use the debugger, set a breakpoint at the payment function, trigger with a $1001 order

4. **Observe, don't modify:** Step through execution until you see the failure



Understanding comes from asking the right question with the right tool. Instrumentation comes from asking vague questions with the wrong approach.



**The understanding-first checklist:**



Before adding any instrumentation code, ask:



- ✓ Can I answer this with the debugger? (Try it first)

- ✓ Can I answer this with framework tools? (Django Debug Toolbar, etc.)

- ✓ Can I answer this with profilers? (`py-spy`, Chrome DevTools, etc.)

- ✓ Can I answer this by reading existing logs? (Check production/development logs)

- ✓ Have I clearly stated what I'm trying to understand? (Write it down)



Only after checking all these should you write instrumentation code. And when you do, commit to removing it afterward.



**A real-world example:**



A developer was trying to understand why a Django view was slow. Their approach:



```python

import time



def my_view(request):

    start = time.time()

    queryset = get_queryset()

    print(f"Queryset: {time.time() - start:.3f}s")



    start = time.time()

    data = list(queryset)

    print(f"Evaluation: {time.time() - start:.3f}s")



    start = time.time()

    serializer = MySerializer(data, many=True)

    print(f"Serialization: {time.time() - start:.3f}s")



    # ... 20 more timing statements ...

```



They spent an hour adding timing statements, another hour analyzing the output.



The expert's approach:



```python

# No code changes. Open Django Debug Toolbar SQL panel.

# See: 47 queries, 12 duplicates, total time 1.2 seconds

# Conclusion: N+1 query problem in serializer

```



Django Debug Toolbar showed the problem immediately. No instrumentation needed. The expert got their answer in 2 minutes instead of 2 hours.



**This is crucial:** Instrumentation is a last resort, not a first instinct. Understanding comes from observation, not modification. The best tracers are those who can explore a codebase thoroughly while changing nothing.



---



### 7.23 Teaching Others to Trace



You've mastered execution tracing. You can navigate unfamiliar codebases efficiently, choose the right tools intuitively, and build accurate mental models quickly. Now you face a different challenge: teaching these skills to others.



Teaching tracing is hard because execution flow is invisible and dynamic. You can't point at a file and say "here's where it happens"—you have to demonstrate the process of discovery. But if you do it well, you multiply your impact: every developer you teach becomes more effective at understanding complex systems.



#### Onboarding new developers with tracing workflows



New developers joining your team face a steep learning curve. They need to understand not just the code, but how it executes. Traditional onboarding—"here's the README, browse the code, ask questions"—leaves them floundering.



**Tracing-first onboarding works better.**



**Week 1: Guided Tracing Sessions**



Instead of code reading, do live tracing sessions where the new developer drives and you guide:



**Session 1: "Your First Request" (90 minutes)**



"You're going to trace a complete HTTP request from start to finish. I'll guide you, but you'll operate the tools."



1. **Setup (15 minutes)**

   - Install Django Debug Toolbar (or equivalent for your stack)

   - Configure VS Code debugger

   - Verify everything works with a simple breakpoint



2. **High-level overview (20 minutes)**

   - Submit a real request (e.g., login form)

   - Open Debug Toolbar, look at each panel together

   - You narrate: "See this? 6 SQL queries. This one runs twice—that's probably inefficient. Keep that in mind."



3. **Deep dive with debugger (40 minutes)**

   - Set breakpoint at the view entry point

   - Have them step through: "Press F10. What function are you in now? What does this do?"

   - When they're uncertain: "Let's check. Press F11 to step into that function."

   - Point out key moments: "Notice we just crossed into third-party code. This is where Django's session middleware runs."



4. **Documentation exercise (15 minutes)**

   - Together, sketch the execution flow on a whiteboard

   - Have them write a summary in their own words

   - Review and correct misunderstandings



**The key is active participation.** They're not watching you trace—they're tracing while you guide. They make the observations, you ask the questions: "What do you notice here? Why do you think it does that? What would happen if...?"



**Session 2: "Database Interactions" (90 minutes)**



Focus on ORM behavior and query patterns:



1. **Instrument a feature** (30 minutes)

   - Pick a feature that loads data (e.g., user dashboard)

   - Use Debug Toolbar to see all queries

   - Have them identify: Which queries are necessary? Which seem redundant?



2. **Diagnose N+1 queries** (30 minutes)

   - Show them a classic N+1 problem

   - Trace why it happens: "Step through this loop. Each iteration runs a query. Why?"

   - Fix it together with `select_related()` or `prefetch_related()`

   - Verify the fix: "How many queries now?"



3. **Compare approaches** (30 minutes)

   - Show the same data loading done two different ways

   - Trace both, compare query counts and timing

   - Discuss tradeoffs: "This approach is simpler but slower. Is that okay for this use case?"



**Session 3: "Authentication & Authorization" (90 minutes)**



Trace a security-critical flow:



1. **Trace the login process**

   - From form submission to session creation

   - Identify where password checking happens

   - See where session cookies are set



2. **Trace permission checks**

   - Access a protected resource

   - See where authentication middleware runs

   - Trace the decorator that checks permissions



3. **Explore failure modes**

   - Try invalid credentials—trace the error path

   - Try accessing forbidden resources—see how rejection works



**The progression:** Each session builds on the previous. By week's end, they've traced enough flows to start recognizing patterns: "Oh, this is like the login flow we traced, but for password reset."



**Week 2: Semi-Guided Exploration**



Give them tracing assignments to complete independently:



**Assignment 1: "Trace a Bug Fix"**



"Bug #247 says password reset emails aren't sending. Trace the password reset flow and document:

1. Where the request enters

2. Where the email should be sent

3. Why it's not sending (use the debugger to find the failure point)

4. What the fix should be"



Review their work together. Did they identify the root cause? Did they trace efficiently?



**Assignment 2: "Feature Discovery"**



"We're going to add a feature similar to 'export to CSV.' First, trace the existing 'export to PDF' feature and document:

1. How it's triggered

2. What data it collects

3. How the PDF is generated

4. Where the response is constructed



Then use that knowledge to design the CSV export."



This teaches them to use tracing as a design tool, not just a debugging tool.



**Week 3: Independent Mastery**



By week three, they should trace independently. Give them real work that requires understanding unfamiliar code:



"Implement feature X. You'll need to understand how subsystem Y works. Trace it, document what you learn, then implement your feature."



Check in periodically: "Are you stuck? Show me how you're tracing. Let me see your approach."



**The onboarding outcome:**



After three weeks, new developers should:

- Set breakpoints and step through code without guidance

- Use framework-specific tools to understand system behavior

- Recognize common patterns (N+1 queries, middleware chains, signal handlers)

- Document what they discover for future reference

- Know when to ask for help (after attempting to trace, not before)



**This is crucial:** Traditional onboarding teaches *what* the code does. Tracing-first onboarding teaches *how to discover* what code does. The second skill is far more valuable because it works on any codebase, not just yours.



#### Pairing sessions with debuggers



Pair programming is common. Debug-pairing is rare but powerful. When two developers explore unfamiliar code together using a debugger, both learn faster than either would alone.



**Debug-pairing works differently than code-pairing:**



**Pair Programming Model:**

- Driver writes code

- Navigator reviews and suggests

- Roles switch every 15-30 minutes



**Debug-Pairing Model:**

- Driver controls debugger

- Navigator asks questions and predicts behavior

- Both observe execution together

- Roles switch when driver gets stuck or finds something interesting



**A debug-pairing session looks like this:**



**Example: Understanding a Celery Task Failure**



Driver and Navigator are both looking at the screen. Driver controls keyboard/mouse.



**Navigator:** "Okay, we know the task fails sometimes but not always. Let's trace a successful execution first, then a failing one. Where should we start?"



**Driver:** "I'll set a breakpoint at the task entry point." [Sets breakpoint in `process_upload_task`]



**Navigator:** "Good. Now trigger a successful upload. What do we expect to see?"



**Driver:** "We should hit the breakpoint, then probably see it call the parser, then the database save..." [Triggers upload]



**[Breakpoint hits]**



**Driver:** "Okay, we're at the task entry point. The `file_path` argument is `/tmp/upload_xyz.csv`." [Steps forward]



**Navigator:** "Wait, before you step, let's predict: What should happen next? I think it opens the file. Let's verify that assumption."



**Driver:** [Steps] "Yes, it calls `open(file_path)`. The file handle is... wait, it's None. Why is it None?"



**Navigator:** "The `open()` returned None? That doesn't seem right. Can you step into the `open()` call?"



**Driver:** [Steps into] "Oh, it's not the built-in `open()`. It's a custom function. Let me see what it does..."



**Navigator:** "Interesting. So they've wrapped file opening. What's it doing differently?"



**Driver:** [Reads code] "It's checking if the file exists before opening. If the file doesn't exist, it returns None instead of raising an exception."



**Navigator:** "Okay, that explains why `file_path` being wrong would cause silent failure. Let's check: what happens when `open()` returns None?"



**[They trace through and discover no error handling for None return]**



**Navigator:** "There's the bug. If the file doesn't exist, `open()` returns None, and then the next line tries to call `.read()` on None. Should crash, but maybe it's caught somewhere?"



**Driver:** [Continues tracing] "Yeah, there's a generic exception handler that logs and returns. So failures are logged but not obvious."



**Navigator:** "Perfect. Now let's trace the failing case. Can you trigger an upload that fails?"



**[They repeat the process and discover the file path is constructed incorrectly in certain cases]**



**Notice the dynamic:**



- **Navigator asks predictive questions:** "What should happen next?" This keeps both developers engaged and thinking ahead.

- **Both developers reason out loud:** Making thinking visible helps catch mistakes and misconceptions.

- **They pause to discuss:** When something unexpected happens, they stop and figure out why before continuing.

- **Roles are fluid:** When Driver is stuck ("I don't know what to check next"), Navigator suggests the next step. When Navigator is confused ("Wait, what just happened?"), Driver explains.



**Debug-pairing is especially valuable for:**



**1. Complex async code**



```python

# Tracing this alone is confusing:

async def process_batch(items):

    tasks = [process_item(item) for item in items]

    results = await asyncio.gather(*tasks, return_exceptions=True)

    return [r for r in results if not isinstance(r, Exception)]

```



With a pair:

- Driver steps through the execution

- Navigator asks: "So `gather` runs all tasks concurrently? Let's verify—can you check how many tasks are running right now?"

- They discover that exceptions in one task don't crash others

- They understand the exception handling pattern together



**2. Framework magic**



```python

# What actually happens here?

class MyView(ListView):

    model = User

    template_name = 'users.html'

```



Pair tracing reveals:

- Driver: "I'll set a breakpoint at the start of the view dispatch..."

- Navigator: "But there's no dispatch method in this class. Where should we break?"

- Driver: "Good point. Let me check the parent class... ListView inherits from... [traces inheritance]"

- Together they discover Django's CBV dispatch mechanism, template resolution, and queryset building



**3. Distributed system interactions**



Tracing how a service calls another service:



- Driver traces the client code making the request

- Navigator checks the server logs to see what it receives

- They correlate timing: "You made the request at timestamp X, I see it arrived at timestamp X+200ms"

- They discover timeouts, retries, and error handling patterns



**Best practices for debug-pairing:**



**Do:**

- ✓ Pause to discuss unexpected behavior

- ✓ Predict what should happen before stepping

- ✓ Switch roles when energy drops or someone gets stuck

- ✓ Document discoveries immediately (shared notes document)

- ✓ Set time limits (90 minutes max per session)



**Don't:**

- ✗ Let one person passively watch

- ✗ Skip over "boring" parts—sometimes those reveal key insights

- ✗ Argue about what *should* happen—trace what *does* happen

- ✗ Try to fix bugs during the tracing session—understand first, fix later



**The pairing outcome:**



After a debug-pairing session, both developers should have shared understanding. Test this: Can each person independently explain what you discovered? If not, you moved too fast or one person disengaged.



#### Building team documentation from traces



Individual developers trace code all the time. The knowledge they gain often stays in their heads. Teaching others to trace means teaching them to externalize and share that knowledge.



**The problem:** Every developer traces the authentication flow when they first work on auth features. This is wasted effort—one person should trace it thoroughly and document it for everyone.



**The solution:** Make "trace and document" an explicit part of your team's workflow.



**Documentation Pattern 1: Execution Flow Maps**



When a developer traces a complex flow, they create a flow map in your team wiki:



```markdown

## Shopping Cart Checkout Flow



Last traced: 2024-03-15 by @sarah



### Overview

The checkout process involves 4 services and 7 database tables.

Total execution time: 1.2-2.5 seconds depending on payment provider.



### Execution Sequence



```

[Frontend Cart Page]

    ↓ POST /api/checkout/initiate

[API Gateway]

    ↓ JWT validation (10ms)

[Checkout Service]

    ├─ Validate cart items still available (DB query, 50ms)

    ├─ Calculate shipping costs (external API, 200-800ms)

    ├─ Apply promo codes (Redis lookup, 5ms)

    └─ Create pending order (DB write, 20ms)

    ↓ POST to Payment Service

[Payment Service]

    ├─ Tokenize payment method (Stripe API, 300-1000ms)

    └─ Create payment intent (Stripe API, 200-500ms)

    ↓ Return payment intent ID

[Checkout Service]

    └─ Update order with payment intent (DB write, 15ms)

    ↓ Response to client

[Frontend]

    └─ Redirect to payment confirmation page

```



### Critical Details



**Database Transactions:**

- Order creation and payment intent are in separate transactions

- If payment service fails, the pending order remains in database

- Background job cleans up abandoned orders after 30 minutes



**Failure Modes:**

- Shipping API timeout: Falls back to standard shipping (logged for review)

- Payment service down: Shows error, allows retry

- Invalid promo code: Proceeds without discount, shows warning



**Performance Hotspots:**

- Shipping cost calculation is slowest step (external API)

- Payment tokenization adds 300-1000ms latency

- Consider caching shipping costs for zip code/weight combinations



### Tools Used

- Chrome DevTools Network tab (frontend timing)

- API Gateway logs (request tracing)

- Distributed tracing (Jaeger span view)

- VS Code debugger (checkout service internals)



### Open Questions

- Why do we create pending order before payment? (Risk of abandoned orders)

- Could payment tokenization happen asynchronously? (UX tradeoff)



### Related Documentation

- [Payment Service API Docs](link)

- [Cart Service Architecture](link)

```



**Notice what this provides:**



- **Concrete timing data** (future developers can set performance budgets)

- **Failure mode documentation** (helps with error handling and testing)

- **Performance insights** (guides optimization efforts)

- **Tracing methodology** (others can verify or update this)

- **Open questions** (prompts architectural discussion)



This isn't a code comment. It's team knowledge extracted from traces.



**Documentation Pattern 2: The "How I Traced This" Guide**



For particularly tricky traces, document the discovery process:



```markdown

## How to Trace WebSocket Connection Handling



**The Problem:** WebSocket connections sometimes drop unexpectedly. We need to understand the lifecycle.



**The Challenge:** WebSocket code spans multiple layers: Nginx, Django Channels, Redis, and business logic.



**The Tracing Approach:**



### Step 1: Client-Side Tracing

1. Open Chrome DevTools → Network → WS tab

2. Connect to WebSocket

3. Observe messages in both directions

4. Note: Connection stays open for ~60 seconds, then closes



### Step 2: Server-Side Entry Point

1. Set breakpoint in `consumers.py → ChatConsumer.connect()`

2. Trigger connection from browser

3. Step through authorization logic

4. Note: Connection is accepted, added to group



### Step 3: Message Flow

1. Send a chat message from browser

2. Breakpoint in `ChatConsumer.receive()`

3. Trace through to `channel_layer.group_send()`

4. Key insight: Message goes to Redis, then back to all consumers in group



### Step 4: Disconnect Tracing

1. Wait for automatic disconnect (~60s)

2. Breakpoint in `ChatConsumer.disconnect()`

3. Check `close_code`: It's 1000 (normal closure)

4. Check server logs: No errors

5. Check Nginx config: `proxy_read_timeout 60s` ← Found it!



### The Discovery

Nginx was closing idle WebSocket connections after 60 seconds. The close appeared mysterious because there were no errors—it was intentional timeout behavior.



**Fix:** Increase `proxy_read_timeout` or implement ping/pong heartbeat.



**Time Investment:**

- Without this guide: 3 hours (tried wrong approaches first)

- With this guide: 30 minutes (direct path to answer)



**Tools Needed:**

- Chrome DevTools

- VS Code Python debugger with Django Channels support

- Access to Nginx config

- Redis CLI (for observing pub/sub)

```



This guide teaches the discovery process, not just the conclusion. Future developers learn *how to think* about WebSocket tracing, not just the answer to this specific question.



**Documentation Pattern 3: Team Trace Library**



Create a shared collection of common tracing scenarios:



```

team-docs/

├── tracing-guides/

│   ├── authentication-flow.md

│   ├── api-request-lifecycle.md

│   ├── database-query-patterns.md

│   ├── background-job-execution.md

│   ├── websocket-connections.md

│   ├── email-sending-pipeline.md

│   └── cache-invalidation-flow.md

```



Each guide follows the same structure:

1. **What you'll learn** (specific questions answered)

2. **Tools needed** (debugger, framework tools, etc.)

3. **Step-by-step tracing process** (exactly what to do)

4. **Common pitfalls** (what to avoid)

5. **Expected outcomes** (what you should observe)

6. **Last verified** (date + person, so guides stay current)



**Making documentation a habit:**



Documentation only works if it's maintained. Make it part of your workflow:



**Code review requirement:** "If you traced something complex to implement this feature, add a tracing guide or update an existing one."



**Onboarding checklist:** "Complete three traces from the tracing guides library and update them with any changes you notice."



**Monthly rotation:** One team member per month is "documentation lead"—their job is to identify missing guides, update stale ones, and organize the library.



**The documentation outcome:**



When a new developer asks "How does X work?" the answer should be:



"Here's the tracing guide for X. Follow it, and you'll understand. If anything is unclear or outdated, please update the guide so the next person has better information."



Not: "Let me explain it to you..." or "Just read the code..."



Documentation transforms individual knowledge into team knowledge. Every trace you document is a trace future developers don't have to repeat.



#### Creating project-specific tracing guides



Every codebase has unique characteristics—custom frameworks, unusual architecture, domain-specific complexity. Generic tracing knowledge helps, but project-specific guides make developers productive faster.



**A project-specific tracing guide answers:**



1. **What are the entry points?** (Where does execution begin in this codebase?)

2. **What are the common patterns?** (How is this codebase structured?)

3. **What are the gotchas?** (What will confuse developers?)

4. **What tools work best?** (Framework-specific, project-specific tools)



**Example: Django E-commerce Project Tracing Guide**



```markdown

# Tracing Guide: ShopPlatform



## Project Architecture Overview



ShopPlatform is a Django 4.2 monolith with:

- 23 Django apps (authentication, cart, checkout, inventory, etc.)

- Celery for background jobs (Redis broker)

- PostgreSQL database

- Redis for caching and sessions

- Stripe for payments



## Quick Start: Your First Trace



**Goal:** Understand a complete purchase flow in 30 minutes.



**Setup (5 minutes):**

```bash

# Install debugging tools

pip install django-debug-toolbar ipdb



# Add to settings.py (already configured in development)

# INSTALLED_APPS includes 'debug_toolbar'

# INTERNAL_IPS = ['127.0.0.1']

```



**Trace Exercise (25 minutes):**



1. **Start the server with debugger support:**

   ```bash

   python manage.py runserver

   ```



2. **Navigate to http://localhost:8000** and add items to cart



3. **Open Django Debug Toolbar** (right side of page)

   - Note: Cart operations use Redis, not database (see Cache panel)



4. **Proceed to checkout:**

   - Watch SQL panel: Notice the inventory check queries

   - Watch Signal panel: See `pre_checkout` signal firing



5. **Set breakpoint in checkout view:**

   ```python

   # In apps/checkout/views.py

   class CheckoutView(View):

       def post(self, request):

           import ipdb; ipdb.set_trace()  # ← Add this

           # ... rest of method

   ```



6. **Submit checkout form:**

   - You'll drop into ipdb debugger

   - Type `n` to step through line by line

   - Type `s` to step into function calls

   - Type `c` to continue until next breakpoint



7. **Observe the flow:**

   - Form validation

   - Inventory reservation (database lock)

   - Payment processing (Stripe API call)

   - Order creation (database transaction)

   - Email sending (Celery task queued)



## Common Entry Points



### HTTP Requests

- **URL routing:** `config/urls.py` includes app URLs

- **App URLs:** Each app has `urls.py` (e.g., `apps/checkout/urls.py`)

- **Views:** Class-based views in `apps/*/views.py`



### Background Jobs

- **Task definitions:** `apps/*/tasks.py`

- **Task execution:** Celery worker (see logs with `celery -A config worker --loglevel=info`)



### Database Operations

- **Models:** `apps/*/models.py`

- **Signals:** `apps/*/signals.py` (many side effects happen here!)



### API Endpoints

- **DRF views:** `apps/api/views.py`

- **Serializers:** `apps/api/serializers.py`



## Project-Specific Patterns



### Pattern 1: Middleware-Heavy Architecture



We use 12 custom middleware components. To trace them:



```python

# In settings.py, MIDDLEWARE order matters:

MIDDLEWARE = [

    'django.middleware.security.SecurityMiddleware',

    'apps.core.middleware.RequestTimingMiddleware',  # ← Ours

    'apps.core.middleware.TenantMiddleware',         # ← Ours (multi-tenant)

    # ... etc

]

```



**Set breakpoint in any middleware's `__call__` method** to see when it executes in the request/response cycle.



**Key insight:** `TenantMiddleware` modifies database routing. If you're confused about which database a query hits, check the current tenant.



### Pattern 2: Signal-Heavy Side Effects



We use Django signals extensively. Many actions trigger hidden side effects:



```python

# Example: Saving an Order triggers multiple signals

order.save()

# → post_save signal

#   → inventory_update signal handler (decrements stock)

#   → analytics_track signal handler (sends to analytics service)

#   → notification_send signal handler (queues email task)

```



**To trace signals:**

1. Check `apps/*/signals.py` for signal handlers

2. Use Django Debug Toolbar Signal panel to see what fires

3. Set breakpoints in signal handler functions



### Pattern 3: Async Celery Tasks Many operations defer work to Celery:



```python

# In view code:

send_order_confirmation.delay(order.id)  # Returns immediately



# Actual execution happens in Celery worker process

```



**To trace Celery tasks:**



1. **View task registration:**

   ```bash

   # Show all registered tasks

   celery -A config inspect registered

   ```



2. **Run worker in foreground with debugging:**

   ```bash

   celery -A config worker --loglevel=debug --pool=solo

   ```



3. **Set breakpoint in task function:**

   ```python

   # In apps/orders/tasks.py

   @shared_task

   def send_order_confirmation(order_id):

       import ipdb; ipdb.set_trace()

       # ... task code

   ```



4. **Trigger the task** and watch worker console drop into debugger



**Key insight:** Task failures are logged but don't raise exceptions in the calling code. Check Celery logs for task failures.



### Pattern 4: Database Connection Routing



We use multiple databases (primary + read replica). Connection routing is in `config/db_router.py`.



**To trace which database a query uses:**



```python

from django.db import connections



# In your debugging session:

print(connections['default'].connection)  # Primary

print(connections['replica'].connection)  # Read replica

```



Or use Django Debug Toolbar SQL panel—it shows the connection name for each query.



## Common Gotchas



### Gotcha 1: Cached User Object



```python

# This might not be fresh:

user = request.user



# User object is cached on request. If you modify user in database:

User.objects.filter(pk=user.pk).update(email='new@example.com')



# request.user still has old email! Refresh it:

user.refresh_from_db()

```



### Gotcha 2: Signal Handlers That Modify Requests



Our `TenantMiddleware` modifies `request.tenant` based on subdomain. If you're tracing and confused about where data comes from, check `request.tenant` first.



### Gotcha 3: Different Behavior in Tests vs. Development



Some behavior only happens with certain settings:



```python

# In production/development:

CELERY_TASK_ALWAYS_EAGER = False  # Tasks run async



# In tests:

CELERY_TASK_ALWAYS_EAGER = True   # Tasks run synchronously

```



If you're tracing task execution and it seems synchronous, check this setting.



### Gotcha 4: Django Debug Toolbar Doesn't Show on AJAX Requests



Debug Toolbar only appears on full page loads. For AJAX requests:



```python

# View the toolbar data anyway:

# Navigate to /__debug__/

# Shows all recent requests and their debug data

```



### Gotcha 5: Lazy QuerySets



```python

# This doesn't hit the database:

users = User.objects.filter(is_active=True)



# This does:

list(users)  # or iterating, or len(), or bool(), etc.

```



**Use Django Debug Toolbar SQL panel** to see exactly when queries execute, not when QuerySets are created.



## Recommended Tracing Workflows



### Workflow 1: "I Don't Understand This Feature"



**Goal:** Understand how existing feature works.



1. **Identify entry point:**

   - Find the URL in browser DevTools Network tab

   - Locate view in `config/urls.py` and app URLconf



2. **Use Debug Toolbar first:**

   - Perform the action

   - Check SQL panel (database operations)

   - Check Signals panel (side effects)

   - Check Cache panel (cache reads/writes)



3. **Set breakpoint if needed:**

   - Identify the most interesting function from step 2

   - Add `import ipdb; ipdb.set_trace()`

   - Step through to understand logic



4. **Document findings:**

   - Add/update tracing guide for this feature

   - Update code comments if logic is subtle



### Workflow 2: "Why Is This Slow?"



**Goal:** Identify performance bottleneck.



1. **Measure first:**

   - Use Debug Toolbar Timing panel for total time

   - Check SQL panel for query count and time



2. **Profile if necessary:**

   ```bash

   # Install if not present

   pip install django-silk



   # Access at http://localhost:8000/silk/

   # Shows detailed profiling of every request

   ```



3. **Common culprits in this codebase:**

   - N+1 queries (check SQL panel for duplicate queries)

   - External API calls (Stripe, shipping providers)

   - Unoptimized QuerySets (missing `select_related`/`prefetch_related`)

   - Large serialization (DRF serializers can be slow)



### Workflow 3: "This Celery Task Fails Sometimes"



**Goal:** Debug intermittent task failure.



1. **Check Celery logs:**

   ```bash

   # Worker logs show exception tracebacks

   tail -f logs/celery.log

   ```



2. **Add detailed logging:**

   ```python

   @shared_task

   def my_task(order_id):

       logger.info(f"Starting task for order {order_id}")

       try:

           order = Order.objects.get(id=order_id)

           logger.info(f"Order state: {order.status}")

           # ... rest of task

       except Exception as e:

           logger.exception(f"Task failed for order {order_id}")

           raise

   ```



3. **Test with --pool=solo:**

   ```bash

   # Runs tasks synchronously, easier to debug

   celery -A config worker --pool=solo --loglevel=debug

   ```



4. **Use ipdb in task code** (only works with solo pool)



### Workflow 4: "This Integration Is Broken"



**Goal:** Trace interaction with external service (Stripe, shipping API, etc.).



1. **Enable request logging:**

   ```python

   # In settings_dev.py

   LOGGING = {

       'loggers': {

           'urllib3': {'level': 'DEBUG'},  # Logs HTTP requests

       }

   }

   ```



2. **Check request/response:**

   - Use `requests` library's event hooks

   - Or check service's dashboard (e.g., Stripe dashboard shows all API calls)



3. **Common integration issues:**

   - API key misconfiguration (check `.env` file)

   - Webhook signature verification failing

   - Request timeout (default: 30s for most external calls)



## Tool Configuration



### VS Code Debugger



We have `.vscode/launch.json` configured:



```json

{

  "configurations": [

    {

      "name": "Django Server",

      "type": "python",

      "request": "launch",

      "program": "${workspaceFolder}/manage.py",

      "args": ["runserver", "--noreload"],

      "django": true

    },

    {

      "name": "Celery Worker",

      "type": "python",

      "request": "launch",

      "module": "celery",

      "args": ["-A", "config", "worker", "--loglevel=info", "--pool=solo"]

    }

  ]

}

```



**Usage:**

- Press F5 → Select "Django Server" → Set breakpoints → Trigger request

- For tasks: F5 → Select "Celery Worker" → Set breakpoints in task code



### Django Debug Toolbar Customization



We have custom panels configured:



```python

# In settings.py

DEBUG_TOOLBAR_PANELS = [

    'debug_toolbar.panels.timer.TimerPanel',

    'debug_toolbar.panels.sql.SQLPanel',

    'debug_toolbar.panels.signals.SignalsPanel',

    'debug_toolbar.panels.cache.CachePanel',

    'apps.core.debug_panels.TenantPanel',  # Custom panel

]

```



**TenantPanel** shows current tenant context—useful for tracing multi-tenant issues.



## FAQs



**Q: How do I trace what happens when a model saves?**



A: Three approaches:



1. **Override save() method temporarily:**

   ```python

   def save(self, *args, **kwargs):

       import ipdb; ipdb.set_trace()

       super().save(*args, **kwargs)

   ```



2. **Use signal handlers:**

   ```python

   from django.db.models.signals import pre_save



   def trace_save(sender, instance, **kwargs):

       import ipdb; ipdb.set_trace()



   pre_save.connect(trace_save, sender=Order)

   ```



3. **Check Debug Toolbar Signals panel** for all signals that fire



**Q: How do I see what templates are rendered?**



A: Debug Toolbar Templates panel shows:

- Which templates were rendered

- Template inheritance chain

- Context variables passed to each template



**Q: Can I trace migrations?**



A: Yes:



```bash

# Run migration with verbose output

python manage.py migrate --verbosity=3



# Or set breakpoint in migration:

# migrations/0042_add_field.py

def forwards(apps, schema_editor):

    import ipdb; ipdb.set_trace()

    # ... migration code

```



**Q: How do I trace authentication/authorization?**



A: Start at middleware:



```python

# apps/core/middleware.py

class AuthenticationMiddleware:

    def __call__(self, request):

        import ipdb; ipdb.set_trace()

        # Step through to see how request.user is populated

```



Or check Django Debug Toolbar → Request/Response panel → User section.



**Q: How do I trace what queries Django ORM generates before executing them?**



A: Use `query` attribute:



```python

qs = Order.objects.filter(status='pending').select_related('user')

print(qs.query)  # Shows the SQL that will be generated

```



Or enable query logging:



```python

# In settings_dev.py

LOGGING = {

    'loggers': {

        'django.db.backends': {

            'level': 'DEBUG',

            'handlers': ['console'],

        },

    },

}

```



## Getting Help



**If you're stuck tracing something:**



1. **Check this guide first** (you might be hitting a known gotcha)

2. **Ask in #engineering-help Slack channel** with:

   - What you're trying to understand

   - What you've already tried (tools used, breakpoints set)

   - What confused you or didn't work

3. **Pair with someone** who knows the area (see team expertise matrix)

4. **Document your solution** once you figure it out (update this guide!)



## Maintenance



**Last updated:** 2024-03-15 by @tom



**Update schedule:** Review this guide monthly. If you discover outdated info, fix it immediately (it's just a markdown file in the repo).



**What to update:**

- New patterns as architecture evolves

- New gotchas as you discover them

- Tool configurations when they change

- FAQ entries when new common questions emerge



**Version history:** See git history for this file.

```



**Notice what makes this guide valuable:**



1. **Project-specific context:** Not generic tracing advice, but specific to this codebase's patterns

2. **Practical workflows:** Step-by-step instructions for common scenarios

3. **Known gotchas:** Saves developers from common confusion

4. **Tool configurations:** Pre-configured setups they can use immediately

5. **Maintenance plan:** Keeps the guide from becoming stale



**Creating your own project-specific guide:**



**Step 1: Document common patterns (2-3 hours)**



Trace 3-5 representative features in your codebase:

- A typical CRUD operation

- A complex multi-step workflow

- An integration with external service

- A background job

- An error/exception case



Document the patterns you see repeatedly.



**Step 2: Identify entry points (1 hour)**



List all the ways execution enters your codebase:

- HTTP endpoints

- CLI commands

- Background jobs

- Webhooks

- Cron jobs

- Admin actions



Show developers where to set breakpoints for each.



**Step 3: Catalog gotchas (ongoing)**



Whenever you or a teammate gets confused while tracing, document it:

- What you expected

- What actually happened

- Why it happened

- How to recognize this pattern



**Step 4: Configure tools (1 hour)**



Set up and document:

- Debugger configuration (`.vscode/launch.json`, etc.)

- Framework-specific tools (Debug Toolbar, DevTools, etc.)

- Logging configuration for common debugging scenarios



**Step 5: Create FAQs (ongoing)**



When someone asks "How do I trace X?" document the answer in the FAQ section.



**The project-specific guide outcome:**



A new developer should be able to:

- Perform their first trace within 30 minutes of cloning the repo

- Understand common patterns without asking questions

- Avoid known gotchas that would waste hours

- Know exactly which tools to use for which scenarios



**This is the key insight:** Generic tracing knowledge (debuggers, profilers, framework tools) is necessary but not sufficient. Project-specific tracing knowledge (patterns, gotchas, configurations) is what makes developers productive on *your* codebase specifically.



Every hour you invest in creating and maintaining a project-specific tracing guide saves dozens of hours for your team. It's one of the highest-leverage documentation you can create.



---



**The Teaching Mastery Outcome:**



When you've successfully taught others to trace, you'll see:



1. **Reduced "how does this work?" questions** → Developers trace to find answers themselves

2. **Better architectural discussions** → Discussions based on observed behavior, not assumptions

3. **Faster onboarding** → New developers become productive in days, not weeks

4. **Improved documentation** → Execution insights captured in team docs, not individual heads

5. **Cultural shift** → "Let's trace it" becomes the default response to uncertainty



**This is crucial:** Teaching tracing isn't just about transferring technical skills. It's about cultivating a culture of empirical investigation over speculation, tools over cleverness, and shared understanding over individual knowledge.



The best teams are those where every developer can confidently say: "I don't know how this works yet, but I know exactly how to find out." That's the outcome of excellent tracing education.



You've built mastery when you can teach others not just *what* to do, but *how to think* about execution flow exploration. That's the final level: turning tracing from a set of techniques into a systematic, teachable approach to understanding complex systems.