🏠

Part IV: Advanced Techniques

7.10 Building Custom Instrumentation (When Justified)

You've spent three days tracing execution flow in your Django application. The Debug Toolbar shows SQL queries, the VS Code debugger reveals call stacks, and py-spy profiles your hot paths. But now you're facing a new problem: your system spans five microservices, and you need to understand how a single user request flows across all of them. Or perhaps you're debugging a production issue where traditional debuggers can't be attached. Or you need to collect custom metrics that don't exist in any standard tool.

This is the moment where custom instrumentation becomes justified. Notice what got you here: exhausting standard tools first, encountering a specific limitation, and having a clear requirement that no existing tool satisfies. This is not "I wonder how code flows" territory—that's what debuggers are for. This is "I need to trace requests across distributed systems in production" territory.

This is the key insight: Custom instrumentation is justified when standard tools physically cannot solve your problem, not when you haven't learned to use standard tools effectively. The vast majority of developers will never need what's in this section. If you're reading this because tracing a single-application codebase feels hard, go back to Part II and master the debugger first.

7.10.1 When Standard Tools Aren't Enough

Let's ground this in authentic scenarios where professionals actually build custom instrumentation.

Scenario 1: Distributed System Tracing

You're debugging a checkout flow that touches six services: API gateway → authentication → inventory → pricing → payment → fulfillment. A customer reports their payment succeeded but their order never arrived. Your logs show:

[api-gateway] POST /checkout user=12345 request_id=abc123

[auth-service] Validated token user=12345

[inventory-service] Reserved items cart=789

[pricing-service] Calculated total $124.99

[payment-service] Charge succeeded charge_id=ch_xyz

[fulfillment-service] ??? (no log entry)

Notice the problem: each service logs independently. There's no single tool that shows you "here's the complete flow of request abc123 across all systems." Django Debug Toolbar only sees one service. A debugger can't step across network boundaries. This is where distributed tracing becomes essential.

Scenario 2: Production Debugging Requirements

Your FastAPI service starts responding slowly at 10 PM every night. By the time you notice, it's back to normal. You can't attach a debugger to production. Regular profilers would need to run continuously, which adds overhead. You need something that:

Standard tools don't offer this conditional, low-overhead instrumentation. This is where custom solutions become necessary.

Scenario 3: Custom Metrics Collection

Your team needs to answer questions like:

These aren't debugging questions—you understand the code. They're measurement questions that require instrumenting specific points in your application to collect business-relevant metrics. APM tools provide generic metrics (requests per second, error rates), but your questions are application-specific.

The 80/20 Rule: Start Simple, Extend Only When Necessary

Here's what actually happens in professional environments: you start with the simplest instrumentation that could possibly work, and only add complexity when you hit specific limitations. Let me show you this progression with a real example.

You want to understand how long different parts of your Django view take:

Attempt 1: Print Timestamps (5 minutes to implement)

import time



def process_order(request):

    start = time.time()



    user = authenticate(request)

    print(f"Auth took {time.time() - start:.3f}s")



    items = fetch_cart_items(user)

    print(f"Fetch items took {time.time() - start:.3f}s")



    total = calculate_total(items)

    print(f"Calculate total took {time.time() - start:.3f}s")



    return {"total": total}

This works for simple cases. But you realize:

Attempt 2: Context Manager Logger (30 minutes to implement)

import time

import logging

from contextlib import contextmanager



logger = logging.getLogger(__name__)



@contextmanager

def timed_operation(operation_name):

    start = time.time()

    try:

        yield

    finally:

        duration = time.time() - start

        logger.info(f"{operation_name} took {duration:.3f}s")



def process_order(request):

    with timed_operation("authenticate"):

        user = authenticate(request)



    with timed_operation("fetch_items"):

        items = fetch_cart_items(user)



    with timed_operation("calculate_total"):

        total = calculate_total(items)



    return {"total": total}

This is better. Timings are cleaner, logged properly, and easy to aggregate. For many teams, this is where you stop—it's 80% of what you need.

Attempt 3: Decorator with Request Context (2 hours to implement)

But now you want to correlate timings with specific user requests and see them in your monitoring dashboard:

import time

import logging

from functools import wraps

from contextvars import ContextVar



logger = logging.getLogger(__name__)

request_id_var = ContextVar('request_id', default=None)



def timed(operation_name):

    def decorator(func):

        @wraps(func)

        def wrapper(*args, **kwargs):

            start = time.time()

            try:

                result = func(*args, **kwargs)

                return result

            finally:

                duration = time.time() - start

                request_id = request_id_var.get()

                logger.info(

                    f"operation={operation_name} "

                    f"duration={duration:.3f} "

                    f"request_id={request_id}"

                )

        return wrapper

    return decorator



@timed("authenticate")

def authenticate(request):

    # ... existing code

    pass



@timed("fetch_items")

def fetch_cart_items(user):

    # ... existing code

    pass

Notice the progression: each step solved a specific limitation of the previous approach. You didn't start by building the decorator system—you started with print statements, identified their limitations through actual use, then evolved.

This is crucial: Most teams stop at Attempt 2. The context manager is simple, maintainable, and solves 80% of use cases. Only move to Attempt 3 when you actually need request correlation and structured logging integration. Don't build Attempt 3 because it seems "more elegant"—build it because Attempt 2 demonstrably failed to meet a real requirement.

When custom instrumentation is NOT justified:

When custom instrumentation IS justified:

Before writing any custom instrumentation, complete this sentence: "Standard tools can't solve my problem because **___**." If you can't articulate a specific technical limitation, you probably don't need custom instrumentation yet.

7.10.2 Minimal Viable Instrumentation

Let's say you've determined custom instrumentation is justified. Perhaps you need to track how long specific operations take in production, and you want this data in your logs for alerting. Here's how to build the simplest thing that could possibly work.

Writing a Context Manager Logger (10 Lines)

Start with this complete implementation:

import time

import logging

from contextlib import contextmanager



logger = logging.getLogger(__name__)



@contextmanager

def trace_operation(name, **context):

    """Time an operation and log duration with context."""

    start = time.time()

    logger.info(f"Starting {name}", extra=context)

    try:

        yield

    except Exception as e:

        logger.error(f"{name} failed after {time.time() - start:.3f}s: {e}", extra=context)

        raise

    else:

        logger.info(f"{name} completed in {time.time() - start:.3f}s", extra=context)

That's it. Ten lines. Here's how you use it:

def process_payment(user_id, amount):

    with trace_operation("process_payment", user_id=user_id, amount=amount):

        # Your existing code here

        charge = stripe.Charge.create(amount=amount, customer=user_id)

        return charge.id

When this runs, you get logs like:

INFO Starting process_payment user_id=12345 amount=99.99

INFO process_payment completed in 0.234s user_id=12345 amount=99.99

If it fails:

INFO Starting process_payment user_id=12345 amount=99.99

ERROR process_payment failed after 0.156s: CardError('insufficient_funds') user_id=12345 amount=99.99

This is the key insight: This context manager gives you timing, structured logging, error handling, and context propagation in ten lines. You don't need decorators, AST modification, or complex frameworks. This solves 80% of custom instrumentation needs.

Notice what this does well:

When to use this pattern:

Decorators That Don't Modify Behavior

Sometimes you want to instrument functions without wrapping them in context managers. Here's how to write decorators that trace without changing behavior:

import time

import logging

from functools import wraps



logger = logging.getLogger(__name__)



def trace_function(func):

    """Decorator that traces function execution."""

    @wraps(func)  # This is crucial - we'll explain why below

    def wrapper(*args, **kwargs):

        func_name = f"{func.__module__}.{func.__name__}"

        logger.info(f"Calling {func_name}")

        start = time.time()



        try:

            result = func(*args, **kwargs)

            duration = time.time() - start

            logger.info(f"{func_name} completed in {duration:.3f}s")

            return result

        except Exception as e:

            duration = time.time() - start

            logger.error(f"{func_name} failed after {duration:.3f}s: {e}")

            raise



    return wrapper

Usage:

@trace_function

def fetch_user_profile(user_id):

    response = requests.get(f"https://api.example.com/users/{user_id}")

    return response.json()



@trace_function

async def fetch_user_profile_async(user_id):

    async with httpx.AsyncClient() as client:

        response = await client.get(f"https://api.example.com/users/{user_id}")

        return response.json()

Wait—the async version looks the same? Yes, because our decorator doesn't modify behavior. It just wraps the function. But there's a subtle issue here: this decorator doesn't work correctly with async functions. Let me show you the correct pattern:

import asyncio

import time

import logging

from functools import wraps



logger = logging.getLogger(__name__)



def trace_function(func):

    """Decorator that traces both sync and async functions."""



    @wraps(func)

    async def async_wrapper(*args, **kwargs):

        func_name = f"{func.__module__}.{func.__name__}"

        logger.info(f"Calling {func_name}")

        start = time.time()



        try:

            result = await func(*args, **kwargs)

            duration = time.time() - start

            logger.info(f"{func_name} completed in {duration:.3f}s")

            return result

        except Exception as e:

            duration = time.time() - start

            logger.error(f"{func_name} failed after {duration:.3f}s: {e}")

            raise



    @wraps(func)

    def sync_wrapper(*args, **kwargs):

        func_name = f"{func.__module__}.{func.__name__}"

        logger.info(f"Calling {func_name}")

        start = time.time()



        try:

            result = func(*args, **kwargs)

            duration = time.time() - start

            logger.info(f"{func_name} completed in {duration:.3f}s")

            return result

        except Exception as e:

            duration = time.time() - start

            logger.error(f"{func_name} failed after {duration:.3f}s: {e}")

            raise



    # Return the appropriate wrapper based on function type

    if asyncio.iscoroutinefunction(func):

        return async_wrapper

    else:

        return sync_wrapper

This is crucial: When decorating async functions, your wrapper must also be async. The decorator detects the function type and returns the appropriate wrapper. This ensures the function behaves identically whether decorated or not.

Using Python's functools.wraps Correctly

Notice the @wraps(func) decorator on both wrappers above. This is not optional—it's critical for maintaining function metadata. Let me show you why:

def fetch_user_profile(user_id):

    """Fetch a user's profile from the API.



    Args:

        user_id: The unique identifier for the user



    Returns:

        dict: User profile data

    """

    # ... implementation



# Without @wraps

def bad_decorator(func):

    def wrapper(*args, **kwargs):

        return func(*args, **kwargs)

    return wrapper



# With @wraps

def good_decorator(func):

    @wraps(func)

    def wrapper(*args, **kwargs):

        return func(*args, **kwargs)

    return wrapper



# Compare the results

@bad_decorator

def bad_decorated_function(user_id):

    """Original docstring"""

    pass



@good_decorator

def good_decorated_function(user_id):

    """Original docstring"""

    pass



print(bad_decorated_function.__name__)   # 'wrapper' - broken!

print(bad_decorated_function.__doc__)    # None - lost docstring!



print(good_decorated_function.__name__)  # 'good_decorated_function' - correct!

print(good_decorated_function.__doc__)   # 'Original docstring' - preserved!

Why this matters:

  1. Debugging: Stack traces show wrapper instead of the actual function name

  2. Documentation: Auto-generated docs show the wrong signature and docstring

  3. Introspection: Tools that inspect function metadata get incorrect information

  4. Testing: Test error messages reference the wrong function names

Always use @wraps(func) in decorators. Always. No exceptions.

Cleanup and Removal Strategies

Custom instrumentation should be easy to remove. Here's the strategy that works in practice:

Pattern 1: Environment-Controlled Instrumentation

import os

import logging

from contextlib import contextmanager



ENABLE_TRACING = os.getenv('ENABLE_TRACING', 'false').lower() == 'true'

logger = logging.getLogger(__name__)



@contextmanager

def trace_operation(name, **context):

    """Conditionally trace operations based on environment variable."""

    if not ENABLE_TRACING:

        yield  # No-op when disabled

        return



    start = time.time()

    logger.info(f"Starting {name}", extra=context)

    try:

        yield

    finally:

        logger.info(f"{name} completed in {time.time() - start:.3f}s", extra=context)

Now you can disable tracing without code changes:

# Production (no tracing overhead)

python app.py



# Debug production issue (enable tracing)

ENABLE_TRACING=true python app.py

Pattern 2: Feature Flag Integration

If you use feature flags (like LaunchDarkly or Unleash), integrate tracing with them:

import logging

from contextlib import contextmanager

from feature_flags import is_enabled



logger = logging.getLogger(__name__)



@contextmanager

def trace_operation(name, **context):

    """Trace operations only when feature flag is enabled."""

    if not is_enabled('detailed-tracing', context.get('user_id')):

        yield

        return



    start = time.time()

    logger.info(f"Starting {name}", extra=context)

    try:

        yield

    finally:

        logger.info(f"{name} completed in {time.time() - start:.3f}s", extra=context)

This lets you enable tracing for specific users or requests without redeploying.

Pattern 3: Gradual Removal

When you're done with custom instrumentation, remove it gradually:

# Step 1: Disable but keep the code (day 1)

ENABLE_TRACING = False



# Step 2: Comment out the decorator (day 7, after confirming it's not needed)

# @trace_function

def fetch_user_profile(user_id):

    pass



# Step 3: Remove the wrapper completely (day 14, after confirming no issues)

def fetch_user_profile(user_id):

    # Just the original function

    pass



# Step 4: Remove the tracing utilities file (day 30)

# Delete tracing.py entirely

Notice this carefully: Don't remove instrumentation immediately after debugging. Leave it disabled for a week, commented for another week, then remove. This gives you an easy rollback path if the issue recurs.

Common mistakes with custom instrumentation:

  1. Not using functools.wraps: Breaks introspection and debugging

  2. Modifying function behavior: Adding retry logic, changing return values, etc.

  3. Catching exceptions without re-raising: Silences errors

  4. No off switch: Makes instrumentation permanent and hard to remove

  5. Complex configuration: Makes it hard to understand when tracing is active

The minimal instrumentation checklist:

If your custom instrumentation doesn't meet these criteria, simplify it. The goal is observability, not complexity.

7.10.3 AST Modification: The Nuclear Option

You're about to enter dangerous territory. Abstract Syntax Tree (AST) modification is where you programmatically rewrite Python code before it runs. This is the technique that enables Python coverage.py to instrument every line, that powers code transformation tools like Black and autopep8, and that frameworks use to implement magical DSLs.

It's also the technique that causes the most maintenance nightmares when used inappropriately.

When AST Transformation is Actually Justified

Let me be direct: if you're reading this section to solve a tracing problem, you're probably wrong. But let me show you the rare cases where AST modification is the right tool.

Scenario 1: Building a Framework or Library

You're building a testing framework that needs to automatically capture coverage data for every function call. Users shouldn't need to decorate functions or wrap code. The framework must transparently instrument code. This is what coverage.py does—it modifies the AST to inject coverage tracking into every statement.

Scenario 2: DSL Implementation

You're building a workflow engine where users write Python-like code that compiles to a different execution model. For example, Apache Airflow's task dependencies could theoretically use AST transformation (though they use decorators instead). You need to transform Python syntax into a different semantic model.

Scenario 3: Whole-Codebase Transformations

You're migrating a massive codebase from Python 2 to Python 3 using automated tools like lib2to3. Or you're applying a security audit that inserts validation at every database query. The transformation is one-time or periodic, not runtime.

Notice the pattern: All justified uses involve either building infrastructure (frameworks, tools) or one-time transformations. You're never modifying AST at runtime for everyday application debugging. If you think you need AST modification to understand how your Django application works, stop—you need a debugger, not AST surgery.

Using ast.NodeTransformer Safely

Let's build a minimal AST transformer that instruments function calls for timing. This will show you both how it works and why it's complex:

import ast

import time

import inspect



class FunctionCallTimer(ast.NodeTransformer):

    """Transform function definitions to add timing instrumentation."""



    def visit_FunctionDef(self, node):

        # First, recursively visit child nodes

        self.generic_visit(node)



        # Create timing instrumentation code

        # This is equivalent to:

        # _start_time = time.time()

        # try:

        #     [original function body]

        # finally:

        #     _end_time = time.time()

        #     print(f"{function_name} took {_end_time - _start_time:.3f}s")



        start_assign = ast.Assign(

            targets=[ast.Name(id='_start_time', ctx=ast.Store())],

            value=ast.Call(

                func=ast.Attribute(

                    value=ast.Name(id='time', ctx=ast.Load()),

                    attr='time',

                    ctx=ast.Load()

                ),

                args=[],

                keywords=[]

            )

        )



        end_assign = ast.Assign(

            targets=[ast.Name(id='_end_time', ctx=ast.Store())],

            value=ast.Call(

                func=ast.Attribute(

                    value=ast.Name(id='time', ctx=ast.Load()),

                    attr='time',

                    ctx=ast.Load()

                ),

                args=[],

                keywords=[]

            )

        )



        print_stmt = ast.Expr(

            value=ast.Call(

                func=ast.Name(id='print', ctx=ast.Load()),

                args=[

                    ast.JoinedStr(values=[

                        ast.Constant(value=f'{node.name} took '),

                        ast.FormattedValue(

                            value=ast.BinOp(

                                left=ast.Name(id='_end_time', ctx=ast.Load()),

                                op=ast.Sub(),

                                right=ast.Name(id='_start_time', ctx=ast.Load())

                            ),

                            conversion=-1,

                            format_spec=ast.Constant(value='.3f')

                        ),

                        ast.Constant(value='s')

                    ])

                ],

                keywords=[]

            )

        )



        # Wrap original body in try/finally

        try_node = ast.Try(

            body=node.body,

            handlers=[],

            orelse=[],

            finalbody=[end_assign, print_stmt]

        )



        # Replace function body with: start timing, try/finally

        node.body = [start_assign, try_node]



        return node



# Usage example

source_code = """

def calculate_total(items):

    total = sum(item.price for item in items)

    return total



def process_order(user_id):

    items = fetch_items(user_id)

    total = calculate_total(items)

    return total

"""



# Parse, transform, and compile

tree = ast.parse(source_code)

transformer = FunctionCallTimer()

transformed_tree = transformer.visit(tree)



# CRITICAL: Fix missing locations (line numbers)

ast.fix_missing_locations(transformed_tree)



# Compile and execute

code = compile(transformed_tree, filename='<ast>', mode='exec')

exec(code)

Look at how complex that is—40+ lines just to add timing to functions. Compare that to the decorator version:

def timed(func):

    @wraps(func)

    def wrapper(*args, **kwargs):

        start = time.time()

        try:

            return func(*args, **kwargs)

        finally:

            print(f"{func.__name__} took {time.time() - start:.3f}s")

    return wrapper



@timed

def calculate_total(items):

    total = sum(item.price for item in items)

    return total

This is the key insight: AST transformation takes 40+ lines of complex code to accomplish what a 10-line decorator does more clearly. Unless you're building infrastructure that makes decorators impossible, use decorators.

Preserving Line Numbers and Debugging

Notice the critical line in the AST example above:

ast.fix_missing_locations(transformed_tree)

This is essential. When you modify the AST, Python doesn't automatically track line numbers for new nodes. Without this call, stack traces and debuggers show wrong line numbers or crash. Let me demonstrate the impact:

# Without fix_missing_locations

source = """

def bad_function():

    raise ValueError("something broke")

"""



tree = ast.parse(source)

# Transform the tree somehow

compiled = compile(tree, filename='test.py', mode='exec')  # DANGER



# When this runs and raises an exception, the traceback will be broken:

# Traceback shows line numbers from the original source, not transformed code

# Debuggers can't set breakpoints correctly

# Coverage tools report wrong lines

Always call ast.fix_missing_locations(tree) after transformation. This recalculates line numbers for all new nodes, preserving debuggability.

But even with proper line numbers, AST-transformed code is harder to debug:

# Original code

def calculate_total(items):

    return sum(item.price for item in items)



# What the debugger sees after AST transformation

def calculate_total(items):

    _start_time = time.time()

    try:

        return sum(item.price for item in items)

    finally:

        _end_time = time.time()

        print(f"calculate_total took {_end_time - _start_time:.3f}s")

When you set a breakpoint in your editor on the return line, the debugger is actually breaking in transformed code. If the transformation is complex, you'll see variables and logic that don't exist in your source file. This makes debugging confusing and error-prone.

The Maintenance Burden: A Cautionary Tale

Let me tell you a real story (anonymized) about a team that learned this lesson the hard way.

Day 1: The team needs to track execution time for all functions in their Django application. They consider decorators but decide AST transformation is "more elegant"—no need to modify every function.

Week 2: They build an import hook that transforms all application code at import time:

import sys

import ast

from importlib.abc import MetaPathFinder, Loader



class InstrumentingLoader(Loader):

    def exec_module(self, module):

        # Load source, parse AST, transform, compile, execute

        # ... (complex code omitted)



class InstrumentingFinder(MetaPathFinder):

    def find_spec(self, name, path, target=None):

        # Intercept imports for app modules

        # ... (complex code omitted)



# Install the import hook

sys.meta_path.insert(0, InstrumentingFinder())

This works! Every function is automatically timed. The team is proud of their elegant solution.

Month 2: Django upgrades from 3.2 to 4.0. The import hook breaks because Django's new async views use different AST structures. The team spends three days updating the transformer to handle async functions correctly.

Month 4: A new developer joins. They're confused why breakpoints don't work correctly. The debugger shows transformed code, not source code. The team writes documentation explaining this quirk. New developers now take 2-3 days longer to onboard.

Month 6: The Python version upgrades from 3.10 to 3.11. Python 3.11 changed the AST structure for exception handling. The transformer crashes on try/except blocks. The team spends another week updating it.

Month 8: The team needs to add distributed tracing. The AST transformer doesn't have request context—it just times functions. They realize they need to modify the transformer to thread context through, which requires transforming not just function definitions but all function calls. This would take weeks.

Month 9: The team rips out the AST transformer and replaces it with:

The replacement takes three days. The new system is easier to understand, debug, and maintain.

Total cost:

This is crucial: The maintenance burden of AST transformation compounds over time. Every Python upgrade, framework upgrade, or feature addition requires revisiting the transformer. Compare this to decorators, which work across Python versions without modification.

When NOT to use AST transformation:

When you might use AST transformation:

The decision criteria:

Ask yourself these questions:

  1. Are you building infrastructure used by many teams? (Yes → maybe justified)

  2. Is this transformation one-time or continuous? (One-time → maybe justified)

  3. Can decorators solve this? (Yes → don't use AST)

  4. Do you have capacity to maintain this through Python upgrades? (No → don't use AST)

  5. Will your team understand this in 6 months? (No → don't use AST)

If you answered "no" to #1 and "yes" to #3, absolutely do not use AST transformation. Use decorators.

Summary: AST modification is the nuclear option—powerful, dangerous, and justified in rare circumstances. In 99% of tracing scenarios, debuggers, decorators, and framework tools are simpler, more maintainable, and more appropriate. If you're considering AST transformation for everyday debugging, step back and use standard tools first. Save AST for when you're building the tools themselves, not using them.