15-715-case-study-3-fastapi-background-task-failure

7.15 Case Study 3: FastAPI Background Task Failure

The Setup: Your FastAPI application processes uploaded files in the background. Users upload a CSV, get an immediate response, and receive an email when processing completes. This worked perfectly in development, but in production, background tasks fail silently about 30% of the time. No exceptions in logs, no error emails, no indication of what went wrong.

The code looks straightforward:

from fastapi import FastAPI, BackgroundTasks, UploadFile

import asyncio



app = FastAPI()



@app.post("/upload")

async def upload_file(file: UploadFile, background_tasks: BackgroundTasks):

    # Save file to disk

    content = await file.read()

    filepath = f"/tmp/{file.filename}"

    with open(filepath, "wb") as f:

        f.write(content)



    # Process in background

    background_tasks.add_task(process_csv, filepath)



    return {"status": "processing"}



async def process_csv(filepath: str):

    async with DatabaseConnection() as db:

        data = parse_csv(filepath)

        await db.insert_many(data)

        await send_completion_email()

The initial confusion: You add logging to every step:

async def process_csv(filepath: str):

    print(f"Starting process_csv for {filepath}")

    async with DatabaseConnection() as db:

        print("Database connected")

        data = parse_csv(filepath)

        print(f"Parsed {len(data)} rows")

        await db.insert_many(data)

        print("Data inserted")

        await send_completion_email()

        print("Email sent")

In production logs, you see:

Starting process_csv for /tmp/data.csv

Database connected

Parsed 250 rows

Then... nothing. The task stops after parsing. No error, no exception, no "Data inserted" log. The process just vanishes.

You try wrapping everything in try-except:

async def process_csv(filepath: str):

    try:

        async with DatabaseConnection() as db:

            data = parse_csv(filepath)

            await db.insert_many(data)

            await send_completion_email()

    except Exception as e:

        print(f"Error: {e}")

        raise

Still no error appears in logs. The try-except never catches anything. How can code fail without raising an exception?

After 6 hours of adding more logging, inspecting database connections, checking file permissions, and restarting services, you're stuck. The task fails silently, and you have no way to see what's happening inside that async context manager.

This is the "silent failure in async code" problem—one of the most frustrating debugging scenarios in modern Python.

Problem: Background task fails silently

Let's solve this properly with the right tools in 30 minutes.

Phase 1: Understanding async task execution (10 minutes)

First, recognize the core issue: FastAPI background tasks run in the same event loop as your request handling, but they're not monitored the same way. If an exception occurs in a background task after the response is sent, there's no HTTP response to attach it to, and the exception might be suppressed depending on how the event loop handles it.

The key question: Is the task actually failing, or is it just not completing? Let's find out with py-spy.

Install py-spy (if not already installed):

pip install py-spy

py-spy is a sampling profiler that can attach to running Python processes without modifying code or restarting. It shows you exactly what functions are executing at any moment.

Start your FastAPI application in one terminal:

uvicorn main:app --host 0.0.0.0 --port 8000

Find the process ID:

ps aux | grep uvicorn

# Output: user  12345  0.5  0.1  ... python -m uvicorn main:app

The process ID is 12345 in this example.

Upload a file to trigger the background task:

curl -F "file=@test.csv" http://localhost:8000/upload

Immediately attach py-spy while the background task should be running:

sudo py-spy dump --pid 12345

py-spy dump takes a snapshot of all threads and their current call stacks. You see:

Thread 0x7f8b2c3d4700 (active): "MainThread"

    File "asyncio/base_events.py", line 1823, in _run_once

    File "asyncio/events.py", line 80, in _run

    File "starlette/background.py", line 42, in __call__

    File "main.py", line 18, in process_csv

        async with DatabaseConnection() as db:

    File "database.py", line 34, in __aenter__

        self.conn = await asyncpg.connect(...)

    File "asyncpg/connection.py", line 156, in connect

    File "asyncio/selector_events.py", line 829, in _read_ready

        # Waiting for connection...

This is the key insight: The background task is stuck waiting in DatabaseConnection().__aenter__(). It's not failing—it's hanging during the database connection attempt.

This explains why you never saw "Data inserted" in your logs. The code never got past the async with DatabaseConnection() line. But why does asyncpg.connect() hang instead of raising a timeout error?

Check the DatabaseConnection implementation:

class DatabaseConnection:

    def __init__(self):

        self.conn = None



    async def __aenter__(self):

        self.conn = await asyncpg.connect(

            host=os.getenv("DB_HOST"),

            database=os.getenv("DB_NAME"),

            user=os.getenv("DB_USER"),

            password=os.getenv("DB_PASSWORD")

            # Missing: timeout parameter!

        )

        return self.conn



    async def __aexit__(self, exc_type, exc_val, exc_tb):

        await self.conn.close()

The connection has no timeout. When the database is overloaded or network is slow, asyncpg.connect() waits indefinitely. In production, database connection pools might be exhausted, causing new connections to hang forever waiting for an available slot.

But there's a second problem: even if connection succeeded, you need to check if the async context manager is being used correctly.

Phase 2: Async context manager validation (10 minutes)

Add more detailed logging with exception handling that actually works for async:

import traceback

import sys



async def process_csv(filepath: str):

    try:

        print(f"Starting process_csv for {filepath}", flush=True)

        print(f"Creating DatabaseConnection...", flush=True)



        db_conn = DatabaseConnection()

        print(f"Entering context manager...", flush=True)



        async with db_conn as db:

            print("Database connected", flush=True)

            data = parse_csv(filepath)

            print(f"Parsed {len(data)} rows", flush=True)

            await db.insert_many(data)

            print("Data inserted", flush=True)

            await send_completion_email()

            print("Email sent", flush=True)



    except Exception as e:

        print(f"Exception caught: {type(e).__name__}: {e}", flush=True)

        traceback.print_exc(file=sys.stdout)

        # Don't just print - actually log somewhere persistent

        raise

    finally:

        print("process_csv complete (finally block)", flush=True)

Notice the flush=True parameter—this is crucial. Without it, print buffering might delay logs until the process ends, making you think code didn't execute when it actually did.

After adding this and uploading another file, you see:

Starting process_csv for /tmp/data.csv

Creating DatabaseConnection...

Entering context manager...

Still hangs at the same place. But now let's use py-spy in a different mode—continuous recording:

sudo py-spy record -o profile.svg --pid 12345 --duration 30

This records the call stack every 10 milliseconds for 30 seconds and generates a flamegraph. Upload a file, wait 30 seconds, then open profile.svg in a browser.

The flamegraph shows:

process_csv (100% of time)

  └─ DatabaseConnection.__aenter__ (100% of time)

      └─ asyncpg.connect (100% of time)

          └─ asyncio.selector_events._read_ready (100% of time)

The task spends 100% of its time waiting for the database connection. This confirms it's not a bug in your code logic—it's a connection timeout/configuration issue.

Phase 3: Finding the actual bug (10 minutes)

Now that you know the connection is the problem, add a timeout and see what error actually occurs:

async def __aenter__(self):

    try:

        self.conn = await asyncio.wait_for(

            asyncpg.connect(

                host=os.getenv("DB_HOST"),

                database=os.getenv("DB_NAME"),

                user=os.getenv("DB_USER"),

                password=os.getenv("DB_PASSWORD")

            ),

            timeout=10.0  # Add 10 second timeout

        )

        return self.conn

    except asyncio.TimeoutError:

        print("Database connection timeout!")

        raise

Upload a file again. After 10 seconds, you see:

Starting process_csv for /tmp/data.csv

Creating DatabaseConnection...

Entering context manager...

Database connection timeout!

Exception caught: TimeoutError

Good! Now you're getting an actual error. But check the production configuration. You discover:

# production.env

DB_HOST=localhost

DB_NAME=production_db

DB_USER=app_user

DB_PASSWORD=secret123

Wait—DB_HOST=localhost? But in production, the database isn't running on localhost. It's running in a separate container. The environment variable is wrong!

The correct value should be DB_HOST=postgres-container (the Docker service name). The connection was trying to reach localhost:5432 where no database exists, hanging indefinitely waiting for a response that would never come.

But here's the actual async gotcha. Even after fixing the host, you discover another issue. Look at this code again:

@app.post("/upload")

async def upload_file(file: UploadFile, background_tasks: BackgroundTasks):

    content = await file.read()

    filepath = f"/tmp/{file.filename}"

    with open(filepath, "wb") as f:  # Synchronous file I/O in async function!

        f.write(content)



    background_tasks.add_task(process_csv, filepath)

    return {"status": "processing"}

The with open() is synchronous I/O that blocks the event loop. In production with many concurrent uploads, this blocks all other async tasks. Better approach:

import aiofiles



@app.post("/upload")

async def upload_file(file: UploadFile, background_tasks: BackgroundTasks):

    content = await file.read()

    filepath = f"/tmp/{file.filename}"



    async with aiofiles.open(filepath, "wb") as f:  # Async file I/O

        await f.write(content)



    background_tasks.add_task(process_csv, filepath)

    return {"status": "processing"}

The complete bug: The actual problem was multi-layered:

Wrong database host in environment variables (localhost vs container name)
No connection timeout, causing tasks to hang silently forever
Synchronous I/O in async functions, blocking the event loop
No structured exception logging for background tasks

All four issues combined to create silent failures that were nearly impossible to debug with print statements alone.

Tools used: `py-spy` + custom exception logging

py-spy gave you:

Call stack snapshots showing exactly where code was stuck (asyncpg.connect)
Flamegraph visualization showing time spent in each function (100% in connection)
No code changes required—attach to running process
Minimal overhead—sampling profiler, not tracing profiler
Works in production—safe to use on live systems

Custom exception logging (done right) gave you:

Async-aware exception handling—catches exceptions in async context
Explicit flush=True—ensures logs appear immediately
Traceback printing—shows full call stack when errors occur
Finally blocks—confirms whether code completed or hung

Why print debugging failed:

Print statements showed:

Starting process_csv

Database connected  (never appeared)

This tells you the code stops somewhere between starting and "connected", but:

❌ Doesn't show if it's hanging or erroring
❌ Doesn't show the call stack (what function is blocking)
❌ Doesn't show timing (is it slow or stuck?)
❌ Doesn't reveal the environment misconfiguration

py-spy revealed:

✅ Code is hanging (not erroring)
✅ Exact location: asyncpg.connect() waiting for network response
✅ Timing: spending 100% of time there
✅ Led you to check connection configuration

The workflow that worked:

Recognize silent failure pattern
Attach py-spy to see where code is stuck
Identify hanging operation (database connection)
Add timeout to force error instead of hang
See actual error message
Discover configuration issue

Discovery: Async context manager not awaited correctly

Let's dig deeper into what "not awaited correctly" actually means, because this is a subtle async programming error that catches even experienced developers.

The problematic pattern:

# This looks correct but has a subtle bug

async def process_csv(filepath: str):

    async with DatabaseConnection() as db:

        # If an exception occurs here...

        data = parse_csv(filepath)  # Synchronous, might raise

        await db.insert_many(data)

If parse_csv() raises an exception, the async context manager's __aexit__ runs:

async def __aexit__(self, exc_type, exc_val, exc_tb):

    await self.conn.close()  # Closes the connection properly

But here's the gotcha: If the connection never completes in __aenter__, the context manager never fully "enters", so your code inside the async with block never runs. The task just hangs in __aenter__ forever.

Another common mistake:

# Creating the context manager without awaiting

db = DatabaseConnection()  # This doesn't connect yet

conn = await db.__aenter__()  # Manual await - don't do this!

The proper way:

# Let Python handle the async context manager protocol

async with DatabaseConnection() as db:

    # Python automatically:

    # 1. Calls await DatabaseConnection().__aenter__()

    # 2. Assigns result to 'db'

    # 3. Runs your code block

    # 4. Calls await DatabaseConnection().__aexit__() even if exception occurs

    pass

The fix for this case study:

class DatabaseConnection:

    def __init__(self, timeout=10.0):

        self.conn = None

        self.timeout = timeout



    async def __aenter__(self):

        try:

            # Wrap connection with timeout

            self.conn = await asyncio.wait_for(

                asyncpg.connect(

                    host=os.getenv("DB_HOST", "localhost"),

                    database=os.getenv("DB_NAME"),

                    user=os.getenv("DB_USER"),

                    password=os.getenv("DB_PASSWORD"),

                    command_timeout=60  # Query timeout

                ),

                timeout=self.timeout  # Connection timeout

            )

            return self.conn

        except asyncio.TimeoutError as e:

            # Log properly for background tasks

            import logging

            logging.error(f"Database connection timeout after {self.timeout}s")

            raise ConnectionError(f"Could not connect to database within {self.timeout} seconds") from e

        except Exception as e:

            logging.error(f"Database connection failed: {e}")

            raise



    async def __aexit__(self, exc_type, exc_val, exc_tb):

        if self.conn:

            try:

                await asyncio.wait_for(self.conn.close(), timeout=5.0)

            except asyncio.TimeoutError:

                # Force close if graceful close hangs

                self.conn.terminate()

        return False  # Don't suppress exceptions

Additional async gotchas discovered:

Background tasks don't propagate exceptions by default:

# Exceptions in background tasks are swallowed!

background_tasks.add_task(process_csv, filepath)



# Better: Add exception handler

async def safe_process_csv(filepath):

    try:

        await process_csv(filepath)

    except Exception as e:

        logging.exception(f"Background task failed for {filepath}")

        # Send alert, store error, etc.



background_tasks.add_task(safe_process_csv, filepath)

Async functions must be awaited in background tasks:

# WRONG: Adds coroutine object, doesn't run it

background_tasks.add_task(process_csv(filepath))



# RIGHT: Adds function reference, FastAPI awaits it

background_tasks.add_task(process_csv, filepath)

Event loop differences between dev and prod:
Development: uvicorn --reload (restarts on code changes, masks some issues)
Production: uvicorn with multiple workers (different event loop per worker)

Time to fix: 30 minutes with tools, 6 hours without

Time breakdown with proper tools:

10 minutes: Attach py-spy, identify hanging location
5 minutes: Add connection timeout, reproduce error
10 minutes: Discover environment variable misconfiguration
5 minutes: Fix DatabaseConnection with proper error handling

Total: 30 minutes from silent failure to deployed fix.

Time spent without tools (the first 6 hours):

2 hours: Adding print statements throughout code
1 hour: Trying different exception handling patterns
1 hour: Manually testing database connections from Python shell
1 hour: Reading asyncpg documentation looking for gotchas
1 hour: Restarting services, checking logs, examining Docker networking

Total: 6 hours of frustration with no clear progress.

Why the 12× time difference?

Without py-spy:

You're debugging blind—you don't know if code is running, hanging, or erroring
Print statements don't appear when code hangs
Async hangs are nearly impossible to diagnose with logging alone
You chase wrong hypotheses (checking database permissions, file I/O, etc.)

With py-spy:

First run shows exactly where code is stuck
Flamegraph quantifies the problem (100% time in connection)
You immediately focus on the actual issue (connection config)
No code changes needed for diagnosis

The key lesson: Async debugging requires runtime inspection tools. Print statements don't show you:

What the event loop is doing
Where async operations are blocking
Whether code is waiting vs running vs failed

py-spy reveals the complete picture instantly.

When to use py-spy:

✅ Silent failures in production
✅ Hanging async operations
✅ Performance profiling without code changes
✅ Understanding where time is spent
✅ Diagnosing deadlocks or stuck tasks

When py-spy isn't enough:

❌ Logic errors in your code (use debugger)
❌ Data flow problems (use debugger with breakpoints)
❌ One-time crashes (use exception logging)
❌ Race conditions (use async debugging tools like asyncio debug mode)

This case study demonstrates that modern Python async code requires modern debugging tools. The patterns that worked for synchronous Python (print debugging, try-except logging) are insufficient for async. Tools like py-spy bridge that gap.

7.15 Case Study 3: FastAPI Background Task Failure

Problem: Background task fails silently

Tools used: py-spy + custom exception logging

Discovery: Async context manager not awaited correctly

Time to fix: 30 minutes with tools, 6 hours without

Tools used: `py-spy` + custom exception logging