Part I: Foundation - The Universal Pattern

Chapter 1: Introduction - Why This Matters

You've written this code before:

for root, dirs, files in os.walk('/my/project'):
    for file in files:
        if file.endswith('.py'):
            print(os.path.join(root, file))

And you've probably written something like this:

def find_email(data):
    if isinstance(data, dict):
        if 'email' in data:
            return data['email']
        for value in data.values():
            result = find_email(value)
            if result:
                return result
    elif isinstance(data, list):
        for item in data:
            result = find_email(item)
            if result:
                return result
    return None

These look completely different. One walks through directories, the other searches nested JSON. But here's what most tutorials won't tell you: they're the same problem wearing different clothes.

The Hidden Commonality

Both are asking:

Where am I right now?
What's at this location?
Where can I go next?
Am I done, or should I keep looking?

Once you see this pattern, you'll recognize it everywhere:

Parsing HTML tables with BeautifulSoup
Walking an Abstract Syntax Tree to analyze code
Traversing configuration files
Navigating API responses
Exploring database query results

The problem isn't that these tasks are hard. The problem is that every tutorial teaches them in isolation, as if os.walk() and JSON recursion and DOM traversal are separate skills you have to learn from scratch each time.

What You'll Actually Learn

This guide teaches you one mental model that works across all nested data structures. You'll learn:

How to recognize traversal problems before you start coding
When to use existing library features vs. when to write custom code
The core patterns that work regardless of data type
How to adapt the pattern to file systems, JSON, HTML, and code ASTs

Most importantly, you'll learn how to explore unfamiliar structures confidently, without frantically Googling "how to get nested value from JSON" for the hundredth time.

Who This Guide Is For

You should:

Be comfortable with Python basics (functions, loops, dictionaries)
Have seen os.walk() at least once, even if you don't fully understand it
Know what recursion is, even if you're not great at writing it yet
Have struggled with nested data before

You don't need to be an expert. If you've ever stared at a 500-line JSON response and wondered "how do I get that value out?", this guide is for you.

How to Use This Guide

If you're in a hurry: Read Chapters 2-3 (the pattern and decision framework), then skip to whatever domain you're working in (JSON, HTML, or ASTs).

If you want mastery: Read sequentially. Each chapter builds on the last, starting simple (file systems) and progressing to complex (AST analysis).

If you're stuck on something specific: Start with Chapter 23's toolkit, then jump to the relevant domain chapter.

The exercises are essential. Don't skip them. You can read about traversal patterns all day, but until you struggle with the design yourself—until you make the mistakes and fix them—the knowledge won't stick.

Let's begin.

Chapter 2: The Universal Traversal Pattern

Here's a deceptively simple piece of code:

for root, dirs, files in os.walk('/project'):
    for file in files:
        print(file)

Let me ask you: where is os.walk() right now? What directory is it looking at?

The answer changes with every iteration. That's the essence of traversal: moving through a structure, asking questions at each position.

The Four Questions Every Traversal Answers

Every traversal—whether it's directories, JSON, HTML, or code—answers these four questions:

1. WHERE AM I?

What's my current position in the structure?

# In os.walk:
root  # Current directory path

# In JSON traversal:
current_dict  # Current dictionary or list

# In BeautifulSoup:
current_tag  # Current HTML element

# In Tree-sitter:
current_node  # Current AST node

2. WHAT'S HERE?

What data exists at this position?

# In os.walk:
files  # List of filenames in this directory

# In JSON:
data['name']  # Value at this key

# In BeautifulSoup:
tag.text  # Text content of this element

# In Tree-sitter:
node.type  # What kind of code construct is this?

3. WHERE CAN I GO?

What are my next possible positions?

# In os.walk:
dirs  # Subdirectories to visit next

# In JSON:
data.values()  # Nested dictionaries or lists

# In BeautifulSoup:
tag.children  # Child elements

# In Tree-sitter:
node.children  # Child nodes in the syntax tree

4. WHAT AM I LOOKING FOR?

When do I stop? What triggers success or failure?

# Examples:
if file.endswith('.py'):  # Found a Python file
if 'email' in data:        # Found the email field
if tag.name == 'table':    # Found a table element
if node.type == 'function_definition':  # Found a function

Let's see these questions in action with a real example:

def find_large_files(start_path, size_threshold):
    """Find all files larger than threshold."""
    large_files = []

    for root, dirs, files in os.walk(start_path):  # WHERE CAN I GO?
        # WHERE AM I: root is current directory

        for file in files:  # WHAT'S HERE?
            path = os.path.join(root, file)
            size = os.path.getsize(path)

            # WHAT AM I LOOKING FOR?
            if size > size_threshold:
                large_files.append((path, size))

    return large_files

Now here's the exact same pattern searching nested JSON:

def find_large_values(data, threshold, path=""):
    """Find all numeric values larger than threshold."""
    results = []

    # WHERE AM I: current data (dict, list, or value)

    if isinstance(data, dict):
        # WHERE CAN I GO: dictionary values
        for key, value in data.items():
            results.extend(
                find_large_values(value, threshold, f"{path}.{key}")
            )

    elif isinstance(data, list):
        # WHERE CAN I GO: list items
        for i, item in enumerate(data):
            results.extend(
                find_large_values(item, threshold, f"{path}[{i}]")
            )

    elif isinstance(data, (int, float)):
        # WHAT'S HERE: a number
        # WHAT AM I LOOKING FOR?
        if data > threshold:
            results.append((path, data))

    return results

Do you see it? Same four questions, different data structure.

The Pattern in Plain English

Every traversal follows this logic:

Start somewhere (root directory, top-level dict, document root, etc.)
Look at what's here (files, values, tags, nodes)
Check if you've found what you want (matching criteria)
Decide where to go next (subdirectories, nested dicts, child elements)
Repeat until done (no more places to go, or found what you need)

That's it. That's the whole pattern.

Recognizing Traversal Problems in the Wild

How do you know when you're facing a traversal problem? Look for these indicators:

"I need to find..."

"...all Python files in this project"
"...every occurrence of 'user_id' in this JSON"
"...all the tables on this webpage"
"...every function call in this codebase"

"I need to collect..."

"...the total size of all files"
"...all email addresses from nested data"
"...all links grouped by section"
"...all import statements"

"I don't know where it is..."

"The data could be nested at any level"
"Tables could appear anywhere in the HTML"
"Functions could be in any file"

"I need to understand the structure..."

"What fields exist in this API response?"
"How is this HTML document organized?"
"What's the shape of this configuration file?"

If you're saying any of these things, you're facing a traversal problem.

Exercise: Identifying the Pattern in Familiar Code

Look at this code and answer the four questions:

soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('a'):
    href = link.get('href')
    if href and href.startswith('https'):
        print(href)

Questions:

WHERE AM I on each iteration?
WHAT'S HERE at each position?
WHERE CAN I GO? (Is this going anywhere, or is it just collecting?)
WHAT AM I LOOKING FOR?

Bonus: Rewrite this using recursion to find links at any depth, not just top-level <a> tags.

Answers (try first before reading):

WHERE AM I: Each <a> tag found by find_all
WHAT'S HERE: The href attribute of the current link
WHERE CAN I GO: We're not going anywhere—find_all already did the traversal for us
WHAT AM I LOOKING FOR: HTTPS links

This is a key insight: sometimes the library already handles the traversal. You just need to filter what it gives you.

Chapter 3: The Two-Layer Decision Framework

Before writing any traversal code, you need to make two decisions. Get these wrong and you'll waste hours writing code that already exists, or fighting with recursion you don't need.

Layer 1: Does the Library Already Solve This?

This is the most important question, and the one most people skip.

Here's the mistake: You see nested data and immediately think, "I need to write a recursive function."

Here's the truth: Most well-designed libraries already provide traversal features. Your job is to find them.

The Exploration Protocol

When facing unfamiliar nested data, do this first:

# Step 1: What type is this?
print(type(data))

# Step 2: What can I do with it?
print(dir(data))

# Step 3: Read the documentation
help(data.some_method)

Let me show you what this looks like in practice:

# You have a BeautifulSoup object and need to find all tables
soup = BeautifulSoup(html, 'html.parser')

# WRONG: Start writing recursive traversal
# def find_tables(element):
#     if element.name == 'table':
#         ...

# RIGHT: Explore first
print(dir(soup))  # Oh, there's a .find_all() method
print(help(soup.find_all))  # It searches descendants automatically!

# The solution is one line:
tables = soup.find_all('table')

You almost never need to manually traverse a BeautifulSoup tree. The library does it for you.

Common Library Patterns Across Domains

Different domains, same patterns:

Domain	Library Feature	What It Does
File Systems	`os.walk()`	Recursively yields all directories and files
JSON	Direct indexing	`data['a']['b']['c']` when you know the path
HTML	`.find_all()`	Searches entire tree for matching elements
HTML	`.select()`	CSS selectors for complex queries
AST	`.children`	Direct access to child nodes
AST	`.child_by_field_name()`	Semantic field access

The pattern: Good libraries expose the structure in a way that makes traversal easy or unnecessary.

When to Stop and Use What's Provided

Use library features when:

You're searching for all instances of something (find_all, os.walk)
You know the path to the data (data['user']['email'])
The library has semantic accessors (.child_by_field_name('name'))
The documentation shows traversal examples

Layer 2: Do You Need Custom Traversal?

Sometimes the library doesn't solve your specific problem. But don't jump to custom traversal yet. First, identify which problem you actually have.

The Four Scenarios That Require Custom Code

Scenario 1: Context Tracking

Problem: You need to know where you are in the structure, not just what you found.

# Example: "Find all links, but label which section they're in"
# BeautifulSoup's find_all() loses section context

def extract_links_by_section(soup):
    sections = {}
    for section in soup.find_all('section'):
        section_name = section.get('id', 'unknown')
        sections[section_name] = [
            a.get('href') for a in section.find_all('a')
        ]
    return sections

Scenario 2: Conditional Navigation

Problem: Whether you explore a branch depends on what you find.

# Example: "Skip __pycache__ directories"
# os.walk() visits everything by default

for root, dirs, files in os.walk(start_path):
    # Modify dirs in-place to skip directories
    dirs[:] = [d for d in dirs if d != '__pycache__']
    # Process files...

Scenario 3: Custom Aggregation

Problem: You need to combine data in ways the library doesn't support.

# Example: "Sum the size of all Python files, grouped by directory"
# os.walk() gives you files, but doesn't aggregate

def size_by_directory(start_path):
    sizes = {}
    for root, dirs, files in os.walk(start_path):
        py_files = [f for f in files if f.endswith('.py')]
        sizes[root] = sum(
            os.path.getsize(os.path.join(root, f))
            for f in py_files
        )
    return sizes

Scenario 4: Unknown Structure

Problem: You don't know where the data is or what the structure looks like.

# Example: "Find all values for the key 'email', anywhere in this JSON"
# Direct access won't work because you don't know the path

def find_all_values(data, key):
    results = []
    if isinstance(data, dict):
        if key in data:
            results.append(data[key])
        for value in data.values():
            results.extend(find_all_values(value, key))
    elif isinstance(data, list):
        for item in data:
            results.extend(find_all_values(item, key))
    return results

The Cost-Benefit Analysis

Custom traversal has costs:

More code to write and maintain
More places for bugs to hide
Often slower than optimized library methods
Harder for others to understand

Only write custom traversal when:

The library doesn't support your use case
You've checked the documentation thoroughly
The benefits outweigh the maintenance cost

Decision Tree: Library vs. Custom Implementation

START: I need to navigate nested data

├─ Do I know the exact path to the data?
│  └─ YES → Use direct access: data['a']['b']['c']
│  └─ NO → Continue
│
├─ Am I searching for all instances of something?
│  ├─ File system → Use os.walk()
│  ├─ HTML → Use soup.find_all() or .select()
│  ├─ JSON/Dicts → Need custom traversal (Scenario 4)
│  └─ AST → Use node.children or custom traversal
│
├─ Do I need to track context while searching?
│  └─ YES → Custom traversal (Scenario 1)
│
├─ Does navigation depend on what I find?
│  └─ YES → Custom traversal (Scenario 2)
│
└─ Do I need custom aggregation?
   └─ YES → Use library for traversal + custom processing (Scenario 3)

Exercise: Evaluating Real-World Scenarios

For each scenario below, decide: Library feature or custom traversal?

Scenario A:

# Find all .jpg files in a directory tree
# and print their full paths

Scenario B:

# Given JSON API response, extract the value
# at data['results'][0]['user']['email']

Scenario C:

# Find all <img> tags in HTML, but only
# those inside <article> elements, and
# group them by article title

Scenario D:

# Search nested JSON for ANY occurrence
# of the key "timestamp", regardless of depth

Scenario E:

# Walk directory tree but calculate
# total size only for .py files,
# skipping any 'test' directories

Answers:

A: Library (os.walk()) - it's exactly what it's designed for

B: Library (direct access) - you know the path

C: Custom traversal (Scenario 1) - context tracking (which article) + conditional collection (only imgs in articles)

D: Custom traversal (Scenario 4) - unknown structure requires recursion

E: Library + custom logic (Scenario 3) - use os.walk() but with custom filtering and aggregation

What You've Learned

You now understand:

The Universal Pattern - Four questions every traversal answers
Problem Recognition - How to identify traversal problems
The Exploration Protocol - Always check library features first
The Four Scenarios - When you actually need custom code
Decision Framework - How to choose your approach

In Part II, we'll start with the most familiar domain—file systems—and build your traversal intuition from there. You already know os.walk() exists. Now you'll learn how it works, and more importantly, how to think beyond it.