Part I: Foundation - The Universal Pattern
Chapter 1: Introduction - Why This Matters
You've written this code before:
for root, dirs, files in os.walk('/my/project'):
for file in files:
if file.endswith('.py'):
print(os.path.join(root, file))
And you've probably written something like this:
def find_email(data):
if isinstance(data, dict):
if 'email' in data:
return data['email']
for value in data.values():
result = find_email(value)
if result:
return result
elif isinstance(data, list):
for item in data:
result = find_email(item)
if result:
return result
return None
These look completely different. One walks through directories, the other searches nested JSON. But here's what most tutorials won't tell you: they're the same problem wearing different clothes.
The Hidden Commonality
Both are asking:
- Where am I right now?
- What's at this location?
- Where can I go next?
- Am I done, or should I keep looking?
Once you see this pattern, you'll recognize it everywhere:
- Parsing HTML tables with BeautifulSoup
- Walking an Abstract Syntax Tree to analyze code
- Traversing configuration files
- Navigating API responses
- Exploring database query results
The problem isn't that these tasks are hard. The problem is that every tutorial teaches them in isolation, as if os.walk() and JSON recursion and DOM traversal are separate skills you have to learn from scratch each time.
What You'll Actually Learn
This guide teaches you one mental model that works across all nested data structures. You'll learn:
- How to recognize traversal problems before you start coding
- When to use existing library features vs. when to write custom code
- The core patterns that work regardless of data type
- How to adapt the pattern to file systems, JSON, HTML, and code ASTs
Most importantly, you'll learn how to explore unfamiliar structures confidently, without frantically Googling "how to get nested value from JSON" for the hundredth time.
Who This Guide Is For
You should:
- Be comfortable with Python basics (functions, loops, dictionaries)
- Have seen
os.walk()at least once, even if you don't fully understand it - Know what recursion is, even if you're not great at writing it yet
- Have struggled with nested data before
You don't need to be an expert. If you've ever stared at a 500-line JSON response and wondered "how do I get that value out?", this guide is for you.
How to Use This Guide
If you're in a hurry: Read Chapters 2-3 (the pattern and decision framework), then skip to whatever domain you're working in (JSON, HTML, or ASTs).
If you want mastery: Read sequentially. Each chapter builds on the last, starting simple (file systems) and progressing to complex (AST analysis).
If you're stuck on something specific: Start with Chapter 23's toolkit, then jump to the relevant domain chapter.
The exercises are essential. Don't skip them. You can read about traversal patterns all day, but until you struggle with the design yourself—until you make the mistakes and fix them—the knowledge won't stick.
Let's begin.
Chapter 2: The Universal Traversal Pattern
Here's a deceptively simple piece of code:
for root, dirs, files in os.walk('/project'):
for file in files:
print(file)
Let me ask you: where is os.walk() right now? What directory is it looking at?
The answer changes with every iteration. That's the essence of traversal: moving through a structure, asking questions at each position.
The Four Questions Every Traversal Answers
Every traversal—whether it's directories, JSON, HTML, or code—answers these four questions:
1. WHERE AM I?
What's my current position in the structure?
# In os.walk:
root # Current directory path
# In JSON traversal:
current_dict # Current dictionary or list
# In BeautifulSoup:
current_tag # Current HTML element
# In Tree-sitter:
current_node # Current AST node
2. WHAT'S HERE?
What data exists at this position?
# In os.walk:
files # List of filenames in this directory
# In JSON:
data['name'] # Value at this key
# In BeautifulSoup:
tag.text # Text content of this element
# In Tree-sitter:
node.type # What kind of code construct is this?
3. WHERE CAN I GO?
What are my next possible positions?
# In os.walk:
dirs # Subdirectories to visit next
# In JSON:
data.values() # Nested dictionaries or lists
# In BeautifulSoup:
tag.children # Child elements
# In Tree-sitter:
node.children # Child nodes in the syntax tree
4. WHAT AM I LOOKING FOR?
When do I stop? What triggers success or failure?
# Examples:
if file.endswith('.py'): # Found a Python file
if 'email' in data: # Found the email field
if tag.name == 'table': # Found a table element
if node.type == 'function_definition': # Found a function
Let's see these questions in action with a real example:
def find_large_files(start_path, size_threshold):
"""Find all files larger than threshold."""
large_files = []
for root, dirs, files in os.walk(start_path): # WHERE CAN I GO?
# WHERE AM I: root is current directory
for file in files: # WHAT'S HERE?
path = os.path.join(root, file)
size = os.path.getsize(path)
# WHAT AM I LOOKING FOR?
if size > size_threshold:
large_files.append((path, size))
return large_files
Now here's the exact same pattern searching nested JSON:
def find_large_values(data, threshold, path=""):
"""Find all numeric values larger than threshold."""
results = []
# WHERE AM I: current data (dict, list, or value)
if isinstance(data, dict):
# WHERE CAN I GO: dictionary values
for key, value in data.items():
results.extend(
find_large_values(value, threshold, f"{path}.{key}")
)
elif isinstance(data, list):
# WHERE CAN I GO: list items
for i, item in enumerate(data):
results.extend(
find_large_values(item, threshold, f"{path}[{i}]")
)
elif isinstance(data, (int, float)):
# WHAT'S HERE: a number
# WHAT AM I LOOKING FOR?
if data > threshold:
results.append((path, data))
return results
Do you see it? Same four questions, different data structure.
The Pattern in Plain English
Every traversal follows this logic:
- Start somewhere (root directory, top-level dict, document root, etc.)
- Look at what's here (files, values, tags, nodes)
- Check if you've found what you want (matching criteria)
- Decide where to go next (subdirectories, nested dicts, child elements)
- Repeat until done (no more places to go, or found what you need)
That's it. That's the whole pattern.
Recognizing Traversal Problems in the Wild
How do you know when you're facing a traversal problem? Look for these indicators:
"I need to find..."
- "...all Python files in this project"
- "...every occurrence of 'user_id' in this JSON"
- "...all the tables on this webpage"
- "...every function call in this codebase"
"I need to collect..."
- "...the total size of all files"
- "...all email addresses from nested data"
- "...all links grouped by section"
- "...all import statements"
"I don't know where it is..."
- "The data could be nested at any level"
- "Tables could appear anywhere in the HTML"
- "Functions could be in any file"
"I need to understand the structure..."
- "What fields exist in this API response?"
- "How is this HTML document organized?"
- "What's the shape of this configuration file?"
If you're saying any of these things, you're facing a traversal problem.
Exercise: Identifying the Pattern in Familiar Code
Look at this code and answer the four questions:
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('a'):
href = link.get('href')
if href and href.startswith('https'):
print(href)
Questions:
- WHERE AM I on each iteration?
- WHAT'S HERE at each position?
- WHERE CAN I GO? (Is this going anywhere, or is it just collecting?)
- WHAT AM I LOOKING FOR?
Bonus: Rewrite this using recursion to find links at any depth, not just top-level <a> tags.
Answers (try first before reading):
- WHERE AM I: Each
<a>tag found byfind_all - WHAT'S HERE: The
hrefattribute of the current link - WHERE CAN I GO: We're not going anywhere—
find_allalready did the traversal for us - WHAT AM I LOOKING FOR: HTTPS links
This is a key insight: sometimes the library already handles the traversal. You just need to filter what it gives you.
Chapter 3: The Two-Layer Decision Framework
Before writing any traversal code, you need to make two decisions. Get these wrong and you'll waste hours writing code that already exists, or fighting with recursion you don't need.
Layer 1: Does the Library Already Solve This?
This is the most important question, and the one most people skip.
Here's the mistake: You see nested data and immediately think, "I need to write a recursive function."
Here's the truth: Most well-designed libraries already provide traversal features. Your job is to find them.
The Exploration Protocol
When facing unfamiliar nested data, do this first:
# Step 1: What type is this?
print(type(data))
# Step 2: What can I do with it?
print(dir(data))
# Step 3: Read the documentation
help(data.some_method)
Let me show you what this looks like in practice:
# You have a BeautifulSoup object and need to find all tables
soup = BeautifulSoup(html, 'html.parser')
# WRONG: Start writing recursive traversal
# def find_tables(element):
# if element.name == 'table':
# ...
# RIGHT: Explore first
print(dir(soup)) # Oh, there's a .find_all() method
print(help(soup.find_all)) # It searches descendants automatically!
# The solution is one line:
tables = soup.find_all('table')
You almost never need to manually traverse a BeautifulSoup tree. The library does it for you.
Common Library Patterns Across Domains
Different domains, same patterns:
| Domain | Library Feature | What It Does |
|---|---|---|
| File Systems | os.walk() |
Recursively yields all directories and files |
| JSON | Direct indexing | data['a']['b']['c'] when you know the path |
| HTML | .find_all() |
Searches entire tree for matching elements |
| HTML | .select() |
CSS selectors for complex queries |
| AST | .children |
Direct access to child nodes |
| AST | .child_by_field_name() |
Semantic field access |
The pattern: Good libraries expose the structure in a way that makes traversal easy or unnecessary.
When to Stop and Use What's Provided
Use library features when:
- You're searching for all instances of something (
find_all,os.walk) - You know the path to the data (
data['user']['email']) - The library has semantic accessors (
.child_by_field_name('name')) - The documentation shows traversal examples
Layer 2: Do You Need Custom Traversal?
Sometimes the library doesn't solve your specific problem. But don't jump to custom traversal yet. First, identify which problem you actually have.
The Four Scenarios That Require Custom Code
Scenario 1: Context Tracking
Problem: You need to know where you are in the structure, not just what you found.
# Example: "Find all links, but label which section they're in"
# BeautifulSoup's find_all() loses section context
def extract_links_by_section(soup):
sections = {}
for section in soup.find_all('section'):
section_name = section.get('id', 'unknown')
sections[section_name] = [
a.get('href') for a in section.find_all('a')
]
return sections
Scenario 2: Conditional Navigation
Problem: Whether you explore a branch depends on what you find.
# Example: "Skip __pycache__ directories"
# os.walk() visits everything by default
for root, dirs, files in os.walk(start_path):
# Modify dirs in-place to skip directories
dirs[:] = [d for d in dirs if d != '__pycache__']
# Process files...
Scenario 3: Custom Aggregation
Problem: You need to combine data in ways the library doesn't support.
# Example: "Sum the size of all Python files, grouped by directory"
# os.walk() gives you files, but doesn't aggregate
def size_by_directory(start_path):
sizes = {}
for root, dirs, files in os.walk(start_path):
py_files = [f for f in files if f.endswith('.py')]
sizes[root] = sum(
os.path.getsize(os.path.join(root, f))
for f in py_files
)
return sizes
Scenario 4: Unknown Structure
Problem: You don't know where the data is or what the structure looks like.
# Example: "Find all values for the key 'email', anywhere in this JSON"
# Direct access won't work because you don't know the path
def find_all_values(data, key):
results = []
if isinstance(data, dict):
if key in data:
results.append(data[key])
for value in data.values():
results.extend(find_all_values(value, key))
elif isinstance(data, list):
for item in data:
results.extend(find_all_values(item, key))
return results
The Cost-Benefit Analysis
Custom traversal has costs:
- More code to write and maintain
- More places for bugs to hide
- Often slower than optimized library methods
- Harder for others to understand
Only write custom traversal when:
- The library doesn't support your use case
- You've checked the documentation thoroughly
- The benefits outweigh the maintenance cost
Decision Tree: Library vs. Custom Implementation
START: I need to navigate nested data
├─ Do I know the exact path to the data?
│ └─ YES → Use direct access: data['a']['b']['c']
│ └─ NO → Continue
│
├─ Am I searching for all instances of something?
│ ├─ File system → Use os.walk()
│ ├─ HTML → Use soup.find_all() or .select()
│ ├─ JSON/Dicts → Need custom traversal (Scenario 4)
│ └─ AST → Use node.children or custom traversal
│
├─ Do I need to track context while searching?
│ └─ YES → Custom traversal (Scenario 1)
│
├─ Does navigation depend on what I find?
│ └─ YES → Custom traversal (Scenario 2)
│
└─ Do I need custom aggregation?
└─ YES → Use library for traversal + custom processing (Scenario 3)
Exercise: Evaluating Real-World Scenarios
For each scenario below, decide: Library feature or custom traversal?
Scenario A:
# Find all .jpg files in a directory tree
# and print their full paths
Scenario B:
# Given JSON API response, extract the value
# at data['results'][0]['user']['email']
Scenario C:
# Find all <img> tags in HTML, but only
# those inside <article> elements, and
# group them by article title
Scenario D:
# Search nested JSON for ANY occurrence
# of the key "timestamp", regardless of depth
Scenario E:
# Walk directory tree but calculate
# total size only for .py files,
# skipping any 'test' directories
Answers:
A: Library (os.walk()) - it's exactly what it's designed for
B: Library (direct access) - you know the path
C: Custom traversal (Scenario 1) - context tracking (which article) + conditional collection (only imgs in articles)
D: Custom traversal (Scenario 4) - unknown structure requires recursion
E: Library + custom logic (Scenario 3) - use os.walk() but with custom filtering and aggregation
What You've Learned
You now understand:
- The Universal Pattern - Four questions every traversal answers
- Problem Recognition - How to identify traversal problems
- The Exploration Protocol - Always check library features first
- The Four Scenarios - When you actually need custom code
- Decision Framework - How to choose your approach
In Part II, we'll start with the most familiar domain—file systems—and build your traversal intuition from there. You already know os.walk() exists. Now you'll learn how it works, and more importantly, how to think beyond it.