Part VI: Advanced Patterns and Best Practices
Chapter 15: Problem Type Recognition
You've now worked with file systems, JSON, HTML, and ASTs. You've seen os.walk(), BeautifulSoup, and tree-sitter. But here's the deeper lesson: these aren't four separate skills. They're one skill wearing different hats.
The true mastery isn't memorizing each library's API. It's recognizing what kind of problem you're solving, then choosing the right tool for that problem type.
The Four Problem Types Revisited
Let's start simple. Every nested data problem falls into one of four categories:
1. Path Navigation
You know exactly where you're going. You have a map.
# You know the structure
user_email = api_response['data']['user']['contact']['email']
# You know the node type you want
function_body = func_node.child_by_field_name('body')
This is the easiest problem. Use direct access. Don't overcomplicate it.
2. Search/Collection
You're looking for something, but you don't know where it is. You need to explore.
# Find all functions named 'helper' anywhere in the code
# Find all JSON keys named 'price' at any nesting level
# Find all <a> tags with class 'external'
This needs traversal—either recursive or using a library method that does it for you.
3. Structure Exploration
You don't even know what you're looking for yet. You're mapping the territory.
# What fields does this API response have?
# What node types appear in this AST?
# What's the hierarchy of this HTML?
You need to print, inspect, and understand before you can write real code.
4. Contextual Navigation
The hard one. What you do depends on where you are or where you've been.
# Extract table data, but only from the "Results" section
# Count function calls, but track which function they're inside
# Process JSON differently based on parent keys
This requires tracking state as you traverse.
A Real Example: Let's Classify Together
Imagine you're given this task:
"Parse Python files and find all function calls to
print(), but only report them if they're inside a function whose name starts withtest_."
Before you write any code, let's think through this:
What's the problem type?
- Not Path Navigation—we don't know where the test functions are
- Not Structure Exploration—we know what we're looking for
- It's Search/Collection (find all print calls) plus Context Tracking (which function am I in?)
This is Type 4: Contextual Navigation.
What does this tell us?
- We'll need to traverse the AST recursively
- We need to track state: "current function name"
- We'll need to check node types:
function_definitionandcall
Now let's write it. First, let's explore what we're working with:
import tree_sitter_python as tspython
from tree_sitter import Language, Parser
# Setup (you'd do this once)
PY_LANGUAGE = Language(tspython.language())
parser = Parser(PY_LANGUAGE)
code = b"""
def test_basic():
print("Starting test")
result = calculate()
print(f"Result: {result}")
def calculate():
print("This shouldn't be reported")
return 42
"""
tree = parser.parse(code)
root = tree.root_node
# Let's explore the structure first
def print_tree(node, depth=0):
print(" " * depth + node.type)
for child in node.children:
print_tree(child, depth + 1)
print_tree(root)
Run this. Look at the output. You'll see function_definition nodes, call nodes, and the structure becomes clear.
Now we can build our solution:
def find_test_prints(node, source, current_function=None):
"""Find print() calls inside test_ functions."""
results = []
# Are we entering a new function?
if node.type == 'function_definition':
name_node = node.child_by_field_name('name')
if name_node:
func_name = source[name_node.start_byte:name_node.end_byte].decode('utf8')
# Update context only if it's a test function
if func_name.startswith('test_'):
current_function = func_name
else:
current_function = None # Don't track non-test functions
# Are we at a print call?
if node.type == 'call' and current_function:
func_node = node.child_by_field_name('function')
if func_node:
func_text = source[func_node.start_byte:func_node.end_byte].decode('utf8')
if func_text == 'print':
line_num = source[:node.start_byte].decode('utf8').count('\n') + 1
results.append({
'function': current_function,
'line': line_num,
'call': source[node.start_byte:node.end_byte].decode('utf8')
})
# Continue traversing
for child in node.children:
results.extend(find_test_prints(child, source, current_function))
return results
# Use it
prints = find_test_prints(root, code)
for p in prints:
print(f"{p['function']} (line {p['line']}): {p['call']}")
What did we just do?
- Recognized the problem type (contextual navigation)
- Explored the structure first
- Built a simple version that tracks one piece of context
- Used recursive traversal with state passing
This is the pattern. Problem recognition → exploration → implementation.
The Recognition Matrix
Here's how to quickly classify problems:
| If you need to... | Problem Type | Approach |
|---|---|---|
| Get a known field/path | Path Navigation | Direct access |
| Find all X anywhere | Search/Collection | Recursive traverse or library method |
| Understand structure | Structure Exploration | Print/inspect |
| Track "where am I?" | Contextual Navigation | Recursive + state parameter |
Exercise: Classification Practice
For each scenario, identify the problem type and the approach you'd use:
- "Extract all email addresses from a JSON API response"
- "Get the username from:
config['server']['auth']['username']" - "Find function definitions that contain more than 3 print statements"
- "List all possible keys that appear in this deeply nested JSON"
- "Extract all links from
<a>tags, but only those in the main article, not the sidebar"
Answers:
- Type 2 (Search) if structure unknown; Type 1 (Path) if you know emails are always at
data[i]['email'] - Type 1 (Path Navigation) - direct access
- Type 4 (Contextual) - need to track "inside which function?" and count
- Type 3 (Structure Exploration) - you're mapping the territory
- Type 4 (Contextual) - need to know which section you're in
Chapter 16: Performance and Optimization
Let's talk about a mistake many developers make: they optimize too early.
Here's the truth: most traversal code doesn't need to be optimized. You're usually processing a file, an API response, or a single HTML page. These are small. Your first version will probably be fast enough.
But when it's not fast enough, you need to know what to do.
Start By Measuring
Never optimize without measuring. Here's a simple pattern:
import time
def time_it(func, *args):
start = time.time()
result = func(*args)
elapsed = time.time() - start
print(f"{func.__name__} took {elapsed:.3f} seconds")
return result
Or use Python's cProfile:
import cProfile
import pstats
profiler = cProfile.Profile()
profiler.enable()
# Your traversal code here
traverse_large_structure(data)
profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10) # Top 10 slowest functions
Run this first. Find out where the time actually goes. You might be surprised.
Memory: The Generator Pattern
Let's say you're finding all Python files in a large directory. Here's the obvious way:
def find_python_files_slow(root_dir):
"""Returns a list of all .py files"""
results = []
for dirpath, dirnames, filenames in os.walk(root_dir):
for filename in filenames:
if filename.endswith('.py'):
results.append(os.path.join(dirpath, filename))
return results
# This loads ALL files into memory at once
files = find_python_files_slow('/large/codebase')
for f in files:
process(f)
If there are 10,000 files, you're storing 10,000 paths in memory at once.
Now watch this:
def find_python_files_fast(root_dir):
"""Yields .py files one at a time"""
for dirpath, dirnames, filenames in os.walk(root_dir):
for filename in filenames:
if filename.endswith('.py'):
yield os.path.join(dirpath, filename)
# This processes one file at a time
for f in find_python_files_fast('/large/codebase'):
process(f)
The only change: yield instead of append + return. But the memory difference is huge.
When to use generators:
- Processing large file systems
- Working with huge JSON/XML files
- Any time you're collecting thousands of items
- When you might not need all results (early stopping)
When lists are fine:
- Small datasets (hundreds of items, not thousands)
- When you need the full list multiple times
- When you need to count items first (
len(results))
Early Stopping: Don't Do More Work Than Needed
Imagine finding the first function in a file:
# Slow: traverses the entire tree
def find_first_function_slow(node):
results = []
if node.type == 'function_definition':
results.append(node)
for child in node.children:
results.extend(find_first_function_slow(child))
return results[0] if results else None
This visits every node, even after finding what you want.
# Fast: stops when found
def find_first_function_fast(node):
if node.type == 'function_definition':
return node
for child in node.children:
result = find_first_function_fast(child)
if result: # Found it! Stop searching
return result
return None
The fast version stops as soon as it finds a match. If the first function is at line 10, it doesn't traverse lines 11-1000.
Depth Limiting: When You Don't Need to Go Deep
Sometimes you only care about top-level structure:
def find_top_level_functions(node, max_depth=2):
"""Only search the first few levels"""
results = []
def search(node, depth):
if depth > max_depth:
return
if node.type == 'function_definition':
results.append(node)
for child in node.children:
search(child, depth + 1)
search(node, 0)
return results
This is especially useful for deeply nested JSON or HTML where you know your target is near the top.
Caching: When You Traverse Repeatedly
Let's say you're analyzing a codebase and need to find functions by name multiple times:
# Without caching: re-traverses every time
def find_function(root, name):
# ... traverse tree looking for function with 'name' ...
pass
# Called multiple times - slow!
find_function(root, 'initialize')
find_function(root, 'process')
find_function(root, 'cleanup')
Better approach: traverse once, cache results:
def build_function_map(root, source):
"""Traverse once, build a lookup dict"""
functions = {}
def traverse(node):
if node.type == 'function_definition':
name_node = node.child_by_field_name('name')
if name_node:
name = source[name_node.start_byte:name_node.end_byte].decode('utf8')
functions[name] = node
for child in node.children:
traverse(child)
traverse(root)
return functions
# Build once
func_map = build_function_map(root, source)
# Look up many times - fast!
init_func = func_map.get('initialize')
process_func = func_map.get('process')
cleanup_func = func_map.get('cleanup')
The TreeCursor Memory Optimization
For very large ASTs, tree-sitter's TreeCursor can reduce memory:
def count_nodes_with_list(node):
"""Creates a list of children - uses more memory"""
count = 1
for child in node.children: # Creates list of all children
count += count_nodes_with_list(child)
return count
def count_nodes_with_cursor(node):
"""Uses cursor - less memory"""
cursor = node.walk()
count = 1
if cursor.goto_first_child():
count += count_nodes_with_cursor(cursor.node)
while cursor.goto_next_sibling():
count += count_nodes_with_cursor(cursor.node)
return count
But honestly? Only use TreeCursor if you're processing massive files (10,000+ line files) and profiling shows memory is an issue. The direct .children approach is clearer and fast enough for most cases.
Real-World Example: Optimizing a Slow Script
Let's walk through a real optimization:
# Original: Slow for large codebases
def find_complex_functions(directory):
results = []
# Issue 1: Traverses file system multiple times
for file in find_all_python_files(directory):
with open(file, 'rb') as f:
code = f.read()
tree = parser.parse(code)
# Issue 2: Traverses AST multiple times
functions = find_all_functions(tree.root_node, code)
for func in functions:
lines = count_lines(func, code)
calls = count_function_calls(func, code)
# Issue 3: Does work even for simple functions
if lines > 50 or calls > 10:
results.append({
'file': file,
'name': get_function_name(func, code),
'lines': lines,
'calls': calls
})
return results
Optimization 1: Single AST traversal
def analyze_function(node, source):
"""Do all analysis in one pass"""
lines = source[node.start_byte:node.end_byte].decode('utf8').count('\n')
# Count calls while traversing
calls = 0
def count_calls(n):
nonlocal calls
if n.type == 'call':
calls += 1
for child in n.children:
count_calls(child)
count_calls(node)
return lines, calls
Optimization 2: Early filtering
# Check line count first (fast)
lines, calls = analyze_function(func, code)
if lines <= 50 and calls <= 10:
continue # Skip simple functions early
Optimization 3: Generator for files
def find_complex_functions(directory):
for file in find_all_python_files(directory): # Already a generator
# Process one file at a time
# Memory usage stays constant
Result: 10x faster, uses 1/10th the memory.
When Not to Optimize
Remember: premature optimization is the root of all evil. Don't optimize if:
- Your code runs in under a second
- You only run it occasionally
- The bottleneck is I/O (disk, network), not your code
- The "slow" version is much clearer
Write clear code first. Optimize only when you measure a real problem.
Exercise: Find the Bottleneck
Here's a slow function:
def extract_all_strings(root, source):
"""Extract all string literals from AST"""
results = []
def traverse(node):
if node.type == 'string':
text = source[node.start_byte:node.end_byte].decode('utf8')
results.append(text)
# Get all children, then filter
for child in node.children:
traverse(child)
traverse(root)
# Remove duplicates
unique = []
for s in results:
if s not in unique:
unique.append(s)
return sorted(unique)
Your task: Identify the performance issues and rewrite this function.
Hints
- The traversal itself is fine - The duplicate removal is O(n²) - Sorting at the end might not be necessary - Consider using a setChapter 17: Error Handling and Robustness
Here's a hard truth: your first version will break on real data.
It doesn't matter how carefully you write your traversal code. Real-world data is messy:
- Missing fields
- Unexpected types
- Malformed HTML
- Encoding errors
- Permission issues
Let's learn to write code that doesn't crash when reality hits.
The Defensive Mindset
Compare these two approaches:
# Optimistic (crashes on messy data)
def get_user_email_bad(data):
return data['users'][0]['contact']['email']
# Defensive (handles missing data)
def get_user_email_good(data):
users = data.get('users')
if not users or len(users) == 0:
return None
contact = users[0].get('contact')
if not contact:
return None
return contact.get('email')
The good version is longer, yes. But it won't crash when:
- The
userskey is missing - The
userslist is empty - The first user has no
contact - The contact has no
email
But wait—there's a cleaner way for simple path navigation:
def get_nested_value(data, *keys, default=None):
"""Safely navigate nested dicts"""
for key in keys:
if isinstance(data, dict):
data = data.get(key)
elif isinstance(data, list) and isinstance(key, int):
data = data[key] if 0 <= key < len(data) else None
else:
return default
if data is None:
return default
return data
# Use it
email = get_nested_value(data, 'users', 0, 'contact', 'email', default='unknown')
Try-Except: When and How
Don't use try-except to hide logic errors. This is bad:
# Bad: Catching your own mistakes
try:
result = node.children[5].children[2].text # What if structure differs?
except:
result = None
Do use try-except for things outside your control:
# Good: Handling I/O errors
try:
with open(filepath, 'rb') as f:
content = f.read()
except PermissionError:
print(f"Cannot access {filepath}: permission denied")
return None
except FileNotFoundError:
print(f"File not found: {filepath}")
return None
# Good: Handling encoding issues
try:
text = content.decode('utf-8')
except UnicodeDecodeError:
# Try alternate encoding
text = content.decode('latin-1', errors='replace')
Be specific with exceptions:
# Bad: Too broad
try:
process_data(data)
except Exception: # Catches everything!
pass
# Good: Specific handling
try:
process_data(data)
except KeyError as e:
print(f"Missing required key: {e}")
except ValueError as e:
print(f"Invalid value format: {e}")
Handling Missing Data in Traversals
Let's build a robust JSON traversal:
def find_all_values(data, target_key):
"""Find all values for a key, handling any data type"""
results = []
def traverse(obj, path="root"):
# Handle None
if obj is None:
return
# Handle dictionaries
if isinstance(obj, dict):
for key, value in obj.items():
if key == target_key:
results.append({
'path': f"{path}.{key}",
'value': value
})
# Recurse into value
traverse(value, f"{path}.{key}")
# Handle lists
elif isinstance(obj, list):
for i, item in enumerate(obj):
traverse(item, f"{path}[{i}]")
# Primitives (strings, numbers, booleans) - base case
# No action needed, just return
traverse(data)
return results
This handles:
- Nested dicts
- Lists at any level
- Mixed types
- None values
- Empty containers
HTML Traversal Robustness
BeautifulSoup is forgiving, but you still need to be careful:
def extract_article_text(soup):
"""Extract text, handling missing elements gracefully"""
article = soup.find('article')
if not article:
print("Warning: No <article> tag found")
return ""
# Some elements might be text nodes, not tags
paragraphs = []
for elem in article.descendants:
# Check if it's a tag (not a text node)
if hasattr(elem, 'name') and elem.name == 'p':
text = elem.get_text(strip=True)
if text: # Skip empty paragraphs
paragraphs.append(text)
return '\n\n'.join(paragraphs)
AST Traversal with Error Recovery
Tree-sitter is robust, but your code needs to handle edge cases:
def extract_function_info(node, source):
"""Extract function details with error handling"""
if node.type != 'function_definition':
raise ValueError(f"Expected function_definition, got {node.type}")
# Name might be missing (syntax error in source)
name_node = node.child_by_field_name('name')
name = "unknown"
if name_node:
try:
name = source[name_node.start_byte:name_node.end_byte].decode('utf-8')
except UnicodeDecodeError:
name = "decode_error"
# Parameters might be missing
params_node = node.child_by_field_name('parameters')
param_count = 0
if params_node:
# Count parameter nodes
param_count = sum(1 for child in params_node.children
if child.type == 'identifier')
# Body is required, but might have errors
body_node = node.child_by_field_name('body')
line_count = 0
if body_node:
body_text = source[body_node.start_byte:body_node.end_byte].decode('utf-8', errors='replace')
line_count = body_text.count('\n')
return {
'name': name,
'params': param_count,
'lines': line_count
}
Logging for Debugging
When traversal fails, you need to know where and why:
import logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)
def traverse_with_logging(node, depth=0):
"""Traverse with detailed logging"""
indent = " " * depth
logger.debug(f"{indent}Visiting: {node.type}")
try:
# Your traversal logic
for child in node.children:
traverse_with_logging(child, depth + 1)
except Exception as e:
logger.error(f"{indent}Error at {node.type}: {e}")
logger.error(f"{indent}Position: {node.start_point}")
raise
For file system traversal:
def traverse_directory_safe(root_dir):
"""Traverse with error reporting"""
for dirpath, dirnames, filenames in os.walk(root_dir):
# Handle permission errors on directories
dirnames[:] = [d for d in dirnames
if os.access(os.path.join(dirpath, d), os.R_OK)]
for filename in filenames:
filepath = os.path.join(dirpath, filename)
try:
# Process file
with open(filepath, 'r') as f:
content = f.read()
process(content)
except PermissionError:
logger.warning(f"Permission denied: {filepath}")
except UnicodeDecodeError:
logger.warning(f"Encoding error: {filepath}")
except Exception as e:
logger.error(f"Unexpected error in {filepath}: {e}")
Exercise: Make It Bulletproof
Here's fragile code:
def extract_prices(html):
"""Extract all prices from product listings"""
soup = BeautifulSoup(html, 'html.parser')
products = soup.find_all('div', class_='product')
prices = []
for product in products:
price_text = product.find('span', class_='price').text
# Expect format: "$19.99"
price = float(price_text.replace('$', ''))
prices.append(price)
return prices
Your task: Rewrite this to handle:
- Missing price spans
- Different price formats ("$19.99", "19.99", "€19,99")
- Non-numeric text in price spans
- Empty product divs
Chapter 18: Designing Reusable Traversal Functions
You've written traversal code for specific problems. Now let's talk about writing traversal code that you'll use again and again.
The goal: write a function once, use it everywhere.
The Generic Traversal Pattern
Here's the skeleton of most traversal problems:
def generic_traverse(root, condition_func, action_func):
"""
Traverse a tree structure.
Args:
root: Starting node/dict/object
condition_func: (node) -> bool - should we process this?
action_func: (node) -> Any - what to do if condition is true
Returns:
List of results from action_func
"""
results = []
def traverse(node):
# Check condition
if condition_func(node):
result = action_func(node)
if result is not None:
results.append(result)
# Get children (customize per data type)
children = get_children(node)
for child in children:
traverse(child)
traverse(root)
return results
Now you can use it for different problems:
# Find all function definitions
functions = generic_traverse(
root=ast_root,
condition_func=lambda n: n.type == 'function_definition',
action_func=lambda n: extract_function_info(n, source)
)
# Find all large files
large_files = generic_traverse(
root=root_dir,
condition_func=lambda path: os.path.isfile(path) and os.path.getsize(path) > 1_000_000,
action_func=lambda path: path
)
The Visitor Pattern
For more complex cases, use the visitor pattern:
class ASTVisitor:
"""Base class for AST traversal with specific handlers"""
def visit(self, node):
"""Main entry point"""
method_name = f'visit_{node.type}'
visitor = getattr(self, method_name, self.generic_visit)
return visitor(node)
def generic_visit(self, node):
"""Default: just traverse children"""
for child in node.children:
self.visit(child)
# Subclasses override specific visit_* methods
class FunctionCollector(ASTVisitor):
def __init__(self, source):
self.source = source
self.functions = []
def visit_function_definition(self, node):
"""Called automatically for function nodes"""
name_node = node.child_by_field_name('name')
if name_node:
name = self.source[name_node.start_byte:name_node.end_byte].decode('utf8')
self.functions.append(name)
# Continue traversing children
self.generic_visit(node)
# Use it
collector = FunctionCollector(source)
collector.visit(root)
print(collector.functions)
This scales beautifully:
class CodeAnalyzer(ASTVisitor):
def __init__(self, source):
self.source = source
self.functions = []
self.classes = []
self.imports = []
def visit_function_definition(self, node):
# Handle functions
self.functions.append(self.extract_function_info(node))
self.generic_visit(node)
def visit_class_definition(self, node):
# Handle classes
self.classes.append(self.extract_class_info(node))
self.generic_visit(node)
def visit_import_statement(self, node):
# Handle imports
self.imports.append(self.extract_import_info(node))
# No need to traverse import children
Generator vs. List Returns
Return a list when:
- The result set is small
- You need to access results multiple times
- You need to count results first
Yield results when:
- The result set might be large
- Processing can stop early
- You're chaining operations
# List version: loads everything
def find_all_functions(root):
functions = []
# ... traverse and append ...
return functions
# Generator version: yields one at a time
def find_all_functions(root):
def traverse(node):
if node.type == 'function_definition':
yield node
for child in node.children:
yield from traverse(child)
yield from traverse(root)
# Using the generator
for func in find_all_functions(root):
if is_what_i_need(func):
return func # Can stop early!
Configuration Objects for Complex Traversals
When your traversal has many options, use a config object:
from dataclasses import dataclass
from typing import Callable, Optional
@dataclass
class TraversalConfig:
"""Configuration for file system traversal"""
max_depth: Optional[int] = None
follow_symlinks: bool = False
include_hidden: bool = False
file_filter: Optional[Callable[[str], bool]] = None
error_handler: Optional[Callable[[str, Exception], None]] = None
def traverse_directory(root, config: TraversalConfig):
"""Flexible directory traversal"""
def should_include(filename):
if not config.include_hidden and filename.startswith('.'):
return False
if config.file_filter:
return config.file_filter(filename)
return True
def handle_error(path, error):
if config.error_handler:
config.error_handler(path, error)
else:
print(f"Error: {path}: {error}")
# ... implementation using config ...
# Use it
config = TraversalConfig(
max_depth=3,
include_hidden=False,
file_filter=lambda f: f.endswith('.py'),
error_handler=lambda path, err: logging.warning(f"Skipped {path}: {err}")
)
traverse_directory('/my/project', config)
Type Hints for Clarity
Type hints make your reusable functions easier to use:
from typing import TypeVar, Callable, List, Iterator, Optional, Any
T = TypeVar('T') # Generic type variable
def find_all(
root: Any,
predicate: Callable[[Any], bool],
get_children: Callable[[Any], Iterator[Any]]
) -> List[Any]:
"""
Generic tree traversal.
Args:
root: The root node to start from
predicate: Function that returns True for nodes to collect
get_children: Function that returns an iterable of child nodes
Returns:
List of all nodes where predicate returns True
Example:
# Find all dicts with 'error' key
errors = find_all(
root=data,
predicate=lambda x: isinstance(x, dict) and 'error' in x,
get_children=lambda x: x.values() if isinstance(x, dict)
else x if isinstance(x, list)
else []
)
"""
results = []
def traverse(node):
if predicate(node):
results.append(node)
for child in get_children(node):
traverse(child)
traverse(root)
return results
Documentation Matters
Good documentation turns your traversal function into a tool others (including future you) can use:
def extract_functions(
root_node,
source: bytes,
include_nested: bool = True,
min_lines: int = 0
) -> List[dict]:
"""
Extract function information from a Python AST.
This function traverses a tree-sitter AST and collects metadata
about all function definitions found.
Args:
root_node: The root node of the AST (from tree.root_node)
source: The original source code as bytes
include_nested: If True, includes functions defined inside other
functions or classes. If False, only top-level
functions are returned.
min_lines: Only return functions with at least this many lines
Returns:
A list of dictionaries, each containing:
- 'name': Function name (str)
- 'line_start': Starting line number (int)
- 'line_end': Ending line number (int)
- 'params': List of parameter names (List[str])
- 'is_nested': Whether function is nested (bool)
Example:
>>> tree = parser.parse(source_code)
>>> functions = extract_functions(tree.root_node, source_code)
>>> for func in functions:
... print(f"{func['name']} ({func['line_end'] - func['line_start']} lines)")
calculate (5 lines)
process_data (23 lines)
Note:
Functions with syntax errors may have incomplete information.
The 'name' field will be 'unknown' if the name cannot be extracted.
"""
# Implementation...
Building a Traversal Library
Let's put it all together into a reusable module:
# traversal_tools.py
"""
Reusable traversal utilities for trees, nested dicts, and file systems.
"""
from typing import Any, Callable, Iterator, List, Optional, TypeVar
from dataclasses import dataclass
T = TypeVar('T')
@dataclass
class TraversalResult:
"""Container for traversal results with metadata"""
value: Any
path: str
depth: int
def traverse_dict(
data: dict,
predicate: Optional[Callable[[str, Any], bool]] = None,
max_depth: Optional[int] = None
) -> Iterator[TraversalResult]:
"""
Traverse nested dictionaries/lists yielding matching items.
Args:
data: The dictionary to traverse
predicate: Function(key, value) -> bool. If None, yields all items.
max_depth: Maximum depth to traverse. None for unlimited.
Yields:
TraversalResult objects for matching items
"""
def traverse(obj, path="root", depth=0):
if max_depth is not None and depth > max_depth:
return
if isinstance(obj, dict):
for key, value in obj.items():
current_path = f"{path}.{key}"
if predicate is None or predicate(key, value):
yield TraversalResult(value, current_path, depth)
yield from traverse(value, current_path, depth + 1)
elif isinstance(obj, list):
for i, item in enumerate(obj):
current_path = f"{path}[{i}]"
yield from traverse(item, current_path, depth + 1)
yield from traverse(data)
def find_by_key(data: dict, target_key: str) -> List[Any]:
"""
Find all values for a specific key anywhere in nested structure.
This is a convenience wrapper around traverse_dict.
Args:
data: Nested dictionary/list structure
target_key: The key to search for
Returns:
List of all values found for that key
Example:
>>> data = {'user': {'email': 'a@b.com'}, 'admin': {'email': 'c@d.com'}}
>>> find_by_key(data, 'email')
['a@b.com', 'c@d.com']
"""
results = []
for result in traverse_dict(data, predicate=lambda k, v: k == target_key):
results.append(result.value)
return results
def find_by_value(data: dict, target_value: Any) -> List[str]:
"""
Find all paths to a specific value in nested structure.
Args:
data: Nested dictionary/list structure
target_value: The value to search for
Returns:
List of paths where the value was found
Example:
>>> data = {'a': 1, 'b': {'c': 1}}
>>> find_by_value(data, 1)
['root.a', 'root.b.c']
"""
paths = []
for result in traverse_dict(data, predicate=lambda k, v: v == target_value):
paths.append(result.path)
return paths
class TreeVisitor:
"""
Base class for custom tree traversals using the visitor pattern.
Subclass this and override visit_* methods to handle specific node types.
"""
def __init__(self, get_children: Callable[[Any], Iterator[Any]]):
"""
Args:
get_children: Function that returns children for any node
"""
self.get_children = get_children
def visit(self, node, depth=0):
"""
Visit a node and all its descendants.
Override visit_<type> methods in subclasses for custom handling.
"""
# Try to call specific visitor method
node_type = getattr(node, 'type', type(node).__name__)
method_name = f'visit_{node_type}'
visitor = getattr(self, method_name, self.generic_visit)
visitor(node, depth)
# Traverse children
for child in self.get_children(node):
self.visit(child, depth + 1)
def generic_visit(self, node, depth):
"""
Default visitor. Override this or specific visit_* methods.
"""
pass
# Example usage in comments:
"""
# Using traverse_dict:
data = {...}
for result in traverse_dict(data, max_depth=3):
print(f"{result.path}: {result.value}")
# Using find_by_key:
emails = find_by_key(api_response, 'email')
# Using TreeVisitor:
class MyVisitor(TreeVisitor):
def __init__(self):
super().__init__(get_children=lambda node: node.children)
self.function_count = 0
def visit_function_definition(self, node, depth):
self.function_count += 1
visitor = MyVisitor()
visitor.visit(ast_root)
print(visitor.function_count)
"""
Exercise: Build Your Own Utility
Create a reusable function that:
- Traverses any tree-like structure
- Collects statistics (node count by type, max depth, etc.)
- Can be configured to stop early
- Handles errors gracefully
- Has clear documentation and type hints
Here's a starting point:
from typing import Any, Callable, Dict
from dataclasses import dataclass
@dataclass
class TreeStats:
"""Statistics collected during tree traversal"""
total_nodes: int = 0
max_depth: int = 0
node_type_counts: Dict[str, int] = None
def __post_init__(self):
if self.node_type_counts is None:
self.node_type_counts = {}
def analyze_tree(
root: Any,
get_type: Callable[[Any], str],
get_children: Callable[[Any], list],
max_nodes: Optional[int] = None
) -> TreeStats:
"""
Analyze tree structure and collect statistics.
Your implementation here...
"""
pass
Bonus challenge: Add timing information and memory usage tracking.
Closing Thoughts on Part VI
You've learned to recognize problem types, optimize when needed, handle errors gracefully, and design reusable functions.
But here's the real lesson: these patterns are everywhere.
Once you see the traversal pattern in file systems, you'll see it in:
- Database query results
- Configuration hierarchies
- UI component trees
- Network packet structures
- Game scene graphs
The specifics change. The pattern doesn't.
In the next part, we'll see this pattern in action with real-world case studies—showing you how professionals approach these problems from start to finish.