Part II: Starting Point - File System Traversal
Chapter 4: Mastering os.walk() - Your Foundation
You've used os.walk() before. Maybe you copied it from Stack Overflow, maybe you understand it. But I want you to actually run this code right now:
import os
for root, dirs, files in os.walk('.'):
print(f"Root: {root}")
print(f"Dirs: {dirs}")
print(f"Files: {files}")
print("---")
if root.count(os.sep) > 2: # Stop after a few levels
break
Run it in any directory with subdirectories. Watch what happens.
Did you notice something? It prints one directory at a time. Not all at once. It's giving you directories as it discovers them.
This is the key insight most tutorials skip: os.walk() is a generator. It doesn't load your entire file system into memory and hand it to you. It explores as you consume it.
How os.walk() Works Under the Hood
Let me show you what's really happening:
# What os.walk() is doing internally (simplified)
def simplified_walk(start_path):
# Get contents of current directory
entries = os.listdir(start_path)
# Separate into files and directories
dirs = []
files = []
for entry in entries:
full_path = os.path.join(start_path, entry)
if os.path.isdir(full_path):
dirs.append(entry)
else:
files.append(entry)
# Yield current directory info
yield start_path, dirs, files
# Recursively walk each subdirectory
for dir_name in dirs:
dir_path = os.path.join(start_path, dir_name)
yield from simplified_walk(dir_path)
See the pattern? It's our four questions again:
- WHERE AM I? β
start_path(current directory) - WHAT'S HERE? β
files(files in this directory) - WHERE CAN I GO? β
dirs(subdirectories to explore) - WHAT AM I LOOKING FOR? β That's up to you!
Understanding the (root, dirs, files) Tuple
Let's be very clear about what each part means:
for root, dirs, files in os.walk('/my/project'):
# root: Full path to current directory
# Example: '/my/project/src/utils'
# dirs: List of subdirectory NAMES (not full paths)
# Example: ['helpers', 'tests']
# files: List of file NAMES (not full paths)
# Example: ['config.py', 'constants.py']
# To get full path to a file:
full_path = os.path.join(root, files[0])
# Result: '/my/project/src/utils/config.py'
Common mistake: Treating files[0] as a path. It's not. It's just a name.
# WRONG - won't work as expected
os.path.getsize(files[0]) # FileNotFoundError!
# RIGHT - join with root first
os.path.getsize(os.path.join(root, files[0]))
Why It's Memory-Efficient
Watch what happens with a huge directory:
import os
import sys
count = 0
for root, dirs, files in os.walk('/'):
count += len(files)
if count > 100000:
print(f"Processed {count} files")
break # Stop whenever we want
This can handle millions of files because os.walk() never loads them all at once. It yields one directory, you process it, it yields the next.
This is crucial: You can walk massive file systems without running out of memory.
Common Patterns with os.walk()
Let's build your intuition with real patterns you'll use constantly.
Pattern 1: Finding Files by Extension
The most common use case:
def find_python_files(start_path):
"""Find all .py files."""
python_files = []
for root, dirs, files in os.walk(start_path):
for file in files:
if file.endswith('.py'):
full_path = os.path.join(root, file)
python_files.append(full_path)
return python_files
But waitβwe can make this better. Let's think about what we're really doing:
def find_files_by_extension(start_path, extension):
"""More reusable version."""
# Ensure extension starts with a dot
if not extension.startswith('.'):
extension = '.' + extension
for root, dirs, files in os.walk(start_path):
for file in files:
if file.endswith(extension):
yield os.path.join(root, file)
# Usage
for py_file in find_files_by_extension('/project', '.py'):
print(py_file) # Process one at a time
Do you see what changed? We made it a generator too. Now it's just as memory-efficient as os.walk() itself.
Pattern 2: Calculating Directory Sizes
Here's where it gets interesting:
def calculate_sizes(start_path):
"""Calculate total size of files in each directory."""
for root, dirs, files in os.walk(start_path):
total_size = 0
for file in files:
try:
full_path = os.path.join(root, file)
total_size += os.path.getsize(full_path)
except OSError:
# File might have been deleted, or permission denied
pass
print(f"{root}: {total_size:,} bytes")
Notice the try-except? Real file systems are messy. Files disappear, permissions get denied. Robust code handles this.
But there's a problem with this code. Can you spot it?
It only shows the size of files directly in each directory, not the total size including subdirectories. If you want the total size including everything below, you need a different approach:
def total_size_recursive(start_path):
"""Calculate total size including all subdirectories."""
total = 0
for root, dirs, files in os.walk(start_path):
for file in files:
try:
full_path = os.path.join(root, file)
total += os.path.getsize(full_path)
except OSError:
pass
return total
# If you want per-directory totals, you need to walk each dir separately
for item in os.listdir(start_path):
item_path = os.path.join(start_path, item)
if os.path.isdir(item_path):
size = total_size_recursive(item_path)
print(f"{item}: {size:,} bytes")
Pattern 3: Skipping Directories (The Hidden Feature)
Here's something most people don't know: you can modify the dirs list in-place to control traversal.
def find_files_skip_tests(start_path):
"""Find Python files, but skip test directories."""
for root, dirs, files in os.walk(start_path):
# THIS IS THE MAGIC: Modify dirs in-place
dirs[:] = [d for d in dirs if d != 'tests' and d != '__pycache__']
for file in files:
if file.endswith('.py'):
yield os.path.join(root, file)
Why does this work? os.walk() uses the dirs list to decide which subdirectories to visit next. By modifying it, you're telling os.walk() "don't go there."
Critical detail: You must use dirs[:] = not dirs =. The first modifies the list in-place, the second just reassigns your local variable and has no effect.
# WRONG - doesn't work
dirs = [d for d in dirs if d != 'tests']
# RIGHT - modifies the actual list
dirs[:] = [d for d in dirs if d != 'tests']
Pattern 4: Building File Trees
Sometimes you want to represent the structure:
def build_tree(start_path, max_depth=3):
"""Build a visual representation of directory structure."""
for root, dirs, files in os.walk(start_path):
# Calculate depth
level = root.replace(start_path, '').count(os.sep)
if level > max_depth:
dirs[:] = [] # Don't go deeper
continue
# Create indentation
indent = ' ' * level
print(f"{indent}{os.path.basename(root)}/")
# Print files with extra indentation
file_indent = ' ' * (level + 1)
for file in files[:5]: # Limit files shown
print(f"{file_indent}{file}")
if len(files) > 5:
print(f"{file_indent}... and {len(files) - 5} more files")
# Try it:
# build_tree('.', max_depth=2)
When os.walk() Isn't Enough
os.walk() is powerful, but it has limitations:
Limitation 1: Symbolic Links
By default, os.walk() doesn't follow symbolic links:
# This won't follow symlinks
for root, dirs, files in os.walk(start_path):
pass
# This will
for root, dirs, files in os.walk(start_path, followlinks=True):
pass
Warning: Following symlinks can lead to infinite loops if symlinks create cycles!
Limitation 2: You Want to Stop Early
Sometimes you just want to find one file, not all of them:
def find_first_match(start_path, filename):
"""Find first occurrence of a filename."""
for root, dirs, files in os.walk(start_path):
if filename in files:
return os.path.join(root, filename)
return None
This works, but it's not obvious that os.walk() will stop when you return. It does, but if you need more control, you might want custom traversal.
Limitation 3: Complex Filtering Logic
When your filtering logic depends on multiple factors:
def find_recent_large_python_files(start_path, min_size, days_old):
"""Find .py files that are large and recently modified."""
import time
cutoff_time = time.time() - (days_old * 86400)
for root, dirs, files in os.walk(start_path):
for file in files:
if not file.endswith('.py'):
continue
full_path = os.path.join(root, file)
try:
stat = os.stat(full_path)
if stat.st_size > min_size and stat.st_mtime > cutoff_time:
yield full_path
except OSError:
pass
This still uses os.walk(), but now you're layering complex logic on top. Sometimes at this point, custom traversal is cleaner.
Exercise: Building a Smart File Finder
Write a function that finds files matching multiple criteria:
def smart_find(start_path, **criteria):
"""
Find files matching any combination of:
- extension: file extension (e.g., '.py')
- min_size: minimum file size in bytes
- max_size: maximum file size in bytes
- name_contains: substring that must be in filename
- skip_dirs: list of directory names to skip
- max_depth: maximum depth to search
Returns list of matching file paths.
Example usage:
smart_find('/project',
extension='.py',
min_size=1000,
skip_dirs=['__pycache__', 'venv'],
max_depth=3)
"""
# Your implementation here
pass
Hints:
- Use
os.walk() - Remember to modify
dirs[:]to skip directories - Build up your conditions one at a time
- Handle errors gracefully
- Track depth by counting separators in the path
Try it yourself before looking at the solution.
Chapter 5: Beyond os.walk() - Custom File System Traversal
You've mastered os.walk(). But sometimes you need complete control. Let me show you when and why.
When You Actually Need Custom Traversal
Here's a real scenario: You're building a backup tool that needs to:
- Copy files, but skip certain patterns
- Track which files failed and why
- Resume from where it left off if interrupted
- Show progress with exact counts
os.walk() can't easily do all this. Time to write your own.
Writing Recursive Directory Traversal
Let's build up from the simplest possible version:
def walk_directory_v1(path):
"""Simplest possible recursive walk."""
print(path)
for item in os.listdir(path):
item_path = os.path.join(path, item)
if os.path.isdir(item_path):
walk_directory_v1(item_path) # Recurse!
Run this. What happens?
If you have any symlinks or permissions issues, it crashes. Let's fix that:
def walk_directory_v2(path):
"""More robust version."""
try:
items = os.listdir(path)
except PermissionError:
print(f"Permission denied: {path}")
return # Base case: can't read, so stop
for item in items:
item_path = os.path.join(path, item)
try:
if os.path.isdir(item_path) and not os.path.islink(item_path):
walk_directory_v2(item_path)
else:
print(f"File: {item_path}")
except OSError as e:
print(f"Error accessing {item_path}: {e}")
Do you see the base cases?
- Can't read directory β return
- Not a directory β print and continue
- Is a symlink β don't recurse (avoiding loops)
But this just prints. Let's make it useful:
def walk_and_collect(path, results=None):
"""Collect all files with error tracking."""
if results is None:
results = {'files': [], 'errors': []}
try:
items = os.listdir(path)
except PermissionError as e:
results['errors'].append((path, str(e)))
return results
for item in items:
item_path = os.path.join(path, item)
try:
if os.path.isdir(item_path) and not os.path.islink(item_path):
walk_and_collect(item_path, results) # Pass results down
else:
results['files'].append(item_path)
except OSError as e:
results['errors'].append((item_path, str(e)))
return results
# Usage
result = walk_and_collect('/my/project')
print(f"Found {len(result['files'])} files")
print(f"Encountered {len(result['errors'])} errors")
Notice the pattern: We pass results through the recursion to collect information across all levels.
Understanding Base Cases and Termination
Every recursive function needs to answer: "When do I stop?"
In directory traversal, you stop when:
- You can't read the directory (permission error)
- You've reached a file (not a directory, nothing to recurse into)
- You hit a symbolic link (avoiding infinite loops)
- You've reached a depth limit (if you're tracking depth)
Let's add depth limiting:
def walk_with_depth_limit(path, max_depth, current_depth=0):
"""Walk directories up to a maximum depth."""
# BASE CASE 1: Depth limit reached
if current_depth > max_depth:
return
try:
items = os.listdir(path)
except PermissionError:
# BASE CASE 2: Can't read directory
return
indent = " " * current_depth
print(f"{indent}{os.path.basename(path)}/")
for item in items:
item_path = os.path.join(path, item)
if os.path.isdir(item_path) and not os.path.islink(item_path):
# Recurse with incremented depth
walk_with_depth_limit(item_path, max_depth, current_depth + 1)
else:
# BASE CASE 3: It's a file
print(f"{indent} {item}")
Handling Permissions and Errors
Real file systems are hostile environments. Here's a robust pattern:
def robust_walk(path, on_error=None):
"""
Walk directories with comprehensive error handling.
Args:
path: Starting directory
on_error: Callback function for errors: on_error(path, exception)
"""
def default_error_handler(path, exc):
print(f"Error at {path}: {exc}")
error_handler = on_error or default_error_handler
try:
items = os.listdir(path)
except OSError as e:
error_handler(path, e)
return
for item in items:
item_path = os.path.join(path, item)
try:
# Use os.stat instead of isdir/isfile for one system call
stat_info = os.stat(item_path)
if os.path.stat.S_ISDIR(stat_info.st_mode):
robust_walk(item_path, on_error)
else:
yield item_path, stat_info # Yield file with its metadata
except OSError as e:
error_handler(item_path, e)
# Usage with custom error handling
errors = []
def collect_errors(path, exc):
errors.append((path, str(exc)))
for file_path, stat_info in robust_walk('/project', on_error=collect_errors):
print(f"{file_path}: {stat_info.st_size} bytes")
print(f"\nEncountered {len(errors)} errors")
Iterative vs. Recursive Approaches
Recursion is elegant but has limits. Python's default recursion limit is around 1000. If you have a directory nested 1000+ levels deep (unlikely but possible), recursion fails.
Here's the iterative equivalent using a stack:
def walk_iterative(start_path):
"""Iterative directory traversal using a stack."""
# Stack holds directories to visit
stack = [start_path]
while stack:
current_path = stack.pop()
try:
items = os.listdir(current_path)
except PermissionError:
continue
for item in items:
item_path = os.path.join(current_path, item)
try:
if os.path.isdir(item_path):
stack.append(item_path) # Add to stack instead of recursing
else:
yield item_path
except OSError:
pass
The mental shift: Instead of calling the function recursively, we add directories to a stack and process them in a loop.
Which should you use?
| Recursive | Iterative |
|---|---|
| More intuitive | More verbose |
| Cleaner code | More manual control |
| Depth limited | No depth limit |
| Natural DFS | Can do BFS easily |
For most cases: Use recursive. It's clearer and depth limits rarely matter.
For extreme cases: Use iterative. When you need BFS or worry about stack depth.
Here's BFS (breadth-first) with iteration:
from collections import deque
def walk_bfs(start_path):
"""Breadth-first traversal."""
queue = deque([start_path])
while queue:
current_path = queue.popleft() # FIFO instead of LIFO
try:
items = os.listdir(current_path)
except PermissionError:
continue
for item in items:
item_path = os.path.join(current_path, item)
try:
if os.path.isdir(item_path):
queue.append(item_path)
else:
yield item_path
except OSError:
pass
When does this matter? If you want to find files by how far they are from the root, not by which branch they're in.
Performance Considerations
Memory Usage
# MEMORY INTENSIVE - loads everything into memory
def get_all_files(path):
files = []
for root, dirs, filenames in os.walk(path):
for f in filenames:
files.append(os.path.join(root, f))
return files # Returns potentially millions of items
all_files = get_all_files('/huge/directory')
for f in all_files: # Only now can you start processing
process(f)
# MEMORY EFFICIENT - generates one at a time
def iter_files(path):
for root, dirs, filenames in os.walk(path):
for f in filenames:
yield os.path.join(root, f)
for f in iter_files('/huge/directory'): # Start processing immediately
process(f)
Rule of thumb: Use generators unless you specifically need the full list.
Early Stopping Strategies
Stop as soon as you find what you need:
def find_first_large_file(path, size_threshold):
"""Find the first file over threshold, then stop."""
for root, dirs, files in os.walk(path):
for file in files:
full_path = os.path.join(root, file)
try:
if os.path.getsize(full_path) > size_threshold:
return full_path # STOP IMMEDIATELY
except OSError:
pass
return None
This is much faster than finding all large files when you only need one.
Caching Stat Calls
os.stat() is expensive. Don't call it multiple times:
# SLOW - multiple stat calls per file
def slow_version(path):
for root, dirs, files in os.walk(path):
for file in files:
full_path = os.path.join(root, file)
size = os.path.getsize(full_path) # stat call
mtime = os.path.getmtime(full_path) # stat call
# ...
# FAST - one stat call per file
def fast_version(path):
for root, dirs, files in os.walk(path):
for file in files:
full_path = os.path.join(root, file)
stat = os.stat(full_path) # ONE stat call
size = stat.st_size
mtime = stat.st_mtime
# ...
Exercise: Implementing a Directory Tree Visualizer
Build a function that creates a visual tree like this:
project/
src/
main.py
utils/
helpers.py
constants.py
tests/
test_main.py
README.md
Requirements:
- Show directories with
/suffix - Indent properly based on depth
- Sort directories before files
- Add option to limit depth
- Add option to show file sizes
- Handle errors gracefully
def tree(start_path,
max_depth=None,
show_sizes=False,
show_hidden=False):
"""
Display directory tree.
Args:
start_path: Root directory to display
max_depth: Maximum depth to display (None = unlimited)
show_sizes: Show file sizes in bytes
show_hidden: Include hidden files/dirs (starting with .)
"""
# Your implementation here
pass
# Test it:
# tree('.', max_depth=2, show_sizes=True)
Hints:
- Track current depth in recursion
- Use
os.path.basename()for names - Sort: directories first, then alphabetically
- Pass depth through recursive calls
- Format sizes nicely (consider humanize library pattern)
What You've Learned
You now understand:
- How
os.walk()really works - generators, memory efficiency - The tuple structure - root, dirs, files and how to use them
- Common patterns - filtering, skipping, building trees
- When to write custom traversal - specific control needs
- Recursion mechanics - base cases, state passing, depth tracking
- Iterative alternatives - stack-based, BFS vs DFS
- Performance considerations - memory, early stopping, caching
In Part III, we'll take everything you've learned about file system traversal and apply it to JSON and nested dictionaries. You'll see the exact same patterns, just with different data structures. The mental model you've built here transfers directly.