🏠

Part II: Starting Point - File System Traversal

Chapter 4: Mastering os.walk() - Your Foundation

You've used os.walk() before. Maybe you copied it from Stack Overflow, maybe you understand it. But I want you to actually run this code right now:

import os

for root, dirs, files in os.walk('.'):
    print(f"Root: {root}")
    print(f"Dirs: {dirs}")
    print(f"Files: {files}")
    print("---")
    if root.count(os.sep) > 2:  # Stop after a few levels
        break

Run it in any directory with subdirectories. Watch what happens.

Did you notice something? It prints one directory at a time. Not all at once. It's giving you directories as it discovers them.

This is the key insight most tutorials skip: os.walk() is a generator. It doesn't load your entire file system into memory and hand it to you. It explores as you consume it.

How os.walk() Works Under the Hood

Let me show you what's really happening:

# What os.walk() is doing internally (simplified)
def simplified_walk(start_path):
    # Get contents of current directory
    entries = os.listdir(start_path)

    # Separate into files and directories
    dirs = []
    files = []
    for entry in entries:
        full_path = os.path.join(start_path, entry)
        if os.path.isdir(full_path):
            dirs.append(entry)
        else:
            files.append(entry)

    # Yield current directory info
    yield start_path, dirs, files

    # Recursively walk each subdirectory
    for dir_name in dirs:
        dir_path = os.path.join(start_path, dir_name)
        yield from simplified_walk(dir_path)

See the pattern? It's our four questions again:

  1. WHERE AM I? β†’ start_path (current directory)
  2. WHAT'S HERE? β†’ files (files in this directory)
  3. WHERE CAN I GO? β†’ dirs (subdirectories to explore)
  4. WHAT AM I LOOKING FOR? β†’ That's up to you!

Understanding the (root, dirs, files) Tuple

Let's be very clear about what each part means:

for root, dirs, files in os.walk('/my/project'):
    # root:  Full path to current directory
    #        Example: '/my/project/src/utils'

    # dirs:  List of subdirectory NAMES (not full paths)
    #        Example: ['helpers', 'tests']

    # files: List of file NAMES (not full paths)
    #        Example: ['config.py', 'constants.py']

    # To get full path to a file:
    full_path = os.path.join(root, files[0])
    # Result: '/my/project/src/utils/config.py'

Common mistake: Treating files[0] as a path. It's not. It's just a name.

# WRONG - won't work as expected
os.path.getsize(files[0])  # FileNotFoundError!

# RIGHT - join with root first
os.path.getsize(os.path.join(root, files[0]))

Why It's Memory-Efficient

Watch what happens with a huge directory:

import os
import sys

count = 0
for root, dirs, files in os.walk('/'):
    count += len(files)
    if count > 100000:
        print(f"Processed {count} files")
        break  # Stop whenever we want

This can handle millions of files because os.walk() never loads them all at once. It yields one directory, you process it, it yields the next.

This is crucial: You can walk massive file systems without running out of memory.

Common Patterns with os.walk()

Let's build your intuition with real patterns you'll use constantly.

Pattern 1: Finding Files by Extension

The most common use case:

def find_python_files(start_path):
    """Find all .py files."""
    python_files = []

    for root, dirs, files in os.walk(start_path):
        for file in files:
            if file.endswith('.py'):
                full_path = os.path.join(root, file)
                python_files.append(full_path)

    return python_files

But waitβ€”we can make this better. Let's think about what we're really doing:

def find_files_by_extension(start_path, extension):
    """More reusable version."""
    # Ensure extension starts with a dot
    if not extension.startswith('.'):
        extension = '.' + extension

    for root, dirs, files in os.walk(start_path):
        for file in files:
            if file.endswith(extension):
                yield os.path.join(root, file)

# Usage
for py_file in find_files_by_extension('/project', '.py'):
    print(py_file)  # Process one at a time

Do you see what changed? We made it a generator too. Now it's just as memory-efficient as os.walk() itself.

Pattern 2: Calculating Directory Sizes

Here's where it gets interesting:

def calculate_sizes(start_path):
    """Calculate total size of files in each directory."""

    for root, dirs, files in os.walk(start_path):
        total_size = 0

        for file in files:
            try:
                full_path = os.path.join(root, file)
                total_size += os.path.getsize(full_path)
            except OSError:
                # File might have been deleted, or permission denied
                pass

        print(f"{root}: {total_size:,} bytes")

Notice the try-except? Real file systems are messy. Files disappear, permissions get denied. Robust code handles this.

But there's a problem with this code. Can you spot it?

It only shows the size of files directly in each directory, not the total size including subdirectories. If you want the total size including everything below, you need a different approach:

def total_size_recursive(start_path):
    """Calculate total size including all subdirectories."""
    total = 0

    for root, dirs, files in os.walk(start_path):
        for file in files:
            try:
                full_path = os.path.join(root, file)
                total += os.path.getsize(full_path)
            except OSError:
                pass

    return total

# If you want per-directory totals, you need to walk each dir separately
for item in os.listdir(start_path):
    item_path = os.path.join(start_path, item)
    if os.path.isdir(item_path):
        size = total_size_recursive(item_path)
        print(f"{item}: {size:,} bytes")

Pattern 3: Skipping Directories (The Hidden Feature)

Here's something most people don't know: you can modify the dirs list in-place to control traversal.

def find_files_skip_tests(start_path):
    """Find Python files, but skip test directories."""

    for root, dirs, files in os.walk(start_path):
        # THIS IS THE MAGIC: Modify dirs in-place
        dirs[:] = [d for d in dirs if d != 'tests' and d != '__pycache__']

        for file in files:
            if file.endswith('.py'):
                yield os.path.join(root, file)

Why does this work? os.walk() uses the dirs list to decide which subdirectories to visit next. By modifying it, you're telling os.walk() "don't go there."

Critical detail: You must use dirs[:] = not dirs =. The first modifies the list in-place, the second just reassigns your local variable and has no effect.

# WRONG - doesn't work
dirs = [d for d in dirs if d != 'tests']

# RIGHT - modifies the actual list
dirs[:] = [d for d in dirs if d != 'tests']

Pattern 4: Building File Trees

Sometimes you want to represent the structure:

def build_tree(start_path, max_depth=3):
    """Build a visual representation of directory structure."""

    for root, dirs, files in os.walk(start_path):
        # Calculate depth
        level = root.replace(start_path, '').count(os.sep)

        if level > max_depth:
            dirs[:] = []  # Don't go deeper
            continue

        # Create indentation
        indent = '  ' * level
        print(f"{indent}{os.path.basename(root)}/")

        # Print files with extra indentation
        file_indent = '  ' * (level + 1)
        for file in files[:5]:  # Limit files shown
            print(f"{file_indent}{file}")

        if len(files) > 5:
            print(f"{file_indent}... and {len(files) - 5} more files")

# Try it:
# build_tree('.', max_depth=2)

When os.walk() Isn't Enough

os.walk() is powerful, but it has limitations:

Limitation 1: Symbolic Links

By default, os.walk() doesn't follow symbolic links:

# This won't follow symlinks
for root, dirs, files in os.walk(start_path):
    pass

# This will
for root, dirs, files in os.walk(start_path, followlinks=True):
    pass

Warning: Following symlinks can lead to infinite loops if symlinks create cycles!

Limitation 2: You Want to Stop Early

Sometimes you just want to find one file, not all of them:

def find_first_match(start_path, filename):
    """Find first occurrence of a filename."""
    for root, dirs, files in os.walk(start_path):
        if filename in files:
            return os.path.join(root, filename)
    return None

This works, but it's not obvious that os.walk() will stop when you return. It does, but if you need more control, you might want custom traversal.

Limitation 3: Complex Filtering Logic

When your filtering logic depends on multiple factors:

def find_recent_large_python_files(start_path, min_size, days_old):
    """Find .py files that are large and recently modified."""
    import time
    cutoff_time = time.time() - (days_old * 86400)

    for root, dirs, files in os.walk(start_path):
        for file in files:
            if not file.endswith('.py'):
                continue

            full_path = os.path.join(root, file)
            try:
                stat = os.stat(full_path)
                if stat.st_size > min_size and stat.st_mtime > cutoff_time:
                    yield full_path
            except OSError:
                pass

This still uses os.walk(), but now you're layering complex logic on top. Sometimes at this point, custom traversal is cleaner.

Exercise: Building a Smart File Finder

Write a function that finds files matching multiple criteria:

def smart_find(start_path, **criteria):
    """
    Find files matching any combination of:
    - extension: file extension (e.g., '.py')
    - min_size: minimum file size in bytes
    - max_size: maximum file size in bytes
    - name_contains: substring that must be in filename
    - skip_dirs: list of directory names to skip
    - max_depth: maximum depth to search

    Returns list of matching file paths.

    Example usage:
    smart_find('/project',
               extension='.py',
               min_size=1000,
               skip_dirs=['__pycache__', 'venv'],
               max_depth=3)
    """
    # Your implementation here
    pass

Hints:

  1. Use os.walk()
  2. Remember to modify dirs[:] to skip directories
  3. Build up your conditions one at a time
  4. Handle errors gracefully
  5. Track depth by counting separators in the path

Try it yourself before looking at the solution.


Chapter 5: Beyond os.walk() - Custom File System Traversal

You've mastered os.walk(). But sometimes you need complete control. Let me show you when and why.

When You Actually Need Custom Traversal

Here's a real scenario: You're building a backup tool that needs to:

  1. Copy files, but skip certain patterns
  2. Track which files failed and why
  3. Resume from where it left off if interrupted
  4. Show progress with exact counts

os.walk() can't easily do all this. Time to write your own.

Writing Recursive Directory Traversal

Let's build up from the simplest possible version:

def walk_directory_v1(path):
    """Simplest possible recursive walk."""
    print(path)

    for item in os.listdir(path):
        item_path = os.path.join(path, item)
        if os.path.isdir(item_path):
            walk_directory_v1(item_path)  # Recurse!

Run this. What happens?

If you have any symlinks or permissions issues, it crashes. Let's fix that:

def walk_directory_v2(path):
    """More robust version."""
    try:
        items = os.listdir(path)
    except PermissionError:
        print(f"Permission denied: {path}")
        return  # Base case: can't read, so stop

    for item in items:
        item_path = os.path.join(path, item)

        try:
            if os.path.isdir(item_path) and not os.path.islink(item_path):
                walk_directory_v2(item_path)
            else:
                print(f"File: {item_path}")
        except OSError as e:
            print(f"Error accessing {item_path}: {e}")

Do you see the base cases?

  1. Can't read directory β†’ return
  2. Not a directory β†’ print and continue
  3. Is a symlink β†’ don't recurse (avoiding loops)

But this just prints. Let's make it useful:

def walk_and_collect(path, results=None):
    """Collect all files with error tracking."""
    if results is None:
        results = {'files': [], 'errors': []}

    try:
        items = os.listdir(path)
    except PermissionError as e:
        results['errors'].append((path, str(e)))
        return results

    for item in items:
        item_path = os.path.join(path, item)

        try:
            if os.path.isdir(item_path) and not os.path.islink(item_path):
                walk_and_collect(item_path, results)  # Pass results down
            else:
                results['files'].append(item_path)
        except OSError as e:
            results['errors'].append((item_path, str(e)))

    return results

# Usage
result = walk_and_collect('/my/project')
print(f"Found {len(result['files'])} files")
print(f"Encountered {len(result['errors'])} errors")

Notice the pattern: We pass results through the recursion to collect information across all levels.

Understanding Base Cases and Termination

Every recursive function needs to answer: "When do I stop?"

In directory traversal, you stop when:

  1. You can't read the directory (permission error)
  2. You've reached a file (not a directory, nothing to recurse into)
  3. You hit a symbolic link (avoiding infinite loops)
  4. You've reached a depth limit (if you're tracking depth)

Let's add depth limiting:

def walk_with_depth_limit(path, max_depth, current_depth=0):
    """Walk directories up to a maximum depth."""

    # BASE CASE 1: Depth limit reached
    if current_depth > max_depth:
        return

    try:
        items = os.listdir(path)
    except PermissionError:
        # BASE CASE 2: Can't read directory
        return

    indent = "  " * current_depth
    print(f"{indent}{os.path.basename(path)}/")

    for item in items:
        item_path = os.path.join(path, item)

        if os.path.isdir(item_path) and not os.path.islink(item_path):
            # Recurse with incremented depth
            walk_with_depth_limit(item_path, max_depth, current_depth + 1)
        else:
            # BASE CASE 3: It's a file
            print(f"{indent}  {item}")

Handling Permissions and Errors

Real file systems are hostile environments. Here's a robust pattern:

def robust_walk(path, on_error=None):
    """
    Walk directories with comprehensive error handling.

    Args:
        path: Starting directory
        on_error: Callback function for errors: on_error(path, exception)
    """

    def default_error_handler(path, exc):
        print(f"Error at {path}: {exc}")

    error_handler = on_error or default_error_handler

    try:
        items = os.listdir(path)
    except OSError as e:
        error_handler(path, e)
        return

    for item in items:
        item_path = os.path.join(path, item)

        try:
            # Use os.stat instead of isdir/isfile for one system call
            stat_info = os.stat(item_path)

            if os.path.stat.S_ISDIR(stat_info.st_mode):
                robust_walk(item_path, on_error)
            else:
                yield item_path, stat_info  # Yield file with its metadata

        except OSError as e:
            error_handler(item_path, e)

# Usage with custom error handling
errors = []
def collect_errors(path, exc):
    errors.append((path, str(exc)))

for file_path, stat_info in robust_walk('/project', on_error=collect_errors):
    print(f"{file_path}: {stat_info.st_size} bytes")

print(f"\nEncountered {len(errors)} errors")

Iterative vs. Recursive Approaches

Recursion is elegant but has limits. Python's default recursion limit is around 1000. If you have a directory nested 1000+ levels deep (unlikely but possible), recursion fails.

Here's the iterative equivalent using a stack:

def walk_iterative(start_path):
    """Iterative directory traversal using a stack."""
    # Stack holds directories to visit
    stack = [start_path]

    while stack:
        current_path = stack.pop()

        try:
            items = os.listdir(current_path)
        except PermissionError:
            continue

        for item in items:
            item_path = os.path.join(current_path, item)

            try:
                if os.path.isdir(item_path):
                    stack.append(item_path)  # Add to stack instead of recursing
                else:
                    yield item_path
            except OSError:
                pass

The mental shift: Instead of calling the function recursively, we add directories to a stack and process them in a loop.

Which should you use?

Recursive Iterative
More intuitive More verbose
Cleaner code More manual control
Depth limited No depth limit
Natural DFS Can do BFS easily

For most cases: Use recursive. It's clearer and depth limits rarely matter.

For extreme cases: Use iterative. When you need BFS or worry about stack depth.

Here's BFS (breadth-first) with iteration:

from collections import deque

def walk_bfs(start_path):
    """Breadth-first traversal."""
    queue = deque([start_path])

    while queue:
        current_path = queue.popleft()  # FIFO instead of LIFO

        try:
            items = os.listdir(current_path)
        except PermissionError:
            continue

        for item in items:
            item_path = os.path.join(current_path, item)

            try:
                if os.path.isdir(item_path):
                    queue.append(item_path)
                else:
                    yield item_path
            except OSError:
                pass

When does this matter? If you want to find files by how far they are from the root, not by which branch they're in.

Performance Considerations

Memory Usage

# MEMORY INTENSIVE - loads everything into memory
def get_all_files(path):
    files = []
    for root, dirs, filenames in os.walk(path):
        for f in filenames:
            files.append(os.path.join(root, f))
    return files  # Returns potentially millions of items

all_files = get_all_files('/huge/directory')
for f in all_files:  # Only now can you start processing
    process(f)
# MEMORY EFFICIENT - generates one at a time
def iter_files(path):
    for root, dirs, filenames in os.walk(path):
        for f in filenames:
            yield os.path.join(root, f)

for f in iter_files('/huge/directory'):  # Start processing immediately
    process(f)

Rule of thumb: Use generators unless you specifically need the full list.

Early Stopping Strategies

Stop as soon as you find what you need:

def find_first_large_file(path, size_threshold):
    """Find the first file over threshold, then stop."""
    for root, dirs, files in os.walk(path):
        for file in files:
            full_path = os.path.join(root, file)
            try:
                if os.path.getsize(full_path) > size_threshold:
                    return full_path  # STOP IMMEDIATELY
            except OSError:
                pass
    return None

This is much faster than finding all large files when you only need one.

Caching Stat Calls

os.stat() is expensive. Don't call it multiple times:

# SLOW - multiple stat calls per file
def slow_version(path):
    for root, dirs, files in os.walk(path):
        for file in files:
            full_path = os.path.join(root, file)
            size = os.path.getsize(full_path)  # stat call
            mtime = os.path.getmtime(full_path)  # stat call
            # ...

# FAST - one stat call per file
def fast_version(path):
    for root, dirs, files in os.walk(path):
        for file in files:
            full_path = os.path.join(root, file)
            stat = os.stat(full_path)  # ONE stat call
            size = stat.st_size
            mtime = stat.st_mtime
            # ...

Exercise: Implementing a Directory Tree Visualizer

Build a function that creates a visual tree like this:

project/
  src/
    main.py
    utils/
      helpers.py
      constants.py
  tests/
    test_main.py
  README.md

Requirements:

  1. Show directories with / suffix
  2. Indent properly based on depth
  3. Sort directories before files
  4. Add option to limit depth
  5. Add option to show file sizes
  6. Handle errors gracefully
def tree(start_path,
         max_depth=None,
         show_sizes=False,
         show_hidden=False):
    """
    Display directory tree.

    Args:
        start_path: Root directory to display
        max_depth: Maximum depth to display (None = unlimited)
        show_sizes: Show file sizes in bytes
        show_hidden: Include hidden files/dirs (starting with .)
    """
    # Your implementation here
    pass

# Test it:
# tree('.', max_depth=2, show_sizes=True)

Hints:

  1. Track current depth in recursion
  2. Use os.path.basename() for names
  3. Sort: directories first, then alphabetically
  4. Pass depth through recursive calls
  5. Format sizes nicely (consider humanize library pattern)

What You've Learned

You now understand:

  1. How os.walk() really works - generators, memory efficiency
  2. The tuple structure - root, dirs, files and how to use them
  3. Common patterns - filtering, skipping, building trees
  4. When to write custom traversal - specific control needs
  5. Recursion mechanics - base cases, state passing, depth tracking
  6. Iterative alternatives - stack-based, BFS vs DFS
  7. Performance considerations - memory, early stopping, caching

In Part III, we'll take everything you've learned about file system traversal and apply it to JSON and nested dictionaries. You'll see the exact same patterns, just with different data structures. The mental model you've built here transfers directly.