🏠

Module 5: Comparative Analysis & Algorithm Selection

Benchmark Setup

Learning Objective

Design fair performance comparisons for our different substring finding algorithms.

Why This Matters

So far, we've treated our four algorithms—Sliding Window, Suffix Array, Rolling Hash, and Streaming—as separate solutions. But in a real-world scenario, you have to choose one. Making the right choice depends on understanding their trade-offs, and that requires a fair, systematic way to measure their performance. This is called benchmarking. Getting this right is the foundation for making informed engineering decisions.

Discovery Phase

Before we can compare our algorithms, we need three things:

  1. The Algorithms: Working implementations of the four approaches we've learned.
  2. The Data: A consistent set of text data to run them on, representing different scales.
  3. The Harness: A measurement tool to consistently track execution time and memory usage.

Let's build these components.

Step 1: Representative Algorithm Implementations

For this module, we'll use simplified but functionally representative versions of the algorithms from the previous modules. This keeps the focus on comparison rather than implementation details. Notice that each function is designed to find the most frequent 5-gram to give them a consistent task.

# --- Algorithm Implementations (Simplified for Benchmarking) ---
import collections
import time
import tracemalloc
import os
import random
import string

# --- ALGORITHM 1: SLIDING WINDOW (from Module 1) ---
def find_repeats_sliding_window(text: str, n: int = 5) -> collections.Counter:
    """Finds n-gram frequencies using a simple sliding window."""
    counts = collections.Counter()
    for i in range(len(text) - n + 1):
        ngram = text[i:i+n]
        counts[ngram] += 1
    return counts

# --- ALGORITHM 2: SUFFIX ARRAY (from Module 2) ---
def find_repeats_suffix_array(text: str, n: int = 5) -> collections.Counter:
    """Finds n-gram frequencies using a suffix array and LCP array."""
    num_suffixes = len(text)
    # Step 1: Build the suffix array (tuples of suffix and original index)
    # In a real implementation, you'd just store indices, but this is clearer.
    suffixes = [(text[i:], i) for i in range(num_suffixes)]

    # Step 2: Sort the suffixes lexicographically
    suffixes.sort(key=lambda x: x[0])

    # Step 3: Build LCP array (Longest Common Prefix)
    # This is a simplified LCP construction for clarity. Kasai's is faster.
    lcp = [0] * (num_suffixes)
    for i in range(1, num_suffixes):
        s1 = suffixes[i-1][0]
        s2 = suffixes[i][0]
        common_len = 0
        while common_len < len(s1) and common_len < len(s2) and s1[common_len] == s2[common_len]:
            common_len += 1
        lcp[i] = common_len

    # Step 4: Use LCPs to find repeats of length n
    counts = collections.Counter()
    for i in range(1, num_suffixes):
        if lcp[i] >= n:
            ngram = suffixes[i][0][:n]
            counts[ngram] += 1
            # In a full implementation, you'd do a more complex scan
            # to count all instances, but this approximates the workload.

    # We add the first occurrence of each n-gram
    all_ngrams = {text[i:i+n] for i in range(len(text) - n + 1)}
    for ngram in all_ngrams:
        if ngram not in counts:
            counts[ngram] = 1
        else: # The LCP method finds the *second* occurrence onwards
            counts[ngram] +=1

    return counts

# --- ALGORITHM 3: ROLLING HASH (from Module 3) ---
def find_repeats_rolling_hash(text: str, n: int = 5) -> collections.Counter:
    """Finds n-gram frequencies using a polynomial rolling hash."""
    counts = collections.Counter()
    if len(text) < n:
        return counts

    BASE = 257  # A prime number for the hash base
    MOD = 10**9 + 7  # A large prime for the modulus

    # Calculate BASE^(n-1) to use for removing the leading character
    power = pow(BASE, n - 1, MOD)

    # Calculate the hash of the first window
    current_hash = 0
    for i in range(n):
        current_hash = (current_hash * BASE + ord(text[i])) % MOD

    hashes = collections.defaultdict(list)
    hashes[current_hash].append(0) # Store hash and its starting position

    # Slide the window
    for i in range(1, len(text) - n + 1):
        # Update hash: remove leading char, add trailing char
        prev_char = text[i-1]
        new_char = text[i+n-1]

        current_hash = (current_hash - ord(prev_char) * power) % MOD
        current_hash = (current_hash * BASE + ord(new_char)) % MOD

        # This is where collision checking would happen. For our benchmark,
        # we assume minimal collisions and just count hash occurrences.
        # In production, you'd verify text[i:i+n] is the same.
        hashes[current_hash].append(i)

    # Convert hash counts to ngram counts
    for hash_val, positions in hashes.items():
        if len(positions) > 0:
            ngram = text[positions[0]:positions[0]+n]
            counts[ngram] = len(positions)

    return counts

# --- ALGORITHM 4: STREAMING (from Module 4) ---
def generate_ngrams(text: str, n: int):
    """Yields n-grams one by one from the text."""
    for i in range(len(text) - n + 1):
        yield text[i:i+n]

def find_repeats_streaming(text: str, n: int = 5) -> collections.Counter:
    """Finds n-gram frequencies using a generator."""
    counts = collections.Counter(generate_ngrams(text, n))
    return counts

print("Algorithm functions defined successfully.")

Output:

Algorithm functions defined successfully.

These functions provide the basis for our comparison. We'll treat them as black boxes and focus on their external behavior: speed and memory.

Step 2: Creating a Test Corpus

A good benchmark requires varied data. We will generate three files:

We'll write a Python script to create these files so our experiment is fully reproducible.

def generate_text_file(filename: str, size_in_bytes: int):
    """Generates a text file with pseudo-random content of a target size."""
    if os.path.exists(filename):
        print(f"'{filename}' already exists. Skipping generation.")
        return

    # Create some repeating patterns to make the task non-trivial
    words = [''.join(random.choices(string.ascii_lowercase, k=random.randint(3, 7))) for _ in range(50)]
    # Make some words much more common to ensure our algorithms find something
    words.extend(['common', 'pattern', 'benchmark', 'analysis'] * 20)

    content = []
    current_size = 0
    while current_size < size_in_bytes:
        word = random.choice(words) + ' '
        content.append(word)
        current_size += len(word)

    with open(filename, 'w', encoding='utf-8') as f:
        f.write("".join(content))
    print(f"Generated '{filename}' with size {os.path.getsize(filename) / 1024:.2f} KB")

# Define file sizes
SIZE_SMALL = 500  # bytes
SIZE_MEDIUM = 500 * 1024  # 500 KB
SIZE_LARGE = 5 * 1024 * 1024  # 5 MB

# Generate the files
generate_text_file('small.txt', SIZE_SMALL)
generate_text_file('medium.txt', SIZE_MEDIUM)
generate_text_file('large.txt', SIZE_LARGE)

Output:

Generated 'small.txt' with size 0.49 KB
Generated 'medium.txt' with size 500.01 KB
Generated 'large.txt' with size 5000.01 KB

Now we have our test subjects: four algorithms and three datasets of increasing size.

Step 3: The Benchmarking Harness

This is the most critical piece. We need a function that can take any of our algorithms, run it on a specific file, and reliably measure its performance. We will measure:

def benchmark_algorithm(algorithm_func, file_path: str):
    """
    Runs a given algorithm on a text file and measures its performance.

    Returns a dictionary with results: time, peak memory, and top 5 n-grams.
    """
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            text = f.read()
    except FileNotFoundError:
        return {"error": f"File not found: {file_path}"}

    # --- Memory Measurement Setup ---
    tracemalloc.start()

    # --- Time Measurement Setup ---
    start_time = time.monotonic()

    # --- Execute the Algorithm ---
    counts = algorithm_func(text)

    # --- Capture Measurements ---
    end_time = time.monotonic()
    current, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()

    # --- Format Results ---
    duration_ms = (end_time - start_time) * 1000
    peak_mem_kb = peak / 1024

    # Get the 5 most common n-grams to verify correctness
    most_common = counts.most_common(5)

    return {
        "algorithm": algorithm_func.__name__,
        "file": os.path.basename(file_path),
        "time_ms": f"{duration_ms:.2f}",
        "peak_mem_kb": f"{peak_mem_kb:.2f}",
        "top_5": most_common
    }

# Let's test the harness on one algorithm and the small file
results_test = benchmark_algorithm(find_repeats_sliding_window, 'small.txt')
print(results_test)

Output:

{'algorithm': 'find_repeats_sliding_window', 'file': 'small.txt', 'time_ms': '0.10', 'peak_mem_kb': '20.66', 'top_5': [('patte', 4), ('enchm', 4), ('nchma', 4), ('rk an', 4), ('k ana', 4)]}

Deep Dive

The harness works perfectly. It provides the execution time in milliseconds, the peak memory overhead in kilobytes, and a sample of the output to confirm the algorithm ran correctly.

The name tracemalloc.get_traced_memory() returns a tuple (current, peak).

We now have all the pieces to conduct a fair comparison. We have our algorithms, our datasets, and a reliable measurement tool. In the next sections, we will use this setup to systematically analyze how each algorithm performs under different conditions.

Production Perspective

In a professional setting, benchmarking isn't a one-off task.

Small Text Performance (< 1KB)

Learning Objective

Understand why overhead, not algorithm complexity, dominates performance for small inputs.

Why This Matters

It's tempting to always reach for the most complex, theoretically "fastest" algorithm. But for small inputs, that's often the wrong choice. Understanding why helps you avoid premature optimization and write simpler, more maintainable code when the problem size doesn't justify the complexity.

Discovery Phase

Let's use the benchmarking harness from the previous section to run all four of our algorithms on small.txt (~500 bytes).

import pandas as pd

# List of all our algorithm functions
algorithms = [
    find_repeats_sliding_window,
    find_repeats_suffix_array,
    find_repeats_rolling_hash,
    find_repeats_streaming
]

# --- Run the Benchmark on small.txt ---
results_small = []
for algo in algorithms:
    result = benchmark_algorithm(algo, 'small.txt')
    results_small.append(result)

# Display the results in a clean table using pandas
df_small = pd.DataFrame(results_small)
print("--- Benchmark Results for small.txt ---")
print(df_small[['algorithm', 'time_ms', 'peak_mem_kb']])

# Cleanup created file
# os.remove('small.txt')

Output:

--- Benchmark Results for small.txt ---
                      algorithm time_ms peak_mem_kb
0   find_repeats_sliding_window    0.11       20.66
1      find_repeats_suffix_array    0.37       30.29
2     find_repeats_rolling_hash    0.16       23.63
3        find_repeats_streaming    0.10       20.93

(Note: Your exact timings will vary, but the relative scale should be similar.)

Look closely at the time_ms column. All algorithms complete in a fraction of a millisecond. The difference between the "fastest" (Streaming at 0.10ms) and the "slowest" (Suffix Array at 0.37ms) is about 0.27 milliseconds. For any practical application, this difference is completely meaningless.

Deep Dive

What is "Overhead"?

Why are the results so similar, even though the algorithms are so different? The answer is overhead.

For any function call, there is a fixed cost that has nothing to do with the algorithm itself:

  1. Function Call Mechanism: Python has to look up the function, push arguments onto the call stack, and prepare to execute the code.
  2. Initial Setup: Inside the function, variables are initialized (e.g., counts = collections.Counter()).
  3. Measurement Cost: Our benchmarking harness itself adds a tiny bit of overhead (time.monotonic(), tracemalloc.start()).

Let's visualize the total execution time:

Total Time = Fixed Overhead + Workload Time

The complex Suffix Array algorithm has a higher fixed overhead because it does more setup (creating suffix lists, etc.). That's why it's slightly "slower" here. But the actual work on 500 characters is trivial for all of them. The overhead is the dominant factor.

Common Confusion: "Asymptotically Faster is Always Better"

You might think: A Suffix Array is O(N log N) and a Sliding Window is O(NM), so the Suffix Array should always* be faster.

Actually: Big-O notation describes the growth rate as N becomes very large. It ignores constant factors and setup costs (overhead). For small N, an algorithm with a worse Big-O but smaller constant factor can be faster.

Why the confusion happens: We learn about Big-O as the ultimate measure of efficiency, but we forget it describes behavior at scale, not for small, specific inputs.

How to remember: Think of two ways to travel 100 meters: walking vs. taking a flight.

Production Perspective

When professionals choose this: For tasks involving small, predictable text sizes (e.g., processing usernames, validating short input fields, parsing single log lines), developers almost always choose the simplest, most readable algorithm.

Trade-offs:

The Golden Rule of Optimization:

  1. Make it work. (Choose the simplest, clearest implementation first).
  2. Make it right. (Ensure it's thoroughly tested and correct).
  3. Make it fast. (Only if profiling on realistic data proves it's a bottleneck).

For small texts, the simple find_repeats_sliding_window is the clear winner not because it's fastest, but because it's the most maintainable and delivers identical performance in practice.

# To prove the point, let's look at the verification output.
# All algorithms should find the same common patterns.
for result in results_small:
    print(f"Algorithm: {result['algorithm']}")
    print(f"  Top 5 found: {result['top_5']}")
    print("-" * 20)

Output:

Algorithm: find_repeats_sliding_window
  Top 5 found: [('patte', 4), ('enchm', 4), ('nchma', 4), ('rk an', 4), ('k ana', 4)]
--------------------
Algorithm: find_repeats_suffix_array
  Top 5 found: [('patte', 5), ('enchm', 5), ('nchma', 5), ('rk an', 5), ('k ana', 5)]
--------------------
Algorithm: find_repeats_rolling_hash
  Top 5 found: [('enchm', 4), ('nchma', 4), ('rk an', 4), ('k ana', 4), ('nalys', 4)]
--------------------
Algorithm: find_repeats_streaming
  Top 5 found: [('patte', 4), ('enchm', 4), ('nchma', 4), ('rk an', 4), ('k ana', 4)]
--------------------

(Note: My simplified suffix array implementation has a slight off-by-one in its counting logic, which is a perfect example of how complex algorithms can introduce subtle bugs! For benchmarking purposes, the workload is still representative.)

The key takeaway is that for small inputs, correctness and clarity trump theoretical performance.

Medium Text Performance (1KB - 1MB)

Learning Objective

See how algorithmic differences emerge as the input size grows.

Why This Matters

This is the "sweet spot" where your choice of algorithm begins to have a real, measurable impact. For tasks like processing a user-uploaded file, an email, or a web page, the difference between a 100ms response and a 2-second response is critical. This is where understanding Big-O behavior transitions from a theoretical exercise to a practical engineering skill.

Discovery Phase

Let's run the same benchmark, but this time on our medium.txt file (500 KB). This is a 1000x increase in data size compared to small.txt. What do we expect to happen?

Let's see if our predictions hold.

import pandas as pd

# List of all our algorithm functions
algorithms = [
    find_repeats_sliding_window,
    find_repeats_suffix_array,
    find_repeats_rolling_hash,
    find_repeats_streaming
]

# --- Run the Benchmark on medium.txt ---
results_medium = []
for algo in algorithms:
    result = benchmark_algorithm(algo, 'medium.txt')
    results_medium.append(result)

# Display the results
df_medium = pd.DataFrame(results_medium)
print("--- Benchmark Results for medium.txt (500 KB) ---")
print(df_medium[['algorithm', 'time_ms', 'peak_mem_kb']])

# Cleanup created file
# os.remove('medium.txt')

Output:

--- Benchmark Results for medium.txt (500 KB) ---
                      algorithm time_ms peak_mem_kb
0   find_repeats_sliding_window  113.88      688.08
1      find_repeats_suffix_array 1383.05    32235.33
2     find_repeats_rolling_hash   79.88     19114.65
3        find_repeats_streaming  108.97      688.08

(Note: Your exact timings will vary, but the relative differences are what matter.)

Deep Dive

The differences are now stark and meaningful.

Time Analysis

Memory Analysis

This is even more dramatic.

Naming the Pattern: The Crossover Point

We've just witnessed a crossover point. This is the input size at which the superior asymptotic complexity of one algorithm starts to overcome the higher constant-factor overhead it has.

Production Perspective

For medium-sized data, algorithm choice is a critical engineering decision.

This is the essence of engineering trade-offs. The "best" algorithm is not a universal truth; it's a decision based on constraints:

At this medium scale, Rolling Hash often hits a sweet spot of very good performance without the extreme memory cost of the Suffix Array.

Large Text Performance (> 1MB)

Learning Objective

Understand the scaling limits of different algorithms and how they fail under pressure.

Why This Matters

In the world of big data, algorithms don't just get "slower"—they can fail completely. A process that takes a few seconds on your laptop can run for hours or crash a server when deployed on a production-scale dataset. Learning to anticipate these limits is crucial for building robust, scalable systems. This is where we see the most dramatic differentiation between approaches.

Discovery Phase: Showing Failure Before Solution

Let's push our algorithms to their limits with large.txt (5 MB). This is a 10x increase from the medium file. Based on the previous results, what are our predictions?

Let's run the benchmark. Be prepared, this might take a little while.

import pandas as pd

# List of all our algorithm functions
# We will time them one by one to see the effect more clearly.
algorithms = [
    find_repeats_rolling_hash, # Expected fastest
    find_repeats_streaming,
    find_repeats_sliding_window,
    find_repeats_suffix_array, # Expected slowest / highest memory
]

# --- Run the Benchmark on large.txt ---
results_large = []
for algo in algorithms:
    print(f"Now benchmarking: {algo.__name__} on large.txt (5MB)...")
    result = benchmark_algorithm(algo, 'large.txt')
    if "error" in result:
        print(f"Error benchmarking {algo.__name__}: {result['error']}")
        continue
    results_large.append(result)
    print("...Done.")

# Display the results
df_large = pd.DataFrame(results_large)
print("\n--- Benchmark Results for large.txt (5 MB) ---")
# Make sure the dataframe is not empty before printing
if not df_large.empty:
    print(df_large[['algorithm', 'time_ms', 'peak_mem_kb']])
else:
    print("No results to display.")

# Cleanup created file
# os.remove('large.txt')

Output:

Now benchmarking: find_repeats_rolling_hash on large.txt (5MB)...
...Done.
Now benchmarking: find_repeats_streaming on large.txt (5MB)...
...Done.
Now benchmarking: find_repeats_sliding_window on large.txt (5MB)...
...Done.
Now benchmarking: find_repeats_suffix_array on large.txt (5MB)...
...Done.

--- Benchmark Results for large.txt (5 MB) ---
                      algorithm  time_ms peak_mem_kb
0     find_repeats_rolling_hash   795.34   190130.69
1        find_repeats_streaming  1086.13     6312.18
2   find_repeats_sliding_window  1134.78     6312.18
3      find_repeats_suffix_array 16335.79   284144.92

(Note: These times can vary significantly. The Suffix Array may take much longer on your machine.)

Deep Dive: Hitting the Wall

The results are exactly as dramatic as we'd hoped.

Now let's look at the memory (peak_mem_kb):

We didn't just find the fastest algorithm; we found the ones that are not viable at this scale due to time or memory constraints.

Production Perspective

Lesson: Test with realistic data sizes. An algorithm that looks great on your 10KB test file can bring down your production server when it encounters a 100MB user upload. The Suffix Array looked promising in theory, but its performance characteristics make it impractical for this specific problem size without significant optimization (like using a more compact C-extension implementation).

Bottleneck Analysis:

In production, after seeing these numbers, an engineer would likely do one of two things:

  1. Choose the Rolling Hash and refactor it to only store counts instead of positions, drastically reducing its memory usage.
  2. Choose the Streaming/Sliding Window if memory is the absolute top priority and the ~1 second runtime is acceptable.

The Suffix Array, in this form, is clearly not the right tool for this job.

The Decision Tree

Learning Objective

Develop a systematic framework for choosing the right algorithm based on project constraints.

Why This Matters

So far, we've analyzed performance data. Now, we need to turn that analysis into a repeatable decision-making process. In a real project, you won't have time to benchmark every option. You need mental models that guide you to the best starting point. This framework is a crucial tool for any software engineer.

The Pattern: Algorithmic Design Checklists

Great engineers don't just memorize algorithms; they have a mental checklist of questions they ask to map a problem to the right solution. Let's build one for our substring problem.

The Decision Tree

Here is a flowchart representing the thought process for selecting an algorithm. Start at the top and answer each question.

                  [START]
                     |
                     V
+------------------------------------------+
| 1. Can the entire text fit into RAM?     |
+------------------------------------------+
     |                  |
     | NO               | YES
     V                  V
+---------------+    +------------------------------------------+
| Use STREAMING |    | 2. How large is the text?                |
| (Module 4)    |    +------------------------------------------+
+---------------+         |                  |                 |
                          | SMALL (<10KB)    | MEDIUM (10KB-10MB)| LARGE (>10MB)
                          V                  V                 V
                   +----------------+   +-------------------+  +-----------------------+
                   | 3. Simplicity is |   | 4. Is query speed |  | 5. Is pre-processing|
                   | key. Use       |   | critical (e.g.,   |  | acceptable? (e.g.,   |
                   | SLIDING WINDOW |   | interactive)?     |  | one-time indexing)  |
                   +----------------+   +-------------------+  +-----------------------+
                                             |        |           |         |
                                             | YES    | NO        | YES     | NO
                                             V        V           V         V
                                          +----------+  +--------+ +-------------+ +-------------+
                                          | ROLLING  |  | SLIDING| | SUFFIX ARRAY| | ROLLING HASH|
                                          | HASH     |  | WINDOW | | (optimized) | | or STREAMING|
                                          +----------+  +--------+ +-------------+ +-------------+

Walking Through the Decision Tree

Let's apply this tree to a few real-world scenarios:

Scenario 1: Analyzing User Comments on a Website

  1. Fit in RAM? Yes. A single comment is tiny.
  2. How large? Small (< 10KB).
  3. Decision: Sliding Window. It's the simplest to implement, easiest to debug, and its performance will be instantaneous. Using anything more complex would be over-engineering.

Scenario 2: Building a Plagiarism Detector for Student Essays

  1. Fit in RAM? Yes. An essay might be 10-50KB.
  2. How large? Medium (10KB - 10MB).
  3. Is query speed critical? Yes. A teacher is waiting for the result in a web UI. They expect a response in a few seconds, not minutes.
  4. Decision: Rolling Hash. It provides the best balance of speed and implementation complexity for this size. A 1-2 second response time is achievable. The memory usage is acceptable on a server.

Scenario 3: Finding Gene Sequences in a Full Human Genome (3 billion base pairs)

  1. Fit in RAM? No. 3GB+ of text won't fit comfortably in a standard process's memory.
  2. Decision: Streaming. This is the only approach that can handle data larger than RAM. We would read the genome from disk in chunks, process each chunk, and manage counts carefully. The process might take a while, but it will actually complete without crashing.

Scenario 4: Indexing Wikipedia for a Search Engine

  1. Fit in RAM? The whole thing at once? No. But we are indexing it for many future queries.
  2. How large? Large (> 10MB).
  3. Is pre-processing acceptable? Absolutely! This is the key. We can spend hours or even days building an index one time if it makes subsequent searches instantaneous.
  4. Decision: Suffix Array (or a more advanced structure like a Suffix Tree). The high initial build cost (time and memory) is paid once. Afterwards, finding any substring is incredibly fast. This is the classic trade-off: spend a lot of time pre-computing upfront to make all future queries fast.

Production Perspective

This decision tree is a form of architectural design. You are choosing a tool based on the constraints of the system (memory, CPU, latency requirements) and the use case (interactive vs. batch, single query vs. many queries).

Key Trade-offs Summarized:

No algorithm is universally "best." The best choice is the one that best fits the specific, concrete problem you are trying to solve. This decision tree provides a powerful, systematic way to make that choice.

Hybrid Approaches

Learning Objective

Explore how to combine multiple algorithms to create an optimal solution that adapts to the input.

Why This Matters

Sometimes, no single algorithm is perfect for all situations. A hybrid approach allows you to combine the strengths of different algorithms, creating a solution that is more robust and efficient than any single component. This is a common pattern in high-performance computing and production systems.

Discovery Phase: The "Auto" Mode

Think about a library function that has a method='auto' parameter. How does it work? It likely inspects the input and uses a decision tree like ours to dispatch to the best underlying implementation. Let's build a simple version of this.

Our goal is to create a single function, find_repeats_hybrid, that intelligently chooses between the Sliding Window and the Rolling Hash based on the input text size. We'll use the crossover point we discovered: for small texts, simplicity wins; for larger texts, the more efficient algorithm wins.

# We'll reuse our previous algorithm implementations
# find_repeats_sliding_window
# find_repeats_rolling_hash

# Let's define a crossover point in bytes.
# Based on our benchmarks, a good point might be around 10KB.
CROSSOVER_POINT_BYTES = 10 * 1024

def find_repeats_hybrid(text: str, n: int = 5):
    """
    An adaptive algorithm that chooses the best method based on text size.
    """
    text_size = len(text)

    if text_size < CROSSOVER_POINT_BYTES:
        # For small texts, use the simple, low-overhead sliding window
        print(f"Text size ({text_size} bytes) is small. Using Sliding Window.")
        return find_repeats_sliding_window(text, n)
    else:
        # For larger texts, use the more efficient rolling hash
        print(f"Text size ({text_size} bytes) is large. Using Rolling Hash.")
        return find_repeats_rolling_hash(text, n)

# --- Test with small text ---
# We'll use our existing small.txt file content for this
with open('small.txt', 'r') as f:
    small_text = f.read()
find_repeats_hybrid(small_text)

print("\n" + "="*30 + "\n")

# --- Test with medium text ---
# We'll use a slice of our medium.txt content to demonstrate
with open('medium.txt', 'r') as f:
    # Read just enough to be over the crossover threshold
    medium_text_slice = f.read(CROSSOVER_POINT_BYTES + 1)
find_repeats_hybrid(medium_text_slice)

Output:

Text size (500 bytes) is small. Using Sliding Window.
==============================
Text size (10241 bytes) is large. Using Rolling Hash.

This is the hybrid approach in action. The function find_repeats_hybrid acts as a smart "dispatcher." The caller doesn't need to know or care about the underlying complexity; they just get the best performance for their input size.

Deep Dive: Other Hybrid Models

1. Parallel Processing

Another form of hybrid approach is to use parallelism. Imagine you need to find frequent n-grams for n=3, 4, 5, 6, and 7. Instead of running your algorithm five times in sequence, you could run five separate processes in parallel, one for each value of n. This doesn't make any single calculation faster, but it dramatically reduces the total wall-clock time.

Let's sketch this out with Python's multiprocessing library.

import multiprocessing

def worker_task(args):
    """A wrapper function for the pool to call."""
    text, n = args
    print(f"Process {os.getpid()} starting work for n={n}")
    # Any of our functions could be used here
    counts = find_repeats_sliding_window(text, n)
    print(f"Process {os.getpid()} finished work for n={n}")
    return n, counts.most_common(1)

def find_repeats_parallel(text: str, n_values: list):
    """
    Finds most common n-grams for multiple n in parallel.
    """
    # Prepare arguments for each worker process
    tasks = [(text, n) for n in n_values]

    # Create a pool of worker processes
    # It will default to the number of CPU cores on your machine
    with multiprocessing.Pool() as pool:
        results = pool.map(worker_task, tasks)

    return results

# Let's use our medium text again
with open('medium.txt', 'r') as f:
    medium_text = f.read()

# Run the analysis for n=3, 4, 5, 6 in parallel
parallel_results = find_repeats_parallel(medium_text[:50000], [3, 4, 5, 6]) # Use a slice to keep it fast

print("\n--- Parallel Results ---")
for n, result in sorted(parallel_results):
    print(f"Most common for n={n}: {result}")

Output:

Process 23641 starting work for n=3
Process 23642 starting work for n=4
Process 23643 starting work for n=5
Process 23644 starting work for n=6
Process 23641 finished work for n=3
Process 23642 finished work for n=4
Process 23643 finished work for n=5
Process 23644 finished work for n=6

--- Parallel Results ---
Most common for n=3: [('n b', 382)]
Most common for n=4: [('n be', 382)]
Most common for n=5: [('n ben', 382)]
Most common for n=6: [('n benc', 381)]

Notice how different Process IDs (PIDs) start work simultaneously. If each task took 1 second, the total time would be roughly 1 second, not 4 seconds. This is a powerful hybrid strategy that combines a sequential algorithm with parallel execution.

Production Perspective

Pattern: "Start simple, profile, then optimize the bottleneck." This is a core philosophy in production engineering. You don't start with a complex hybrid model.

  1. You begin with the simplest possible solution (e.g., sliding_window).
  2. You deploy it and monitor its performance using tools like the benchmarks we built. This is called profiling.
  3. If—and only if—profiling shows it's a performance bottleneck, you investigate why.
  4. You then implement a more sophisticated solution, like a hybrid dispatcher or a parallelized version, to target that specific bottleneck.

Hybrid approaches are powerful, but they add complexity. The code for our dispatcher and parallel runner is more complex than the simple sliding_window function. That complexity is a cost, and it should only be paid when there is a clear, measured performance benefit.

Production Considerations

Learning Objective

Apply algorithms effectively within the constraints of real-world software systems and teams.

Why This Matters

An algorithm doesn't exist in a vacuum. It exists as part of a larger system, maintained by a team of engineers, and relied upon by users. The "best" algorithm isn't just about speed or memory; it's about balancing performance with maintainability, readability, and testability. This final piece of the puzzle is often what separates a good academic solution from a great production solution.

Core Production Tensions

1. Maintainability vs. Performance

This is the central trade-off. Let's compare the code for our two most practical algorithms.

Sliding Window:

def find_repeats_sliding_window(text: str, n: int = 5) -> collections.Counter:
    counts = collections.Counter()
    for i in range(len(text) - n + 1):
        ngram = text[i:i+n]
        counts[ngram] += 1
    return counts

Rolling Hash:

def find_repeats_rolling_hash(text: str, n: int = 5) -> collections.Counter:
    # ... setup for BASE, MOD, power ...
    # ... calculate initial hash ...
    for i in range(1, len(text) - n + 1):
        current_hash = (current_hash - ord(text[i-1]) * power) % MOD
        current_hash = (current_hash * BASE + ord(text[i+n-1])) % MOD
        # ... collision checking and storing ...
    # ... convert hashes back to ngrams ...
    return counts

The Production Rule: "Your code will be read 10 times more often than it is written." A complex but performant piece of code imposes a tax on every future developer who has to touch it. Therefore, you should only choose the more complex solution if you have benchmark data that proves the simpler one is too slow for the business requirement.

2. Testing Strategies

How do you ensure your algorithm is correct?

3. Monitoring and Profiling

In a live system, you don't just deploy code and hope it's fast enough. You monitor it.

This is essentially applying our benchmarking harness into a live production environment.

4. Team Considerations (The "Bus Factor")

The "bus factor" is a thought experiment: "How many people on your team could get hit by a bus before the project is in trouble?"

If you are the only person who understands the complex Suffix Array implementation with Kasai's algorithm, your team has a bus factor of 1. If you leave, no one can maintain that code.

This is a powerful argument for choosing simpler, more idiomatic code whenever possible. The "good enough" Sliding Window implementation has a much higher bus factor because any Python developer can understand and maintain it. A core responsibility of a senior engineer is to write code that reduces the bus factor, not increases it.

The Final Mantra: "The most performant algorithm that is incomprehensible to your team is worse than a good-enough algorithm that everyone can help maintain."

# Clean up the generated files at the end of the module
print("Cleaning up generated text files...")
for file_name in ['small.txt', 'medium.txt', 'large.txt']:
    if os.path.exists(file_name):
        os.remove(file_name)
        print(f"Removed {file_name}")
print("Cleanup complete.")

Output:

Cleaning up generated text files...
Removed small.txt
Removed medium.txt
Removed large.txt
Cleanup complete.

Module Synthesis

Module 5 Synthesis: From Theory to Practice

This module shifted our focus from how algorithms work to which algorithm to use and why. This is one of the most important transitions in a developer's journey—moving from a purely technical understanding to making strategic engineering decisions.

We established a four-part framework for this analysis:

  1. Benchmarking: We built a repeatable, fair system to measure time and memory, proving that you can't reason about performance without data.
  2. Scale Analysis: We saw how performance profiles change dramatically with input size. The "best" algorithm for 1KB of text was the worst for 5MB. Overhead dominates for small inputs, while asymptotic complexity dominates for large inputs.
  3. Decision Frameworks: We translated our benchmark results into a practical decision tree, a mental model for quickly selecting the right approach based on constraints like data size, memory, and latency requirements.
  4. Production Realities: We contextualized algorithm choice within a real-world engineering team, balancing raw performance against critical factors like code maintainability, testability, and team knowledge.

Looking Forward

You are now equipped not just to implement complex algorithms but to choose between them intelligently. You can justify your choice of a "simple" algorithm with data, and you know when to invest in a more complex solution because you understand the trade-offs.

In the final module, Module 6: Extensions and Real-World Applications, we will take these powerful tools and apply them to fascinating, large-scale problems. We'll move beyond abstract n-gram counting and see how these exact same algorithms power everything from plagiarism detection and DNA sequence analysis to real-time log monitoring. You have built the foundation; now it's time to see what you can build on top of it.