🏠

Part III: JSON and Nested Dictionaries


Chapter 6: Navigating JSON Structures

From File Paths to Dictionary Keys

You already know how to navigate a file system. When you see a path like /home/user/documents/report.pdf, you understand there's a hierarchy: the home directory contains user, which contains documents, which contains report.pdf.

JSON structures work the same way, but instead of directories and files, you have dictionaries and values:

data = {
    "home": {
        "user": {
            "documents": {
                "report": "Here's the content"
            }
        }
    }
}

The mental shift is simple:

When you typed os.path.join('home', 'user', 'documents'), you were building a path. When you type data['home']['user']['documents'], you're doing the exact same thing—just with a different syntax.

Direct Access Patterns

Let's start with a realistic example. Here's a response from a weather API:

weather = {
    "city": "San Francisco",
    "current": {
        "temperature": 68,
        "conditions": "Partly cloudy",
        "wind": {
            "speed": 12,
            "direction": "NW"
        }
    },
    "forecast": [
        {"day": "Monday", "high": 72, "low": 58},
        {"day": "Tuesday", "high": 70, "low": 56}
    ]
}

Problem: You want the current temperature.

First approach (the obvious path):

temperature = weather["current"]["temperature"]
print(temperature)  # 68

This works perfectly when you know the data structure and you're confident the keys exist. Each bracket ["key"] is like moving down one directory level.

But what if the API doesn't always include current conditions? Your code crashes:

# Imagine 'current' is missing
weather = {"city": "San Francisco"}
temperature = weather["current"]["temperature"]  # KeyError!

Second approach (the safer path):

temperature = weather.get("current", {}).get("temperature")
print(temperature)  # None (instead of crashing)

Let's pause and understand what just happened. The .get() method does two things:

  1. It retrieves the value if the key exists
  2. It returns a default value (None, or whatever you specify) if the key doesn't exist

The pattern weather.get("current", {}) says: "Get me the 'current' dictionary, but if it doesn't exist, give me an empty dictionary instead." Then we immediately call .get("temperature") on that result. If current was missing, we're calling .get("temperature") on an empty dict, which safely returns None.

Why is this better? Because your program keeps running. You can make decisions based on whether the data exists:

temperature = weather.get("current", {}).get("temperature")

if temperature is not None:
    print(f"It's {temperature}°F")
else:
    print("Temperature data unavailable")

Third approach (with meaningful defaults):

temperature = weather.get("current", {}).get("temperature", "Unknown")
wind_speed = weather.get("current", {}).get("wind", {}).get("speed", 0)

Now let's look at that forecast list. You want Tuesday's high temperature:

# Direct access (assuming you know Tuesday is at index 1)
tuesday_high = weather["forecast"][1]["high"]

# But what if forecast is empty or too short?
tuesday = weather.get("forecast", [])[1] if len(weather.get("forecast", [])) > 1 else None

Wait—that's getting messy. Let's refactor:

forecast = weather.get("forecast", [])
if len(forecast) > 1:
    tuesday_high = forecast[1]["high"]
else:
    tuesday_high = None

Better. We checked first, then accessed. This is defensive navigation: always verify before you dive deeper.

When You Know the Schema

The approaches above work brilliantly when you know what the JSON looks like. This is common in three scenarios:

1. API responses with documentation

You're calling a GitHub API that returns:

{
    "repository": {
        "name": "my-project",
        "owner": {
            "login": "username"
        }
    }
}

The documentation tells you this structure is guaranteed, so you can confidently write:

owner = data["repository"]["owner"]["login"]

2. Configuration files you control

Your application has a config.json:

{
    "database": {
        "host": "localhost",
        "port": 5432
    }
}

You created this file, you know its structure, so direct access is fine:

db_host = config["database"]["host"]

3. Data you just created

user = {
    "name": "Alice",
    "preferences": {
        "theme": "dark"
    }
}

theme = user["preferences"]["theme"]  # You just made this, you know it's there

In all these cases, direct access with chained brackets is perfect. It's clear, it's fast, and it's exactly as safe as it needs to be.

But notice what's common: you know the path ahead of time. You know ["repository"]["owner"]["login"] will get you the username. This is like knowing the exact file path /home/user/documents/report.pdf before you look for it.

Exercise: Extracting Data from a Complex API Response

Here's a simplified response from a music API:

response = {
    "artist": {
        "name": "Miles Davis",
        "albums": [
            {
                "title": "Kind of Blue",
                "year": 1959,
                "tracks": [
                    {"name": "So What", "duration": 542},
                    {"name": "Freddie Freeloader", "duration": 591}
                ]
            },
            {
                "title": "Bitches Brew",
                "year": 1970,
                "tracks": [
                    {"name": "Pharaoh's Dance", "duration": 1220}
                ]
            }
        ]
    }
}

Your tasks:

  1. Extract the artist name (direct access)
  2. Get the title of the second album (list indexing)
  3. Find the duration of "So What" (nested access)
  4. Get the year of the first album, with a default of 0 if missing (safe access)
  5. Calculate the total duration of all tracks in "Kind of Blue"

Try solving these before looking ahead. Remember: you know the structure, so start with the straightforward approach. Only add safety checks where you think the data might genuinely be missing.

Solutions:

# 1. Artist name
artist_name = response["artist"]["name"]

# 2. Second album title
second_album = response["artist"]["albums"][1]["title"]

# 3. Duration of "So What" (first track of first album)
so_what_duration = response["artist"]["albums"][0]["tracks"][0]["duration"]

# 4. Year with default
first_album = response["artist"]["albums"][0]
year = first_album.get("year", 0)

# 5. Total duration of Kind of Blue
kind_of_blue = response["artist"]["albums"][0]
total = sum(track["duration"] for track in kind_of_blue["tracks"])
print(f"Total duration: {total} seconds")

Notice how natural this feels once you see the pattern. albums[0]["tracks"][0] reads like "first album, first track"—exactly what you're thinking in your head.

Now let's introduce a problem. What if the API sometimes doesn't include track durations?

# More defensive version
tracks = response.get("artist", {}).get("albums", [])[0].get("tracks", [])
total = sum(track.get("duration", 0) for track in tracks)

See how we added .get() calls only where things might be missing? The albums list might not exist, tracks might not exist, duration might not exist. But we know if albums exists, it's a list, so we can index it with [0].

This is the judgment you develop: use direct access where you're confident, add safety where you're not.


Chapter 7: Searching Nested JSON

The Problem: Unknown Structure

Everything in Chapter 6 assumed you knew the path: ["current"]["temperature"] or ["artist"]["albums"][0]. But what if you don't?

Here's a real scenario. You're given a complex configuration file:

config = {
    "version": "2.0",
    "database": {
        "host": "localhost",
        "credentials": {
            "username": "admin",
            "password": "secret123"
        }
    },
    "cache": {
        "redis": {
            "host": "localhost",
            "port": 6379
        }
    },
    "services": [
        {
            "name": "api",
            "host": "localhost",
            "port": 8000
        },
        {
            "name": "worker",
            "host": "192.168.1.100"
        }
    ]
}

Your task: Find all occurrences of "localhost" in this config.

You can see them: in database.host, cache.redis.host, and services[0].host. But how do you write code to find them without knowing where they are ahead of time?

This is where recursive traversal comes in. Let's build it step by step.

Let's start with the most obvious approach:

def find_value(data, target):
    # If this is the value we're looking for, we found it
    if data == target:
        return True

    # If it's a dictionary, check all values
    if isinstance(data, dict):
        for value in data.values():
            if find_value(value, target):
                return True

    # If it's a list, check all items
    if isinstance(data, list):
        for item in data:
            if find_value(item, target):
                return True

    return False

# Test it
print(find_value(config, "localhost"))  # True

This works! Let's understand what just happened:

  1. We check if the current data is what we're looking for
  2. If it's a dictionary, we recursively search all its values
  3. If it's a list, we recursively search all its items
  4. Otherwise, it's a primitive (string, number, etc.) and we're done

This is the same pattern as os.walk(), but adapted for dictionaries and lists instead of directories and files. Remember the four questions?

But this function only returns True or False. What if you want to know where you found the value?

Second Attempt: Tracking the Path

def find_paths(data, target, current_path="root"):
    results = []

    # Check if current data matches
    if data == target:
        results.append(current_path)

    # Explore dictionary
    if isinstance(data, dict):
        for key, value in data.items():
            new_path = f"{current_path}.{key}"
            results.extend(find_paths(value, target, new_path))

    # Explore list
    if isinstance(data, list):
        for index, item in enumerate(data):
            new_path = f"{current_path}[{index}]"
            results.extend(find_paths(item, target, new_path))

    return results

# Test it
paths = find_paths(config, "localhost")
for path in paths:
    print(path)

Output:

root.database.host
root.cache.redis.host
root.services[0].host

Much better! Now we can see exactly where each occurrence lives. Let's trace through one path to understand the recursion:

  1. Start at root, data is the full config dict
  2. Not a match, but it's a dict, so explore its values
  3. Find key "database", recurse with path "root.database"
  4. That's a dict too, explore its values
  5. Find key "host", recurse with path "root.database.host"
  6. Value is "localhost" — that's a match! Add the path to results

Notice how each recursive call builds the path: "root" → "root.database" → "root.database.host". This is like building a file path as you descend directories.

But there's a subtle issue. Let's say you want to find all values for the key "host", not a specific value. Our current function won't work:

# This won't help us find all "host" keys
find_paths(config, "host")  # Only finds if the VALUE is "host"

We need a different function.

Third Attempt: Searching by Key

def find_by_key(data, target_key, current_path="root"):
    results = []

    if isinstance(data, dict):
        for key, value in data.items():
            new_path = f"{current_path}.{key}"

            # If this key matches, record it
            if key == target_key:
                results.append((new_path, value))

            # Keep searching deeper
            results.extend(find_by_key(value, target_key, new_path))

    if isinstance(data, list):
        for index, item in enumerate(data):
            new_path = f"{current_path}[{index}]"
            results.extend(find_by_key(item, target_key, new_path))

    return results

# Test it
hosts = find_by_key(config, "host")
for path, value in hosts:
    print(f"{path} = {value}")

Output:

root.database.host = localhost
root.cache.redis.host = localhost
root.services[0].host = localhost
root.services[1].host = 192.168.1.100

Perfect! Now we're finding keys, not values. And notice we return tuples: (path, value), so you get both the location and the data.

Let's pause and see what changed:

This distinction—searching by value vs. searching by key—is crucial. In file systems, you typically search by filename (like a key). In JSON, you might search by either.

Building Reusable Search Functions

Let's create a more flexible tool. What if you want to find all numbers greater than 1000? Or all dictionaries that have a "name" key?

def find_matching(data, predicate, current_path="root"):
    """
    Find all values where predicate(value) returns True.

    Args:
        data: The structure to search
        predicate: A function that takes a value and returns bool
        current_path: Current location (for tracking)

    Returns:
        List of (path, value) tuples
    """
    results = []

    # Check current value
    if predicate(data):
        results.append((current_path, data))

    # Recurse into structure
    if isinstance(data, dict):
        for key, value in data.items():
            new_path = f"{current_path}.{key}"
            results.extend(find_matching(value, predicate, new_path))

    if isinstance(data, list):
        for index, item in enumerate(data):
            new_path = f"{current_path}[{index}]"
            results.extend(find_matching(item, predicate, new_path))

    return results

# Now you can search for anything
# Find all ports
ports = find_matching(config, lambda v: isinstance(v, int) and v > 1000)

# Find all "localhost" strings
localhosts = find_matching(config, lambda v: v == "localhost")

# Find all dictionaries with a "name" key
named_objects = find_matching(
    config,
    lambda v: isinstance(v, dict) and "name" in v
)

This is powerful! The predicate is a function you pass in that decides what matches. This pattern is called a callback function or predicate function.

Let's refactor one more time to make it even clearer:

def traverse_json(data, callback, current_path="root"):
    """
    Visit every value in a JSON structure.

    Args:
        data: The structure to traverse
        callback: Called for each value as callback(path, value)
        current_path: Current location
    """
    # Visit current value
    callback(current_path, data)

    # Recurse into children
    if isinstance(data, dict):
        for key, value in data.items():
            traverse_json(value, callback, f"{current_path}.{key}")

    elif isinstance(data, list):
        for index, item in enumerate(data):
            traverse_json(item, callback, f"{current_path}[{index}]")

# Use it
def print_all(path, value):
    print(f"{path}: {type(value).__name__} = {value}")

traverse_json(config, print_all)

Now you have a generic traversal function! You can plug in any logic:

# Collect all string values
strings = []
traverse_json(config, lambda p, v: strings.append(v) if isinstance(v, str) else None)

# Count depth
max_depth = [0]  # Using list to modify in callback
def track_depth(path, value):
    depth = path.count('.') + path.count('[')
    max_depth[0] = max(max_depth[0], depth)

traverse_json(config, track_depth)

Compare this to what we did in Chapter 6. There, we knew the path: config["database"]["host"]. Here, we're exploring every path to find what we need.

Exercise: Building a JSON Query Tool

Here's a nested dataset representing a small company:

company = {
    "name": "TechCorp",
    "employees": [
        {
            "name": "Alice",
            "role": "Engineer",
            "salary": 95000,
            "projects": ["API", "Database"]
        },
        {
            "name": "Bob",
            "role": "Designer",
            "salary": 85000,
            "projects": ["Website"]
        },
        {
            "name": "Charlie",
            "role": "Engineer",
            "salary": 105000,
            "projects": ["API", "Mobile"]
        }
    ],
    "departments": {
        "engineering": {
            "budget": 500000,
            "head": "Alice"
        },
        "design": {
            "budget": 200000,
            "head": "Bob"
        }
    }
}

Your tasks:

  1. Find all employees with salary > 90000
  2. Collect all unique project names
  3. Find the total of all budgets
  4. Find all dictionaries that have a "name" key

Try implementing these using the patterns we've built. Here's a starter:

# 1. High earners
high_earners = []
def check_salary(path, value):
    if isinstance(value, dict) and "salary" in value:
        if value["salary"] > 90000:
            high_earners.append(value["name"])

traverse_json(company, check_salary)
print("High earners:", high_earners)

Solutions:

# 1. High earners (improved version)
def find_high_earners(data, threshold):
    results = []
    def check(path, value):
        if isinstance(value, dict) and value.get("salary", 0) > threshold:
            results.append(value["name"])
    traverse_json(data, check)
    return results

print(find_high_earners(company, 90000))  # ['Alice', 'Charlie']

# 2. Unique projects
projects = set()
def collect_projects(path, value):
    if isinstance(value, list) and "projects" in path:
        projects.update(value)

traverse_json(company, collect_projects)
print(sorted(projects))  # ['API', 'Database', 'Mobile', 'Website']

# 3. Total budget
total = [0]
def sum_budgets(path, value):
    if "budget" in path and isinstance(value, int):
        total[0] += value

traverse_json(company, sum_budgets)
print(f"Total budget: ${total[0]}")  # $700000

# 4. Dictionaries with "name" key
named = []
def find_named(path, value):
    if isinstance(value, dict) and "name" in value:
        named.append((path, value["name"]))

traverse_json(company, find_named)
for path, name in named:
    print(f"{path}: {name}")

Notice how each solution follows the same structure:

  1. Define what you're collecting
  2. Write a callback that checks and collects
  3. Traverse the data with your callback
  4. Return or print the results

This is the power of separating traversal (visiting everything) from logic (deciding what to do at each node).


Chapter 8: Transforming Nested Data

Structure Reshaping Patterns

So far we've navigated and searched. Now let's transform data—take one structure and build a different one.

The problem: You have a nested API response and need to flatten it for a database or spreadsheet:

api_response = {
    "users": [
        {
            "id": 1,
            "name": "Alice",
            "address": {
                "city": "Portland",
                "state": "OR"
            }
        },
        {
            "id": 2,
            "name": "Bob",
            "address": {
                "city": "Seattle",
                "state": "WA"
            }
        }
    ]
}

Goal: Convert this into a flat list:

[
    {"id": 1, "name": "Alice", "city": "Portland", "state": "OR"},
    {"id": 2, "name": "Bob", "city": "Seattle", "state": "WA"}
]

First approach (when you know the structure):

flat_users = []
for user in api_response["users"]:
    flat_user = {
        "id": user["id"],
        "name": user["name"],
        "city": user["address"]["city"],
        "state": user["address"]["state"]
    }
    flat_users.append(flat_user)

This works great! It's explicit, readable, and handles this specific structure perfectly. But what if the address is sometimes missing?

Second approach (defensive):

flat_users = []
for user in api_response["users"]:
    address = user.get("address", {})
    flat_user = {
        "id": user.get("id"),
        "name": user.get("name"),
        "city": address.get("city"),
        "state": address.get("state")
    }
    flat_users.append(flat_user)

Better. Now if address is missing, we get an empty dict, and city and state become None instead of crashing.

But let's say the structure varies wildly—sometimes address is nested, sometimes it's flat, sometimes there's additional metadata. Writing custom code for every variation is tedious.

Let's build something more general.

Generic Flattening

Here's a function that flattens any nested dictionary:

def flatten_dict(data, parent_key='', separator='_'):
    """
    Flatten a nested dictionary.

    Example:
        {"a": {"b": 1}} -> {"a_b": 1}
    """
    items = []

    for key, value in data.items():
        new_key = f"{parent_key}{separator}{key}" if parent_key else key

        if isinstance(value, dict):
            # Recurse into nested dict
            items.extend(flatten_dict(value, new_key, separator).items())
        elif isinstance(value, list):
            # For simplicity, convert list to comma-separated string
            items.append((new_key, ', '.join(map(str, value))))
        else:
            items.append((new_key, value))

    return dict(items)

# Test it
user = {
    "id": 1,
    "name": "Alice",
    "address": {
        "city": "Portland",
        "state": "OR"
    }
}

flat = flatten_dict(user)
print(flat)
# {'id': 1, 'name': 'Alice', 'address_city': 'Portland', 'address_state': 'OR'}

Look at the recursion pattern:

  1. For each key-value pair, build a new key name (address_city)
  2. If the value is a dict, recurse with the new key as the parent
  3. Otherwise, add the flattened key-value pair

Let's trace through one path:

This is recursive traversal in action! Instead of collecting matches, we're building a new structure.

Now apply it to the full dataset:

flat_users = []
for user in api_response["users"]:
    flat_users.append(flatten_dict(user))

for user in flat_users:
    print(user)

Output:

{'id': 1, 'name': 'Alice', 'address_city': 'Portland', 'address_state': 'OR'}
{'id': 2, 'name': 'Bob', 'address_city': 'Seattle', 'address_state': 'WA'}

Grouping by Nested Values

Here's another common transformation: you have a flat list and want to group it by some nested value.

The data:

sales = [
    {"product": "Widget", "region": {"name": "West", "id": 1}, "amount": 100},
    {"product": "Gadget", "region": {"name": "East", "id": 2}, "amount": 150},
    {"product": "Widget", "region": {"name": "West", "id": 1}, "amount": 200},
    {"product": "Gadget", "region": {"name": "West", "id": 1}, "amount": 120},
]

Goal: Group sales by region name.

First approach (direct):

from collections import defaultdict

grouped = defaultdict(list)
for sale in sales:
    region_name = sale["region"]["name"]
    grouped[region_name].append(sale)

for region, items in grouped.items():
    total = sum(item["amount"] for item in items)
    print(f"{region}: ${total}")

Output:

West: $420
East: $150

Clean and straightforward! When you know the structure (sale["region"]["name"]), direct access is perfect.

But what if you want to make this generic—group by any nested path?

Second approach (generic):

def get_nested_value(data, path):
    """
    Get a value from a nested dict using a path like "region.name".
    """
    keys = path.split('.')
    value = data
    for key in keys:
        if isinstance(value, dict):
            value = value.get(key)
            if value is None:
                return None
        else:
            return None
    return value

def group_by_nested(items, path):
    """
    Group items by a nested key path.
    """
    grouped = defaultdict(list)
    for item in items:
        key = get_nested_value(item, path)
        if key is not None:
            grouped[key].append(item)
    return dict(grouped)

# Use it
by_region = group_by_nested(sales, "region.name")
by_product = group_by_nested(sales, "product")

print("By region:")
for region, items in by_region.items():
    print(f"  {region}: {len(items)} sales")

The get_nested_value function walks a path by splitting on dots: "region.name" becomes ["region", "name"], then accesses data["region"]["name"].

This is like our directory traversal, but instead of recursing into all children, we're following one specific path.

Handling Missing Data Gracefully

Real-world data is messy. Keys are missing, values are null, types are inconsistent. Let's handle that.

The problem:

messy_data = [
    {"name": "Alice", "age": 30, "email": "alice@example.com"},
    {"name": "Bob", "email": None},  # Missing age, null email
    {"name": "Charlie", "age": "unknown"},  # Age is a string
    {"age": 40, "email": "nobody@example.com"}  # Missing name
]

Goal: Extract ages as integers, defaulting to 0 for missing or invalid data.

Try-Except approach:

ages = []
for person in messy_data:
    try:
        age = int(person["age"])
        ages.append(age)
    except (KeyError, ValueError, TypeError):
        ages.append(0)

print(ages)  # [30, 0, 0, 40]

This works. Any error—missing key, wrong type, bad value—results in 0. The try-except catches everything and provides a fallback.

Defensive checks approach:

ages = []
for person in messy_data:
    age = person.get("age")
    if isinstance(age, int):
        ages.append(age)
    elif isinstance(age, str) and age.isdigit():
        ages.append(int(age))
    else:
        ages.append(0)

print(ages)  # [30, 0, 0, 40]

This is more explicit. We check each condition: is it already an int? Is it a string that looks like a number? Otherwise, default to 0.

Which is better?

It depends on your situation:

In practice, I often start with defensive checks while developing (so I see exactly what's happening), then refactor to try-except in production (so the code is more concise and resilient).

Here's a hybrid that logs problems:

ages = []
for person in messy_data:
    age = person.get("age")
    try:
        ages.append(int(age))
    except (ValueError, TypeError):
        print(f"Warning: Invalid age for {person.get('name', 'unknown')}: {age}")
        ages.append(0)

Now you see what's wrong while still handling it gracefully.

Performance: When to Use Libraries

Everything we've built works, but Python has libraries that do this faster and more robustly. Let's compare.

Our flatten function:

result = [flatten_dict(user) for user in api_response["users"]]

Using pandas:

import pandas as pd

# Normalize nested JSON
df = pd.json_normalize(api_response["users"])
print(df)

Output:

   id   name address.city address.state
0   1  Alice     Portland            OR
1   2    Bob      Seattle            WA

Pandas creates a DataFrame with columns like address.city automatically. If you need a list of dicts:

result = df.to_dict('records')

When to use which?

Let's see these libraries in action.

Using jsonpath-ng for complex queries:

from jsonpath_ng import parse

# Find all cities anywhere in the structure
expr = parse('$..city')
cities = [match.value for match in expr.find(api_response)]
print(cities)  # ['Portland', 'Seattle']

# Find all users with a specific city
expr = parse('$.users[?(@.address.city=="Portland")]')
portland_users = [match.value for match in expr.find(api_response)]

This is powerful when you don't know the exact structure but know what you're looking for. The $..city means "find 'city' at any depth," similar to our recursive search.

Using glom for transformations:

from glom import glom, Coalesce

spec = {
    'users': [('users', [{
        'id': 'id',
        'name': 'name',
        'city': ('address.city', Coalesce(str, default='Unknown')),
        'state': 'address.state'
    }])]
}

result = glom(api_response, spec)

Glom lets you declare the transformation you want. Coalesce handles missing data by trying multiple paths or providing defaults.

So which should you use?

Here's my decision tree:

  1. Do you know the exact structure? → Use direct access (data["key"]["subkey"])
  2. Is it a one-time script? → Custom code is fine, don't overthink it
  3. Do you need to flatten for CSV/database? → Use pandas
  4. Are you searching unknown structures repeatedly? → Use jsonpath-ng
  5. Do you have complex transformation rules? → Use glom

Most importantly: start simple, add complexity only when needed. A straightforward loop with direct access beats an over-engineered solution that nobody understands.

Exercise: Normalizing Messy API Data

You're building a dashboard that aggregates data from multiple social media APIs. Each has a different structure:

twitter_data = {
    "posts": [
        {
            "id": "t1",
            "user": {
                "username": "alice",
                "followers": 1500
            },
            "content": "Hello world!",
            "metrics": {
                "likes": 42,
                "retweets": 7
            }
        },
        {
            "id": "t2",
            "user": {
                "username": "bob",
                "followers": 300
            },
            "content": "Great day!",
            "metrics": {
                "likes": 15,
                "retweets": 2
            }
        }
    ]
}

instagram_data = {
    "items": [
        {
            "post_id": "i1",
            "author": "charlie",
            "author_followers": 5000,
            "text": "Check this out!",
            "engagement": {
                "likes": 230,
                "comments": 12
            }
        },
        {
            "post_id": "i2",
            "author": "diana",
            "author_followers": 800,
            "text": "Amazing!",
            "engagement": {
                "likes": 67
                # Note: comments missing
            }
        }
    ]
}

Your goal: Normalize both into this unified format:

[
    {
        "id": "t1",
        "platform": "twitter",
        "username": "alice",
        "followers": 1500,
        "content": "Hello world!",
        "likes": 42,
        "engagement_score": 49  # likes + retweets
    },
    # ... more posts
]

Requirements:

  1. Handle missing fields (like Instagram comments) gracefully
  2. Calculate engagement_score (Twitter: likes + retweets, Instagram: likes + comments)
  3. Default to 0 for missing numeric values

Try building this yourself before looking at the solution.

Here's a starter approach:

def normalize_twitter(post):
    # Extract and transform one Twitter post
    pass

def normalize_instagram(item):
    # Extract and transform one Instagram item
    pass

# Then combine
all_posts = []
for post in twitter_data.get("posts", []):
    all_posts.append(normalize_twitter(post))

for item in instagram_data.get("items", []):
    all_posts.append(normalize_instagram(item))

Solution:

def normalize_twitter(post):
    metrics = post.get("metrics", {})
    user = post.get("user", {})

    return {
        "id": post.get("id"),
        "platform": "twitter",
        "username": user.get("username"),
        "followers": user.get("followers", 0),
        "content": post.get("content", ""),
        "likes": metrics.get("likes", 0),
        "engagement_score": metrics.get("likes", 0) + metrics.get("retweets", 0)
    }

def normalize_instagram(item):
    engagement = item.get("engagement", {})

    return {
        "id": item.get("post_id"),
        "platform": "instagram",
        "username": item.get("author"),
        "followers": item.get("author_followers", 0),
        "content": item.get("text", ""),
        "likes": engagement.get("likes", 0),
        "engagement_score": engagement.get("likes", 0) + engagement.get("comments", 0)
    }

# Combine everything
all_posts = []

for post in twitter_data.get("posts", []):
    all_posts.append(normalize_twitter(post))

for item in instagram_data.get("items", []):
    all_posts.append(normalize_instagram(item))

# Sort by engagement
all_posts.sort(key=lambda p: p["engagement_score"], reverse=True)

# Display top posts
for post in all_posts[:3]:
    print(f"{post['platform']}: {post['username']} - {post['engagement_score']} engagement")

Output:

instagram: charlie - 242 engagement
instagram: diana - 67 engagement
twitter: alice - 49 engagement

Notice the patterns we used:

Now let's make a deliberate mistake and fix it with a test. Here's a version without proper defaults:

def normalize_twitter_buggy(post):
    # BUG: Assumes metrics always exists
    return {
        "id": post["id"],
        "platform": "twitter",
        "username": post["user"]["username"],
        "followers": post["user"]["followers"],
        "content": post["content"],
        "likes": post["metrics"]["likes"],
        "engagement_score": post["metrics"]["likes"] + post["metrics"]["retweets"]
    }

# This will crash if any field is missing!
broken_post = {"id": "t3", "user": {}, "content": "Test"}
try:
    result = normalize_twitter_buggy(broken_post)
except KeyError as e:
    print(f"ERROR: Missing key {e}")

The error exposes the bug. Now we can write a test that would have caught this:

def test_normalize_twitter():
    # Test with minimal data
    minimal_post = {
        "id": "t1",
        "user": {},
        "content": "Test"
    }

    result = normalize_twitter(minimal_post)

    # Should not crash and should have defaults
    assert result["id"] == "t1"
    assert result["followers"] == 0
    assert result["likes"] == 0
    assert result["engagement_score"] == 0
    print("âś“ Test passed: handles missing fields")

test_normalize_twitter()

This demonstrates The Virtuous Flaw from our teaching principles: by first showing code that breaks, then fixing it, you understand why the defensive .get() calls are necessary. It's not just defensive programming for its own sake—it's preventing real errors you've now experienced.

Bringing It All Together

Let's review what we've learned in Part III:

When you know the structure:

When you're searching:

When you're transforming:

The mental model:

Everything we did with os.walk() applies here, just with different syntax. The pattern is universal: identify where you are, look at what's here, decide where to go next.

In the next part, we'll apply this same pattern to HTML trees with BeautifulSoup. You'll see the pattern again: nodes with children, but this time with text content, attributes, and irregular structure. The core concepts—direct access vs. recursive search, tracking paths, handling missing data—remain exactly the same.

The key insight is this: once you recognize the pattern, you can navigate any tree structure. JSON today, HTML tomorrow, ASTs after that. The details change, but the fundamental approach is identical.


End of Part III