Part IV: HTML and DOM Structures

HTML as an Irregular Tree

Let's start with a simple observation: when you look at a file system, you see a predictable pattern—directories contain files and other directories. JSON follows clear rules too—objects contain key-value pairs, arrays contain items.

HTML? HTML is messier.

<div class="article">
  <h1>Title Here</h1>
  Some random text
  <p>A paragraph</p>
  <!-- A comment -->
  <div class="meta">
    <span>Author</span>
    More floating text
  </div>
</div>

Notice what's happening here. You have:

Tags (the elements like <div>, <p>)
Text nodes (that "Some random text" just sitting there)
Comments (which you probably want to ignore)
Attributes (like class="article")

Unlike a file system where everything is either a file or a directory, HTML mixes different types of things at the same level. A <div> might contain other tags, bare text, and comments—all as siblings.

Why does this matter for traversal? Because you can't assume that everything you encounter during traversal has the same properties. Not everything has a .name attribute. Not everything has .children. This irregularity is what makes HTML traversal different.

BeautifulSoup's Traversal Arsenal

Let's explore the object first. Create a simple HTML document and parse it:

from bs4 import BeautifulSoup

html = """
<html>
    <body>
        <div class="container">
            <h1>Welcome</h1>
            <p class="intro">First paragraph</p>
            <p>Second paragraph</p>
        </div>
    </body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

Now, the exploration protocol:

print(type(soup))  # <class 'bs4.BeautifulSoup'>
print(type(soup.body))  # <class 'bs4.element.Tag'>
print(type(soup.body.string))  # <class 'bs4.element.NavigableString'>

Already we see three different types. Let's explore what methods are available:

tag = soup.find('div')
print([m for m in dir(tag) if not m.startswith('_')])

You'll see dozens of methods. Let's focus on the ones that matter for traversal.

Finding Elements: `.find()` and `.find_all()`

The simplest way to navigate HTML is to ask: "Get me the thing I'm looking for."

# Find the first h1
h1 = soup.find('h1')
print(h1.text)  # "Welcome"

# Find all paragraphs
paragraphs = soup.find_all('p')
for p in paragraphs:
    print(p.text)

# Find by class (note: class_ because 'class' is a Python keyword)
intro = soup.find('p', class_='intro')
print(intro.text)  # "First paragraph"

# Find by multiple criteria
intro_alt = soup.find('p', {'class': 'intro'})

What's happening here? BeautifulSoup is doing the traversal for you. It's walking the tree, checking each element, and returning matches. You didn't have to write any recursive code.

CSS Selectors: `.select()`

If you're familiar with CSS, you can use selectors:

# Select by tag
paragraphs = soup.select('p')

# Select by class
intro = soup.select('.intro')

# Select by hierarchy
div_paragraphs = soup.select('div.container p')

# Select by attribute
intro_by_class = soup.select('p[class="intro"]')

This is powerful because CSS selectors are a rich language for describing patterns in HTML.

Manual Navigation: `.children` and `.descendants`

Sometimes you want to traverse manually. BeautifulSoup gives you two ways to see what's inside an element:

container = soup.find('div', class_='container')

# Direct children only
print("Children:")
for child in container.children:
    print(f"  {type(child).__name__}: {repr(child)[:50]}")

# All descendants (recursive)
print("\nDescendants:")
for descendant in container.descendants:
    print(f"  {type(descendant).__name__}: {repr(descendant)[:50]}")

The difference:

.children gives you only the immediate children (one level down)
.descendants gives you everything nested inside (recursive)

Run this and observe the output. You'll see that .children includes text nodes (those whitespace characters between tags), while .descendants walks the entire tree.

Going Up: `.parent` and `.parents`

You can also traverse upward:

p = soup.find('p', class_='intro')

# Immediate parent
print(p.parent.name)  # "div"

# All parents up to the root
for parent in p.parents:
    if hasattr(parent, 'name'):
        print(parent.name)

This is useful when you find an element and need to know its context: "What section am I in? What container holds this?"

The Text Node Gotcha

Here's a common mistake:

container = soup.find('div', class_='container')

for child in container.children:
    print(child.name)  # This will crash!

Why? Because .children includes text nodes (NavigableString objects), and text nodes don't have a .name attribute. Only Tag objects have names.

The fix:

for child in container.children:
    if hasattr(child, 'name'):  # Check if it's a Tag
        print(child.name)
    else:
        print(f"Text: {repr(child.strip())}")

Or, more elegantly:

for child in container.children:
    if child.name:  # Tag names are truthy, text nodes are None
        print(child.name)

This is crucial: Always remember that not everything in an HTML tree is a tag. When you traverse manually, check types or check for the presence of attributes before accessing them.

Common Pattern: Finding All of Something

Let's build a simple but realistic example. You're scraping a page with product listings:

html = """
<div class="products">
    <div class="product">
        <h2>Widget A</h2>
        <span class="price">$19.99</span>
    </div>
    <div class="product">
        <h2>Widget B</h2>
        <span class="price">$29.99</span>
    </div>
</div>
"""

soup = BeautifulSoup(html, 'html.parser')

# The simple way: let BeautifulSoup do it
products = []
for product_div in soup.find_all('div', class_='product'):
    name = product_div.find('h2').text
    price = product_div.find('span', class_='price').text
    products.append({'name': name, 'price': price})

print(products)

Notice what we're NOT doing: We're not writing recursive traversal code. We're using the library's traversal methods (.find_all() and .find()) to navigate the structure.

Exercise: Extracting Structured Data from a Wikipedia Page

Here's your challenge: Given a Wikipedia article's HTML (or any article with headings and paragraphs), extract a structured representation where each section heading is a key, and the paragraphs under it are values.

Starting point:

html = """
<article>
    <h2>Introduction</h2>
    <p>First intro paragraph.</p>
    <p>Second intro paragraph.</p>

    <h2>History</h2>
    <p>First history paragraph.</p>

    <h3>Early Period</h3>
    <p>Early history details.</p>

    <h2>Modern Era</h2>
    <p>Modern details.</p>
</article>
"""

soup = BeautifulSoup(html, 'html.parser')

# Your task: Build a dictionary like:
# {
#     'Introduction': ['First intro paragraph.', 'Second intro paragraph.'],
#     'History': ['First history paragraph.'],
#     'History > Early Period': ['Early history details.'],
#     'Modern Era': ['Modern details.']
# }

Hints:

Start by finding all headings
For each heading, collect following paragraphs until you hit the next heading
Use .find_next_siblings() or traverse manually
Handle nested headings (h2 vs h3) by tracking context

Try it yourself first. The act of figuring out which BeautifulSoup methods to use is the learning process.

Chapter 10: Custom HTML Traversal

When BeautifulSoup Isn't Enough

BeautifulSoup's built-in methods handle most cases. But sometimes you need something more:

Scenario 1: Context Tracking

You're scraping a page where the meaning of an element depends on which section it's in:

<div class="section" data-type="features">
  <a href="/feature1">Feature Link</a>
</div>
<div class="section" data-type="pricing">
  <a href="/price1">Pricing Link</a>
</div>

You want to classify links differently based on their containing section. BeautifulSoup can find all links, but it won't automatically tell you which section each belongs to.

Scenario 2: Custom Aggregation

You're building a table of contents and need to maintain the hierarchy of headings (h1, h2, h3) with proper nesting. BeautifulSoup finds the headings, but you need custom logic to build the nested structure.

Scenario 3: Stateful Processing

You're extracting data where later elements modify earlier ones. For example, a page might list items followed by modifiers:

<ul>
  <li>Item A</li>
  <li>Item B</li>
  <li class="discount">20% off above items</li>
</ul>

You need to apply the discount information back to previous items.

Building Context-Aware Traversal

Let's tackle Scenario 1 with a simple but effective pattern:

html = """
<article>
    <section data-category="tech">
        <h2>Technology</h2>
        <a href="/ai">AI Article</a>
        <a href="/blockchain">Blockchain Article</a>
    </section>
    <section data-category="science">
        <h2>Science</h2>
        <a href="/physics">Physics Article</a>
    </section>
</article>
"""

soup = BeautifulSoup(html, 'html.parser')

# First approach: Use parent traversal
links_by_category = {}
for link in soup.find_all('a'):
    # Find the containing section
    section = link.find_parent('section')
    if section and section.get('data-category'):
        category = section['data-category']
        if category not in links_by_category:
            links_by_category[category] = []
        links_by_category[category].append({
            'text': link.text,
            'href': link['href']
        })

print(links_by_category)

What's happening: We find all links using BeautifulSoup, then for each link, we traverse upward to find its context (the containing section).

Alternative approach: Context-first traversal

links_by_category = {}

for section in soup.find_all('section'):
    category = section.get('data-category')
    if category:
        links_by_category[category] = []
        # Find all links within THIS section
        for link in section.find_all('a'):
            links_by_category[category].append({
                'text': link.text,
                'href': link['href']
            })

Which is better? The second approach. Why? Because it maintains clear context as we traverse. We establish the section first, then process its contents. This is more readable and less error-prone.

The principle: When you need context, structure your traversal to establish context first, then process items within that context.

Handling Nested Hierarchies

Now let's tackle Scenario 2: building a nested table of contents from headings.

html = """
<article>
    <h1>Main Title</h1>
    <p>Intro</p>

    <h2>First Section</h2>
    <p>Content</p>

    <h3>Subsection A</h3>
    <p>More content</p>

    <h3>Subsection B</h3>
    <p>Even more</p>

    <h2>Second Section</h2>
    <p>Final content</p>
</article>
"""

soup = BeautifulSoup(html, 'html.parser')

We want to produce:

{
    'title': 'Main Title',
    'sections': [
        {
            'title': 'First Section',
            'subsections': [
                {'title': 'Subsection A'},
                {'title': 'Subsection B'}
            ]
        },
        {
            'title': 'Second Section',
            'subsections': []
        }
    ]
}

Let's write this iteratively, building our understanding step by step:

def build_toc_simple(soup):
    """First attempt: Just collect headings with levels"""
    headings = []
    for tag in soup.find_all(['h1', 'h2', 'h3']):
        level = int(tag.name[1])  # Extract the number from 'h1', 'h2', etc.
        headings.append({
            'level': level,
            'text': tag.text.strip()
        })
    return headings

# Test it
print(build_toc_simple(soup))

This gives us a flat list. Now let's add nesting logic:

def build_toc_nested(soup):
    """Second attempt: Build nested structure"""
    result = {'sections': []}
    current_h2 = None

    for tag in soup.find_all(['h1', 'h2', 'h3']):
        level = int(tag.name[1])
        text = tag.text.strip()

        if level == 1:
            result['title'] = text
        elif level == 2:
            current_h2 = {'title': text, 'subsections': []}
            result['sections'].append(current_h2)
        elif level == 3:
            if current_h2:  # Make sure we have an h2 to attach to
                current_h2['subsections'].append({'title': text})

    return result

print(build_toc_nested(soup))

What changed? We added state tracking (current_h2). As we iterate through headings, we maintain context about which section we're currently filling.

The principle: When HTML structure implies hierarchy but elements are siblings in the DOM, use state variables to track where you are in your output structure.

Handling Missing Data Gracefully

Real HTML is messy. Attributes might be missing. Expected tags might not exist. Let's make our code robust:

def safe_find_text(element, selector, default="N/A"):
    """Find an element and get its text, or return default"""
    found = element.find(selector) if hasattr(element, 'find') else None
    return found.text.strip() if found else default

def safe_get_attr(element, attr, default=None):
    """Get an attribute, or return default"""
    return element.get(attr, default) if element else default

# Usage:
html = """
<div class="product">
    <h2>Widget</h2>
    <!-- Note: price is missing -->
</div>
<div class="product">
    <h2>Gadget</h2>
    <span class="price">$29.99</span>
</div>
"""

soup = BeautifulSoup(html, 'html.parser')

for product in soup.find_all('div', class_='product'):
    name = safe_find_text(product, 'h2', 'Unnamed')
    price = safe_find_text(product, 'span.price', 'Price not available')
    print(f"{name}: {price}")

The pattern: Wrap your element access in functions that handle the None case. This prevents crashes and gives you control over default values.

Try-Except vs. Defensive Checks:

# Approach 1: Try-Except
try:
    price = product.find('span', class_='price').text
except AttributeError:
    price = 'N/A'

# Approach 2: Defensive Check
price_tag = product.find('span', class_='price')
price = price_tag.text if price_tag else 'N/A'

Which is better? The defensive check (Approach 2) is clearer in this case. You're explicitly asking "does this exist?" rather than attempting and catching. Use try-except when the exceptional case is truly rare or when you're catching multiple potential errors at once.

Practical Pattern: Table Extraction with Context

Tables are a common scraping target. Let's extract a table while maintaining context:

html = """
<table>
    <thead>
        <tr>
            <th>Product</th>
            <th>Price</th>
            <th>Stock</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Widget A</td>
            <td>$19.99</td>
            <td>In Stock</td>
        </tr>
        <tr>
            <td>Widget B</td>
            <td>$29.99</td>
            <td>Out of Stock</td>
        </tr>
    </tbody>
</table>
"""

soup = BeautifulSoup(html, 'html.parser')

def extract_table(soup, table_selector='table'):
    table = soup.find(table_selector)
    if not table:
        return []

    # Get headers
    headers = []
    header_row = table.find('thead')
    if header_row:
        headers = [th.text.strip() for th in header_row.find_all('th')]

    # Get rows
    rows = []
    tbody = table.find('tbody')
    if tbody:
        for tr in tbody.find_all('tr'):
            cells = [td.text.strip() for td in tr.find_all('td')]
            if len(cells) == len(headers):
                row_dict = dict(zip(headers, cells))
                rows.append(row_dict)

    return rows

data = extract_table(soup)
print(data)

What makes this robust?

Checks if table exists before accessing
Safely extracts headers
Validates row length matches header count
Returns structured dictionaries, not raw cells

Exercise: Building a Web Scraper with Context

Here's your challenge: Build a scraper that extracts article information from a blog-style page, where articles are in different categories and you need to track which category each article belongs to.

html = """
<main>
    <section class="category" data-name="Technology">
        <h2>Tech Articles</h2>
        <article>
            <h3>AI Breakthrough</h3>
            <p class="author">By Alice</p>
            <p class="summary">Summary here...</p>
        </article>
        <article>
            <h3>Quantum Computing</h3>
            <p class="author">By Bob</p>
            <p class="summary">Another summary...</p>
        </article>
    </section>

    <section class="category" data-name="Science">
        <h2>Science Articles</h2>
        <article>
            <h3>New Discovery</h3>
            <!-- Note: author is missing here -->
            <p class="summary">Science summary...</p>
        </article>
    </section>
</main>
"""

soup = BeautifulSoup(html, 'html.parser')

# Your task: Build a list of dictionaries like:
# [
#     {
#         'category': 'Technology',
#         'title': 'AI Breakthrough',
#         'author': 'Alice',
#         'summary': 'Summary here...'
#     },
#     ...
# ]

Requirements:

Extract all articles with their category context
Handle missing authors gracefully (use "Unknown" as default)
Return a clean list of dictionaries
Make it robust—don't crash if structure varies

Approach:

Iterate through category sections first (establish context)
For each section, extract the category name
Find all articles within that section
Extract title, author (with fallback), and summary
Build the output dictionary

Try writing this yourself before looking at solutions. The process of deciding which BeautifulSoup methods to use and how to structure your traversal is where learning happens.

Key Takeaways from Part IV:

HTML is irregularly structured — not everything is a tag, and not everything has the same properties
BeautifulSoup does most traversal for you — use .find(), .find_all(), and .select() before writing custom traversal
Check types before accessing attributes — use hasattr() or check for .name to avoid crashes on text nodes
Context-first traversal — when you need context, establish it first, then process children
Defensive coding — wrap element access in safe functions that handle missing data
State tracking — use variables to maintain context when building nested structures from flat DOM siblings

The pattern is the same as before: explore the object, use library methods when available, write custom logic only when needed, and always handle the messy reality of real-world data.