Part IV: HTML and DOM Structures
Chapter 9: BeautifulSoup - Irregular Tree Navigation
HTML as an Irregular Tree
Let's start with a simple observation: when you look at a file system, you see a predictable pattern—directories contain files and other directories. JSON follows clear rules too—objects contain key-value pairs, arrays contain items.
HTML? HTML is messier.
<div class="article">
<h1>Title Here</h1>
Some random text
<p>A paragraph</p>
<!-- A comment -->
<div class="meta">
<span>Author</span>
More floating text
</div>
</div>
Notice what's happening here. You have:
- Tags (the elements like
<div>,<p>) - Text nodes (that "Some random text" just sitting there)
- Comments (which you probably want to ignore)
- Attributes (like
class="article")
Unlike a file system where everything is either a file or a directory, HTML mixes different types of things at the same level. A <div> might contain other tags, bare text, and comments—all as siblings.
Why does this matter for traversal? Because you can't assume that everything you encounter during traversal has the same properties. Not everything has a .name attribute. Not everything has .children. This irregularity is what makes HTML traversal different.
BeautifulSoup's Traversal Arsenal
Let's explore the object first. Create a simple HTML document and parse it:
from bs4 import BeautifulSoup
html = """
<html>
<body>
<div class="container">
<h1>Welcome</h1>
<p class="intro">First paragraph</p>
<p>Second paragraph</p>
</div>
</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
Now, the exploration protocol:
print(type(soup)) # <class 'bs4.BeautifulSoup'>
print(type(soup.body)) # <class 'bs4.element.Tag'>
print(type(soup.body.string)) # <class 'bs4.element.NavigableString'>
Already we see three different types. Let's explore what methods are available:
tag = soup.find('div')
print([m for m in dir(tag) if not m.startswith('_')])
You'll see dozens of methods. Let's focus on the ones that matter for traversal.
Finding Elements: .find() and .find_all()
The simplest way to navigate HTML is to ask: "Get me the thing I'm looking for."
# Find the first h1
h1 = soup.find('h1')
print(h1.text) # "Welcome"
# Find all paragraphs
paragraphs = soup.find_all('p')
for p in paragraphs:
print(p.text)
# Find by class (note: class_ because 'class' is a Python keyword)
intro = soup.find('p', class_='intro')
print(intro.text) # "First paragraph"
# Find by multiple criteria
intro_alt = soup.find('p', {'class': 'intro'})
What's happening here? BeautifulSoup is doing the traversal for you. It's walking the tree, checking each element, and returning matches. You didn't have to write any recursive code.
CSS Selectors: .select()
If you're familiar with CSS, you can use selectors:
# Select by tag
paragraphs = soup.select('p')
# Select by class
intro = soup.select('.intro')
# Select by hierarchy
div_paragraphs = soup.select('div.container p')
# Select by attribute
intro_by_class = soup.select('p[class="intro"]')
This is powerful because CSS selectors are a rich language for describing patterns in HTML.
Manual Navigation: .children and .descendants
Sometimes you want to traverse manually. BeautifulSoup gives you two ways to see what's inside an element:
container = soup.find('div', class_='container')
# Direct children only
print("Children:")
for child in container.children:
print(f" {type(child).__name__}: {repr(child)[:50]}")
# All descendants (recursive)
print("\nDescendants:")
for descendant in container.descendants:
print(f" {type(descendant).__name__}: {repr(descendant)[:50]}")
The difference:
.childrengives you only the immediate children (one level down).descendantsgives you everything nested inside (recursive)
Run this and observe the output. You'll see that .children includes text nodes (those whitespace characters between tags), while .descendants walks the entire tree.
Going Up: .parent and .parents
You can also traverse upward:
p = soup.find('p', class_='intro')
# Immediate parent
print(p.parent.name) # "div"
# All parents up to the root
for parent in p.parents:
if hasattr(parent, 'name'):
print(parent.name)
This is useful when you find an element and need to know its context: "What section am I in? What container holds this?"
The Text Node Gotcha
Here's a common mistake:
container = soup.find('div', class_='container')
for child in container.children:
print(child.name) # This will crash!
Why? Because .children includes text nodes (NavigableString objects), and text nodes don't have a .name attribute. Only Tag objects have names.
The fix:
for child in container.children:
if hasattr(child, 'name'): # Check if it's a Tag
print(child.name)
else:
print(f"Text: {repr(child.strip())}")
Or, more elegantly:
for child in container.children:
if child.name: # Tag names are truthy, text nodes are None
print(child.name)
This is crucial: Always remember that not everything in an HTML tree is a tag. When you traverse manually, check types or check for the presence of attributes before accessing them.
Common Pattern: Finding All of Something
Let's build a simple but realistic example. You're scraping a page with product listings:
html = """
<div class="products">
<div class="product">
<h2>Widget A</h2>
<span class="price">$19.99</span>
</div>
<div class="product">
<h2>Widget B</h2>
<span class="price">$29.99</span>
</div>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
# The simple way: let BeautifulSoup do it
products = []
for product_div in soup.find_all('div', class_='product'):
name = product_div.find('h2').text
price = product_div.find('span', class_='price').text
products.append({'name': name, 'price': price})
print(products)
Notice what we're NOT doing: We're not writing recursive traversal code. We're using the library's traversal methods (.find_all() and .find()) to navigate the structure.
Exercise: Extracting Structured Data from a Wikipedia Page
Here's your challenge: Given a Wikipedia article's HTML (or any article with headings and paragraphs), extract a structured representation where each section heading is a key, and the paragraphs under it are values.
Starting point:
html = """
<article>
<h2>Introduction</h2>
<p>First intro paragraph.</p>
<p>Second intro paragraph.</p>
<h2>History</h2>
<p>First history paragraph.</p>
<h3>Early Period</h3>
<p>Early history details.</p>
<h2>Modern Era</h2>
<p>Modern details.</p>
</article>
"""
soup = BeautifulSoup(html, 'html.parser')
# Your task: Build a dictionary like:
# {
# 'Introduction': ['First intro paragraph.', 'Second intro paragraph.'],
# 'History': ['First history paragraph.'],
# 'History > Early Period': ['Early history details.'],
# 'Modern Era': ['Modern details.']
# }
Hints:
- Start by finding all headings
- For each heading, collect following paragraphs until you hit the next heading
- Use
.find_next_siblings()or traverse manually - Handle nested headings (h2 vs h3) by tracking context
Try it yourself first. The act of figuring out which BeautifulSoup methods to use is the learning process.
Chapter 10: Custom HTML Traversal
When BeautifulSoup Isn't Enough
BeautifulSoup's built-in methods handle most cases. But sometimes you need something more:
Scenario 1: Context Tracking
You're scraping a page where the meaning of an element depends on which section it's in:
<div class="section" data-type="features">
<a href="/feature1">Feature Link</a>
</div>
<div class="section" data-type="pricing">
<a href="/price1">Pricing Link</a>
</div>
You want to classify links differently based on their containing section. BeautifulSoup can find all links, but it won't automatically tell you which section each belongs to.
Scenario 2: Custom Aggregation
You're building a table of contents and need to maintain the hierarchy of headings (h1, h2, h3) with proper nesting. BeautifulSoup finds the headings, but you need custom logic to build the nested structure.
Scenario 3: Stateful Processing
You're extracting data where later elements modify earlier ones. For example, a page might list items followed by modifiers:
<ul>
<li>Item A</li>
<li>Item B</li>
<li class="discount">20% off above items</li>
</ul>
You need to apply the discount information back to previous items.
Building Context-Aware Traversal
Let's tackle Scenario 1 with a simple but effective pattern:
html = """
<article>
<section data-category="tech">
<h2>Technology</h2>
<a href="/ai">AI Article</a>
<a href="/blockchain">Blockchain Article</a>
</section>
<section data-category="science">
<h2>Science</h2>
<a href="/physics">Physics Article</a>
</section>
</article>
"""
soup = BeautifulSoup(html, 'html.parser')
# First approach: Use parent traversal
links_by_category = {}
for link in soup.find_all('a'):
# Find the containing section
section = link.find_parent('section')
if section and section.get('data-category'):
category = section['data-category']
if category not in links_by_category:
links_by_category[category] = []
links_by_category[category].append({
'text': link.text,
'href': link['href']
})
print(links_by_category)
What's happening: We find all links using BeautifulSoup, then for each link, we traverse upward to find its context (the containing section).
Alternative approach: Context-first traversal
links_by_category = {}
for section in soup.find_all('section'):
category = section.get('data-category')
if category:
links_by_category[category] = []
# Find all links within THIS section
for link in section.find_all('a'):
links_by_category[category].append({
'text': link.text,
'href': link['href']
})
Which is better? The second approach. Why? Because it maintains clear context as we traverse. We establish the section first, then process its contents. This is more readable and less error-prone.
The principle: When you need context, structure your traversal to establish context first, then process items within that context.
Handling Nested Hierarchies
Now let's tackle Scenario 2: building a nested table of contents from headings.
html = """
<article>
<h1>Main Title</h1>
<p>Intro</p>
<h2>First Section</h2>
<p>Content</p>
<h3>Subsection A</h3>
<p>More content</p>
<h3>Subsection B</h3>
<p>Even more</p>
<h2>Second Section</h2>
<p>Final content</p>
</article>
"""
soup = BeautifulSoup(html, 'html.parser')
We want to produce:
{
'title': 'Main Title',
'sections': [
{
'title': 'First Section',
'subsections': [
{'title': 'Subsection A'},
{'title': 'Subsection B'}
]
},
{
'title': 'Second Section',
'subsections': []
}
]
}
Let's write this iteratively, building our understanding step by step:
def build_toc_simple(soup):
"""First attempt: Just collect headings with levels"""
headings = []
for tag in soup.find_all(['h1', 'h2', 'h3']):
level = int(tag.name[1]) # Extract the number from 'h1', 'h2', etc.
headings.append({
'level': level,
'text': tag.text.strip()
})
return headings
# Test it
print(build_toc_simple(soup))
This gives us a flat list. Now let's add nesting logic:
def build_toc_nested(soup):
"""Second attempt: Build nested structure"""
result = {'sections': []}
current_h2 = None
for tag in soup.find_all(['h1', 'h2', 'h3']):
level = int(tag.name[1])
text = tag.text.strip()
if level == 1:
result['title'] = text
elif level == 2:
current_h2 = {'title': text, 'subsections': []}
result['sections'].append(current_h2)
elif level == 3:
if current_h2: # Make sure we have an h2 to attach to
current_h2['subsections'].append({'title': text})
return result
print(build_toc_nested(soup))
What changed? We added state tracking (current_h2). As we iterate through headings, we maintain context about which section we're currently filling.
The principle: When HTML structure implies hierarchy but elements are siblings in the DOM, use state variables to track where you are in your output structure.
Handling Missing Data Gracefully
Real HTML is messy. Attributes might be missing. Expected tags might not exist. Let's make our code robust:
def safe_find_text(element, selector, default="N/A"):
"""Find an element and get its text, or return default"""
found = element.find(selector) if hasattr(element, 'find') else None
return found.text.strip() if found else default
def safe_get_attr(element, attr, default=None):
"""Get an attribute, or return default"""
return element.get(attr, default) if element else default
# Usage:
html = """
<div class="product">
<h2>Widget</h2>
<!-- Note: price is missing -->
</div>
<div class="product">
<h2>Gadget</h2>
<span class="price">$29.99</span>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
for product in soup.find_all('div', class_='product'):
name = safe_find_text(product, 'h2', 'Unnamed')
price = safe_find_text(product, 'span.price', 'Price not available')
print(f"{name}: {price}")
The pattern: Wrap your element access in functions that handle the None case. This prevents crashes and gives you control over default values.
Try-Except vs. Defensive Checks:
# Approach 1: Try-Except
try:
price = product.find('span', class_='price').text
except AttributeError:
price = 'N/A'
# Approach 2: Defensive Check
price_tag = product.find('span', class_='price')
price = price_tag.text if price_tag else 'N/A'
Which is better? The defensive check (Approach 2) is clearer in this case. You're explicitly asking "does this exist?" rather than attempting and catching. Use try-except when the exceptional case is truly rare or when you're catching multiple potential errors at once.
Practical Pattern: Table Extraction with Context
Tables are a common scraping target. Let's extract a table while maintaining context:
html = """
<table>
<thead>
<tr>
<th>Product</th>
<th>Price</th>
<th>Stock</th>
</tr>
</thead>
<tbody>
<tr>
<td>Widget A</td>
<td>$19.99</td>
<td>In Stock</td>
</tr>
<tr>
<td>Widget B</td>
<td>$29.99</td>
<td>Out of Stock</td>
</tr>
</tbody>
</table>
"""
soup = BeautifulSoup(html, 'html.parser')
def extract_table(soup, table_selector='table'):
table = soup.find(table_selector)
if not table:
return []
# Get headers
headers = []
header_row = table.find('thead')
if header_row:
headers = [th.text.strip() for th in header_row.find_all('th')]
# Get rows
rows = []
tbody = table.find('tbody')
if tbody:
for tr in tbody.find_all('tr'):
cells = [td.text.strip() for td in tr.find_all('td')]
if len(cells) == len(headers):
row_dict = dict(zip(headers, cells))
rows.append(row_dict)
return rows
data = extract_table(soup)
print(data)
What makes this robust?
- Checks if table exists before accessing
- Safely extracts headers
- Validates row length matches header count
- Returns structured dictionaries, not raw cells
Exercise: Building a Web Scraper with Context
Here's your challenge: Build a scraper that extracts article information from a blog-style page, where articles are in different categories and you need to track which category each article belongs to.
html = """
<main>
<section class="category" data-name="Technology">
<h2>Tech Articles</h2>
<article>
<h3>AI Breakthrough</h3>
<p class="author">By Alice</p>
<p class="summary">Summary here...</p>
</article>
<article>
<h3>Quantum Computing</h3>
<p class="author">By Bob</p>
<p class="summary">Another summary...</p>
</article>
</section>
<section class="category" data-name="Science">
<h2>Science Articles</h2>
<article>
<h3>New Discovery</h3>
<!-- Note: author is missing here -->
<p class="summary">Science summary...</p>
</article>
</section>
</main>
"""
soup = BeautifulSoup(html, 'html.parser')
# Your task: Build a list of dictionaries like:
# [
# {
# 'category': 'Technology',
# 'title': 'AI Breakthrough',
# 'author': 'Alice',
# 'summary': 'Summary here...'
# },
# ...
# ]
Requirements:
- Extract all articles with their category context
- Handle missing authors gracefully (use "Unknown" as default)
- Return a clean list of dictionaries
- Make it robust—don't crash if structure varies
Approach:
- Iterate through category sections first (establish context)
- For each section, extract the category name
- Find all articles within that section
- Extract title, author (with fallback), and summary
- Build the output dictionary
Try writing this yourself before looking at solutions. The process of deciding which BeautifulSoup methods to use and how to structure your traversal is where learning happens.
Key Takeaways from Part IV:
- HTML is irregularly structured — not everything is a tag, and not everything has the same properties
- BeautifulSoup does most traversal for you — use
.find(),.find_all(), and.select()before writing custom traversal - Check types before accessing attributes — use
hasattr()or check for.nameto avoid crashes on text nodes - Context-first traversal — when you need context, establish it first, then process children
- Defensive coding — wrap element access in safe functions that handle missing data
- State tracking — use variables to maintain context when building nested structures from flat DOM siblings
The pattern is the same as before: explore the object, use library methods when available, write custom logic only when needed, and always handle the messy reality of real-world data.