Part V: Abstract Syntax Trees (Tree-sitter)

Chapter 11: Introduction to Tree-sitter

What Is an AST?

When Python reads your code, it doesn't just see text. It sees structure. Consider this simple function:

def greet(name):
    return f"Hello, {name}"

To you and me, this is a function definition. To Python, it's a tree of connected concepts: a function node, with a name child, a parameters child, and a body child. That body contains a return statement, which contains a formatted string, which itself has parts.

This tree structure—where every piece of code is a node with a type and children—is called an Abstract Syntax Tree (AST).

Why ASTs Matter

Think about what a code editor can do: it can find all function definitions, highlight syntax correctly, rename variables safely, or analyze complexity. None of this works by searching text. It all works by understanding structure.

When you want to:

Find every function call in a codebase
Extract all class definitions with their methods
Analyze import patterns
Build documentation automatically
Detect code patterns or anti-patterns

You need to work with the AST, not raw text.

Meet Tree-sitter

Tree-sitter is a parser generator that builds ASTs from source code. Unlike Python's built-in ast module (which only works for Python), tree-sitter supports dozens of languages with the same interface. You can analyze Python, JavaScript, Rust, or Go using identical patterns.

Let's get started:

from tree_sitter import Parser, Language
import tree_sitter_python

# Create a parser for Python
parser = Parser()
parser.set_language(tree_sitter_python.language())

# Parse some code
source_code = b"""
def calculate(x, y):
    result = x + y
    return result
"""

tree = parser.parse(source_code)
print(tree.root_node)

Output:

<Node type=module, start_point=(0, 0), end_point=(4, 0)>

Notice we got a Node object. This is our entry point—the root of the tree.

Understanding Tree-sitter Nodes

Every node in a tree-sitter AST has these key attributes:

.type - A string describing what kind of node this is:

print(tree.root_node.type)  # 'module'

.children - A list of child nodes:

for child in tree.root_node.children:
    print(f"  {child.type}")
# function_definition

.start_byte and .end_byte - Where this node appears in the source:

func_node = tree.root_node.children[0]
print(func_node.start_byte, func_node.end_byte)  # 1, 51

.text - The actual source code for this node (as bytes):

print(func_node.text)
# b'def calculate(x, y):\n    result = x + y\n    return result'

Our First Traversal: Seeing the Structure

Let's write a simple function to visualize the tree. Notice how we'll approach this—not trying to build something perfect immediately, but starting with something straightforward that works:

def print_tree(node, indent=0):
    # Show this node
    print("  " * indent + f"{node.type}")

    # Show all children
    for child in node.children:
        print_tree(child, indent + 1)

# Try it
tree = parser.parse(source_code)
print_tree(tree.root_node)

Output:

module
  function_definition
    def
    identifier
    parameters
      (
      identifier
      ,
      identifier
      )
    :
    block
      expression_statement
        assignment
          identifier
          =
          binary_operator
            identifier
            +
            identifier
      return_statement
        return
        identifier

Look at this output carefully. Do you see the pattern? Every part of your code—keywords like def and return, operators like +, even punctuation like : and ,—becomes a node in the tree.

The Mental Shift

You've already learned this pattern with file systems and JSON:

File system: Directories contain files and subdirectories
JSON: Dictionaries contain keys with values (which might be more dictionaries)
AST: Nodes have types and contain child nodes

It's the same recursive structure we've been working with. The only new piece is that instead of directory names or dictionary keys, we now have node types like function_definition or binary_operator.

Exercise: Visualizing Your Own Code

Take this challenge:

Parse a simple Python file you wrote recently
Use print_tree() to visualize its structure
Find one pattern in the output that surprised you
Identify where in the tree a specific line of your code appears

Start with something small—maybe 10-15 lines. The goal is to build intuition about how code becomes structure.

Hint: When you look at the output, try to match each node back to a piece of your source code. What node type represents a function call? An if statement? A list comprehension?

Chapter 12: Tree-sitter Access Patterns

Now that we understand what an AST node is, let's explore how to navigate them. Tree-sitter gives us three distinct approaches, and choosing the right one matters.

The Simple Path: Direct Access

Most of the time, working with .children is all you need:

source = b"x = 5 + 3"
tree = parser.parse(source)
root = tree.root_node

# Navigate down the tree
assignment = root.children[0]  # expression_statement
actual_assignment = assignment.children[0]  # assignment
left_side = actual_assignment.children[0]  # identifier 'x'
operator = actual_assignment.children[1]  # '='
right_side = actual_assignment.children[2]  # binary_operator

print(left_side.text)  # b'x'
print(right_side.type)  # binary_operator

This works exactly like navigating nested lists or dictionaries. But there's a catch—what if the structure changes slightly?

source = b"x = y = 5"  # Chained assignment
tree = parser.parse(source)
# Now the structure is different!

Counting child positions (children[0], children[1]) makes code brittle. Let's see a better way.

Semantic Access: Field Names

Tree-sitter nodes often have named fields that describe what a child represents semantically:

source = b"""
def process(data):
    return data * 2
"""

tree = parser.parse(source)
func = tree.root_node.children[0]  # function_definition

# Instead of counting positions:
name_node = func.child_by_field_name('name')
params_node = func.child_by_field_name('parameters')
body_node = func.child_by_field_name('body')

print(name_node.text)  # b'process'
print(body_node.type)   # block

This is more robust because it doesn't depend on the exact order of punctuation nodes like def or :.

When to use each:

Use .children when exploring or when you need every node (including punctuation)
Use .child_by_field_name() when you know what semantic part you want

The Manual Approach: TreeCursor

Here's where many learners hit confusion. Tree-sitter has a .walk() method that returns a TreeCursor. But a TreeCursor is not a generator. It doesn't iterate. It's more like a finger pointing at a node, which you manually move around.

Let me show you what it does, and then we'll discuss when you'd actually want it:

cursor = tree.root_node.walk()

# TreeCursor starts at the root
print(cursor.node.type)  # module

# Manually move down to first child
if cursor.goto_first_child():
    print(cursor.node.type)  # function_definition

    # Move to next sibling (if there was one)
    if cursor.goto_next_sibling():
        print("Found a sibling")
    else:
        print("No sibling")

    # Move back up
    cursor.goto_parent()
    print(cursor.node.type)  # Back to module

Notice: you're calling methods to move a pointer. No loops, no iteration. You control every movement.

The Critical Confusion to Avoid

Many developers see .walk() and .children and try to combine them:

# ❌ DON'T DO THIS
cursor = root.walk()
for child in cursor.node.children:  # This defeats the purpose!
    print(child.type)

If you're using .children, you don't need .walk() at all. Just use the node directly:

# ✅ DO THIS INSTEAD
for child in root.children:
    print(child.type)

So When DO You Use TreeCursor?

TreeCursor exists for one main reason: memory efficiency when traversing enormous trees.

When you access .children, tree-sitter creates a Python list of all child nodes. For a huge file, this uses memory. A TreeCursor, by contrast, is just a pointer—it doesn't allocate child lists until you ask for them.

For typical scripts and files (under 1000 lines), this rarely matters. Use the simple .children approach. But if you're analyzing massive codebases or working in memory-constrained environments, TreeCursor gives you fine control.

Here's a realistic pattern if you need it:

def find_all_functions_with_cursor(root):
    """Memory-efficient function finder using TreeCursor"""
    functions = []

    def traverse(cursor):
        # Check current node
        if cursor.node.type == 'function_definition':
            name = cursor.node.child_by_field_name('name')
            if name:
                functions.append(name.text.decode('utf-8'))

        # Visit children
        if cursor.goto_first_child():
            traverse(cursor)  # Visit first child

            while cursor.goto_next_sibling():
                traverse(cursor)  # Visit siblings

            cursor.goto_parent()  # Go back up

    cursor = root.walk()
    traverse(cursor)
    return functions

But honestly? For most purposes, this is cleaner and works just as well:

def find_all_functions(node):
    """Simple recursive function finder"""
    functions = []

    if node.type == 'function_definition':
        name = node.child_by_field_name('name')
        if name:
            functions.append(name.text.decode('utf-8'))

    for child in node.children:
        functions.extend(find_all_functions(child))

    return functions

The Takeaway

You have three tools:

.children - Your default. Use this.
.child_by_field_name() - When you want semantic parts
.walk() + TreeCursor - Only when memory is tight

Don't overthink it. Start with .children for everything. If you find yourself writing brittle code that breaks when the tree structure changes slightly, switch to field names. If you're analyzing 10,000-file codebases and running out of memory, then explore TreeCursor.

Exercise: Comparing Access Patterns

Parse this code:

source = b"""
class DataProcessor:
    def __init__(self, name):
        self.name = name

    def process(self, data):
        return data.upper()
"""

Your challenge:

Find the class name using .children and counting
Find it again using .child_by_field_name('name')
Find all method names using recursive traversal with .children
(Optional) Implement the same search using TreeCursor

Which approaches felt natural? Which felt fragile?

Chapter 13: Building AST Analysis Tools

Now let's build something useful. We'll create three practical tools, each teaching a different lesson about working with ASTs.

Tool 1: Structure Explorer

When you encounter a new language or unfamiliar code pattern, the first question is: What does the tree structure look like?

Let's improve our earlier print_tree() function. This time, we'll add depth limiting and show useful information:

def explore_structure(node, source_bytes, depth=0, max_depth=3):
    """Print tree structure with source code snippets"""
    if depth > max_depth:
        return

    indent = "  " * depth

    # Show node type
    node_info = f"{indent}{node.type}"

    # For leaf nodes (no children), show the actual text
    if not node.children:
        text = node.text.decode('utf-8').replace('\n', '\\n')
        if len(text) > 30:
            text = text[:27] + "..."
        node_info += f" → '{text}'"

    print(node_info)

    # Recurse into children
    for child in node.children:
        explore_structure(child, source_bytes, depth + 1, max_depth)

# Try it
source = b"""
def calculate_total(items):
    total = sum(item.price for item in items)
    return total
"""

tree = parser.parse(source)
explore_structure(tree.root_node, source)

Output:

module
  function_definition
    def → 'def'
    identifier → 'calculate_total'
    parameters
      ( → '('
      identifier → 'items'
      ) → ')'
    : → ':'
    block

This is your first tool. Before writing any analysis code, run this to understand what you're working with.

Tool 2: Finding All Nodes of a Type

Now let's build something more powerful: a function that finds every node of a given type. This is the foundation for countless analysis tools.

We'll start with a straightforward recursive implementation:

def find_nodes_by_type(node, target_type):
    """Find all nodes matching a specific type"""
    results = []

    # Check if current node matches
    if node.type == target_type:
        results.append(node)

    # Search children
    for child in node.children:
        results.extend(find_nodes_by_type(child, target_type))

    return results

# Example: Find all function calls
source = b"""
print("Starting")
process_data(items)
result = calculate(x, y)
print("Done")
"""

tree = parser.parse(source)
calls = find_nodes_by_type(tree.root_node, 'call')

for call in calls:
    # Get the function name (first child of call is usually the function)
    func = call.children[0]
    print(f"Found call to: {func.text.decode('utf-8')}")

Output:

Found call to: print
Found call to: process_data
Found call to: calculate
Found call to: print

This pattern—finding all nodes of a type—is incredibly versatile. Want to find all classes? All imports? All string literals? Same function, different target_type.

Extracting Source Code: The Bytes Gotcha

Here's a subtle but important detail. Tree-sitter works with bytes, not strings. When you parse code, you must keep the original bytes:

source_code = b"def test():\n    return 42"  # bytes
tree = parser.parse(source_code)

func = tree.root_node.children[0]
print(func.text)  # b'def test():\n    return 42' - still bytes

# To get a string:
text_string = func.text.decode('utf-8')
print(text_string)  # Now it's a proper string

You'll also frequently need to extract source using byte positions:

# The original source bytes must be kept!
source = b"x = 5 + 3"
tree = parser.parse(source)

node = tree.root_node.children[0]  # expression_statement
start = node.start_byte
end = node.end_byte

# Extract from original source
extracted = source[start:end].decode('utf-8')
print(extracted)  # "x = 5 + 3"

Why does this matter? Because .text only works if the node came from a tree that still exists. If you're processing many files or building data to return later, you need to extract the text while you have both the node and the original source bytes.

Tool 3: Function Signature Extractor

Let's combine these ideas into something practical. We'll build a tool that extracts all function definitions with their signatures:

def extract_functions(node, source_bytes):
    """Extract all function definitions with their signatures"""
    functions = []

    if node.type == 'function_definition':
        # Get function name
        name_node = node.child_by_field_name('name')
        name = name_node.text.decode('utf-8') if name_node else '<unknown>'

        # Get parameters
        params_node = node.child_by_field_name('parameters')
        params = params_node.text.decode('utf-8') if params_node else '()'

        # Get full signature (from def to :)
        signature_end = node.start_byte
        for child in node.children:
            if child.text == b':':
                signature_end = child.end_byte
                break

        signature = source_bytes[node.start_byte:signature_end].decode('utf-8')

        functions.append({
            'name': name,
            'parameters': params,
            'signature': signature,
            'start_line': name_node.start_point[0] if name_node else 0
        })

    # Recurse into children
    for child in node.children:
        functions.extend(extract_functions(child, source_bytes))

    return functions

# Test it
source = b"""
def greet(name):
    return f"Hello, {name}"

def calculate_average(numbers):
    return sum(numbers) / len(numbers)

class Processor:
    def process(self, data):
        return data.upper()
"""

tree = parser.parse(source)
functions = extract_functions(tree.root_node, source)

for func in functions:
    print(f"Line {func['start_line'] + 1}: {func['signature']}")

Output:

Line 1: def greet(name):
Line 4: def calculate_average(numbers):
Line 8: def process(self, data):

Notice how we're building structured data from the tree. This is the pattern for most AST analysis: traverse, extract information, return useful data structures.

Exercise: Build a Function Documentation Extractor

Here's your challenge: extend the function extractor to also capture docstrings.

Hints:

A function's body is accessed via .child_by_field_name('body')
A docstring is an expression_statement containing a string
It appears as the first statement in the body's block

Your function should return something like:

{
    'name': 'greet',
    'parameters': '(name)',
    'docstring': '"""Say hello to someone"""',  # or None if no docstring
    'signature': 'def greet(name):'
}

Start by exploring the structure of a function with a docstring using our explore_structure() tool. Let the tree show you the path before you code it.

Chapter 14: Advanced AST Patterns

We've built basic traversal tools. Now let's tackle the challenging problems that reveal the real power—and real difficulty—of AST analysis.

The Parent Problem

Here's something that might surprise you: tree-sitter nodes don't know their parents.

func_node = tree.root_node.children[0]
print(func_node.parent)  # AttributeError: 'Node' object has no attribute 'parent'

Why does this matter? Consider this question: "Is this variable assignment inside a function or at module level?" You can't answer without tracking context as you traverse.

Let's solve this step by step. First, the naive approach that doesn't quite work:

def find_assignments(node):
    """Find assignments - but we lose context!"""
    assignments = []

    if node.type == 'assignment':
        left = node.children[0]
        var_name = left.text.decode('utf-8')
        assignments.append(var_name)

    for child in node.children:
        assignments.extend(find_assignments(child))

    return assignments

This finds assignments, but we don't know where they are. Let's fix that by tracking the parent path:

def find_assignments_with_context(node, parent_path=None):
    """Find assignments and track what scope they're in"""
    if parent_path is None:
        parent_path = []

    assignments = []

    # Track what scope we're entering
    current_path = parent_path.copy()
    if node.type == 'function_definition':
        name_node = node.child_by_field_name('name')
        if name_node:
            current_path.append(f"function:{name_node.text.decode('utf-8')}")
    elif node.type == 'class_definition':
        name_node = node.child_by_field_name('name')
        if name_node:
            current_path.append(f"class:{name_node.text.decode('utf-8')}")

    # Check for assignment
    if node.type == 'assignment':
        left = node.children[0]
        var_name = left.text.decode('utf-8')
        scope = ' > '.join(current_path) if current_path else 'module'
        assignments.append({
            'variable': var_name,
            'scope': scope,
            'line': node.start_point[0] + 1
        })

    # Recurse with updated context
    for child in node.children:
        assignments.extend(
            find_assignments_with_context(child, current_path)
        )

    return assignments

# Test it
source = b"""
x = 10

def process():
    y = 20

    def inner():
        z = 30
"""

tree = parser.parse(source)
assignments = find_assignments_with_context(tree.root_node)

for assign in assignments:
    print(f"Line {assign['line']}: {assign['variable']} in {assign['scope']}")

Output:

Line 1: x in module
Line 4: y in function:process
Line 7: z in function:process > function:inner

See the pattern? We maintain a parent_path list that we copy and extend as we descend into new scopes. This is how you track context in a tree without parent pointers.

Scoped Search: Finding Within Functions

Sometimes you don't want all nodes of a type—you want nodes within a specific scope. Let's build a function that finds all calls within a particular function:

def find_calls_in_function(root, function_name):
    """Find all function calls inside a specific function"""

    # First, find the target function
    def find_function(node):
        if node.type == 'function_definition':
            name_node = node.child_by_field_name('name')
            if name_node and name_node.text.decode('utf-8') == function_name:
                return node

        for child in node.children:
            result = find_function(child)
            if result:
                return result
        return None

    target_func = find_function(root)
    if not target_func:
        return []

    # Now find calls only within this function's body
    body = target_func.child_by_field_name('body')
    if not body:
        return []

    calls = []
    def collect_calls(node):
        if node.type == 'call':
            func = node.children[0]
            calls.append(func.text.decode('utf-8'))

        for child in node.children:
            collect_calls(child)

    collect_calls(body)
    return calls

# Test it
source = b"""
def process_data(items):
    filtered = filter_items(items)
    cleaned = clean_data(filtered)
    return save_result(cleaned)

def other_function():
    load_data()  # This shouldn't be found
"""

tree = parser.parse(source)
calls = find_calls_in_function(tree.root_node, 'process_data')
print("Calls in process_data:", calls)

Output:

Calls in process_data: ['filter_items', 'clean_data', 'save_result']

This two-phase approach—first find the scope, then search within it—is a common pattern in AST analysis.

Building Structured Data: The Import Analyzer

Let's build something more ambitious: a tool that extracts all imports and organizes them by type:

def analyze_imports(root, source_bytes):
    """Extract and categorize all import statements"""
    imports = {
        'standard': [],  # from import
        'simple': [],    # import x
        'aliased': []    # import x as y
    }

    def process_node(node):
        if node.type == 'import_statement':
            # Simple import: import os
            # or: import os, sys
            for child in node.children:
                if child.type == 'dotted_name':
                    imports['simple'].append(child.text.decode('utf-8'))
                elif child.type == 'aliased_import':
                    # import numpy as np
                    name = child.child_by_field_name('name')
                    alias = child.child_by_field_name('alias')
                    if name and alias:
                        imports['aliased'].append({
                            'module': name.text.decode('utf-8'),
                            'alias': alias.text.decode('utf-8')
                        })

        elif node.type == 'import_from_statement':
            # from os import path
            # from os.path import join, exists
            module_name = None
            items = []

            # Find what we're importing from
            name_node = node.child_by_field_name('module_name')
            if name_node:
                module_name = name_node.text.decode('utf-8')

            # Find what we're importing
            for child in node.children:
                if child.type == 'dotted_name':
                    items.append(child.text.decode('utf-8'))
                elif child.type == 'aliased_import':
                    name = child.child_by_field_name('name')
                    alias = child.child_by_field_name('alias')
                    if name:
                        item_text = name.text.decode('utf-8')
                        if alias:
                            item_text += f" as {alias.text.decode('utf-8')}"
                        items.append(item_text)

            if module_name:
                imports['standard'].append({
                    'from': module_name,
                    'items': items
                })

        # Recurse
        for child in node.children:
            process_node(child)

    process_node(root)
    return imports

# Test it
source = b"""
import os
import sys
from pathlib import Path
from collections import defaultdict, Counter
import numpy as np
from typing import List, Dict
"""

tree = parser.parse(source)
result = analyze_imports(tree.root_node, source)

print("Simple imports:", result['simple'])
print("From imports:", result['standard'])
print("Aliased imports:", result['aliased'])

Notice how we're building a data structure that's more useful than just a flat list of nodes. This is the goal of most AST work: transform tree structure into actionable information.

Performance: When to Stop Early

Sometimes you don't need to traverse the entire tree. If you're searching for one specific thing, stop when you find it:

def find_first_function(node):
    """Find the first function definition and stop"""
    if node.type == 'function_definition':
        return node  # Found it! Stop searching.

    for child in node.children:
        result = find_first_function(child)
        if result:  # If we found it in a child, bubble it up
            return result

    return None  # Didn't find it in this subtree

This early-exit pattern can dramatically speed up searches in large files.

Exercise: Build a Code Complexity Analyzer

Here's a challenging but rewarding exercise. Build a function that calculates cyclomatic complexity—essentially, how many decision points exist in a function.

Count these node types within each function:

if_statement: +1
elif_clause: +1
for_statement: +1
while_statement: +1
except_clause: +1

Your output should look like:

{
    'process_data': {'complexity': 3, 'line': 5},
    'simple_func': {'complexity': 1, 'line': 12}
}

Steps to consider:

Find all function definitions
For each function, search its body for decision nodes
Count them
Return structured results

This exercise combines everything: scoped search, context tracking, and building useful data from tree structure.

The next chapter will pull back from the details and help you recognize which problems need which patterns. But first, try that complexity analyzer. Let the struggle of designing it teach you what questions to ask of an AST.