Part V: Abstract Syntax Trees (Tree-sitter)
Chapter 11: Introduction to Tree-sitter
What Is an AST?
When Python reads your code, it doesn't just see text. It sees structure. Consider this simple function:
def greet(name):
return f"Hello, {name}"
To you and me, this is a function definition. To Python, it's a tree of connected concepts: a function node, with a name child, a parameters child, and a body child. That body contains a return statement, which contains a formatted string, which itself has parts.
This tree structure—where every piece of code is a node with a type and children—is called an Abstract Syntax Tree (AST).
Why ASTs Matter
Think about what a code editor can do: it can find all function definitions, highlight syntax correctly, rename variables safely, or analyze complexity. None of this works by searching text. It all works by understanding structure.
When you want to:
- Find every function call in a codebase
- Extract all class definitions with their methods
- Analyze import patterns
- Build documentation automatically
- Detect code patterns or anti-patterns
You need to work with the AST, not raw text.
Meet Tree-sitter
Tree-sitter is a parser generator that builds ASTs from source code. Unlike Python's built-in ast module (which only works for Python), tree-sitter supports dozens of languages with the same interface. You can analyze Python, JavaScript, Rust, or Go using identical patterns.
Let's get started:
from tree_sitter import Parser, Language
import tree_sitter_python
# Create a parser for Python
parser = Parser()
parser.set_language(tree_sitter_python.language())
# Parse some code
source_code = b"""
def calculate(x, y):
result = x + y
return result
"""
tree = parser.parse(source_code)
print(tree.root_node)
Output:
<Node type=module, start_point=(0, 0), end_point=(4, 0)>
Notice we got a Node object. This is our entry point—the root of the tree.
Understanding Tree-sitter Nodes
Every node in a tree-sitter AST has these key attributes:
.type - A string describing what kind of node this is:
print(tree.root_node.type) # 'module'
.children - A list of child nodes:
for child in tree.root_node.children:
print(f" {child.type}")
# function_definition
.start_byte and .end_byte - Where this node appears in the source:
func_node = tree.root_node.children[0]
print(func_node.start_byte, func_node.end_byte) # 1, 51
.text - The actual source code for this node (as bytes):
print(func_node.text)
# b'def calculate(x, y):\n result = x + y\n return result'
Our First Traversal: Seeing the Structure
Let's write a simple function to visualize the tree. Notice how we'll approach this—not trying to build something perfect immediately, but starting with something straightforward that works:
def print_tree(node, indent=0):
# Show this node
print(" " * indent + f"{node.type}")
# Show all children
for child in node.children:
print_tree(child, indent + 1)
# Try it
tree = parser.parse(source_code)
print_tree(tree.root_node)
Output:
module
function_definition
def
identifier
parameters
(
identifier
,
identifier
)
:
block
expression_statement
assignment
identifier
=
binary_operator
identifier
+
identifier
return_statement
return
identifier
Look at this output carefully. Do you see the pattern? Every part of your code—keywords like def and return, operators like +, even punctuation like : and ,—becomes a node in the tree.
The Mental Shift
You've already learned this pattern with file systems and JSON:
- File system: Directories contain files and subdirectories
- JSON: Dictionaries contain keys with values (which might be more dictionaries)
- AST: Nodes have types and contain child nodes
It's the same recursive structure we've been working with. The only new piece is that instead of directory names or dictionary keys, we now have node types like function_definition or binary_operator.
Exercise: Visualizing Your Own Code
Take this challenge:
- Parse a simple Python file you wrote recently
- Use
print_tree()to visualize its structure - Find one pattern in the output that surprised you
- Identify where in the tree a specific line of your code appears
Start with something small—maybe 10-15 lines. The goal is to build intuition about how code becomes structure.
Hint: When you look at the output, try to match each node back to a piece of your source code. What node type represents a function call? An if statement? A list comprehension?
Chapter 12: Tree-sitter Access Patterns
Now that we understand what an AST node is, let's explore how to navigate them. Tree-sitter gives us three distinct approaches, and choosing the right one matters.
The Simple Path: Direct Access
Most of the time, working with .children is all you need:
source = b"x = 5 + 3"
tree = parser.parse(source)
root = tree.root_node
# Navigate down the tree
assignment = root.children[0] # expression_statement
actual_assignment = assignment.children[0] # assignment
left_side = actual_assignment.children[0] # identifier 'x'
operator = actual_assignment.children[1] # '='
right_side = actual_assignment.children[2] # binary_operator
print(left_side.text) # b'x'
print(right_side.type) # binary_operator
This works exactly like navigating nested lists or dictionaries. But there's a catch—what if the structure changes slightly?
source = b"x = y = 5" # Chained assignment
tree = parser.parse(source)
# Now the structure is different!
Counting child positions (children[0], children[1]) makes code brittle. Let's see a better way.
Semantic Access: Field Names
Tree-sitter nodes often have named fields that describe what a child represents semantically:
source = b"""
def process(data):
return data * 2
"""
tree = parser.parse(source)
func = tree.root_node.children[0] # function_definition
# Instead of counting positions:
name_node = func.child_by_field_name('name')
params_node = func.child_by_field_name('parameters')
body_node = func.child_by_field_name('body')
print(name_node.text) # b'process'
print(body_node.type) # block
This is more robust because it doesn't depend on the exact order of punctuation nodes like def or :.
When to use each:
- Use
.childrenwhen exploring or when you need every node (including punctuation) - Use
.child_by_field_name()when you know what semantic part you want
The Manual Approach: TreeCursor
Here's where many learners hit confusion. Tree-sitter has a .walk() method that returns a TreeCursor. But a TreeCursor is not a generator. It doesn't iterate. It's more like a finger pointing at a node, which you manually move around.
Let me show you what it does, and then we'll discuss when you'd actually want it:
cursor = tree.root_node.walk()
# TreeCursor starts at the root
print(cursor.node.type) # module
# Manually move down to first child
if cursor.goto_first_child():
print(cursor.node.type) # function_definition
# Move to next sibling (if there was one)
if cursor.goto_next_sibling():
print("Found a sibling")
else:
print("No sibling")
# Move back up
cursor.goto_parent()
print(cursor.node.type) # Back to module
Notice: you're calling methods to move a pointer. No loops, no iteration. You control every movement.
The Critical Confusion to Avoid
Many developers see .walk() and .children and try to combine them:
# ❌ DON'T DO THIS
cursor = root.walk()
for child in cursor.node.children: # This defeats the purpose!
print(child.type)
If you're using .children, you don't need .walk() at all. Just use the node directly:
# âś… DO THIS INSTEAD
for child in root.children:
print(child.type)
So When DO You Use TreeCursor?
TreeCursor exists for one main reason: memory efficiency when traversing enormous trees.
When you access .children, tree-sitter creates a Python list of all child nodes. For a huge file, this uses memory. A TreeCursor, by contrast, is just a pointer—it doesn't allocate child lists until you ask for them.
For typical scripts and files (under 1000 lines), this rarely matters. Use the simple .children approach. But if you're analyzing massive codebases or working in memory-constrained environments, TreeCursor gives you fine control.
Here's a realistic pattern if you need it:
def find_all_functions_with_cursor(root):
"""Memory-efficient function finder using TreeCursor"""
functions = []
def traverse(cursor):
# Check current node
if cursor.node.type == 'function_definition':
name = cursor.node.child_by_field_name('name')
if name:
functions.append(name.text.decode('utf-8'))
# Visit children
if cursor.goto_first_child():
traverse(cursor) # Visit first child
while cursor.goto_next_sibling():
traverse(cursor) # Visit siblings
cursor.goto_parent() # Go back up
cursor = root.walk()
traverse(cursor)
return functions
But honestly? For most purposes, this is cleaner and works just as well:
def find_all_functions(node):
"""Simple recursive function finder"""
functions = []
if node.type == 'function_definition':
name = node.child_by_field_name('name')
if name:
functions.append(name.text.decode('utf-8'))
for child in node.children:
functions.extend(find_all_functions(child))
return functions
The Takeaway
You have three tools:
.children- Your default. Use this..child_by_field_name()- When you want semantic parts.walk()+ TreeCursor - Only when memory is tight
Don't overthink it. Start with .children for everything. If you find yourself writing brittle code that breaks when the tree structure changes slightly, switch to field names. If you're analyzing 10,000-file codebases and running out of memory, then explore TreeCursor.
Exercise: Comparing Access Patterns
Parse this code:
source = b"""
class DataProcessor:
def __init__(self, name):
self.name = name
def process(self, data):
return data.upper()
"""
Your challenge:
- Find the class name using
.childrenand counting - Find it again using
.child_by_field_name('name') - Find all method names using recursive traversal with
.children - (Optional) Implement the same search using TreeCursor
Which approaches felt natural? Which felt fragile?
Chapter 13: Building AST Analysis Tools
Now let's build something useful. We'll create three practical tools, each teaching a different lesson about working with ASTs.
Tool 1: Structure Explorer
When you encounter a new language or unfamiliar code pattern, the first question is: What does the tree structure look like?
Let's improve our earlier print_tree() function. This time, we'll add depth limiting and show useful information:
def explore_structure(node, source_bytes, depth=0, max_depth=3):
"""Print tree structure with source code snippets"""
if depth > max_depth:
return
indent = " " * depth
# Show node type
node_info = f"{indent}{node.type}"
# For leaf nodes (no children), show the actual text
if not node.children:
text = node.text.decode('utf-8').replace('\n', '\\n')
if len(text) > 30:
text = text[:27] + "..."
node_info += f" → '{text}'"
print(node_info)
# Recurse into children
for child in node.children:
explore_structure(child, source_bytes, depth + 1, max_depth)
# Try it
source = b"""
def calculate_total(items):
total = sum(item.price for item in items)
return total
"""
tree = parser.parse(source)
explore_structure(tree.root_node, source)
Output:
module
function_definition
def → 'def'
identifier → 'calculate_total'
parameters
( → '('
identifier → 'items'
) → ')'
: → ':'
block
This is your first tool. Before writing any analysis code, run this to understand what you're working with.
Tool 2: Finding All Nodes of a Type
Now let's build something more powerful: a function that finds every node of a given type. This is the foundation for countless analysis tools.
We'll start with a straightforward recursive implementation:
def find_nodes_by_type(node, target_type):
"""Find all nodes matching a specific type"""
results = []
# Check if current node matches
if node.type == target_type:
results.append(node)
# Search children
for child in node.children:
results.extend(find_nodes_by_type(child, target_type))
return results
# Example: Find all function calls
source = b"""
print("Starting")
process_data(items)
result = calculate(x, y)
print("Done")
"""
tree = parser.parse(source)
calls = find_nodes_by_type(tree.root_node, 'call')
for call in calls:
# Get the function name (first child of call is usually the function)
func = call.children[0]
print(f"Found call to: {func.text.decode('utf-8')}")
Output:
Found call to: print
Found call to: process_data
Found call to: calculate
Found call to: print
This pattern—finding all nodes of a type—is incredibly versatile. Want to find all classes? All imports? All string literals? Same function, different target_type.
Extracting Source Code: The Bytes Gotcha
Here's a subtle but important detail. Tree-sitter works with bytes, not strings. When you parse code, you must keep the original bytes:
source_code = b"def test():\n return 42" # bytes
tree = parser.parse(source_code)
func = tree.root_node.children[0]
print(func.text) # b'def test():\n return 42' - still bytes
# To get a string:
text_string = func.text.decode('utf-8')
print(text_string) # Now it's a proper string
You'll also frequently need to extract source using byte positions:
# The original source bytes must be kept!
source = b"x = 5 + 3"
tree = parser.parse(source)
node = tree.root_node.children[0] # expression_statement
start = node.start_byte
end = node.end_byte
# Extract from original source
extracted = source[start:end].decode('utf-8')
print(extracted) # "x = 5 + 3"
Why does this matter? Because .text only works if the node came from a tree that still exists. If you're processing many files or building data to return later, you need to extract the text while you have both the node and the original source bytes.
Tool 3: Function Signature Extractor
Let's combine these ideas into something practical. We'll build a tool that extracts all function definitions with their signatures:
def extract_functions(node, source_bytes):
"""Extract all function definitions with their signatures"""
functions = []
if node.type == 'function_definition':
# Get function name
name_node = node.child_by_field_name('name')
name = name_node.text.decode('utf-8') if name_node else '<unknown>'
# Get parameters
params_node = node.child_by_field_name('parameters')
params = params_node.text.decode('utf-8') if params_node else '()'
# Get full signature (from def to :)
signature_end = node.start_byte
for child in node.children:
if child.text == b':':
signature_end = child.end_byte
break
signature = source_bytes[node.start_byte:signature_end].decode('utf-8')
functions.append({
'name': name,
'parameters': params,
'signature': signature,
'start_line': name_node.start_point[0] if name_node else 0
})
# Recurse into children
for child in node.children:
functions.extend(extract_functions(child, source_bytes))
return functions
# Test it
source = b"""
def greet(name):
return f"Hello, {name}"
def calculate_average(numbers):
return sum(numbers) / len(numbers)
class Processor:
def process(self, data):
return data.upper()
"""
tree = parser.parse(source)
functions = extract_functions(tree.root_node, source)
for func in functions:
print(f"Line {func['start_line'] + 1}: {func['signature']}")
Output:
Line 1: def greet(name):
Line 4: def calculate_average(numbers):
Line 8: def process(self, data):
Notice how we're building structured data from the tree. This is the pattern for most AST analysis: traverse, extract information, return useful data structures.
Exercise: Build a Function Documentation Extractor
Here's your challenge: extend the function extractor to also capture docstrings.
Hints:
- A function's body is accessed via
.child_by_field_name('body') - A docstring is an
expression_statementcontaining astring - It appears as the first statement in the body's block
Your function should return something like:
{
'name': 'greet',
'parameters': '(name)',
'docstring': '"""Say hello to someone"""', # or None if no docstring
'signature': 'def greet(name):'
}
Start by exploring the structure of a function with a docstring using our explore_structure() tool. Let the tree show you the path before you code it.
Chapter 14: Advanced AST Patterns
We've built basic traversal tools. Now let's tackle the challenging problems that reveal the real power—and real difficulty—of AST analysis.
The Parent Problem
Here's something that might surprise you: tree-sitter nodes don't know their parents.
func_node = tree.root_node.children[0]
print(func_node.parent) # AttributeError: 'Node' object has no attribute 'parent'
Why does this matter? Consider this question: "Is this variable assignment inside a function or at module level?" You can't answer without tracking context as you traverse.
Let's solve this step by step. First, the naive approach that doesn't quite work:
def find_assignments(node):
"""Find assignments - but we lose context!"""
assignments = []
if node.type == 'assignment':
left = node.children[0]
var_name = left.text.decode('utf-8')
assignments.append(var_name)
for child in node.children:
assignments.extend(find_assignments(child))
return assignments
This finds assignments, but we don't know where they are. Let's fix that by tracking the parent path:
def find_assignments_with_context(node, parent_path=None):
"""Find assignments and track what scope they're in"""
if parent_path is None:
parent_path = []
assignments = []
# Track what scope we're entering
current_path = parent_path.copy()
if node.type == 'function_definition':
name_node = node.child_by_field_name('name')
if name_node:
current_path.append(f"function:{name_node.text.decode('utf-8')}")
elif node.type == 'class_definition':
name_node = node.child_by_field_name('name')
if name_node:
current_path.append(f"class:{name_node.text.decode('utf-8')}")
# Check for assignment
if node.type == 'assignment':
left = node.children[0]
var_name = left.text.decode('utf-8')
scope = ' > '.join(current_path) if current_path else 'module'
assignments.append({
'variable': var_name,
'scope': scope,
'line': node.start_point[0] + 1
})
# Recurse with updated context
for child in node.children:
assignments.extend(
find_assignments_with_context(child, current_path)
)
return assignments
# Test it
source = b"""
x = 10
def process():
y = 20
def inner():
z = 30
"""
tree = parser.parse(source)
assignments = find_assignments_with_context(tree.root_node)
for assign in assignments:
print(f"Line {assign['line']}: {assign['variable']} in {assign['scope']}")
Output:
Line 1: x in module
Line 4: y in function:process
Line 7: z in function:process > function:inner
See the pattern? We maintain a parent_path list that we copy and extend as we descend into new scopes. This is how you track context in a tree without parent pointers.
Scoped Search: Finding Within Functions
Sometimes you don't want all nodes of a type—you want nodes within a specific scope. Let's build a function that finds all calls within a particular function:
def find_calls_in_function(root, function_name):
"""Find all function calls inside a specific function"""
# First, find the target function
def find_function(node):
if node.type == 'function_definition':
name_node = node.child_by_field_name('name')
if name_node and name_node.text.decode('utf-8') == function_name:
return node
for child in node.children:
result = find_function(child)
if result:
return result
return None
target_func = find_function(root)
if not target_func:
return []
# Now find calls only within this function's body
body = target_func.child_by_field_name('body')
if not body:
return []
calls = []
def collect_calls(node):
if node.type == 'call':
func = node.children[0]
calls.append(func.text.decode('utf-8'))
for child in node.children:
collect_calls(child)
collect_calls(body)
return calls
# Test it
source = b"""
def process_data(items):
filtered = filter_items(items)
cleaned = clean_data(filtered)
return save_result(cleaned)
def other_function():
load_data() # This shouldn't be found
"""
tree = parser.parse(source)
calls = find_calls_in_function(tree.root_node, 'process_data')
print("Calls in process_data:", calls)
Output:
Calls in process_data: ['filter_items', 'clean_data', 'save_result']
This two-phase approach—first find the scope, then search within it—is a common pattern in AST analysis.
Building Structured Data: The Import Analyzer
Let's build something more ambitious: a tool that extracts all imports and organizes them by type:
def analyze_imports(root, source_bytes):
"""Extract and categorize all import statements"""
imports = {
'standard': [], # from import
'simple': [], # import x
'aliased': [] # import x as y
}
def process_node(node):
if node.type == 'import_statement':
# Simple import: import os
# or: import os, sys
for child in node.children:
if child.type == 'dotted_name':
imports['simple'].append(child.text.decode('utf-8'))
elif child.type == 'aliased_import':
# import numpy as np
name = child.child_by_field_name('name')
alias = child.child_by_field_name('alias')
if name and alias:
imports['aliased'].append({
'module': name.text.decode('utf-8'),
'alias': alias.text.decode('utf-8')
})
elif node.type == 'import_from_statement':
# from os import path
# from os.path import join, exists
module_name = None
items = []
# Find what we're importing from
name_node = node.child_by_field_name('module_name')
if name_node:
module_name = name_node.text.decode('utf-8')
# Find what we're importing
for child in node.children:
if child.type == 'dotted_name':
items.append(child.text.decode('utf-8'))
elif child.type == 'aliased_import':
name = child.child_by_field_name('name')
alias = child.child_by_field_name('alias')
if name:
item_text = name.text.decode('utf-8')
if alias:
item_text += f" as {alias.text.decode('utf-8')}"
items.append(item_text)
if module_name:
imports['standard'].append({
'from': module_name,
'items': items
})
# Recurse
for child in node.children:
process_node(child)
process_node(root)
return imports
# Test it
source = b"""
import os
import sys
from pathlib import Path
from collections import defaultdict, Counter
import numpy as np
from typing import List, Dict
"""
tree = parser.parse(source)
result = analyze_imports(tree.root_node, source)
print("Simple imports:", result['simple'])
print("From imports:", result['standard'])
print("Aliased imports:", result['aliased'])
Notice how we're building a data structure that's more useful than just a flat list of nodes. This is the goal of most AST work: transform tree structure into actionable information.
Performance: When to Stop Early
Sometimes you don't need to traverse the entire tree. If you're searching for one specific thing, stop when you find it:
def find_first_function(node):
"""Find the first function definition and stop"""
if node.type == 'function_definition':
return node # Found it! Stop searching.
for child in node.children:
result = find_first_function(child)
if result: # If we found it in a child, bubble it up
return result
return None # Didn't find it in this subtree
This early-exit pattern can dramatically speed up searches in large files.
Exercise: Build a Code Complexity Analyzer
Here's a challenging but rewarding exercise. Build a function that calculates cyclomatic complexity—essentially, how many decision points exist in a function.
Count these node types within each function:
if_statement: +1elif_clause: +1for_statement: +1while_statement: +1except_clause: +1
Your output should look like:
{
'process_data': {'complexity': 3, 'line': 5},
'simple_func': {'complexity': 1, 'line': 12}
}
Steps to consider:
- Find all function definitions
- For each function, search its body for decision nodes
- Count them
- Return structured results
This exercise combines everything: scoped search, context tracking, and building useful data from tree structure.
The next chapter will pull back from the details and help you recognize which problems need which patterns. But first, try that complexity analyzer. Let the struggle of designing it teach you what questions to ask of an AST.