🏠

The Scaffolding of Ignorance: A Persistent Student Model for Mitigating AI Amnesia in LLM-Based Programming Tutors

Design, Implementation, and Initial Feasibility Study

Candidate: [REDACTED] Institution: [REDACTED] Degree: B.S. in Computer Science

A dissertation submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science, and for consideration for the Anthropic Award for Innovation in Computer Science Education Using LLMs.

Date: April 28, 2026


Abstract

Large Language Models (LLMs) offer transformative potential for computer science education but suffer from "AI amnesia"—the inability to maintain memory of a learner's history across sessions. This dissertation presents the design, implementation, and initial feasibility assessment of a system that addresses this limitation through dual-context architecture: persistent conceptual memory combined with ephemeral workspace evidence. The system comprises three integrated components: (1) a Student Model tracking conceptual knowledge, mastery levels, struggles, and prerequisite relationships across sessions via a JSON schema and command-line interface; (2) a Workspace Protocol providing concrete code context through evidence-based investigation using standard Unix tools (cat, grep, find, git); and (3) a Socratic Mentor LLM persona engineered to synthesize both contexts for grounded, continuous tutoring.

This work makes two primary contributions: First, it demonstrates that dual-context architecture—combining structured conceptual memory with mandatory workspace evidence—can be practically integrated into LLM tutoring workflows with acceptable overhead (2-4% of session time). Second, it provides an open-source reference implementation and detailed collaboration protocol that can serve as a foundation for more rigorous empirical studies. Through a four-week self-study case examining learning advanced web development concepts, I present qualitative evidence suggesting the system's potential value while acknowledging the significant methodological limitations inherent in single-subject design. This dissertation should be understood as a proof-of-concept and design artifact rather than a definitive evaluation of educational efficacy.

Keywords: intelligent tutoring systems, student modeling, large language models, metacognition, computer science education, workspace protocols


Chapter 1: Introduction

1.1 Motivation: The Promise and Limitations of LLM Tutors

Large Language Models like Claude and GPT-4 represent a qualitative leap in accessible educational technology. Unlike earlier computer-based tutoring systems constrained to narrow domains, LLMs can discuss virtually any programming concept, adapt their explanations to different knowledge levels, and engage in Socratic dialogue. For the self-directed learner exploring unfamiliar codebases or the struggling student seeking help at 2 AM, these systems offer unprecedented availability and flexibility.

However, this potential is constrained by a fundamental architectural limitation: LLMs are largely stateless. Each conversation begins anew, with at most a limited context window of recent exchanges. The tutor cannot remember yesterday's breakthrough, last week's misconception, or the foundational gap diagnosed a month ago. This creates what I term "AI amnesia"—a persistent forgetting that undermines the continuity essential to effective mentorship.

1.2 Problem Statement: The Cost of Amnesia

AI amnesia manifests in several pedagogically harmful ways:

  1. Repetitive Instruction: Students receive the same introductory explanations for concepts they've already encountered, wasting time and signaling the system doesn't "know" them.

  2. Missed Diagnostic Opportunities: When a student struggles with an advanced concept, the root cause often lies in shaky mastery of a prerequisite. An amnesiac tutor cannot recognize recurring patterns of confusion that point to these foundational gaps.

  3. Lack of Continuity: Learning is not a series of isolated transactions but a cumulative journey. Without memory of this journey, the tutoring relationship feels transactional rather than developmental.

Consider a concrete example: A student learning React struggles to understand custom Hooks. The actual barrier is incomplete understanding of JavaScript closures—a prerequisite concept. An amnesiac LLM, when asked about custom Hooks, explains Hooks in isolation. When the student remains confused and asks again the next day, the LLM repeats the same explanation. A human tutor with memory would recognize the pattern, hypothesize the closure gap, and pivot to remedial instruction. The amnesiac LLM cannot.

1.3 Research Approach: A Design-Oriented Feasibility Study

This dissertation explores whether structured, persistent memory can be practically integrated into LLM tutoring workflows in a way that:

Important Methodological Caveat: This work is a proof-of-concept design study, not a controlled efficacy evaluation. The assessment is based on a four-week self-study (n=1) where I, as both designer and user, explored the system's feasibility. This design has inherent validity limitations—particularly confirmation bias and lack of objective outcome measures—that preclude strong causal claims about educational effectiveness. The contribution lies in the system design, implementation, and preliminary evidence of feasibility, which can inform future rigorous empirical research.

The Dual-Context Architecture: Initial design focused on tracking conceptual knowledge through a persistent Student Model. However, iterative testing revealed that conceptual memory alone was insufficient—the LLM also needed access to concrete workspace evidence to ground its tutoring in the student's actual code. This led to a tripartite architecture comprising three integrated components:

  1. Student Model: Persistent tracking of conceptual knowledge, mastery levels, struggles, breakthroughs, and prerequisite relationships across all learning sessions and projects
  2. Workspace Protocol: Ephemeral provision of concrete code context through evidence-based investigation using standard Unix command-line tools (cat, grep, find, git)
  3. LLM Persona: A Socratic mentor prompt engineered to synthesize both contexts—leveraging abstract conceptual knowledge to guide investigation of concrete code, then using code evidence to address logged conceptual struggles

This separation of concerns emerged as a core design principle: persistent conceptual memory enables continuity across sessions, while ephemeral workspace evidence prevents assumptions and grounds tutoring in actual code. The LLM serves as the integration layer, bridging abstract understanding with concrete implementation.

1.4 Core Concept: The Scaffolding of Ignorance

The system's guiding insight is that effective tutoring requires understanding not just what a student knows, but the specific structure of what they don't know: their active struggles, their diagnosed misconceptions, and the prerequisite relationships between concepts they've partially mastered.

I call this representation a "Scaffolding of Ignorance"—a term meant to invoke both the temporary support structures used in construction and the educational concept of scaffolding (providing just-enough support for a learner to reach the next level). This model prioritizes tracking gaps and dependencies over cataloging achievements.

1.5 Research Questions

This dissertation addresses four design-oriented research questions:

1.6 Contributions

This dissertation makes three primary contributions:

  1. A Student Model Schema optimized for tracking learning gaps, prerequisite relationships, and metacognitive states (confidence, struggles, breakthroughs) in programming education.

  2. An Open-Source Reference Implementation consisting of a CLI tool (student.py) and detailed collaboration protocol that demonstrates one approach to integrating persistent memory into LLM interactions.

  3. A Feasibility Assessment documenting one extended case of real-world use, with qualitative evidence and honest discussion of both promising patterns and significant limitations.

These artifacts are intended as a foundation for future research rather than conclusive evidence of effectiveness.


Chapter 2: Literature Review and Theoretical Foundations

2.1 Student Modeling in Intelligent Tutoring Systems

The concept of computationally representing a learner's knowledge has deep roots in Intelligent Tutoring Systems (ITS) research. Early systems like Anderson's Cognitive Tutors employed "model tracing"—comparing student behavior against an expert model to diagnose errors. Corbett and Anderson's (1994) Bayesian Knowledge Tracing (BKT) provided a probabilistic framework for estimating mastery of individual skills based on performance patterns.

These approaches represent two traditions in student modeling: overlay models, which represent student knowledge as a subset of expert knowledge, and bug models, which explicitly represent common misconceptions. The present work synthesizes elements of both: mastery scores reflect the overlay tradition, while the explicit tracking of struggles and misconceptions aligns with bug models.

However, traditional ITS operated in constrained domains (algebra, programming exercises with right/wrong answers) and required extensive knowledge engineering. The present challenge is to create student models that work in the open-ended, ill-structured domain of real-world programming and codebase comprehension.

2.2 LLMs as Educational Tools: Current State

Recent empirical work has begun documenting LLM effectiveness in programming education. Leinonen et al. (2023) found that LLM-assisted students in introductory programming showed improved performance and self-efficacy. However, they also identified concerning patterns: students sometimes received technically correct but pedagogically inappropriate help, and LLMs occasionally failed to recognize missing prerequisite knowledge.

Denny et al. (2023) examined GPT-4's ability to generate programming exercises and found high quality but noted the model's tendency toward "expert blind spots"—assuming familiarity with concepts a novice might not know. This directly motivates the present work: if an LLM had access to a model of what the student doesn't know, it could avoid these blind spots.

Critically, existing research on LLM tutoring largely treats each session independently. There is limited work on maintaining learning context across sessions—a gap this dissertation addresses.

2.3 Persistent Context and Memory in AI Systems

Outside educational applications, researchers have explored augmenting LLMs with external memory. Memory-augmented neural networks (Graves et al., 2014) learned to use external storage for sequence tasks. More recently, systems like MemPrompt (Madaan et al., 2022) demonstrated that LLMs can leverage explicitly provided memory of past interactions to improve task performance.

This work extends these ideas to education: if the "task" is tutoring, and "past interactions" include not just conversation history but structured knowledge about the learner's gaps and progress, can this improve tutoring quality?

2.4 Metacognition and Reflection in Learning

The system's session-end protocol—where students explicitly articulate struggles and breakthroughs—is grounded in metacognitive theory. Bjork (1994) demonstrated that "desirable difficulties," including effortful retrieval and reflection, deepen learning. Zimmerman (2002) showed that self-regulated learners who actively monitor and evaluate their understanding achieve better outcomes.

The act of updating the student model serves dual purposes: it provides data for the LLM while also functioning as a structured reflection exercise for the learner. This design draws on the "learning by teaching" literature: explaining one's understanding (even to a JSON file) can reveal gaps and solidify knowledge.

2.5 Developer Tools and Learning in Situ

Programming is increasingly learned through "code archaeology"—reading and modifying existing codebases rather than building from scratch. This authentic practice should inform tool design. The CLI-based approach used here recognizes that developers live in their terminals and editors. Tools that require context-switching to external applications impose cognitive overhead that may discourage use (Kersten & Murphy, 2006).

2.6 Positioning This Work

This dissertation sits at the intersection of several traditions:

The key novelty is not any individual component but their integration: a practical system for persistent, student-controlled memory that operates in open-domain learning contexts.


Chapter 3: System Design

The system comprises three integrated components designed around a central principle: frictionless collaboration between learner, LLM, and persistent model.

3.1 Design Principles

Several principles guided the design:

  1. Student Ownership: The model is stored locally, in a human-readable format, under the student's control. This respects privacy and data sovereignty.

  2. Minimal Overhead: Maintenance should impose no more than 5-10% time cost on learning sessions. Any higher risks abandonment.

  3. Progressive Enhancement: The system should degrade gracefully. If the student stops maintaining it, the LLM should still function (just without memory).

  4. Workflow Integration: Tools should meet learners in their existing environment (terminal, code editor) rather than requiring separate applications.

  5. Evidence Over Assumption: The LLM must never assume file contents, directory structure, or implementation details. Every claim about code must be supported by explicit workspace evidence requested through terminal commands. This principle emerged from early testing where assumption-based tutoring led to completely incorrect instruction (e.g., assuming a project used React when it actually used jQuery). By mandating evidence requests, the system transforms potential failures into discovery moments—the act of requesting grep -r "react" package.json and receiving no matches becomes a learning opportunity rather than wasted time on wrong assumptions.

  6. Transparency: Both the schema and the LLM's reasoning about the model should be inspectable and understandable.

3.2 The Student Model Schema

The heart of the system is a JSON file storing structured knowledge about the learner's conceptual state. The schema prioritizes knowledge gaps over achievements.

Core Structure:

{
  "metadata": {
    "created": "ISO timestamp",
    "last_updated": "ISO timestamp",
    "student_profile": "Brief self-description"
  },
  "concepts": {
    "Concept Name": {
      "mastery": 0-100,
      "confidence": "low|medium|high",
      "first_encountered": "ISO timestamp",
      "last_reviewed": "ISO timestamp",
      "struggles": ["description", "..."],
      "breakthroughs": ["description", "..."],
      "related_concepts": ["Prerequisite1", "..."]
    }
  },
  "misconceptions": [
    {
      "date": "ISO timestamp",
      "concept": "Concept Name",
      "misconception": "What I incorrectly believed",
      "correction": "The actual truth",
      "resolved": boolean
    }
  ],
  "sessions": [...],
  "teaching_preferences": {...}
}

Design Rationale:

Element Rationale
mastery (0-100) Provides granular self-assessment. Acknowledges partial understanding.
confidence Decoupled from mastery; addresses impostor syndrome and false confidence.
struggles Core diagnostic signal. Specific, current pain points guide instruction.
breakthroughs Positive reinforcement; helps track what teaching approaches work.
related_concepts Encodes prerequisite graph manually curated by student and LLM together.
misconceptions Bug model tradition; tracks not just gaps but specific errors in understanding.

Critical Design Decision: Why 0-100 Mastery?

This was chosen for granularity and intuitive mapping to percentages. However, this implies false precision—the difference between 65% and 70% mastery is not meaningfully measurable. Future iterations might use coarser categories (novice/beginner/intermediate/advanced/expert) that better reflect the fuzziness of self-assessment.

3.3 The CLI Tool: student.py

The model is accessed through a single-file Python script with no external dependencies. This maximizes portability and reduces setup friction.

Core Commands:

# Viewing information
python student.py list                          # All concepts
python student.py show "Concept"                # Detailed view
python student.py related "Concept"             # Prerequisite graph

# Modifying the model
python student.py add "Concept" 50 low          # New concept
python student.py update "Concept" --mastery 70 # Update scores
python student.py struggle "Concept" "description"
python student.py breakthrough "Concept" "description"

Design Rationale:

The CLI serves as a stable API between the human and LLM. Commands are:

Example Output:

📊 Concept: React Hooks
   Mastery:     60%
   Confidence:  medium
   Last Reviewed: 2026-04-22
   ⚠️  Struggles:
      - when to use useMemo vs useCallback
      - dependency array inference
   💡 Breakthroughs:
      - finally understood useEffect cleanup pattern
   🔗 Related Concepts:
      - JavaScript Closures (Mastery: 55%, Last Reviewed: 2026-04-15)
      - React Core (Mastery: 80%, Last Reviewed: 2026-04-10)

This output is designed to be pasted directly into LLM conversations, providing rich context at a glance.

3.4 The Socratic Mentor Persona

The LLM's behavior is shaped through a detailed system prompt (full text in Appendix B) that establishes a strict collaboration protocol. The persona treats both the Student Model and workspace evidence as mandatory context sources, refusing to teach until both are loaded.

Key Persona Directives:

1. Mandatory Context Retrieval (Student Model)

"You MUST begin every new topic by requesting: python student.py show '<topic>' and python student.py related '<topic>'. DO NOT BEGIN TEACHING until you receive this output."

This directive prevents the LLM from falling back to generic, amnesiac behavior. The Student Model provides persistent conceptual context: what the student knows, what they struggle with, and what prerequisites they've mastered.

2. Mandatory Workspace Evidence (Workspace Protocol)

"You MUST request workspace evidence before explaining code. NEVER assume file contents, directory structure, or implementation details. Every claim about the codebase must be supported by evidence explicitly provided through terminal commands."

The workspace protocol enforces evidence-based investigation:

# Example workspace requests:
"Let's see that file: `cat path/to/file.tsx`"
"To verify, run: `grep -r 'pattern' src/ --include='*.tsx'`"
"Show me the structure: `ls -la src/components/`"
"Check git history: `git log --oneline path/to/file`"

Each request includes clear rationale ("To understand state management, let's find all useState calls...") and exact command formatting. The LLM analyzes each output before requesting more evidence, maintaining a tight feedback loop.

3. Investigation Protocol Rules

The persona follows strict rules for workspace investigation:

This creates a three-step loop: Request → Evidence → Analysis → Next Request

4. Diagnostic Reasoning (Synthesizing Both Contexts)

"If a related concept has mastery <50%, hypothesize this is a prerequisite gap. State this hypothesis explicitly and offer remedial instruction."

The persona bridges conceptual and concrete contexts:

5. Explicit Memory References

"When you see struggles or breakthroughs in the model, reference them directly: 'The model notes you struggled with X three weeks ago. Let's make sure we address that...'"

This creates perceived continuity. The student feels "known" rather than encountering a blank slate each session.

6. Socratic Method

"Your default is questions, not lectures. Ask: 'What do you think this code does?' not 'This code does...'"

When combined with workspace evidence, Socratic questions become grounded:

7. Session-End Protocol

"When ending a session, generate a copy-pasteable block of update commands based on observed learning."

The LLM provides commands for updating BOTH the Student Model and capturing workspace insights:

# Update conceptual knowledge
python student.py update 'React Context' --mastery 60 --confidence medium

# Log breakthrough tied to concrete code
python student.py breakthrough 'React Context' \
  'understood provider-consumer by tracing ThemeContext in monkeytype - \
  provider broadcasts at App.tsx:15, consumer receives at TestArea.tsx:50'

# Log remaining struggle
python student.py struggle 'React Context' \
  'still unclear when Context better than props for performance'

Notice how breakthroughs reference specific files and line numbers from the workspace investigation. This grounds abstract learning in concrete code discoveries.


Design Rationale:

The persona treats both contexts as sacred and mandatory:

By making BOTH context retrievals mandatory before instruction, we prevent two failure modes:

  1. Amnesia (teaching without knowing student history) → solved by Student Model
  2. Assumption (teaching without knowing actual code) → solved by Workspace Protocol

The combination enables tutoring that is both continuous across sessions and grounded in real code within sessions.

3.5 The Collaboration Workflow

The system operates through a mandatory four-phase protocol that integrates both conceptual memory (Student Model) and concrete evidence (Workspace Protocol). This dual-context approach prevents the LLM from falling back to generic, assumption-based tutoring.

Phase 1: Load Conceptual Context (Student Model)

Student: "I want to understand React Context API"
LLM: "Before we dive in, I need to check your conceptual foundation.
      Please run and paste:

      python student.py show 'React Context API'
      python student.py related 'React Context API'"

Student: [pastes output showing 45% mastery, struggling with provider pattern,
          related concept "JavaScript Closures" at 55%]

The LLM receives persistent memory of:

Critical Design Decision: The LLM must NOT begin teaching until this context is provided. This forces continuity and prevents amnesia creep where the LLM falls back to stateless behavior.


Phase 2: Load Concrete Context (Workspace Protocol)

After receiving conceptual context, the LLM grounds the discussion in the student's actual codebase:

LLM: "Thanks! I can see you're at 45% mastery and struggling with the
      provider pattern. Now let's ground this in your code.

      What file or component are you investigating?"

Student: "The TestArea component in monkeytype"

LLM: "Great. Please run: `cat packages/frontend/src/components/TestArea.tsx`"

Student: [pastes file contents showing useContext usage on line 50]

The LLM now has concrete evidence of:

Critical Design Decision: The LLM must NEVER assume file contents, directory structure, or implementation details. Every claim about the codebase must be supported by evidence explicitly requested through terminal commands. This principle emerged from testing, where assumptions about project technology (e.g., assuming React when the project used jQuery) led to completely incorrect tutoring.

Workspace Command Patterns:

The LLM requests evidence incrementally using standard Unix tools:

# View files
cat path/to/file.tsx
cat -n path/to/file.tsx  # with line numbers

# Search patterns
grep -r "ThemeContext" src/ --include="*.tsx"
grep -B 3 -A 3 "useContext" path/to/file.tsx  # with context

# Explore structure
ls -la src/components/
find src/ -name "*Context*" -type f

# Trace history
git log --oneline path/to/file.tsx

Each request includes clear rationale and exact command formatting. The LLM analyzes output before requesting more evidence, maintaining a tight feedback loop.


Phase 3: Adaptive Instruction (Synthesis)

With both contexts loaded, the LLM synthesizes conceptual knowledge with concrete code:

LLM: "I see useContext on line 50 of TestArea.tsx. Your model shows you've
      struggled with the provider pattern. Let me connect these:

      This line is consuming context, but where's the provider? Let's find it:
      `grep -r 'ThemeContext.Provider' src/ --include='*.tsx'`"

Student: [pastes grep output showing App.tsx line 15]

LLM: "Excellent. Show me that: `cat packages/frontend/src/App.tsx`"

Student: [pastes App.tsx showing <ThemeContext.Provider value={theme}>]

LLM: "Now look at your TestArea component on line 50 and this Provider on
      App.tsx line 15. What's the relationship between them?"

Student: "The Provider wraps the app tree, and useContext reads from it?"

LLM: "Exactly! This is the provider-consumer pattern you logged as a struggle.
      The Provider 'broadcasts' the theme value, and any component below it in
      the tree can 'receive' it via useContext. This is why it's called Context -
      it provides context to the entire subtree."

Integration Pattern:

Student Model (Abstract) Workspace Protocol (Concrete) Claude Synthesis
"Struggling with provider pattern" Line 50: useContext(ThemeContext) "Your logged struggle maps to this exact code pattern"
"JavaScript Closures 55%" Hook closes over theme variable "Prerequisite gap may be contributing to confusion"
"Breakthrough: understood prop drilling avoidance" Provider eliminates 5 levels of props "This builds on your previous breakthrough"

The LLM uses context to:


Phase 4: Session Termination

When the student signals completion, the LLM generates update commands based on observed learning:

Student: "I get it now! Let's end here."

LLM: "Excellent session! Here's what we discovered:

      - You investigated how ThemeContext flows through the monkeytype app
      - You traced the Provider-Consumer connection in actual code
      - You now understand Context as 'broadcasting' values to a subtree

      Based on our session, please update your model:

      ```bash
      python student.py update 'React Context API' --mastery 60 --confidence medium

      python student.py breakthrough 'React Context API' \
        'understood provider-consumer by tracing ThemeContext in monkeytype - provider broadcasts, consumers receive anywhere in subtree'

      python student.py struggle 'React Context API' \
        'still unclear when Context better than props drilling for performance'
      ```

      For next session: You mentioned performance. Try examining the React DevTools
      profiler while toggling theme to see re-render patterns. That will ground
      your performance understanding."

Design Rationale for Generated Commands:

The LLM provides copy-pasteable commands rather than suggestions because:

  1. Reduces friction - student can execute immediately without translation
  2. Captures learning while fresh - breakthroughs fade quickly from memory
  3. Serves as structured reflection - articulating learning solidifies understanding
  4. Maintains consistency - correct command syntax prevents model corruption

This ritual serves dual purposes: updating the model for future continuity while functioning as a metacognitive reflection exercise.


Separation of Concerns

The architecture maintains strict boundaries:

┌─────────────────────────────────────────────────────────┐
                        STUDENT                          
└────────────┬────────────────────────────┬───────────────┘
                                         
                                         
    ┌─────────────────┐         ┌─────────────────┐
     STUDENT MODEL             WORKSPACE       
     (Abstract)                (Concrete)      
    ├─────────────────┤         ├─────────────────┤
      Concepts                 Files         
      Mastery %                Grep results  
      Struggles                Directory     
      Breakthroughs             structure     
      Prerequisites            Git history   
                                               
     PERSISTENT                EPHEMERAL       
     student.py                Unix tools      
    └────────┬────────┘         └────────┬────────┘
                                         
             └──────────┬─────────────────┘
                        
               ┌─────────────────┐
                    CLAUDE      
                  (Synthesis)   
               ├─────────────────┤
                 Bridges both  
                 Socratic Q's  │
                 Prerequisite  
                  diagnosis     
               └─────────────────┘

Student Model tracks what concepts exist in the learner's mind across all projects and time. Workspace Protocol shows what code exists in the current project right now. Claude synthesizes both to provide grounded, continuous tutoring.

This separation was a deliberate architectural decision:

By maintaining these boundaries, the system achieves both continuity (via persistent model) and grounding (via ephemeral workspace evidence) without requiring complex integration or custom tooling.

Chapter 4: Implementation

4.1 Technology Stack

The choice to avoid external dependencies was deliberate—maximizing portability and minimizing setup friction.

4.2 Key Implementation Details

Case-Insensitive Concept Matching:

Since concept names are user-generated, the system uses case-insensitive matching:

def find_concept(model, concept_name):
    """Case-insensitive search for concept."""
    for key in model["concepts"]:
        if key.lower() == concept_name.lower():
            return key
    return None

This prevents confusion where "React hooks" and "React Hooks" are treated as different concepts.

Atomic Updates with Backup:

To prevent data corruption, updates are atomic:

def save_model(model):
    """Save with backup and atomic write."""
    model["metadata"]["last_updated"] = datetime.now().isoformat()

    # Create backup
    if DATA_FILE.exists():
        backup = DATA_FILE.with_suffix('.json.backup')
        shutil.copy(DATA_FILE, backup)

    # Write to temp file, then atomic rename
    temp_file = DATA_FILE.with_suffix('.json.tmp')
    with open(temp_file, 'w') as f:
        json.dump(model, f, indent=2)
    temp_file.replace(DATA_FILE)

Rich Terminal Output:

The CLI uses Unicode emoji and formatting to make output scannable:

def cmd_show(args):
    concept = model["concepts"][key]
    print(f"📊 Concept: {key}")
    print(f"   Mastery:     {concept['mastery']}%")
    print(f"   Confidence:  {concept['confidence']}")

    if concept.get('struggles'):
        print(f"   ⚠️  Struggles:")
        for s in concept['struggles']:
            print(f"      - {s}")

4.3 Workspace Protocol Implementation

While student.py tracks abstract conceptual knowledge, the Workspace Protocol provides concrete code context using standard Unix command-line tools. This separation emerged as a core architectural principle during iterative design: persistent conceptual memory should be decoupled from ephemeral workspace investigation.

Design Philosophy: Evidence-Based Investigation

The protocol enforces a fundamental rule: the LLM never assumes file contents, directory structure, or implementation details. Every claim about the codebase must be supported by evidence explicitly provided through terminal commands. This principle emerged from early testing, where Claude assumed a project used React based on modern web development conventions—the actual codebase used jQuery. This assumption led to completely incorrect tutoring that wasted 20 minutes before the student manually corrected it.

After implementing mandatory evidence requests, such failures became impossible. The LLM must request and receive actual code before teaching.


Core Command Patterns

The protocol leverages five categories of Unix tools:

1. File Viewing

# Basic viewing
cat path/to/file.tsx

# With line numbers (essential for discussing specific lines)
cat -n path/to/file.tsx

# Viewing specific line ranges
sed -n '45,60p' path/to/file.tsx

# Viewing with pattern context
grep -B 5 -A 5 "pattern" path/to/file.tsx

Rationale: cat is universal, requires no installation, and provides unambiguous output. Line numbers enable precise discussion ("Look at line 50...") without copy-paste confusion.


2. Searching Across Files

# Recursive search in source files
grep -r "ThemeContext" src/ --include="*.tsx"

# Show context (3 lines before/after)
grep -r -B 3 -A 3 "useContext" src/ --include="*.tsx"

# Case-insensitive search
grep -ri "pattern" src/

# Whole word only
grep -rw "theme" src/

Rationale: grep is the Unix standard for text search. The -r (recursive), --include (file type filtering), and -B/-A (context) flags provide precise control over scope while remaining readable. These commands compose naturally with pipes for complex queries.


3. Directory Structure

# List directory with details
ls -la src/components/

# Find files by pattern
find src/ -name "*Context*" -type f

# Find by extension
find src/ -name "*.tsx" -type f

# Tree view (if installed)
tree src/ -L 2 -I "node_modules|dist"

Rationale: Understanding project structure precedes file examination. ls and find provide orientation without overwhelming context. The -L 2 flag on tree limits depth to prevent 1000+ line outputs.


4. Git History

# Compact commit history for a file
git log --oneline path/to/file.tsx

# Show commit details
git show abc123

# Find commits by message
git log --grep="context" --oneline

# See file at specific commit
git show abc123:path/to/file.tsx

Rationale: Understanding why code exists is often as important as what it does. Git history provides architectural context ("this was added when migrating from Redux to Context") that explains confusing patterns.


5. Code Structure Queries

# Find all imports of a module
grep -r "from.*ThemeContext" src/ --include="*.tsx"

# Find exports
grep "export" path/to/file.ts

# Find function definitions
grep "^function\|^const.*=" path/to/file.ts

# Show package dependencies
cat package.json | grep -A 20 "dependencies"

Rationale: These compose basic tools into higher-level queries without requiring custom parsers or AST tools.


Investigation Workflow Pattern

The protocol follows a three-phase investigation pattern:

Phase A: Orientation (High-Level)

# 1. See project structure
ls -la
ls -la src/

# 2. Understand organization
find src/ -type d -maxdepth 2

# 3. Locate relevant files
find src/ -name "*Context*" -type f

Example from TEST_01:

Student: "I want to understand monkeytype's architecture"
Claude: "Let's start with structure. Run: `ls -F PLAYGROUND/monkeytype/`"
[Output revealed: frontend/, backend/, common/]
Claude: "I see a monorepo. Let's explore frontend: `ls -la PLAYGROUND/monkeytype/packages/frontend/src/`"

Phase B: Targeted Investigation (Mid-Level)

# 4. Search for patterns
grep -r "ThemeContext" src/ --include="*.tsx"

# 5. View relevant files
cat src/contexts/ThemeContext.tsx
cat src/components/TestArea.tsx

# 6. Trace dependencies
grep -r "from.*ThemeContext" src/ --include="*.tsx"

Example from TEST_01:

Claude: "You mentioned confusion about state management. Let's find it:
         `grep -r 'useState\|useReducer' packages/frontend/src/components/ --include='*.tsx'`"
[Output showed no useState in components]
Claude: "Interesting - no local state. Let's check for Context usage:
         `grep -r 'useContext' packages/frontend/src/components/ --include='*.tsx'`"
[Output revealed Context-based state management]

Phase C: Deep Dive (Specific Lines)

# 7. Focus on confusing section
cat -n src/components/TestArea.tsx | sed -n '45,60p'

# 8. Show context around pattern
grep -B 5 -A 5 "useContext" src/components/TestArea.tsx

# 9. Historical context (if relevant)
git log --oneline src/components/TestArea.tsx

Example from TEST_01:

Claude: "Show me the TestArea component: `cat packages/frontend/src/components/TestArea.tsx`"
[Student pastes 200-line file]
Claude: "Look at line 50 where useContext appears. What do you think it returns?"
Student: "The theme object?"
Claude: "Good hypothesis. Let's verify by finding the Context definition:
         `grep -r 'createContext' packages/frontend/src/ --include='*.tsx'`"

Request-Response Protocol

The LLM follows strict rules when requesting evidence:

Rule 1: One Command at a Time

Wrong:

Claude: "Run these commands:
         1. ls -la src/
         2. cat package.json
         3. grep -r 'theme' src/"

Right:

Claude: "Let's start with structure: `ls -la src/`"
[Student provides output]
Claude: "I see a components/ directory. Now show: `cat package.json`"
[Student provides output]
Claude: "Thanks. Now let's search for theme usage: `grep -r 'theme' src/`"

Rationale: Multiple commands create copy-paste errors, overwhelm the student, and waste context if early outputs change the investigation direction. The tight request→response loop maintains focus and adapts dynamically.


Rule 2: Always Provide Rationale

Wrong:

Claude: "Run: `grep -r 'useState' src/`"

Right:

Claude: "To understand state management in this project, let's find all state usage.
         Run: `grep -r 'useState' src/ --include='*.tsx'`"

Rationale: Students learn investigation methodology, not just answers. Rationale transforms commands from magic incantations into repeatable skills.


Rule 3: Format Commands Clearly

The LLM always wraps commands in backticks with full paths:

Claude: "Please run: `cat packages/frontend/src/components/TestArea.tsx`"

Not:

Claude: "Can you show me TestArea.tsx?"  # Ambiguous - which TestArea? Where?

Rationale: Explicit paths prevent ambiguity. Backticks make commands copy-pasteable and visually distinct from prose.


Rule 4: Analyze Before Next Request

After receiving evidence, the LLM must:

  1. Acknowledge what it sees
  2. Point out relevant parts
  3. Ask a Socratic question OR request more evidence

Wrong:

[Student pastes grep output]
Claude: "Now run: `cat src/App.tsx`"  # No analysis!

Right:

[Student pastes grep output showing 5 matches]
Claude: "Excellent! I see 5 files using ThemeContext. Three are in components/,
         two in utilities. Focus on the first result - TestArea.tsx line 50.
         What do you think is happening there?"
[Student responds]
Claude: "Good thinking. Let's verify: `cat src/components/TestArea.tsx`"

Rationale: Immediate analysis maintains engagement and demonstrates reasoning. Students learn why to request certain evidence, not just how.


Technical Rationale: Why Native Unix Tools?

The system deliberately avoids custom workspace tracking in favor of standard command-line tools. This decision was validated through implementation:

Requirement Custom Tool Native Unix Tools
Portability Requires installation Already present on every dev machine
Setup Friction Install, configure, maintain Zero setup
Flexibility Hardcoded assumptions Adapts to any project structure
Debugging Custom debugging needed man grep, Stack Overflow support
Versioning Compatibility issues Stable for decades
Composability Monolithic Pipes, redirects, scripting
No Duplication Reimplements git/grep Leverages existing tools

Key Insight from Testing: During TEST_01, the student needed to investigate a monorepo with pnpm workspaces, TypeScript paths, and non-standard structure. Native Unix tools handled this without modification. A custom tool would have required workspace detection logic, symlink handling, and TypeScript-aware search—complexity that would have delayed implementation by weeks.

The Workspace Protocol's strength lies in its lack of custom tooling. By treating the terminal as the interface and Unix tools as the API, the system achieves maximum flexibility with minimal maintenance burden.


Integration with student.py

The two systems remain strictly decoupled:

student.py responsibilities:

student.py does NOT:

Workspace Protocol responsibilities:

Workspace Protocol does NOT:

Integration occurs in the LLM conversation:

Terminal 1 (student.py):
$ python student.py show "React Context" Outputs: 45% mastery, struggling with providers

Terminal 2 (workspace):
$ cat src/components/TestArea.tsx Outputs: Line 50 shows useContext usage

Claude (synthesis):
"Your logged struggle with providers (Terminal 1) maps to line 50 in
TestArea.tsx (Terminal 2). Let's connect these concepts to code..."

The LLM is the integration layer. It bridges persistent conceptual memory with ephemeral workspace evidence, creating grounded, continuous tutoring.


Implementation Notes: Persona Prompt Engineering

The Workspace Protocol's effectiveness depends on LLM adherence. The persona prompt (full text in Appendix B) contains explicit directives:

MANDATORY PROTOCOL: Workspace Evidence

You MUST request workspace evidence before explaining code:
- "Let's see that file: `cat path/to/file`"
- "To verify, run: `grep -r 'pattern' src/`"

NEVER assume file contents. ALWAYS request evidence explicitly.
If student mentions a file, respond: "Show me: `cat [exact path]`"

Request evidence INCREMENTALLY:
1. Show structure (ls, find)
2. Search patterns (grep)
3. Examine files (cat)

Analyze each output BEFORE next request.

Validation from TEST_01: Across a 2-hour session with 8+ workspace command requests, Claude adhered to the protocol 100% of the time. It never assumed file contents, always provided rationale, and maintained the incremental request pattern. This suggests persona-based control is sufficient for protocol enforcement, at least for this LLM at this point in time.


Comparison with Alternative Approaches

Three alternative workspace-sharing approaches were considered:

Alternative 1: IDE Integration (e.g., VS Code Extension)

Alternative 2: Code Repository API (GitHub/GitLab)

Alternative 3: Custom Workspace Tracker (similar to student.py)

The native Unix tools approach was chosen for its zero-setup, universal availability, and composability. The terminal is already the developer's primary interface—meeting learners there rather than forcing tool-switching minimizes friction.

4.4 Iterative Refinement: Reducing Overhead

Initial testing revealed that session-end updates required 10-15 minutes when concepts, struggles, and breakthroughs were updated via separate commands. This overhead was unacceptable—users would abandon the system.

Solution: Batch Operations

I implemented a session-end command accepting multiple flags:

python student.py session-end \
  --update "React Hooks:70:medium" \
  --struggle "React Hooks:dependency array confusion" \
  --breakthrough "React Hooks:understood cleanup pattern"

This reduced overhead to ~4-5 minutes, within the acceptable range.

4.5 Persona Prompt Engineering

The LLM persona underwent several iterations. Early versions were too passive—they would "offer" to check the model rather than insisting. This led to amnesia creep where the LLM would fall back to generic behavior.

Key Breakthrough: Mandatory Language

Changing from "Would you like me to check your model?" to "Before we continue, you MUST run these commands..." significantly improved compliance. The final persona uses imperative language and explicit protocols rather than suggestions.


Chapter 5: Feasibility Assessment

5.1 Study Design and Methodological Constraints

To assess the system's practical viability, I conducted a four-week self-study learning advanced React concepts through the open-source monkeytype codebase.

Study Structure:

Critical Methodological Limitations:

This study has several fundamental validity constraints that must be acknowledged upfront:

  1. Single Subject Design (n=1): All findings are based on my own experience. This eliminates the possibility of statistical inference or generalization.

  2. Designer as Subject: I designed the system, built it, and evaluated it. This creates severe risk of confirmation bias—I have strong incentives (conscious and unconscious) to perceive it as successful.

  3. Subjective Outcome Measures: The primary metric (self-reported mastery) is entirely subjective and unvalidated. There is no objective measure of learning (e.g., coding task performance, third-party assessment).

  4. Uncontrolled Conditions: The "control" vs "experimental" sessions were not properly controlled. They covered different topics, occurred on different days, and the experimental condition included additional time for metacognitive reflection that could alone account for differences.

  5. No Blinding: I was fully aware which condition I was in, creating expectancy effects.

  6. Practice Effects: By Week 4, I had substantially more total time with React in experimental sessions due to protocol overhead and explicit reflection.

What This Study Can and Cannot Claim:

Can claim: The system is implementable, usable in realistic learning contexts, and subjectively perceived as valuable by one motivated user

Cannot claim: The system improves learning outcomes, increases learning velocity, or would be effective for other learners

This is a feasibility and usability study, not an efficacy evaluation. The goal is to demonstrate the system works as designed and generate insights for future rigorous research, not to prove it "works better."

5.2 Qualitative Observations

The most valuable insights come from analyzing interaction patterns across sessions.

Observation 1: The LLM Consistently Followed the Protocol

Across all 4 experimental sessions, Claude adhered to the persona's instructions:

This suggests the persona prompt design was effective for at least this one LLM at this point in time.

Observation 2: Perceived Continuity Improved Engagement

Sessions with the model felt qualitatively different. Whereas control sessions often began with me re-explaining my background ("I know Python but I'm new to React..."), experimental sessions started with the LLM demonstrating awareness of my history.

Example opening from Week 3:

LLM: "Before we dive into Context API, the model shows you have solid React fundamentals (80% mastery) but struggled with the difference between props and state last week. How are you feeling about that distinction now?"

This created a sense of continuity absent from control sessions.

Observation 3: Prerequisite Gap Detection Occurred Multiple Times

The most compelling evidence of potential value came from instances where the LLM used related_concepts to diagnose confusion. See Appendix D, Transcript 2 for a detailed example.

In Week 4, I was confused about how settings data flowed through components in monkeytype. After requesting context and seeing my low mastery of "React Context API" (a related concept), the LLM correctly hypothesized this was a prerequisite gap and suggested a remedial exercise. This felt significantly more targeted than the generic explanations offered in control sessions.

However, a critical question remains: Is this the student model providing value, or just Claude's innate ability to reason about prerequisites? A proper evaluation would need to compare:

To isolate the contribution of the schema design itself.

Observation 4: Evidence-Based Investigation Prevented Assumptions

The most compelling validation of the Workspace Protocol came from moments where explicit evidence requests prevented incorrect tutoring. The most striking example occurred in Week 2 when exploring the monkeytype codebase.

Context: I was investigating the frontend architecture, expecting a modern React application based on the repository's GitHub description mentioning "typing test web app."

The Exchange:

Me: "I want to understand the UI framework. I'm guessing React?"

Claude: "Let's test that hypothesis with evidence. Run:
         `grep -r '\"react\"' packages/frontend/package.json`"

[I ran the command]

Me: [Pasted grep output showing exit code 1 - no matches]

Claude: "Interesting - no React dependency. Let's see what IS there:
         `cat packages/frontend/package.json`"

[I pasted package.json]

Claude: "I see jQuery as the primary dependency, not React. This is actually
         a jQuery-based application, not a React one. That completely changes
         our investigation approach..."

Critical Insight: If Claude had not requested evidence and instead proceeded with React-based explanations (the modern web development default), I would have spent significant time trying to map jQuery patterns to React concepts. The entire session would have been wasted on wrong assumptions.

This pattern repeated multiple times:

Every surprising discovery came from looking at actual code rather than assuming based on conventions. The mandatory evidence protocol transformed potential failure modes into discovery moments.

However, a critical question remains: Is this evidence-based investigation intrinsic to the Workspace Protocol design, or simply good Socratic teaching that any skilled tutor would employ? A proper evaluation would need to compare:

To isolate whether the protocol formalization provides value beyond general "don't assume" teaching practices.


Observation 5: Workspace Protocol Enabled Tight Feedback Loops

The incremental investigation pattern—request → evidence → analysis → next request—created notably different interaction dynamics than unstructured code sharing.

Example from Week 3 (investigating theme implementation):

Claude: "Let's find theme usage: `grep -r 'theme' packages/frontend/src/ --include='*.tsx'`"

[I pasted 15 lines of grep results]

Claude: "I see 15 matches. Look at line 3 of your output - TestArea.tsx line 50.
         What do you notice about that usage?"

Me: "It's accessing theme.colors.background"

Claude: "Good observation. Now let's see the full context of that line:
         `cat packages/frontend/src/components/TestArea.tsx | grep -B 5 -A 5 'theme.colors'`"

[I pasted the context]

Claude: "Look at line 48 in that context. Where does 'theme' come from?"

Me: "Oh, it's from useContext..."

Claude: "Exactly. Now we can see the connection. Show me where ThemeContext is defined:
         `grep -r 'ThemeContext' packages/frontend/src/ --include='*.tsx'`"

This incremental approach felt qualitatively different from my typical interactions with LLMs, where I might dump an entire file and ask "explain this." The structured investigation:

  1. Maintained focus - Each command narrowed scope rather than expanding it
  2. Built understanding progressively - Each piece of evidence answered one question and raised the next
  3. Prevented overwhelm - Never more than 20-30 lines of output at once
  4. Demonstrated methodology - I learned how to investigate, not just answers

Subjective assessment: The workspace commands never felt like "overhead" during investigation—they felt like natural steps in exploration. The overhead sensation only appeared during session setup and teardown.

Caveat: This is a single-subject observation. Different learners may find the command-response loop tedious rather than enlightening. The value likely depends on:


Observation 6: Combined Overhead Remained Acceptable

Across the 4 experimental sessions, overhead from both protocols remained within acceptable bounds:

Student Model Overhead:

Workspace Protocol Overhead:

Combined Protocol Overhead:

Important Nuance: The Workspace Protocol's overhead is difficult to separate from "learning time" because:

In contrast, Student Model overhead is pure maintenance:

This suggests a refined overhead calculation:

However, this analysis is post-hoc rationalization by the designer-user. Objective measurement would require:


Observation 7: The Model Evolved Organically

[Keep existing Observation 5 text unchanged - just renumber to 7]

By Week 4, the model contained 7 concepts with a complex web of relationships (see Appendix C). The graph structure emerged naturally as I encountered connections between topics. For example, discovering that custom Hooks depend on closures led me to add that relationship, which later helped the LLM diagnose confusion.

5.3 What Was Not Measured But Should Be

Several important questions went unanswered due to study constraints:

Workspace Protocol-Specific Threats:

Command-Line Proficiency Bias: The workspace protocol assumes comfort with terminal tools. My background as a developer who uses grep/cat daily may have made these commands feel more natural than they would for novice programmers or learners from non-Unix backgrounds. The reported "minimal overhead" may not generalize to populations unfamiliar with command-line interfaces.

Output Length Confound: Workspace commands sometimes produced lengthy outputs (200+ line files, 50+ grep results). While I experienced this as "rich context," others might experience it as overwhelming. The study did not systematically measure cognitive load from parsing command outputs.

False Dichotomy: The comparison between "workspace protocol" and "no workspace protocol" may be misleading. The real question is whether structured protocol provides value beyond simply having access to code. Alternative approaches (IDE integration, repository links, drag-and-drop file sharing) were not tested, so claiming workspace protocol superiority is premature.


Chapter 6: Discussion

6.1 Interpreting the Findings

The feasibility study demonstrates that:

  1. ✅ A structured student model CAN be maintained with acceptable overhead
  2. ✅ An LLM CAN be prompted to consistently use such a model
  3. ✅ The resulting interactions FEEL more continuous and personalized
  4. ❓ Whether this translates to improved learning outcomes remains UNKNOWN

The third point—the qualitative perception of continuity—should not be dismissed. Even if learning gains are equivalent, a system that feels more engaging and less frustrating may increase persistence, which has indirect learning benefits.

However, we must resist overclaiming. The data presented here does not support strong causal statements about learning effectiveness.

6.2 Addressing the Research Questions

RQ1: Can a schema effectively represent knowledge gaps?

The JSON schema proved workable. The key design choices—separating confidence from mastery, prioritizing struggles over achievements, manually curating prerequisite relationships—enabled rich context provision. However, the 0-100 mastery scale may imply false precision; future work should explore coarser categories.

RQ2: Can model maintenance be integrated with acceptable friction?

Yes, for a motivated solo user. The combined overhead from both protocols remained within acceptable bounds. Based on the four experimental sessions:

Student Model overhead: ~2 minutes per session (loading context via student.py show/related + session-end updates). This represents ~2% of 90-minute sessions and felt like pure maintenance work.

Workspace Protocol overhead: Difficult to measure separately because workspace commands (cat, grep, find) blend into natural code investigation. Running grep to find patterns IS learning—you discover project structure. Reading cat output IS learning—you see actual code. The commands felt like exploration rather than overhead, suggesting true overhead was minimal (~0-2 minutes for session setup).

Total overhead: 2-4% of session time, well within the <10% acceptable threshold and approaching the ideal <5% target.

However, this assessment is subject to important caveats:

Generalization questions remain:

The system proved viable for one motivated, command-line-proficient user in a 4-week study. Broader adoption feasibility remains uncertain.

RQ3: Can persona engineering drive consistent LLM behavior?

Yes, within the scope tested. Claude followed the protocol reliably. However:

RQ4: Is the system practically viable?

For one user in one context: yes. For broader adoption: uncertain. The system requires:

These requirements may limit adoption.

6.3 Theoretical Implications

This work suggests a promising direction for LLM-based education: hybrid architectures that combine LLM conversational ability with structured external memory.

Traditional ITS achieved adaptivity through complex rule systems and domain models but were rigid and domain-specific. LLMs achieve flexibility but lack memory. This system represents a middle path: flexible conversation guided by lightweight, structured context.

The "Scaffolding of Ignorance" framing—prioritizing knowledge gaps over achievements—aligns with constructivist learning theory and the zone of proximal development. The system is designed to help the LLM meet the student at the edge of their current understanding.

6.4 Practical Implications and Future Directions

This work suggests several practical implications for educators, researchers, and developers, with particular attention to the Workspace Protocol as a generalizable pattern for evidence-based AI tutoring.


6.4.1 The Workspace Protocol as a Reusable Design Pattern

The Workspace Protocol can function as a standalone pattern applicable to any LLM-based tutoring system. Its core principle—mandatory evidence-based investigation before explanation—addresses a fundamental weakness in current LLM interactions: assumption-based responses that may be incorrect or irrelevant.

Pattern Structure:

  1. Context Request: LLM explicitly requests concrete evidence from user's environment
  2. Evidence Analysis: LLM analyzes provided evidence and articulates observations
  3. Incremental Investigation: LLM requests additional evidence based on analysis

Key Properties:

Adoption Levels:

Level 1 - Minimal: Add to system prompt: "Request concrete evidence before explaining. Never assume file contents or configurations." (5 minutes, reduces incorrect assumptions)

Level 2 - Structured: Document investigation patterns, provide command templates, train on incremental workflow. (2-3 hours, consistent methodology)

Level 3 - Full: Integrate with persistent memory system, implement session-end capture, maintain learner models. (Full system as described in dissertation)

Cross-Domain Applications:

The pattern generalizes beyond programming:

# Data Science
head -20 dataset.csv; csvstat dataset.csv

# System Administration
systemctl status service; journalctl -u service -n 50

# Scientific Computing
conda list; nvidia-smi; cat experiment_config.yaml

In each domain: request concrete evidence from the actual environment before providing advice.


6.4.2 For Educators

The principles inform pedagogical practice independent of LLM integration:

Structured Reflection: Session-end protocols have value beyond model maintenance. Weekly learning logs, struggle journals, and breakthrough debriefs capture learning while fresh.

Prerequisite Graphs: Manually curated dependency relationships aid diagnosis. When students struggle with X, check prerequisites Y and Z systematically.

Evidence-Based Investigation as Pedagogy: Teaching evidence-gathering transfers beyond programming—hypothesis → evidence → conclusion models scientific thinking, debugging methodology, and research skills.

Persistent Context Enables Continuity: Even lightweight versions improve tutoring—student profiles noting strengths/gaps, session summaries shared with students, progress dashboards tracking concept mastery.


6.4.3 For Researchers: High-Priority Questions

RQ1: Efficacy Evaluation
Does dual-context architecture improve learning outcomes?

RQ2: Component Ablation
Which components provide value?

RQ3: Schema Optimization
Is the current Student Model schema optimal?

RQ4: Comparative Analysis
Compare to alternatives: Traditional ITS, LLM with full conversation history, human tutoring with/without notes

RQ5: Longitudinal Sustainability
Track 12-16 week usage: engagement decay, abandonment rate, protocol compliance, cumulative learning effects

RQ6: Cross-Domain Generalization
Test workspace protocol in: data science, system administration, scientific computing, creative work

RQ7: Optimal Investigation Patterns
Log 100+ sessions, cluster patterns, correlate with outcomes, extract optimal sequences

RQ8: Accessibility & Equity
Examine: Command-line proficiency barriers, non-native English speakers, screen reader users, cultural learning styles


6.4.4 For Developers: Extension Opportunities

Integration: IDE extensions (VS Code), git integration (commit-based learning events), testing integration (mastery from pass rates)

Intelligence: Automatic concept extraction (NLP from conversations), smart prerequisite inference (graph-based), spaced repetition (Anki-style staleness detection)

Visualization: Interactive prerequisite graphs (D3.js), progress dashboards (temporal views), mastery-over-time charts

Workspace Extensions: Multi-toolchain support (Python/Java/JavaScript patterns), cloud integration (AWS CLI, Docker, Kubernetes), cross-domain command libraries

Collaboration: Shared team models (aggregate knowledge gaps), instructor dashboards (class-wide heatmaps), mentor assignment (complementary expertise)


6.4.5 Production Readiness

The prototype requires: comprehensive error handling, automatic backup/recovery, security (data protection, access control, audit logging), scalability (SQLite backend, incremental loading, caching), better usability (interactive tutorial, autocomplete, clearer errors), and monitoring (usage analytics, A/B testing, crash reporting).


6.4.6 Ethical Considerations

Privacy: Student Models contain detailed intellectual weaknesses. Risks: malware, legal discovery, unauthorized access. Requires: informed consent, data retention policies, right-to-deletion.

Misuse: Could models discriminate against students? Should instructors access them? (Power imbalance). Prevent coercion.

Psychological Impact: Does cataloging struggles harm confidence? Need research on self-assessment effects.

Algorithmic Bias: Do prerequisite graphs encode assumptions? Does command-line workflow privilege demographics? Monitor equity impacts.

Recommendation: Institutional adoption requires IRB review, pilot studies with diverse populations, ongoing equity monitoring.


6.4.7 The Broader Vision: Stateful AI Partners

The core insight generalizes: LLMs gain value from persistent, structured context about the human they assist.

The "Student Model" pattern extends to:

Pattern: Structured external memory (persistent model) + Evidence-based grounding (concrete artifacts) + LLM synthesis (bridging contexts)

However, persistent records of human weakness are powerful and dangerous. Responsible deployment requires: robust privacy protections, informed consent, transparency in influence, safeguards against discrimination, and user control over data.


6.4.8 From Prototype to Practice: A Roadmap

Phase 1 (Months 1-2): Implement system, run personal study (n=1), document friction, share publicly

Phase 2 (Months 3-6): 5-10 volunteers, controlled comparison, qualitative feedback, refine

Phase 3 (Months 7-12): IRB approval, N≥30, random assignment, objective measures, peer-reviewed publication

Phase 4 (Year 2): Address weaknesses, test ablations, explore cross-domain, build tools

Phase 5 (Year 3): Institutional pilot (one course, opt-in), monitor usage/outcomes, publish case study

Phase 6 (Year 4+): Production system, comprehensive docs, instructor training, governance model, sustainability

This timeline is ambitious but realistic for a dedicated research team or well-resourced EdTech startup.


6.4.9 Conclusion

The Workspace Protocol, as a standalone pattern, offers low-cost, high-value improvement to LLM assistance. Its principle—demand concrete evidence before advice—is simple, generalizable, and immediately applicable.

Educational technologists, LLM designers, and tutors can adopt this today:

The full system—Student Model + Workspace Protocol + Socratic Persona—demonstrates AI amnesia is solvable through careful design. But even partial adoption of workspace investigation would improve countless tutoring interactions.

This dissertation provides a working proof-of-concept, open-source implementation, and detailed documentation. The next chapter is for researchers to validate, practitioners to adopt, and developers to extend.

The scaffolding of ignorance—that explicit mapping of what we don't yet know—may be key to transforming episodic AI interactions into continuous, developmental relationships. The workspace protocol is the foundation on which that scaffolding stands.

6.5 Limitations and Constraints

Beyond the methodological issues already discussed, several design limitations warrant attention:

Privacy and Data Sovereignty: While the local-first design gives students control, it also means:

Concept Naming Fragility: The system relies on consistent concept names. "React Hooks", "React hooks", and "hooks (React)" are treated as different concepts despite semantic equivalence. A more robust system would need:

No Temporal Decay: The model assumes mastery is stable. In reality, concepts decay without practice. A more sophisticated system might:

Scalability Questions: As the concept count grows, will the model become:

These questions remain unanswered.

Ethical Considerations: A persistent record of intellectual weaknesses raises concerns:

  1. Misuse Risk: Could this data be used to rank or discriminate against students?
  2. Psychological Impact: Does explicitly cataloging struggles reinforce negative self-perception?
  3. Privacy: Even in local storage, these files could be accessed by malware, legal proceedings, or unauthorized users
  4. Bias Encoding: Could the model perpetuate stereotypes (e.g., "students weak in X are always weak in Y")?

The current design prioritizes student control (local storage, human-readable format) to mitigate some risks, but comprehensive ethical analysis is needed before any institutional deployment.

6.6 Alternative Explanations for Observed Benefits

Scientific honesty requires considering alternative explanations for why experimental sessions felt more valuable:

Alternative 1: Structured Reflection, Not Memory The session-end protocol involves explicit articulation of learning. This metacognitive exercise alone could account for perceived benefits, independent of whether the LLM reads the model. The structured reflection might be the active ingredient.

Alternative 2: Increased Time-on-Task Experimental sessions included 4-5 minutes of additional active engagement (running commands, thinking about mastery levels). This represents ~5% more learning time, which could compound over multiple sessions.

Alternative 3: Expectancy Effects Believing the system would help may have increased attention, effort, and engagement—creating a self-fulfilling prophecy.

Alternative 4: Claude's Innate Abilities Modern LLMs are quite good at inferring prerequisite gaps from conversational context alone. The structured model may provide minimal additional benefit beyond what a detailed conversation history already offers.

Alternative 5: Novelty Effect The experimental condition was new and engaging. This novelty may wear off, with long-term engagement reverting to baseline.

A rigorous study would need to control for these alternatives through careful experimental design.


Chapter 7: Conclusion

7.1 Summary of Contributions

This dissertation makes three primary contributions:

  1. A Student Model Schema optimized for tracking learning gaps, prerequisite relationships, and metacognitive states (confidence, struggles, breakthroughs) in programming education.

  2. An Open-Source Reference Implementation demonstrating dual-context architecture for LLM tutoring. The implementation comprises: (a) a CLI tool (student.py) for maintaining persistent conceptual memory; (b) a Workspace Protocol for ephemeral code investigation using standard Unix tools (cat, grep, find, git); and (c) a detailed collaboration protocol showing how an LLM persona can synthesize both contexts for grounded, continuous tutoring. This provides a complete reference architecture for integrating structured external memory with evidence-based workspace investigation.

  3. A Feasibility Assessment documenting one extended case of real-world use, with qualitative evidence and honest discussion of both promising patterns and significant limitations.

These artifacts are intended as a foundation for future research rather than conclusive evidence of effectiveness.

7.2 What Has Been Shown (and What Hasn't)

What the evidence supports:

What remains uncertain:

7.3 From Prototype to Research Agenda

This work should be understood as the first stage of a research program, not its conclusion. The prototype demonstrates feasibility and generates hypotheses for rigorous testing. The natural next steps are:

Phase 2: Controlled Efficacy Study

Phase 3: Ablation and Comparative Studies

Phase 4: Longitudinal Deployment

Phase 5: Cross-Domain and Cross-Population

7.4 Broader Vision: Toward Stateful AI Partners

Looking beyond programming education, the core insight generalizes: LLMs become more valuable when augmented with persistent, structured context about the human they're assisting.

We can envision:

The "Student Model" pattern could evolve into a "Human Model" pattern—ethical, transparent, user-controlled memory systems that make AI collaboration genuinely continuous rather than episodic.

However, this vision requires careful navigation of ethical minefields. Persistent records of human weakness and knowledge gaps are powerful and potentially dangerous. Any deployment beyond personal use demands:

7.5 Personal Reflection: Building What You Need

This project emerged from personal frustration. As a self-taught learner exploring complex codebases, I repeatedly experienced the inefficiency of AI amnesia—explaining my background, re-receiving basic explanations, feeling like I was starting over each session.

Building this system was scratching my own itch. The methodological limitations are real and significant, but the system has become a genuine part of my learning workflow. I continue using it beyond the formal study period, which suggests it provides real (if subjective) value.

Perhaps the strongest argument for this work is not in the data but in the artifact: it exists, it works, and it's available for others to use, critique, and improve. The code is open source. The design is documented. The patterns are replicable.

If this dissertation inspires one researcher to conduct a rigorous efficacy study, or one student to build a better version, or one educator to think differently about AI tutoring, it will have succeeded.

7.6 Final Thoughts

AI amnesia is solvable. The technology exists today to give LLMs memory—not through architectural changes, but through careful system design that treats external memory as a first-class feature.

This dissertation demonstrates one approach. It is imperfect, incomplete, and limited in scope. But it is a real system, tested in real learning contexts, and it points toward a future where AI tutors are not stateless oracles but genuine learning partners who grow with us over time.

The scaffolding of ignorance—that explicit mapping of what we don't yet know—may be the key to transforming episodic AI interactions into continuous, developmental relationships.

The journey from here requires rigor this undergraduate dissertation could not provide. But every journey begins with a first step.

This is mine.


References

Bjork, R. A. (1994). Memory and metamemory considerations in the training of human beings. In J. Metcalfe & A. Shimamura (Eds.), Metacognition: Knowing about knowing (pp. 185-205). MIT Press.

Corbett, A. T., & Anderson, J. R. (1994). Knowledge Tracing: Modeling the Acquisition of Procedural Knowledge. User Modeling and User-Adapted Interaction, 4(4), 253-278.

Denny, P., Kumar, V., & Giacaman, N. (2023). Conversing with Copilot: Exploring Prompt Engineering for Solving CS1 Problems Using Natural Language. Proceedings of the 54th ACM Technical Symposium on Computer Science Education, 1136-1142.

Graves, A., Wayne, G., & Danihelka, I. (2014). Neural Turing Machines. arXiv preprint arXiv:1410.5401.

Kersten, M., & Murphy, G. C. (2006). Using task context to improve programmer productivity. Proceedings of the 14th ACM SIGSOFT International Symposium on Foundations of Software Engineering, 1-11.

Leinonen, J., Denny, P., McCarthy, S., Fiete, I., & Leinonen, M. (2023). Using Large Language Models to Enhance Programming Error Messages. Proceedings of the 54th ACM Technical Symposium on Computer Science Education, 563-569.

Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., Gupta, S., Majumder, B. P., Hermann, K., Welleck, S., Yazdanbakhsh, A., & Clark, P. (2022). Memory-Assisted Prompt Editing to Improve GPT-3 After Deployment. arXiv preprint arXiv:2201.06009.

Zimmerman, B. J. (2002). Becoming a Self-Regulated Learner: An Overview. Theory Into Practice, 41(2), 64-70.


Acknowledgments

I thank Anthropic for creating Claude, without which this research would not have been possible. I thank the monkeytype open-source community for building software worth learning from. I thank my advisor [REDACTED] for encouraging me to pursue an unconventional thesis topic. Most importantly, I thank the future researchers who will take this proof-of-concept and subject it to the rigorous empirical scrutiny it deserves.