It Starts with Skills
Every piece of content is generated for a specific skill. Skills are the atomic units of learning—each one maps to academic standards and has prerequisite relationships.
SSkill
StStandards Alignment
Understand addition of fractions as joining parts referring to the same whole.
Represent and solve addition of fractions with equal denominators.
Generation Output Example
The prompt asks the AI to first reason about pedagogy, then generate items. Here's what a complete generation output looks like—the pedagogical context followed by the item sequence.
FirstPedagogical Context (in generation output)
Chain-of-thought{
"pedagogicalContext": {
"bigIdea": {
"concept": "When denominators are the same, we're counting the same-sized pieces",
"whyItMatters": "Foundation for all fraction operations and proportional reasoning",
"realWorldConnection": "Combining parts of the same whole (pizza slices, measuring cups)"
},
"commonStruggles": [
{
"struggle": "Adding both numerators AND denominators",
"cause": "Applying whole-number addition rules to fractions",
"frequency": "Very common (60%+ of initial errors)"
},
{
"struggle": "Not recognizing when denominators match",
"cause": "Focusing on numerators first",
"frequency": "Common (30%)"
}
],
"misconceptions": [
{
"error": "2/8 + 3/8 = 5/16",
"thinking": "Add tops, add bottoms",
"remediation": "Use visual model showing pieces don't change size"
},
{
"error": "2/8 + 3/8 = 5",
"thinking": "Ignoring the denominator entirely",
"remediation": "Emphasize fraction notation meaning 'parts of a whole'"
}
],
"instructionalContext": {
"priorKnowledge": ["Fraction equivalence", "Unit fractions", "Part-whole relationship"],
"typicalSequence": "Visual models → symbolic → mixed representations",
"effectiveStrategies": ["Fraction bars", "Number lines", "Same-denominator emphasis"]
}
},
"items": [ /* 7 items follow, using the context above */ ]
}ThenItem from the Sequence
{
"id": "item-4nf3a-m1-001",
"type": "MULTIPLE_CHOICE",
// === CONTENT ===
"stem": "What is 2/8 + 3/8?",
"content": {
"choices": [
{ "id": "a", "text": "5/16", "isCorrect": false },
{ "id": "b", "text": "5/8", "isCorrect": true },
{ "id": "c", "text": "6/8", "isCorrect": false },
{ "id": "d", "text": "5", "isCorrect": false }
],
"explanation": "When adding fractions with the same denominator,
add the numerators and keep the denominator: 2/8 + 3/8 = 5/8",
"misconceptions": {
"a": "Added both numerators AND denominators",
"c": "Multiplied instead of adding",
"d": "Added numerators but forgot the denominator"
}
},
// === DIFFICULTY METADATA ===
"difficulty": 0.45, // 0-1 scale (IRT-style)
"discrimination": 1.2, // 0-2 scale (how well it separates learners)
"dokLevel": 2, // Depth of Knowledge (1-4)
"sequencePosition": "medium1", // Position in 7-item sequence
// === GRADE & READABILITY ===
"gradeLevel": "4",
"readingLevel": 3.2, // Flesch-Kincaid grade level
"vocabularyTier": 1, // 1=basic, 2=academic, 3=domain-specific
// === SKILL & STANDARDS ===
"skillId": "czi-skill-4nf3a-add-like-denom",
"standards": [
{ "code": "CCSS.Math.4.NF.B.3a", "jurisdiction": "CCSS" },
{ "code": "TEKS.4.3E", "jurisdiction": "Texas" }
],
// === SCAFFOLDING ===
"hints": [
{ "text": "Look at the denominators. Are they the same?", "cost": 5 },
{ "text": "When denominators match, just add the top numbers.", "cost": 10 },
{ "text": "2 + 3 = 5, so the answer is 5/8.", "cost": 20 }
],
// === VISUAL AIDS ===
"visualAids": [
{
"widget": "fraction-bar",
"purpose": "Show 2/8 and 3/8 as shaded portions",
"config": {
"numerator": 2,
"denominator": 8,
"showWhole": true
}
}
],
// === GENERATION METADATA ===
"generatedBy": "gemini-3-flash",
"generatedAt": "2026-01-13T10:30:00Z",
"promptVersion": "v2.3",
"tokensUsed": { "input": 1250, "output": 680 },
"cost": 0.0012,
// === EVALUATION METADATA ===
"evaluationScore": 2.6,
"evaluationDetails": {
"factualAccuracy": 3,
"gradeAppropriateness": 3,
"pedagogicalSoundness": 2,
"jsonValidity": 3,
"completeness": 2
},
"evaluationFeedback": "Good item. Consider adding a visual aid.",
"secondaryEvaluation": {
"model": "gpt-4o",
"score": 2.7,
"agreement": true
},
// === REVIEW STATUS ===
"status": "IN_REVIEW",
"reviewedBy": null,
"reviewedAt": null,
"reviewNotes": []
}Item Metadata Reference
| Field | Type | Description |
|---|---|---|
| Skill & Standards | ||
| skillId | string | Primary skill this item assesses |
| standards | array | Mapped standards (CCSS, TEKS, etc.) |
| Difficulty | ||
| difficulty | 0.0-1.0 | IRT difficulty parameter |
| discrimination | 0.0-2.0 | How well item separates high/low performers |
| dokLevel | 1-4 | Webb's Depth of Knowledge |
| sequencePosition | string | easy1, easy2, medium1-3, hard1, hard2 |
| Readability | ||
| gradeLevel | string | Target grade (K, 1-8) |
| readingLevel | number | Flesch-Kincaid grade level of text |
| vocabularyTier | 1-3 | 1=basic, 2=academic, 3=domain-specific |
| Evaluation | ||
| evaluationScore | 0.0-3.0 | Average of 5 criteria scores |
| evaluationDetails | object | Per-criterion scores (0-3 each) |
| secondaryEvaluation | object | Cross-model verification (GPT-4o) |
Widget & Output Integration
Items can include visual aids that render as interactive widgets (web) or static SVGs (PDF). The same content works across all output formats.
Interactive Widgets
React components for web-based practice:
- • Number lines (draggable markers)
- • Fraction bars (interactive shading)
- • Ten frames (click to fill)
- • Coordinate planes (plot points)
- • Area models (dynamic grids)
- • Clocks (movable hands)
PDF Paper Widgets
Static SVG renders for worksheets:
- • Number lines (with answer blanks)
- • Counting scenes (sprites)
- • Fraction bars (empty for shading)
- • Coordinate grids (points to identify)
- • Ten frames (fill-in)
- • Base-ten blocks
Item Types
9 supported response types:
- • Multiple Choice
- • Multiple Select
- • Numeric Entry
- • Short Answer
- • Fill in the Blank
- • True/False
- • Matching
- • Ordering
- • Constructed Response
The Three-Stage Pipeline
AI Generation
Gemini 3 Flash generates items for each skill, including all metadata, hints, and widget configs.
- •7-item difficulty sequences per skill
- •Standards-aligned via skill mapping
- •Widget configs for visual aids
Dual AI Evaluation
Primary evaluation scores 5 criteria. 10-20% get secondary evaluation by a different model for bias reduction.
- •Factual accuracy (answer correctness)
- •Grade appropriateness (reading level)
- •Pedagogical soundness (learning science)
Human Review
ALL content requires human review. Auto-approved items still get sampled. Teachers verify answers and pedagogy.
- •Verify answer correctness
- •Check age-appropriateness
- •Approve or request changes
1Structured Generation Prompt (Stage 1 Detail)
The generation prompt requires the AI to think through pedagogy first, then generate items. This "chain-of-thought for item design" produces better distractors, hints, and explanations.
FirstThink Through Pedagogy
- • Big Idea: Core concept and why it matters
- • Common Struggles: What students find hard
- • Instructional Context: How this is typically taught
- • Misconceptions: Predictable errors and their causes
- • Prerequisite Gaps: Missing knowledge that causes confusion
ThenGenerate Items Using That Context
- • Distractors: Based on the misconceptions identified
- • Hints: Target the struggles listed
- • Explanations: Connect back to the big idea
- • Difficulty: 7-item sequence from easy1 to hard2
- • Scaffolding: Address prerequisite gaps
This pedagogical thinking happens within the prompt—the AI reasons about teaching before writing items. The context can optionally be extracted and saved as skill-level metadata for teacher guides.
Even items with high evaluation scores go to the review queue. "Auto-approve" means they're flagged as likely good, but humans still verify before publication. We sample 100% of content for answer accuracy and pedagogical quality.
Enrichment: Format-Specific Transforms
After human review, approved items can be enriched with format-specific features. This keeps the generation prompt focused on pedagogy while separate transforms handle interoperability.
QTI 3.0 Compliance
Transform native items to 1EdTech QTI format:
- •
responseDeclarationfor answer keys - •
outcomeDeclarationfor scoring - •
choiceInteractionfor MC items - •
textEntryInteractionfor numeric - • Template processing for feedback
Accessibility (WCAG 2.1)
Add accessibility features automatically:
- • ARIA labels for interactive widgets
- • Alt text for generated diagrams
- • Screen reader hints for math notation
- • Keyboard navigation patterns
- • High contrast mode support
LMS Packaging
Package for external LMS delivery:
- • SCORM 2004 wrappers
- • LTI 1.3 resource links
- • QTI package manifests (imsmanifest.xml)
- • xAPI statement templates
- • Common Cartridge bundles
Localization
Adapt content for different locales:
- • Translation via specialized models
- • Locale-specific number formatting
- • Cultural context adaptation
- • Currency and measurement units
- • Right-to-left layout support
Generating QTI-compliant XML directly would complicate the generation prompt and reduce quality. By generating clean, pedagogically-focused content first, we can run deterministic transforms to add format-specific features—and update those transforms without regenerating content.
Validation: LLM vs Code
Use LLMs for judgment, use code for computation. LLMs hallucinate arithmetic but reason well about pedagogy. Code computes perfectly but can't judge whether an explanation is clear.
Code-Based Validation
Deterministic, fast, 100% reliable
Parse expression, compute result, verify match
Syllables, words, sentences → grade level formula
Required fields, types, enum values
Try rendering—does it work?
Skill graph lookup for allowed concepts
Semantic similarity to existing items
4 choices, 3 hints, stem length limits
LLM-Based Evaluation
Judgment, reasoning, pedagogical expertise
Does this teach the concept effectively?
Beyond F-K: idioms, cultural refs, complexity
Do these represent real misconceptions?
Does this actually test the intended skill?
Progressive, not giving away the answer
Would a student understand this?
Cultural, gender, socioeconomic awareness
Run code checks first (fast, cheap). If an item fails JSON validation or has a wrong answer, don't waste LLM tokens evaluating it. Only send items that pass code checks to LLM evaluation.
Human Evaluation Methods
Direct Rating
Reviewer scores item on rubric criteria (1-3 scale).
Best for: Detailed feedback, training reviewers
Blind Comparison
Show generated item vs. published item (e.g., Smarter Balanced). Reviewer picks better one without knowing which is which.
Best for: Calibration, measuring quality bar
Pass/Fail + Notes
Binary approve/reject with required explanation for rejections. Fast, captures blockers.
Best for: High-volume review, clear quality gate
Statistical Sampling for Population-Level Quality
You don't need to review every item. Random sampling + human ratings gives population-level quality estimates with confidence intervals.
Sample Size Guide
- • 30-50 items: rough quality estimate
- • 100 items: ±10% confidence interval
- • 400 items: ±5% confidence interval
Use Cases
- • Quality trends: track over time
- • A/B testing: compare prompt versions
- • Batch gates: reject if pass rate < 80%
- • Dashboards: report with error bars
Widget & Item Type Selection
How does the AI decide to use a number line vs. fraction bar? When is multiple choice better than numeric entry? These decisions happen at generation time, guided by the skill and prompt.
Item Type Selection
The prompt specifies which item types are appropriate for the skill:
Skill: Add fractions with like denominators
Allowed item types: MULTIPLE_CHOICE, NUMERIC
# NOT: CONSTRUCTED_RESPONSE (too open-ended)
- • Skill metadata defines allowed types
- • Difficulty level may influence (harder = more open)
- • 7-item sequence can mix types for variety
Widget Selection
Widgets are visual aids—the AI chooses based on what helps learning:
"This fraction addition problem would benefit
from a fraction-bar widget showing 2/8 + 3/8
as shaded portions of the same whole."
- • Prompt lists available widgets for the skill
- • AI selects + configures based on pedagogical fit
- • Config validated by code (will it render?)
Widget Selection by Skill Domain
| Domain | Common Widgets | When to Use |
|---|---|---|
| Fractions | fraction-bar, fraction-circle, number-line | Equivalence, comparison, operations |
| Place Value | base-ten-blocks, place-value-chart | Regrouping, expanded form |
| Counting (K-2) | ten-frame, counter-dots, rekenrek | Subitizing, number bonds |
| Multiplication | area-model, array, number-line | Visual multiplication, distributive property |
| Time | clock | Reading time, elapsed time |
| Geometry | shape-canvas, grid | Area, perimeter, transformations |
Widgets are interactive or structured (fraction bars, number lines). Illustrations are decorative or contextual (a picture of pizza, a cartoon character). For now, we focus on widgets. Illustrations may be added via image generation or stock assets in a future phase.
Context Engineering: Teaching the AI to Use Widgets
The generation prompt includes structured documentation so the AI knows how to select and configure widgets correctly.
# Widget Library (included in generation prompt)
## fraction-bar
PURPOSE: Visualize fractions as shaded portions of a rectangular bar
WHEN TO USE:
- Comparing fractions with same/different denominators
- Adding/subtracting fractions (show parts combining)
- Showing equivalence (same shaded area, different partitions)
WHEN NOT TO USE:
- Fractions > 1 (use multiple bars or number line instead)
- Very large denominators (>12 becomes hard to see)
CONFIG SCHEMA:
{
"numerator": number, // 0 to denominator
"denominator": number, // 2-12 recommended
"showLabels": boolean, // show fraction notation
"highlightNumerator": boolean
}
EXAMPLE - Good:
Skill: Compare fractions with like denominators
Widget: fraction-bar with numerator=3, denominator=8
Why: Shows 3/8 as visual area, easy to compare
EXAMPLE - Bad:
Skill: Add fractions 2/3 + 4/5
Widget: fraction-bar
Why: Different denominators need side-by-side bars or number line
---
## number-line
PURPOSE: Show numbers/fractions as positions on a continuous line
WHEN TO USE:
- Ordering/comparing multiple values
- Adding (jumps forward) or subtracting (jumps back)
- Fractions > 1 or mixed numbers
- Showing distance/difference between values
WHEN NOT TO USE:
- Part-whole relationships (fraction-bar is clearer)
- Very early fraction concepts (too abstract for K-1)
CONFIG SCHEMA:
{
"min": number,
"max": number,
"tickInterval": number,
"points": [{ "value": number, "label": string }],
"showJumps": boolean
}
What the Prompt Includes
- • PURPOSE: What the widget visualizes
- • WHEN TO USE: Pedagogical fit
- • WHEN NOT TO USE: Common mistakes
- • CONFIG SCHEMA: Valid parameters
- • GOOD/BAD EXAMPLES: Concrete usage
Skill-Specific Context
- • Available widgets filtered by skill
- • Recommended widget for this skill type
- • Config constraints (e.g., denominators 2-12)
- • Examples from same skill if available
Validation After Generation
Even with good prompt context, validate widget configs programmatically:
- • Schema validation: Are required fields present? Types correct?
- • Render test: Does the widget actually render without error?
- • Bounds check: Is numerator ≤ denominator? Is denominator reasonable?
- • Pedagogical sanity: Does config match the stem? (LLM check)
Quality Assurance Patterns
Beyond the three-stage pipeline, these patterns improve generation quality and catch issues early.
AaReading Level Validation
Automatically check readability of generated content:
- • Flesch-Kincaid Grade Level - must be ≤ grade + 1
- • Vocabulary tier check - flag Tier 3 words for younger grades
- • Sentence complexity - shorter sentences for K-2
- • Word frequency analysis - prefer common words
Answer Verification
Don't trust AI arithmetic—verify programmatically:
- • Parse the stem - extract the math expression
- • Compute the answer - use a math library
- • Verify correct choice - must match computed answer
- • Check distractors - must NOT equal correct answer
Few-Shot Examples
Include 2-3 high-quality items as examples in the prompt:
Bootstrapping Phase (no approved items yet)
- • Hand-craft 5-10 gold standard items per skill type
- • Use published assessment items from open sources (e.g., Smarter Balanced, released state tests)
- • Adapt textbook examples to your format
- • Start with one skill, perfect it, expand
Steady State (approved items exist)
- • Same skill - shows expected format and difficulty
- • Similar skills - fallback when exact match unavailable
- • High-rated items - prioritize by eval score
- • Diverse examples - show range of contexts
Diversity Constraints
Require variety across the 7-item sequence:
- • Contexts - not all pizza (use baking, sports, money...)
- • Numbers - vary denominators, numerators, magnitudes
- • Representations - mix symbolic, visual, word problems
- • Question formats - find answer, find missing part, compare
Negative Examples in Prompts
Show the AI what NOT to generate:
- • Too easy - "What is 1/2 + 1/2?" (trivial)
- • Implausible distractors - options no one would choose
- • Tricky wording - "gotcha" questions that confuse
- • Above grade level - vocabulary or concepts too advanced
Prerequisite Boundary Check
Items shouldn't require skills beyond prerequisites:
- • Skill graph lookup - get prerequisite skill IDs
- • Concept extraction - identify concepts in item
- • Boundary validation - flag concepts from later skills
- • Simplification suggestions - how to remove the violation
Prompt Versioning
Track which prompt produced which items:
- • Store promptVersion - "v2.3" with each item
- • Quality tracking - correlate versions with eval scores
- • A/B testing - compare prompt variations
- • Rollback - revert if new prompt degrades quality
promptVersion, generatedAt, modelIdContent Status Workflow
Multi-Model Strategy
Different AI models excel at different tasks. We use specialized models for each stage to maximize quality while minimizing cost.
| Task | Model | Why This Model |
|---|---|---|
| Content Generation | Gemini 3 Flash | Fast, excellent JSON output, lowest cost ($0.50/1M input) |
| Primary Evaluation | Gemini 3 Flash | Consistent scoring, catches factual and pedagogical errors |
| Secondary Evaluation | GPT-4o | Different perspective for bias reduction (10-20% sample) |
| Orchestration | Claude Opus 4.5 | Complex reasoning, architecture decisions, pedagogical judgment |
7-Item Difficulty Sequence
For each skill, we generate a 7-item sequence spanning the full difficulty spectrum. This enables adaptive practice that meets students where they are.
Evaluation Criteria
Every piece of generated content is scored on 5 criteria using a 0-3 scale.
Factual Accuracy
Mathematical and content correctness. No errors in problems or solutions.
Grade Appropriateness
Language complexity, vocabulary, and context fit the target grade level.
Pedagogical Soundness
Aligned with learning science. Proper scaffolding and progression.
JSON Validity
Schema compliance. All required fields present and correctly formatted.
Completeness
Includes all required components: hints, feedback, distractors with explanations.
Scoring Thresholds
Human Review Process
Review Queue
Teachers and subject matter experts access content needing review through the Review Queue interface.
- ✓Filter by grade level and subject
- ✓See evaluation scores and flags
- ✓Preview items as students would see them
- ✓Add comments and feedback
Reviewer Actions
Reviewers can take several actions on content in the queue.
Quality Targets
Cost Efficiency
Human Review Interfaces
Multiple interfaces support different review workflows. Teachers and subject matter experts can choose the tool that fits their needs.
Content Review Dashboard
Comprehensive review interface for activities, items, lessons, and reading content. Approve, reject, or request revisions with detailed feedback.
Quick Item Review
Fast, keyboard-driven interface for reviewing items one at a time. Thumbs up/down with optional comments. Auto-saves drafts.
Worksheet Review
Review generated worksheets before publication. Preview PDFs, add comments, and manage approval workflow.
Widget Testing
Test and provide feedback on interactive math widgets. Submit issues directly to GitHub for the development team.
Motivational Design Demos
Content is only half the story. These demos showcase the student experience—how we use gamification, feedback, and adaptive difficulty to keep learners engaged.
Fluency Practice
Try ItSpeed-based math fact practice with real-time feedback, streak tracking, and performance analytics.
Adaptive Difficulty
NewThe system adjusts problem difficulty based on student performance in real-time.
Item Types Demo
InteractiveExperience all 9 assessment item types with hints, explanations, and visual feedback.
Built-in Gamification Features
Streak System
Consecutive correct answers build streaks with visual flame animations that grow with each success.
Speed Bonuses
Fast, accurate responses earn bonus XP. The speedometer shows real-time items-per-minute.
Correct Bursts
Particle explosions celebrate correct answers. Wrong answers get subtle shake feedback.
Performance Analytics
End-of-session results show accuracy, speed distribution, and identify problem areas.
Explore the Platform
Try the student experience or access the review queue.