How It Works

Content Production Pipeline

AI generates educational content at scale. Humans review before publication. Nothing reaches students without quality verification.

It Starts with Skills

Every piece of content is generated for a specific skill. Skills are the atomic units of learning—each one maps to academic standards and has prerequisite relationships.

SSkill

Name:Add fractions with like denominators
Grade:4
Domain:Number & Operations—Fractions
Prerequisites:Understand fraction equivalence, Unit fractions

StStandards Alignment

CCSS.Math.4.NF.B.3a

Understand addition of fractions as joining parts referring to the same whole.

TEKS 4.3E

Represent and solve addition of fractions with equal denominators.

The generation prompt includes:Skill name, description, grade level, domain, subject, prerequisite skills, and all mapped standard codes with descriptions.

Generation Output Example

The prompt asks the AI to first reason about pedagogy, then generate items. Here's what a complete generation output looks like—the pedagogical context followed by the item sequence.

FirstPedagogical Context (in generation output)

Chain-of-thought
{
  "pedagogicalContext": {
    "bigIdea": {
      "concept": "When denominators are the same, we're counting the same-sized pieces",
      "whyItMatters": "Foundation for all fraction operations and proportional reasoning",
      "realWorldConnection": "Combining parts of the same whole (pizza slices, measuring cups)"
    },

    "commonStruggles": [
      {
        "struggle": "Adding both numerators AND denominators",
        "cause": "Applying whole-number addition rules to fractions",
        "frequency": "Very common (60%+ of initial errors)"
      },
      {
        "struggle": "Not recognizing when denominators match",
        "cause": "Focusing on numerators first",
        "frequency": "Common (30%)"
      }
    ],

    "misconceptions": [
      {
        "error": "2/8 + 3/8 = 5/16",
        "thinking": "Add tops, add bottoms",
        "remediation": "Use visual model showing pieces don't change size"
      },
      {
        "error": "2/8 + 3/8 = 5",
        "thinking": "Ignoring the denominator entirely",
        "remediation": "Emphasize fraction notation meaning 'parts of a whole'"
      }
    ],

    "instructionalContext": {
      "priorKnowledge": ["Fraction equivalence", "Unit fractions", "Part-whole relationship"],
      "typicalSequence": "Visual models → symbolic → mixed representations",
      "effectiveStrategies": ["Fraction bars", "Number lines", "Same-denominator emphasis"]
    }
  },

  "items": [ /* 7 items follow, using the context above */ ]
}

ThenItem from the Sequence

MULTIPLE_CHOICEmedium1
{
  "id": "item-4nf3a-m1-001",
  "type": "MULTIPLE_CHOICE",

  // === CONTENT ===
  "stem": "What is 2/8 + 3/8?",
  "content": {
    "choices": [
      { "id": "a", "text": "5/16", "isCorrect": false },
      { "id": "b", "text": "5/8", "isCorrect": true },
      { "id": "c", "text": "6/8", "isCorrect": false },
      { "id": "d", "text": "5", "isCorrect": false }
    ],
    "explanation": "When adding fractions with the same denominator,
      add the numerators and keep the denominator: 2/8 + 3/8 = 5/8",
    "misconceptions": {
      "a": "Added both numerators AND denominators",
      "c": "Multiplied instead of adding",
      "d": "Added numerators but forgot the denominator"
    }
  },

  // === DIFFICULTY METADATA ===
  "difficulty": 0.45,           // 0-1 scale (IRT-style)
  "discrimination": 1.2,        // 0-2 scale (how well it separates learners)
  "dokLevel": 2,                // Depth of Knowledge (1-4)
  "sequencePosition": "medium1", // Position in 7-item sequence

  // === GRADE & READABILITY ===
  "gradeLevel": "4",
  "readingLevel": 3.2,          // Flesch-Kincaid grade level
  "vocabularyTier": 1,          // 1=basic, 2=academic, 3=domain-specific

  // === SKILL & STANDARDS ===
  "skillId": "czi-skill-4nf3a-add-like-denom",
  "standards": [
    { "code": "CCSS.Math.4.NF.B.3a", "jurisdiction": "CCSS" },
    { "code": "TEKS.4.3E", "jurisdiction": "Texas" }
  ],

  // === SCAFFOLDING ===
  "hints": [
    { "text": "Look at the denominators. Are they the same?", "cost": 5 },
    { "text": "When denominators match, just add the top numbers.", "cost": 10 },
    { "text": "2 + 3 = 5, so the answer is 5/8.", "cost": 20 }
  ],

  // === VISUAL AIDS ===
  "visualAids": [
    {
      "widget": "fraction-bar",
      "purpose": "Show 2/8 and 3/8 as shaded portions",
      "config": {
        "numerator": 2,
        "denominator": 8,
        "showWhole": true
      }
    }
  ],

  // === GENERATION METADATA ===
  "generatedBy": "gemini-3-flash",
  "generatedAt": "2026-01-13T10:30:00Z",
  "promptVersion": "v2.3",
  "tokensUsed": { "input": 1250, "output": 680 },
  "cost": 0.0012,

  // === EVALUATION METADATA ===
  "evaluationScore": 2.6,
  "evaluationDetails": {
    "factualAccuracy": 3,
    "gradeAppropriateness": 3,
    "pedagogicalSoundness": 2,
    "jsonValidity": 3,
    "completeness": 2
  },
  "evaluationFeedback": "Good item. Consider adding a visual aid.",
  "secondaryEvaluation": {
    "model": "gpt-4o",
    "score": 2.7,
    "agreement": true
  },

  // === REVIEW STATUS ===
  "status": "IN_REVIEW",
  "reviewedBy": null,
  "reviewedAt": null,
  "reviewNotes": []
}

Item Metadata Reference

FieldTypeDescription
Skill & Standards
skillIdstringPrimary skill this item assesses
standardsarrayMapped standards (CCSS, TEKS, etc.)
Difficulty
difficulty0.0-1.0IRT difficulty parameter
discrimination0.0-2.0How well item separates high/low performers
dokLevel1-4Webb's Depth of Knowledge
sequencePositionstringeasy1, easy2, medium1-3, hard1, hard2
Readability
gradeLevelstringTarget grade (K, 1-8)
readingLevelnumberFlesch-Kincaid grade level of text
vocabularyTier1-31=basic, 2=academic, 3=domain-specific
Evaluation
evaluationScore0.0-3.0Average of 5 criteria scores
evaluationDetailsobjectPer-criterion scores (0-3 each)
secondaryEvaluationobjectCross-model verification (GPT-4o)

Widget & Output Integration

Items can include visual aids that render as interactive widgets (web) or static SVGs (PDF). The same content works across all output formats.

Interactive Widgets

React components for web-based practice:

  • • Number lines (draggable markers)
  • • Fraction bars (interactive shading)
  • • Ten frames (click to fill)
  • • Coordinate planes (plot points)
  • • Area models (dynamic grids)
  • • Clocks (movable hands)

PDF Paper Widgets

Static SVG renders for worksheets:

  • • Number lines (with answer blanks)
  • • Counting scenes (sprites)
  • • Fraction bars (empty for shading)
  • • Coordinate grids (points to identify)
  • • Ten frames (fill-in)
  • • Base-ten blocks

Item Types

9 supported response types:

  • • Multiple Choice
  • • Multiple Select
  • • Numeric Entry
  • • Short Answer
  • • Fill in the Blank
  • • True/False
  • • Matching
  • • Ordering
  • • Constructed Response

The Three-Stage Pipeline

1

AI Generation

Gemini 3 Flash generates items for each skill, including all metadata, hints, and widget configs.

  • 7-item difficulty sequences per skill
  • Standards-aligned via skill mapping
  • Widget configs for visual aids
2

Dual AI Evaluation

Primary evaluation scores 5 criteria. 10-20% get secondary evaluation by a different model for bias reduction.

  • Factual accuracy (answer correctness)
  • Grade appropriateness (reading level)
  • Pedagogical soundness (learning science)
3

Human Review

ALL content requires human review. Auto-approved items still get sampled. Teachers verify answers and pedagogy.

  • Verify answer correctness
  • Check age-appropriateness
  • Approve or request changes

1Structured Generation Prompt (Stage 1 Detail)

The generation prompt requires the AI to think through pedagogy first, then generate items. This "chain-of-thought for item design" produces better distractors, hints, and explanations.

FirstThink Through Pedagogy

  • Big Idea: Core concept and why it matters
  • Common Struggles: What students find hard
  • Instructional Context: How this is typically taught
  • Misconceptions: Predictable errors and their causes
  • Prerequisite Gaps: Missing knowledge that causes confusion

ThenGenerate Items Using That Context

  • Distractors: Based on the misconceptions identified
  • Hints: Target the struggles listed
  • Explanations: Connect back to the big idea
  • Difficulty: 7-item sequence from easy1 to hard2
  • Scaffolding: Address prerequisite gaps

This pedagogical thinking happens within the prompt—the AI reasons about teaching before writing items. The context can optionally be extracted and saved as skill-level metadata for teacher guides.

⚠️
All Content Requires Human Review

Even items with high evaluation scores go to the review queue. "Auto-approve" means they're flagged as likely good, but humans still verify before publication. We sample 100% of content for answer accuracy and pedagogical quality.

Enrichment: Format-Specific Transforms

After human review, approved items can be enriched with format-specific features. This keeps the generation prompt focused on pedagogy while separate transforms handle interoperability.

QTI 3.0 Compliance

Transform native items to 1EdTech QTI format:

  • responseDeclaration for answer keys
  • outcomeDeclaration for scoring
  • choiceInteraction for MC items
  • textEntryInteraction for numeric
  • • Template processing for feedback

Accessibility (WCAG 2.1)

Add accessibility features automatically:

  • • ARIA labels for interactive widgets
  • • Alt text for generated diagrams
  • • Screen reader hints for math notation
  • • Keyboard navigation patterns
  • • High contrast mode support

LMS Packaging

Package for external LMS delivery:

  • • SCORM 2004 wrappers
  • • LTI 1.3 resource links
  • • QTI package manifests (imsmanifest.xml)
  • • xAPI statement templates
  • • Common Cartridge bundles

Localization

Adapt content for different locales:

  • • Translation via specialized models
  • • Locale-specific number formatting
  • • Cultural context adaptation
  • • Currency and measurement units
  • • Right-to-left layout support
💡
Why Enrich Later?

Generating QTI-compliant XML directly would complicate the generation prompt and reduce quality. By generating clean, pedagogically-focused content first, we can run deterministic transforms to add format-specific features—and update those transforms without regenerating content.

Validation: LLM vs Code

Use LLMs for judgment, use code for computation. LLMs hallucinate arithmetic but reason well about pedagogy. Code computes perfectly but can't judge whether an explanation is clear.

Code-Based Validation

Deterministic, fast, 100% reliable

Answer correctness

Parse expression, compute result, verify match

Reading level (Flesch-Kincaid)

Syllables, words, sentences → grade level formula

JSON schema validation

Required fields, types, enum values

Widget config validity

Try rendering—does it work?

Prerequisite boundary

Skill graph lookup for allowed concepts

Duplicate detection

Semantic similarity to existing items

Format constraints

4 choices, 3 hints, stem length limits

LLM-Based Evaluation

Judgment, reasoning, pedagogical expertise

Pedagogical soundness

Does this teach the concept effectively?

Grade-appropriate language

Beyond F-K: idioms, cultural refs, complexity

Distractor quality

Do these represent real misconceptions?

Skill alignment

Does this actually test the intended skill?

Hint scaffolding

Progressive, not giving away the answer

Explanation clarity

Would a student understand this?

Bias and sensitivity

Cultural, gender, socioeconomic awareness

Validation Pipeline Order

Run code checks first (fast, cheap). If an item fails JSON validation or has a wrong answer, don't waste LLM tokens evaluating it. Only send items that pass code checks to LLM evaluation.

Human Evaluation Methods

Direct Rating

Reviewer scores item on rubric criteria (1-3 scale).

Best for: Detailed feedback, training reviewers

Blind Comparison

Show generated item vs. published item (e.g., Smarter Balanced). Reviewer picks better one without knowing which is which.

Best for: Calibration, measuring quality bar

Pass/Fail + Notes

Binary approve/reject with required explanation for rejections. Fast, captures blockers.

Best for: High-volume review, clear quality gate

Statistical Sampling for Population-Level Quality

You don't need to review every item. Random sampling + human ratings gives population-level quality estimates with confidence intervals.

Sample Size Guide

  • 30-50 items: rough quality estimate
  • 100 items: ±10% confidence interval
  • 400 items: ±5% confidence interval

Use Cases

  • Quality trends: track over time
  • A/B testing: compare prompt versions
  • Batch gates: reject if pass rate < 80%
  • Dashboards: report with error bars

Widget & Item Type Selection

How does the AI decide to use a number line vs. fraction bar? When is multiple choice better than numeric entry? These decisions happen at generation time, guided by the skill and prompt.

Item Type Selection

The prompt specifies which item types are appropriate for the skill:

# In generation prompt:
Skill: Add fractions with like denominators
Allowed item types: MULTIPLE_CHOICE, NUMERIC
# NOT: CONSTRUCTED_RESPONSE (too open-ended)
  • Skill metadata defines allowed types
  • Difficulty level may influence (harder = more open)
  • 7-item sequence can mix types for variety

Widget Selection

Widgets are visual aids—the AI chooses based on what helps learning:

# AI reasoning in generation:
"This fraction addition problem would benefit
from a fraction-bar widget showing 2/8 + 3/8
as shaded portions of the same whole."
  • Prompt lists available widgets for the skill
  • AI selects + configures based on pedagogical fit
  • Config validated by code (will it render?)

Widget Selection by Skill Domain

DomainCommon WidgetsWhen to Use
Fractionsfraction-bar, fraction-circle, number-lineEquivalence, comparison, operations
Place Valuebase-ten-blocks, place-value-chartRegrouping, expanded form
Counting (K-2)ten-frame, counter-dots, rekenrekSubitizing, number bonds
Multiplicationarea-model, array, number-lineVisual multiplication, distributive property
TimeclockReading time, elapsed time
Geometryshape-canvas, gridArea, perimeter, transformations
🎨
Illustrations vs. Widgets

Widgets are interactive or structured (fraction bars, number lines). Illustrations are decorative or contextual (a picture of pizza, a cartoon character). For now, we focus on widgets. Illustrations may be added via image generation or stock assets in a future phase.

Context Engineering: Teaching the AI to Use Widgets

The generation prompt includes structured documentation so the AI knows how to select and configure widgets correctly.

# Widget Library (included in generation prompt)

## fraction-bar
PURPOSE: Visualize fractions as shaded portions of a rectangular bar
WHEN TO USE:
  - Comparing fractions with same/different denominators
  - Adding/subtracting fractions (show parts combining)
  - Showing equivalence (same shaded area, different partitions)
WHEN NOT TO USE:
  - Fractions > 1 (use multiple bars or number line instead)
  - Very large denominators (>12 becomes hard to see)

CONFIG SCHEMA:
{
  "numerator": number,      // 0 to denominator
  "denominator": number,    // 2-12 recommended
  "showLabels": boolean,    // show fraction notation
  "highlightNumerator": boolean
}

EXAMPLE - Good:
Skill: Compare fractions with like denominators
Widget: fraction-bar with numerator=3, denominator=8
Why: Shows 3/8 as visual area, easy to compare

EXAMPLE - Bad:
Skill: Add fractions 2/3 + 4/5
Widget: fraction-bar
Why: Different denominators need side-by-side bars or number line

---

## number-line
PURPOSE: Show numbers/fractions as positions on a continuous line
WHEN TO USE:
  - Ordering/comparing multiple values
  - Adding (jumps forward) or subtracting (jumps back)
  - Fractions > 1 or mixed numbers
  - Showing distance/difference between values
WHEN NOT TO USE:
  - Part-whole relationships (fraction-bar is clearer)
  - Very early fraction concepts (too abstract for K-1)

CONFIG SCHEMA:
{
  "min": number,
  "max": number,
  "tickInterval": number,
  "points": [{ "value": number, "label": string }],
  "showJumps": boolean
}

What the Prompt Includes

  • PURPOSE: What the widget visualizes
  • WHEN TO USE: Pedagogical fit
  • WHEN NOT TO USE: Common mistakes
  • CONFIG SCHEMA: Valid parameters
  • GOOD/BAD EXAMPLES: Concrete usage

Skill-Specific Context

  • Available widgets filtered by skill
  • Recommended widget for this skill type
  • Config constraints (e.g., denominators 2-12)
  • Examples from same skill if available

Validation After Generation

Even with good prompt context, validate widget configs programmatically:

  • Schema validation: Are required fields present? Types correct?
  • Render test: Does the widget actually render without error?
  • Bounds check: Is numerator ≤ denominator? Is denominator reasonable?
  • Pedagogical sanity: Does config match the stem? (LLM check)

Quality Assurance Patterns

Beyond the three-stage pipeline, these patterns improve generation quality and catch issues early.

AaReading Level Validation

Automatically check readability of generated content:

  • Flesch-Kincaid Grade Level - must be ≤ grade + 1
  • Vocabulary tier check - flag Tier 3 words for younger grades
  • Sentence complexity - shorter sentences for K-2
  • Word frequency analysis - prefer common words
Example: Grade 3 item should have reading level ≤ 4.0

Answer Verification

Don't trust AI arithmetic—verify programmatically:

  • Parse the stem - extract the math expression
  • Compute the answer - use a math library
  • Verify correct choice - must match computed answer
  • Check distractors - must NOT equal correct answer
Critical: Wrong answers destroy student trust

Few-Shot Examples

Include 2-3 high-quality items as examples in the prompt:

Bootstrapping Phase (no approved items yet)

  • Hand-craft 5-10 gold standard items per skill type
  • Use published assessment items from open sources (e.g., Smarter Balanced, released state tests)
  • Adapt textbook examples to your format
  • Start with one skill, perfect it, expand

Steady State (approved items exist)

  • Same skill - shows expected format and difficulty
  • Similar skills - fallback when exact match unavailable
  • High-rated items - prioritize by eval score
  • Diverse examples - show range of contexts

Diversity Constraints

Require variety across the 7-item sequence:

  • Contexts - not all pizza (use baking, sports, money...)
  • Numbers - vary denominators, numerators, magnitudes
  • Representations - mix symbolic, visual, word problems
  • Question formats - find answer, find missing part, compare
Why: Prevents pattern-matching without understanding

Negative Examples in Prompts

Show the AI what NOT to generate:

  • Too easy - "What is 1/2 + 1/2?" (trivial)
  • Implausible distractors - options no one would choose
  • Tricky wording - "gotcha" questions that confuse
  • Above grade level - vocabulary or concepts too advanced
Format: "DON'T: [bad example]. WHY: [reason]"

Prerequisite Boundary Check

Items shouldn't require skills beyond prerequisites:

  • Skill graph lookup - get prerequisite skill IDs
  • Concept extraction - identify concepts in item
  • Boundary validation - flag concepts from later skills
  • Simplification suggestions - how to remove the violation
Example: "Add like denominators" item shouldn't require simplifying

Prompt Versioning

Track which prompt produced which items:

  • Store promptVersion - "v2.3" with each item
  • Quality tracking - correlate versions with eval scores
  • A/B testing - compare prompt variations
  • Rollback - revert if new prompt degrades quality
Stored: promptVersion, generatedAt, modelId

Content Status Workflow

DRAFTIN_REVIEW
APPROVEDorNEEDS_REVISION
PUBLISHEDARCHIVED
DRAFT
Initial state for newly generated content. Not visible to students.
IN_REVIEW
Content scored 2.0-2.5 on evaluation. Waiting for human review.
NEEDS_REVISION
Content scored below 2.0. Flagged for regeneration with feedback.
APPROVED
Content scored 2.5+ (auto-approved) or passed human review.
PUBLISHED
Live and visible to students. Can be assigned in activities.
ARCHIVED
Retired content. Hidden from students but preserved for analysis.

Multi-Model Strategy

Different AI models excel at different tasks. We use specialized models for each stage to maximize quality while minimizing cost.

TaskModelWhy This Model
Content GenerationGemini 3 FlashFast, excellent JSON output, lowest cost ($0.50/1M input)
Primary EvaluationGemini 3 FlashConsistent scoring, catches factual and pedagogical errors
Secondary EvaluationGPT-4oDifferent perspective for bias reduction (10-20% sample)
OrchestrationClaude Opus 4.5Complex reasoning, architecture decisions, pedagogical judgment

7-Item Difficulty Sequence

For each skill, we generate a 7-item sequence spanning the full difficulty spectrum. This enables adaptive practice that meets students where they are.

easy10.20-0.30Direct application, single step
easy20.25-0.35Direct application, minimal variation
medium10.40-0.50Two-step or unfamiliar context
medium20.45-0.55Strategy selection required
medium30.50-0.60Multiple steps or representations
hard10.65-0.75Non-routine problem, DOK 2-3
hard20.70-0.80Transfer to new context

Evaluation Criteria

Every piece of generated content is scored on 5 criteria using a 0-3 scale.

Factual Accuracy

Mathematical and content correctness. No errors in problems or solutions.

Grade Appropriateness

Language complexity, vocabulary, and context fit the target grade level.

Pedagogical Soundness

Aligned with learning science. Proper scaffolding and progression.

JSON Validity

Schema compliance. All required fields present and correctly formatted.

Completeness

Includes all required components: hints, feedback, distractors with explanations.

Scoring Thresholds

Score ≥ 2.5
AUTO-APPROVE
Bypasses human review queue
Score 2.0-2.5
HUMAN REVIEW
Sent to teacher review queue
Score < 2.0
REGENERATE
Flagged with improvement feedback

Human Review Process

Review Queue

Teachers and subject matter experts access content needing review through the Review Queue interface.

  • Filter by grade level and subject
  • See evaluation scores and flags
  • Preview items as students would see them
  • Add comments and feedback

Reviewer Actions

Reviewers can take several actions on content in the queue.

👍
Approve
Content is ready for publication
👎
Request Changes
Add comment explaining issues
💬
Comment
Add notes for improvement

Quality Targets

100%
JSON Parse Rate
Automated recovery from truncation
>80%
Auto-Approve Rate
Content scoring ≥ 2.5
>95%
Human Review Pass
Items approved after review
<15%
Regeneration Rate
Content needing revision

Cost Efficiency

~$0.39
Per skill
Generation + evaluation + regeneration
~$4.70
Full grade level
182 skills with all content
~$11
Full K-8 curriculum
10,000+ pieces of content

Human Review Interfaces

Multiple interfaces support different review workflows. Teachers and subject matter experts can choose the tool that fits their needs.

Motivational Design Demos

Content is only half the story. These demos showcase the student experience—how we use gamification, feedback, and adaptive difficulty to keep learners engaged.

Built-in Gamification Features

🔥

Streak System

Consecutive correct answers build streaks with visual flame animations that grow with each success.

Speed Bonuses

Fast, accurate responses earn bonus XP. The speedometer shows real-time items-per-minute.

🎉

Correct Bursts

Particle explosions celebrate correct answers. Wrong answers get subtle shake feedback.

📊

Performance Analytics

End-of-session results show accuracy, speed distribution, and identify problem areas.

Explore the Platform

Try the student experience or access the review queue.