PlayPowerLearn

It Starts with Skills

Every piece of content is generated for a specific skill. Skills are the atomic units of learning—each one maps to academic standards and has prerequisite relationships.

SSkill

Name:Add fractions with like denominators

Grade:4

Domain:Number & Operations—Fractions

Prerequisites:Understand fraction equivalence, Unit fractions

StStandards Alignment

CCSS.Math.4.NF.B.3a

Understand addition of fractions as joining parts referring to the same whole.

TEKS 4.3E

Represent and solve addition of fractions with equal denominators.

The generation prompt includes:Skill name, description, grade level, domain, subject, prerequisite skills, and all mapped standard codes with descriptions.

Generation Output Example

The prompt asks the AI to first reason about pedagogy, then generate items. Here's what a complete generation output looks like—the pedagogical context followed by the item sequence.

FirstPedagogical Context (in generation output)

Chain-of-thought

{
  "pedagogicalContext": {
    "bigIdea": {
      "concept": "When denominators are the same, we're counting the same-sized pieces",
      "whyItMatters": "Foundation for all fraction operations and proportional reasoning",
      "realWorldConnection": "Combining parts of the same whole (pizza slices, measuring cups)"
    },

    "commonStruggles": [
      {
        "struggle": "Adding both numerators AND denominators",
        "cause": "Applying whole-number addition rules to fractions",
        "frequency": "Very common (60%+ of initial errors)"
      },
      {
        "struggle": "Not recognizing when denominators match",
        "cause": "Focusing on numerators first",
        "frequency": "Common (30%)"
      }
    ],

    "misconceptions": [
      {
        "error": "2/8 + 3/8 = 5/16",
        "thinking": "Add tops, add bottoms",
        "remediation": "Use visual model showing pieces don't change size"
      },
      {
        "error": "2/8 + 3/8 = 5",
        "thinking": "Ignoring the denominator entirely",
        "remediation": "Emphasize fraction notation meaning 'parts of a whole'"
      }
    ],

    "instructionalContext": {
      "priorKnowledge": ["Fraction equivalence", "Unit fractions", "Part-whole relationship"],
      "typicalSequence": "Visual models → symbolic → mixed representations",
      "effectiveStrategies": ["Fraction bars", "Number lines", "Same-denominator emphasis"]
    }
  },

  "items": [ /* 7 items follow, using the context above */ ]
}

ThenItem from the Sequence

MULTIPLE_CHOICEmedium1

{
  "id": "item-4nf3a-m1-001",
  "type": "MULTIPLE_CHOICE",

  // === CONTENT ===
  "stem": "What is 2/8 + 3/8?",
  "content": {
    "choices": [
      { "id": "a", "text": "5/16", "isCorrect": false },
      { "id": "b", "text": "5/8", "isCorrect": true },
      { "id": "c", "text": "6/8", "isCorrect": false },
      { "id": "d", "text": "5", "isCorrect": false }
    ],
    "explanation": "When adding fractions with the same denominator,
      add the numerators and keep the denominator: 2/8 + 3/8 = 5/8",
    "misconceptions": {
      "a": "Added both numerators AND denominators",
      "c": "Multiplied instead of adding",
      "d": "Added numerators but forgot the denominator"
    }
  },

  // === DIFFICULTY METADATA ===
  "difficulty": 0.45,           // 0-1 scale (IRT-style)
  "discrimination": 1.2,        // 0-2 scale (how well it separates learners)
  "dokLevel": 2,                // Depth of Knowledge (1-4)
  "sequencePosition": "medium1", // Position in 7-item sequence

  // === GRADE & READABILITY ===
  "gradeLevel": "4",
  "readingLevel": 3.2,          // Flesch-Kincaid grade level
  "vocabularyTier": 1,          // 1=basic, 2=academic, 3=domain-specific

  // === SKILL & STANDARDS ===
  "skillId": "czi-skill-4nf3a-add-like-denom",
  "standards": [
    { "code": "CCSS.Math.4.NF.B.3a", "jurisdiction": "CCSS" },
    { "code": "TEKS.4.3E", "jurisdiction": "Texas" }
  ],

  // === SCAFFOLDING ===
  "hints": [
    { "text": "Look at the denominators. Are they the same?", "cost": 5 },
    { "text": "When denominators match, just add the top numbers.", "cost": 10 },
    { "text": "2 + 3 = 5, so the answer is 5/8.", "cost": 20 }
  ],

  // === VISUAL AIDS ===
  "visualAids": [
    {
      "widget": "fraction-bar",
      "purpose": "Show 2/8 and 3/8 as shaded portions",
      "config": {
        "numerator": 2,
        "denominator": 8,
        "showWhole": true
      }
    }
  ],

  // === GENERATION METADATA ===
  "generatedBy": "gemini-3-flash",
  "generatedAt": "2026-01-13T10:30:00Z",
  "promptVersion": "v2.3",
  "tokensUsed": { "input": 1250, "output": 680 },
  "cost": 0.0012,

  // === EVALUATION METADATA ===
  "evaluationScore": 2.6,
  "evaluationDetails": {
    "factualAccuracy": 3,
    "gradeAppropriateness": 3,
    "pedagogicalSoundness": 2,
    "jsonValidity": 3,
    "completeness": 2
  },
  "evaluationFeedback": "Good item. Consider adding a visual aid.",
  "secondaryEvaluation": {
    "model": "gpt-4o",
    "score": 2.7,
    "agreement": true
  },

  // === REVIEW STATUS ===
  "status": "IN_REVIEW",
  "reviewedBy": null,
  "reviewedAt": null,
  "reviewNotes": []
}

Item Metadata Reference

Field	Type	Description
Skill & Standards
skillId	string	Primary skill this item assesses
standards	array	Mapped standards (CCSS, TEKS, etc.)
Difficulty
difficulty	0.0-1.0	IRT difficulty parameter
discrimination	0.0-2.0	How well item separates high/low performers
dokLevel	1-4	Webb's Depth of Knowledge
sequencePosition	string	easy1, easy2, medium1-3, hard1, hard2
Readability
gradeLevel	string	Target grade (K, 1-8)
readingLevel	number	Flesch-Kincaid grade level of text
vocabularyTier	1-3	1=basic, 2=academic, 3=domain-specific
Evaluation
evaluationScore	0.0-3.0	Average of 5 criteria scores
evaluationDetails	object	Per-criterion scores (0-3 each)
secondaryEvaluation	object	Cross-model verification (GPT-4o)

Widget & Output Integration

Items can include visual aids that render as interactive widgets (web) or static SVGs (PDF). The same content works across all output formats.

Interactive Widgets

React components for web-based practice:

• Number lines (draggable markers)
• Fraction bars (interactive shading)
• Ten frames (click to fill)
• Coordinate planes (plot points)
• Area models (dynamic grids)
• Clocks (movable hands)

PDF Paper Widgets

Static SVG renders for worksheets:

• Number lines (with answer blanks)
• Counting scenes (sprites)
• Fraction bars (empty for shading)
• Coordinate grids (points to identify)
• Ten frames (fill-in)
• Base-ten blocks

Item Types

9 supported response types:

• Multiple Choice
• Multiple Select
• Numeric Entry
• Short Answer
• Fill in the Blank
• True/False
• Matching
• Ordering
• Constructed Response

The Three-Stage Pipeline

AI Generation

Gemini 3 Flash generates items for each skill, including all metadata, hints, and widget configs.

•7-item difficulty sequences per skill
•Standards-aligned via skill mapping
•Widget configs for visual aids

Dual AI Evaluation

Primary evaluation scores 5 criteria. 10-20% get secondary evaluation by a different model for bias reduction.

•Factual accuracy (answer correctness)
•Grade appropriateness (reading level)
•Pedagogical soundness (learning science)

Human Review

ALL content requires human review. Auto-approved items still get sampled. Teachers verify answers and pedagogy.

•Verify answer correctness
•Check age-appropriateness
•Approve or request changes

1Structured Generation Prompt (Stage 1 Detail)

The generation prompt requires the AI to think through pedagogy first, then generate items. This "chain-of-thought for item design" produces better distractors, hints, and explanations.

FirstThink Through Pedagogy

• Big Idea: Core concept and why it matters
• Common Struggles: What students find hard
• Instructional Context: How this is typically taught
• Misconceptions: Predictable errors and their causes
• Prerequisite Gaps: Missing knowledge that causes confusion

ThenGenerate Items Using That Context

• Distractors: Based on the misconceptions identified
• Hints: Target the struggles listed
• Explanations: Connect back to the big idea
• Difficulty: 7-item sequence from easy1 to hard2
• Scaffolding: Address prerequisite gaps

This pedagogical thinking happens within the prompt—the AI reasons about teaching before writing items. The context can optionally be extracted and saved as skill-level metadata for teacher guides.

⚠️

All Content Requires Human Review

Even items with high evaluation scores go to the review queue. "Auto-approve" means they're flagged as likely good, but humans still verify before publication. We sample 100% of content for answer accuracy and pedagogical quality.

Enrichment: Format-Specific Transforms

After human review, approved items can be enriched with format-specific features. This keeps the generation prompt focused on pedagogy while separate transforms handle interoperability.

QTI 3.0 Compliance

Transform native items to 1EdTech QTI format:

• responseDeclaration for answer keys
• outcomeDeclaration for scoring
• choiceInteraction for MC items
• textEntryInteraction for numeric
• Template processing for feedback

Accessibility (WCAG 2.1)

Add accessibility features automatically:

• ARIA labels for interactive widgets
• Alt text for generated diagrams
• Screen reader hints for math notation
• Keyboard navigation patterns
• High contrast mode support

LMS Packaging

Package for external LMS delivery:

• SCORM 2004 wrappers
• LTI 1.3 resource links
• QTI package manifests (imsmanifest.xml)
• xAPI statement templates
• Common Cartridge bundles

Localization

Adapt content for different locales:

• Translation via specialized models
• Locale-specific number formatting
• Cultural context adaptation
• Currency and measurement units
• Right-to-left layout support

💡

Why Enrich Later?

Generating QTI-compliant XML directly would complicate the generation prompt and reduce quality. By generating clean, pedagogically-focused content first, we can run deterministic transforms to add format-specific features—and update those transforms without regenerating content.

Validation: LLM vs Code

Use LLMs for judgment, use code for computation. LLMs hallucinate arithmetic but reason well about pedagogy. Code computes perfectly but can't judge whether an explanation is clear.

Code-Based Validation

Deterministic, fast, 100% reliable

✓

Answer correctness

Parse expression, compute result, verify match

✓

Reading level (Flesch-Kincaid)

Syllables, words, sentences → grade level formula

✓

JSON schema validation

Required fields, types, enum values

✓

Widget config validity

Try rendering—does it work?

✓

Prerequisite boundary

Skill graph lookup for allowed concepts

✓

Duplicate detection

Semantic similarity to existing items

✓

Format constraints

4 choices, 3 hints, stem length limits

LLM-Based Evaluation

Judgment, reasoning, pedagogical expertise

✓

Pedagogical soundness

Does this teach the concept effectively?

✓

Grade-appropriate language

Beyond F-K: idioms, cultural refs, complexity

✓

Distractor quality

Do these represent real misconceptions?

✓

Skill alignment

Does this actually test the intended skill?

✓

Hint scaffolding

Progressive, not giving away the answer

✓

Explanation clarity

Would a student understand this?

✓

Bias and sensitivity

Cultural, gender, socioeconomic awareness

⚡

Validation Pipeline Order

Run code checks first (fast, cheap). If an item fails JSON validation or has a wrong answer, don't waste LLM tokens evaluating it. Only send items that pass code checks to LLM evaluation.

Human Evaluation Methods

Direct Rating

Reviewer scores item on rubric criteria (1-3 scale).

Best for: Detailed feedback, training reviewers

Blind Comparison

Show generated item vs. published item (e.g., Smarter Balanced). Reviewer picks better one without knowing which is which.

Best for: Calibration, measuring quality bar

Pass/Fail + Notes

Binary approve/reject with required explanation for rejections. Fast, captures blockers.

Best for: High-volume review, clear quality gate

Statistical Sampling for Population-Level Quality

You don't need to review every item. Random sampling + human ratings gives population-level quality estimates with confidence intervals.

Sample Size Guide

• 30-50 items: rough quality estimate
• 100 items: ±10% confidence interval
• 400 items: ±5% confidence interval

Use Cases

• Quality trends: track over time
• A/B testing: compare prompt versions
• Batch gates: reject if pass rate < 80%
• Dashboards: report with error bars

Widget & Item Type Selection

How does the AI decide to use a number line vs. fraction bar? When is multiple choice better than numeric entry? These decisions happen at generation time, guided by the skill and prompt.

Item Type Selection

The prompt specifies which item types are appropriate for the skill:

# In generation prompt:
Skill: Add fractions with like denominators
Allowed item types: MULTIPLE_CHOICE, NUMERIC
# NOT: CONSTRUCTED_RESPONSE (too open-ended)

• Skill metadata defines allowed types
• Difficulty level may influence (harder = more open)
• 7-item sequence can mix types for variety

Widget Selection

Widgets are visual aids—the AI chooses based on what helps learning:

# AI reasoning in generation:
"This fraction addition problem would benefit
from a fraction-bar widget showing 2/8 + 3/8
as shaded portions of the same whole."

• Prompt lists available widgets for the skill
• AI selects + configures based on pedagogical fit
• Config validated by code (will it render?)

Widget Selection by Skill Domain

Domain	Common Widgets	When to Use
Fractions	`fraction-bar`, `fraction-circle`, `number-line`	Equivalence, comparison, operations
Place Value	`base-ten-blocks`, `place-value-chart`	Regrouping, expanded form
Counting (K-2)	`ten-frame`, `counter-dots`, `rekenrek`	Subitizing, number bonds
Multiplication	`area-model`, `array`, `number-line`	Visual multiplication, distributive property
Time	`clock`	Reading time, elapsed time
Geometry	`shape-canvas`, `grid`	Area, perimeter, transformations

🎨

Illustrations vs. Widgets

Widgets are interactive or structured (fraction bars, number lines). Illustrations are decorative or contextual (a picture of pizza, a cartoon character). For now, we focus on widgets. Illustrations may be added via image generation or stock assets in a future phase.

Context Engineering: Teaching the AI to Use Widgets

The generation prompt includes structured documentation so the AI knows how to select and configure widgets correctly.

# Widget Library (included in generation prompt)

## fraction-bar
PURPOSE: Visualize fractions as shaded portions of a rectangular bar
WHEN TO USE:
  - Comparing fractions with same/different denominators
  - Adding/subtracting fractions (show parts combining)
  - Showing equivalence (same shaded area, different partitions)
WHEN NOT TO USE:
  - Fractions > 1 (use multiple bars or number line instead)
  - Very large denominators (>12 becomes hard to see)

CONFIG SCHEMA:
{
  "numerator": number,      // 0 to denominator
  "denominator": number,    // 2-12 recommended
  "showLabels": boolean,    // show fraction notation
  "highlightNumerator": boolean
}

EXAMPLE - Good:
Skill: Compare fractions with like denominators
Widget: fraction-bar with numerator=3, denominator=8
Why: Shows 3/8 as visual area, easy to compare

EXAMPLE - Bad:
Skill: Add fractions 2/3 + 4/5
Widget: fraction-bar
Why: Different denominators need side-by-side bars or number line

---

## number-line
PURPOSE: Show numbers/fractions as positions on a continuous line
WHEN TO USE:
  - Ordering/comparing multiple values
  - Adding (jumps forward) or subtracting (jumps back)
  - Fractions > 1 or mixed numbers
  - Showing distance/difference between values
WHEN NOT TO USE:
  - Part-whole relationships (fraction-bar is clearer)
  - Very early fraction concepts (too abstract for K-1)

CONFIG SCHEMA:
{
  "min": number,
  "max": number,
  "tickInterval": number,
  "points": [{ "value": number, "label": string }],
  "showJumps": boolean
}

What the Prompt Includes

• PURPOSE: What the widget visualizes
• WHEN TO USE: Pedagogical fit
• WHEN NOT TO USE: Common mistakes
• CONFIG SCHEMA: Valid parameters
• GOOD/BAD EXAMPLES: Concrete usage

Skill-Specific Context

• Available widgets filtered by skill
• Recommended widget for this skill type
• Config constraints (e.g., denominators 2-12)
• Examples from same skill if available

Validation After Generation

Even with good prompt context, validate widget configs programmatically:

• Schema validation: Are required fields present? Types correct?
• Render test: Does the widget actually render without error?
• Bounds check: Is numerator ≤ denominator? Is denominator reasonable?
• Pedagogical sanity: Does config match the stem? (LLM check)

Quality Assurance Patterns

Beyond the three-stage pipeline, these patterns improve generation quality and catch issues early.

AaReading Level Validation

Automatically check readability of generated content:

• Flesch-Kincaid Grade Level - must be ≤ grade + 1
• Vocabulary tier check - flag Tier 3 words for younger grades
• Sentence complexity - shorter sentences for K-2
• Word frequency analysis - prefer common words

Example: Grade 3 item should have reading level ≤ 4.0

Answer Verification

Don't trust AI arithmetic—verify programmatically:

• Parse the stem - extract the math expression
• Compute the answer - use a math library
• Verify correct choice - must match computed answer
• Check distractors - must NOT equal correct answer

Critical: Wrong answers destroy student trust

Few-Shot Examples

Include 2-3 high-quality items as examples in the prompt:

Bootstrapping Phase (no approved items yet)

• Hand-craft 5-10 gold standard items per skill type
• Use published assessment items from open sources (e.g., Smarter Balanced, released state tests)
• Adapt textbook examples to your format
• Start with one skill, perfect it, expand

Steady State (approved items exist)

• Same skill - shows expected format and difficulty
• Similar skills - fallback when exact match unavailable
• High-rated items - prioritize by eval score
• Diverse examples - show range of contexts

Diversity Constraints

Require variety across the 7-item sequence:

• Contexts - not all pizza (use baking, sports, money...)
• Numbers - vary denominators, numerators, magnitudes
• Representations - mix symbolic, visual, word problems
• Question formats - find answer, find missing part, compare

Why: Prevents pattern-matching without understanding

Negative Examples in Prompts

Show the AI what NOT to generate:

• Too easy - "What is 1/2 + 1/2?" (trivial)
• Implausible distractors - options no one would choose
• Tricky wording - "gotcha" questions that confuse
• Above grade level - vocabulary or concepts too advanced

Format: "DON'T: [bad example]. WHY: [reason]"

Prerequisite Boundary Check

Items shouldn't require skills beyond prerequisites:

• Skill graph lookup - get prerequisite skill IDs
• Concept extraction - identify concepts in item
• Boundary validation - flag concepts from later skills
• Simplification suggestions - how to remove the violation

Example: "Add like denominators" item shouldn't require simplifying

Prompt Versioning

Track which prompt produced which items:

• Store promptVersion - "v2.3" with each item
• Quality tracking - correlate versions with eval scores
• A/B testing - compare prompt variations
• Rollback - revert if new prompt degrades quality

Stored: promptVersion, generatedAt, modelId

Content Status Workflow

DRAFTIN_REVIEW

APPROVEDorNEEDS_REVISION

PUBLISHEDARCHIVED

DRAFT

Initial state for newly generated content. Not visible to students.

IN_REVIEW

Content scored 2.0-2.5 on evaluation. Waiting for human review.

NEEDS_REVISION

Content scored below 2.0. Flagged for regeneration with feedback.

APPROVED

Content scored 2.5+ (auto-approved) or passed human review.

PUBLISHED

Live and visible to students. Can be assigned in activities.

ARCHIVED

Retired content. Hidden from students but preserved for analysis.

Multi-Model Strategy

Different AI models excel at different tasks. We use specialized models for each stage to maximize quality while minimizing cost.

Task	Model	Why This Model
Content Generation	Gemini 3 Flash	Fast, excellent JSON output, lowest cost ($0.50/1M input)
Primary Evaluation	Gemini 3 Flash	Consistent scoring, catches factual and pedagogical errors
Secondary Evaluation	GPT-4o	Different perspective for bias reduction (10-20% sample)
Orchestration	Claude Opus 4.5	Complex reasoning, architecture decisions, pedagogical judgment

7-Item Difficulty Sequence

For each skill, we generate a 7-item sequence spanning the full difficulty spectrum. This enables adaptive practice that meets students where they are.

easy10.20-0.30Direct application, single step

easy20.25-0.35Direct application, minimal variation

medium10.40-0.50Two-step or unfamiliar context

medium20.45-0.55Strategy selection required

medium30.50-0.60Multiple steps or representations

hard10.65-0.75Non-routine problem, DOK 2-3

hard20.70-0.80Transfer to new context

Evaluation Criteria

Every piece of generated content is scored on 5 criteria using a 0-3 scale.

Factual Accuracy

Mathematical and content correctness. No errors in problems or solutions.

Grade Appropriateness

Language complexity, vocabulary, and context fit the target grade level.

Pedagogical Soundness

Aligned with learning science. Proper scaffolding and progression.

JSON Validity

Schema compliance. All required fields present and correctly formatted.

Completeness

Includes all required components: hints, feedback, distractors with explanations.

Scoring Thresholds

Score ≥ 2.5

AUTO-APPROVE

Bypasses human review queue

Score 2.0-2.5

HUMAN REVIEW

Sent to teacher review queue

Score < 2.0

REGENERATE

Flagged with improvement feedback

Human Review Process

Review Queue

Teachers and subject matter experts access content needing review through the Review Queue interface.

✓Filter by grade level and subject
✓See evaluation scores and flags
✓Preview items as students would see them
✓Add comments and feedback

Reviewer Actions

Reviewers can take several actions on content in the queue.

👍

Approve

Content is ready for publication

👎

Request Changes

Add comment explaining issues

💬

Comment

Add notes for improvement

Quality Targets

100%

JSON Parse Rate

Automated recovery from truncation

>80%

Auto-Approve Rate

Content scoring ≥ 2.5

>95%

Human Review Pass

Items approved after review

<15%

Regeneration Rate

Content needing revision

Cost Efficiency

~$0.39

Per skill

Generation + evaluation + regeneration

~$4.70

Full grade level

182 skills with all content

~$11

Full K-8 curriculum

10,000+ pieces of content

Human Review Interfaces

Multiple interfaces support different review workflows. Teachers and subject matter experts can choose the tool that fits their needs.

Motivational Design Demos

Content is only half the story. These demos showcase the student experience—how we use gamification, feedback, and adaptive difficulty to keep learners engaged.

Fluency Practice

Try It

Speed-based math fact practice with real-time feedback, streak tracking, and performance analytics.

🔥Streak counter⚡Speedometer🎯XP bonuses📊Error analysis

Adaptive Difficulty

New

The system adjusts problem difficulty based on student performance in real-time.

📈Auto-progression🎚️7 difficulty levels🧠Optimal challenge💪Confidence building

Item Types Demo

Interactive

Experience all 9 assessment item types with hints, explanations, and visual feedback.

✅9 item types💡Hint system📝Explanations🎨Visual feedback

Built-in Gamification Features

🔥

Streak System

Consecutive correct answers build streaks with visual flame animations that grow with each success.

⚡

Speed Bonuses

Fast, accurate responses earn bonus XP. The speedometer shows real-time items-per-minute.

🎉

Correct Bursts

Particle explosions celebrate correct answers. Wrong answers get subtle shake feedback.

📊

Performance Analytics

End-of-session results show accuracy, speed distribution, and identify problem areas.

Explore the Platform

Try the student experience or access the review queue.

Try Fluency Demo Open Review Queue Item Types Demo

Content Production Pipeline