Home
Data Schema

Data Schema

Explains the exact structure, fields, and formats of trajectories in the dataset.

Trajectory

A trajectory is the atomic unit of the dataset: one complete human execution of a task.

Execution Lifecycle

StageDescription
Task AssignmentDemonstrator receives task (via TID)
Start Trigger3-second cooldown after entering TID → recording begins
ExecutionFull human interaction captured (video + events)
End TriggerManual stop via OS-level shortcut
SubmissionUser either submits or discards
StorageOnly submitted trajectories are stored
  • One trajectory = one task = one continuous recording
  • Rehearsals are never recorded
  • All stored trajectories are post-submission

Trajectory Types

TypeDefinition
SuccessfulTask meets intent and passes human audit
FailedIncludes all rejected recordings (incomplete, deviation, audit failures)
  • Failed ≠ Partial
  • Failed trajectories may still achieve the goal but violate quality constraints.

Data Type

The dataset spans multiple levels of task complexity to support both low-level and long-horizon model training.

TypeDescriptionTypical Length
ElementarySingle UI interaction (click, toggle, select)1 step
AtomicSmall sequence within a single tool2–3 steps
Multi-stepProcedural sequence within a bounded context4–15 steps
WorkflowLong-horizon tasks involving retries, backtracking, and unbounded context switching15+ steps
GamesInteractive environments with dynamic decision making required under changing stateVariable length
  • All types share identical schema
  • Only sequence length and task complexity vary

Action Space

Defines the complete set of interaction primitives captured in the dataset, representing all possible actions a human can perform in different environments.

CategoryAction TypeDescription
MouseMouse Move / TrajectoryContinuous cursor movement across screen coordinates
MouseMouse ClickMouse press and release at a coordinate (includes click count)
MouseDrag & DropPress → drag → release sequence across coordinates
MouseMouse ScrollScroll action with direction (up/down/left/right)
KeyboardKey Press (Input Text)Text input captured between input_text_start and input_text_end
KeyboardFunctional Key PressNon-character keys (Enter, Tab, Escape, arrows, function keys)
KeyboardCombo Key PressMulti-key combinations (e.g., Ctrl + A, Cmd + Shift + P)
KeyboardModifier Key PressModifier-only inputs (Ctrl, Shift, Alt, Cmd combinations)

Action Space Representation

Actions are represented at two levels: Raw events & Semantic actions (or grouped actions).

Raw Event Log (Low-Level)

System-captured signals at millisecond precision, while preserving exact human behavior.

ActionEvents
Mouse MoveMouse moves to (x, y)
Mouse ClickMouse press (left/right) at (x, y); Mouse release (left/right) at (x, y)
Mouse ScrollMouse scrolls (direction) at (x, y)
Mouse DragMouse press (left/right) → drag to (x, y) → Mouse release
Key PressKey press and Key release

Semantic Actions

Semantic abstraction over raw events.

Semantic ActionsRaw Events
Mouse Movemouseover_start → mouse moves to (x, y) → mouseover_end
Mouse ClickMouse press at (x, y); Mouse release (left/right) at (x, y); click_count (1 = single, 2 = double, 3 = triple, at the same coordinates ±2 pixels)
Mouse DragMouse press (left/right) → drag_start → drag to (x, y) → drag_end → Mouse release (left/right)
Mouse ScrollMouse press (left/right) → scroll_startscroll_end → Mouse release (left/right)
Key Press (Input Text)input_text_startinput_text_end
Functional Key Presskey_downkey_up
Combo Key Presscombo_key_downcombo_key_up
Modifier Key Pressmodifier_keys_downmodifier_keys_up
  • Deterministic(fixed) grouping rules, with edge-case handling for missing OS signals
  • Both raw and grouped always preserved

Reasoning Trace

Reasoning is layered on top of grouped actions.

Pipeline

Human execution
Raw events
Grouped actions (or step)
Human reasoning (per step)
Human validation

Derived Signals

LevelDescription
Raw ThoughtHuman-written reasoning per step
ReasoningCleaned, grouped explanation of raw thoughts
IntentHigher multiple-level abstraction across reasoning steps

State-Action Pair

A state is the visual UI snapshot or frame aligned to a human interaction. The dataset is fundamentally structured as state → action → next state, with all modalities (video, raw event logs, Semantic Actions, and frames) synchronized on a shared millisecond timeline.

  • Extracted from 60 FPS video
  • Only captured at grouped action boundaries with absolute and millisecond precision timestamp
  • State–action alignment is event-triggered rather than time-sampled, yielding a one-to-one, lossless mapping that enables exact trajectory replayability without temporal inference

Frame Types

Frames are stored in .webp format by default (configurable to .jpg, .png) and preserve native device resolution without normalization.

PlatformInteraction TypeFrame Capture
macOS / WindowsMouse move, drag, scrollStart + End
macOS / WindowsClick, keypress, combo, modifierEnd
AndroidTap, scroll, pinch, dragStart + End
AndroidKey, system eventsEnd
  • Start frame = UI state immediately before action
  • End frame = UI state captured immediately at action completion

Metadata Schema

Metadata is embedded inside StructuredTrajectory.json.

Core Fields

FieldDescription
task_idUnique identifier of each task executed
instructionTask prompt
tool_nameTool used
categoryDomain classification
OSWindows / macOS / Android
resolutionNative screen resolution
durationTask duration
action_countRaw event count
grouped_action_countNumber of steps
FPSVideo capture rate (60 FPS)
frame_countTotal number of frames
trajectory_statusSuccessfull / Failed
rejection_reasonOnly for failed trajectories
video_urlSource recording

Data Formats & File Structure

Each export is a master folder containing multiple trajectories.

Directory Structure

<Master Folder>/
  ├── <TID>-<Tool>-<Prompt>/
        ├── <TID>.mp4
        ├── <TID>-EventLogs.txt
        ├── <TID>-StructuredTrajectory.json
        ├── Frames/
              ├── <TID>-Frame-Step-1-before.jpg
              ├── <TID>-Frame-Step-1-after.jpg

Files per Trajectory

FileDescription
Video (.mp4)Full execution, 60 FPS
Event LogsRaw events (.txt / .csv / .json)
StructuredTrajectory.jsonGrouped actions + metadata + reasoning
FramesAction-aligned screenshots

Customization

  • Formats. metadata or schema of different files are configurable per enterprise delivery.