Data Schema
Explains the exact structure, fields, and formats of trajectories in the dataset.
Trajectory
A trajectory is the atomic unit of the dataset: one complete human execution of a task.
Execution Lifecycle
| Stage | Description |
|---|---|
| Task Assignment | Demonstrator receives task (via TID) |
| Start Trigger | 3-second cooldown after entering TID → recording begins |
| Execution | Full human interaction captured (video + events) |
| End Trigger | Manual stop via OS-level shortcut |
| Submission | User either submits or discards |
| Storage | Only submitted trajectories are stored |
- One trajectory = one task = one continuous recording
- Rehearsals are never recorded
- All stored trajectories are post-submission
Trajectory Types
| Type | Definition |
|---|---|
| Successful | Task meets intent and passes human audit |
| Failed | Includes all rejected recordings (incomplete, deviation, audit failures) |
- Failed ≠ Partial
- Failed trajectories may still achieve the goal but violate quality constraints.
Data Type
The dataset spans multiple levels of task complexity to support both low-level and long-horizon model training.
| Type | Description | Typical Length |
|---|---|---|
| Elementary | Single UI interaction (click, toggle, select) | 1 step |
| Atomic | Small sequence within a single tool | 2–3 steps |
| Multi-step | Procedural sequence within a bounded context | 4–15 steps |
| Workflow | Long-horizon tasks involving retries, backtracking, and unbounded context switching | 15+ steps |
| Games | Interactive environments with dynamic decision making required under changing state | Variable length |
- All types share identical schema
- Only sequence length and task complexity vary
Action Space
Defines the complete set of interaction primitives captured in the dataset, representing all possible actions a human can perform in different environments.
| Category | Action Type | Description |
|---|---|---|
| Mouse | Mouse Move / Trajectory | Continuous cursor movement across screen coordinates |
| Mouse | Mouse Click | Mouse press and release at a coordinate (includes click count) |
| Mouse | Drag & Drop | Press → drag → release sequence across coordinates |
| Mouse | Mouse Scroll | Scroll action with direction (up/down/left/right) |
| Keyboard | Key Press (Input Text) | Text input captured between input_text_start and input_text_end |
| Keyboard | Functional Key Press | Non-character keys (Enter, Tab, Escape, arrows, function keys) |
| Keyboard | Combo Key Press | Multi-key combinations (e.g., Ctrl + A, Cmd + Shift + P) |
| Keyboard | Modifier Key Press | Modifier-only inputs (Ctrl, Shift, Alt, Cmd combinations) |
Action Space Representation
Actions are represented at two levels: Raw events & Semantic actions (or grouped actions).
Raw Event Log (Low-Level)
System-captured signals at millisecond precision, while preserving exact human behavior.
| Action | Events |
|---|---|
| Mouse Move | Mouse moves to (x, y) |
| Mouse Click | Mouse press (left/right) at (x, y); Mouse release (left/right) at (x, y) |
| Mouse Scroll | Mouse scrolls (direction) at (x, y) |
| Mouse Drag | Mouse press (left/right) → drag to (x, y) → Mouse release |
| Key Press | Key press and Key release |
Semantic Actions
Semantic abstraction over raw events.
| Semantic Actions | Raw Events |
|---|---|
| Mouse Move | mouseover_start → mouse moves to (x, y) → mouseover_end |
| Mouse Click | Mouse press at (x, y); Mouse release (left/right) at (x, y); click_count (1 = single, 2 = double, 3 = triple, at the same coordinates ±2 pixels) |
| Mouse Drag | Mouse press (left/right) → drag_start → drag to (x, y) → drag_end → Mouse release (left/right) |
| Mouse Scroll | Mouse press (left/right) → scroll_start → scroll_end → Mouse release (left/right) |
| Key Press (Input Text) | input_text_start → input_text_end |
| Functional Key Press | key_down → key_up |
| Combo Key Press | combo_key_down → combo_key_up |
| Modifier Key Press | modifier_keys_down → modifier_keys_up |
- Deterministic(fixed) grouping rules, with edge-case handling for missing OS signals
- Both raw and grouped always preserved
Reasoning Trace
Reasoning is layered on top of grouped actions.
Pipeline
Derived Signals
| Level | Description |
|---|---|
| Raw Thought | Human-written reasoning per step |
| Reasoning | Cleaned, grouped explanation of raw thoughts |
| Intent | Higher multiple-level abstraction across reasoning steps |
State-Action Pair
A state is the visual UI snapshot or frame aligned to a human interaction. The dataset is fundamentally structured as state → action → next state, with all modalities (video, raw event logs, Semantic Actions, and frames) synchronized on a shared millisecond timeline.
- Extracted from 60 FPS video
- Only captured at grouped action boundaries with absolute and millisecond precision timestamp
- State–action alignment is event-triggered rather than time-sampled, yielding a one-to-one, lossless mapping that enables exact trajectory replayability without temporal inference
Frame Types
Frames are stored in .webp format by default (configurable to .jpg, .png) and preserve native device resolution without normalization.
| Platform | Interaction Type | Frame Capture |
|---|---|---|
| macOS / Windows | Mouse move, drag, scroll | Start + End |
| macOS / Windows | Click, keypress, combo, modifier | End |
| Android | Tap, scroll, pinch, drag | Start + End |
| Android | Key, system events | End |
- Start frame = UI state immediately before action
- End frame = UI state captured immediately at action completion
Metadata Schema
Metadata is embedded inside StructuredTrajectory.json.
Core Fields
| Field | Description |
|---|---|
| task_id | Unique identifier of each task executed |
| instruction | Task prompt |
| tool_name | Tool used |
| category | Domain classification |
| OS | Windows / macOS / Android |
| resolution | Native screen resolution |
| duration | Task duration |
| action_count | Raw event count |
| grouped_action_count | Number of steps |
| FPS | Video capture rate (60 FPS) |
| frame_count | Total number of frames |
| trajectory_status | Successfull / Failed |
| rejection_reason | Only for failed trajectories |
| video_url | Source recording |
Data Formats & File Structure
Each export is a master folder containing multiple trajectories.
Directory Structure
<Master Folder>/
├── <TID>-<Tool>-<Prompt>/
├── <TID>.mp4
├── <TID>-EventLogs.txt
├── <TID>-StructuredTrajectory.json
├── Frames/
├── <TID>-Frame-Step-1-before.jpg
├── <TID>-Frame-Step-1-after.jpg
Files per Trajectory
| File | Description |
|---|---|
| Video (.mp4) | Full execution, 60 FPS |
| Event Logs | Raw events (.txt / .csv / .json) |
| StructuredTrajectory.json | Grouped actions + metadata + reasoning |
| Frames | Action-aligned screenshots |
Customization
- Formats. metadata or schema of different files are configurable per enterprise delivery.