Data Schema
A General Data demonstration is a single, human-executed task captured as a time-aligned, multi-modal record.
Each demonstration corresponds to one unique task, identified by a Task ID (TID), and represents the full execution of that task by a human operator.
Demonstration Unit
Each task includes:
- Task Prompt: Natural-language instruction presented to the human
- Action Logs: Millisecond-precision interaction events
- Video Recording: Continuous screen capture at 60 FP
- Frame Images: Pre-action and post-action UI states
- Metadata: Environment, execution, and quality signals
- Chain-of-Thought (optional): Human reasoning behind the steps taken
Temporal Alignment Guarantee
All modalities are synchronized using millisecond-precision timestamps.
- On Windows and macOS, mouse and keyboard events are aligned with their corresponding pre-action and post-action frames
- On Android, touch events follow the same timestamp-based alignment
- Video, frames, and action logs share a unified timeline
This alignment enables direct use of the data for:
- Behavior cloning
- Offline reinforcement learning
- Vision–language–action model training
Data Type
- The dataset captures user interactions across a wide range of task lengths and difficulty levels, supporting both precise UI control and long-horizon workflow learning.
Task Taxonomy
| Task Type | Action Count | Scope | Primary Signal |
|---|---|---|---|
| Atomic | 1 | Single UI primitive | Perception-to-action mapping |
| Elementary | 2–3 | Single tool | UI building blocks |
| Multi-step | 4–8 | Tightly scoped context | Procedural flows |
| Workflow | 10+ | Multi-tool | Planning and recovery |
Atomic Tasks
Single-step UI interactions such as clicking a button, selecting a menu item, or toggling a control.
Used to learn accurate perception-to-action mapping.
Elementary Tasks
Short sequences of 2 to 3 actions within a single tool, such as changing a setting or applying simple formatting.
Serve as reusable UI primitives.
Multi-step Tasks
Sequences of 4 to 8 actions within a single tool or tightly scoped context.
Capture procedural flows and short decision loops.
Workflow Tasks
Tasks involving 10 or more actions, often spanning multiple tools.
Include planning, context switching, and valid recovery paths that reflect real-world usage.
Action Definitions
Actions are logged as low-level, platform-native primitives, preserving the exact structure of human input.
macOS / Windows Actions
Mouse Movement
- mouseover_start
- mouseover_end
Mouse Clicks
- mouse_down_left
- mouse_up_left
- click_count
- 1 = single
- 2 = double
- 3 = triple (same coordinates)
Drag Actions
- mouse_down_left
- drag_start
- drag_end
- mouse_up_left
Scroll Actions
- scroll_start
- scroll_end
- direction: up | down | left | right
Keyboard Input
- key_down
- key_up
- input_text_start
- input_text_end
Rules
- Combo keys use explicit concatenation
Example:CtrlLeft+AltRight+KeyA - Functional keys are logged explicitly
- Modifier keys are logged separately and precisely
Android / iOS
Touch Actions
- tap
- double_tap
- long_press (duration in milliseconds)
Scroll Gestures
- scroll_up
- scroll_down
- scroll_left
- scroll_right
Pinch Gestures
- pinch_zoom_in
- pinch_zoom_out
(two-point coordinates captured)
Drag Actions
- drag_start
- drag_end
Text Input
- input_text
(masked values with character length)
Device-Level Events
- orientation_lock
- recent_button
- back_button
- home_button
Files Included (Per Task)
Each task generates a complete, deterministic set of files:
-
Prompt File (.txt)
Task instruction and optional setup context -
Video File (.mp4)
Continuous screen recording at 60 FPS -
Raw Action Log (.txt / .csv)
Machine-readable event stream with timestamps -
Grouped Action Log (.txt)
Human-readable grouping of actions -
Frame Images (.webp)
Pre-action and post-action UI snapshots -
Chain-of-Thought File (.txt / .csv)
Human reasoning when applicable -
Metadata File
Structured task summary and QA signals
Task Object
-The task object is the top-level record for a complete demonstration.
Core Fields
-
Task ID
Unique identifier linking all files -
Prompt
Natural-language task instruction -
Context
Optional setup or prerequisites -
Environment
- OS: Windows | macOS | Android | iOS (in progress)
- Screen resolution: width × height (pixels)
- Device type: Desktop | Laptop | Mobile
Supported Execution Environments (Desktop)
macOS
- macOS 12 Monterey, 13 Ventura, 14 Sonoma, 15 Sequoia
- Screen resolution: 1920 × 1080 or higher
- Devices: MacBook, iMac, Mac mini, Mac Studio
- Architecture: Apple Silicon or Intel (64-bit)
Windows
- Windows 10, Windows 11
- Screen resolution: 1920 × 1080
- Devices: Desktop or Laptop
- Architecture: 64-bit required
Action Object
An action object represents a single user interaction in sequential order.
Fields
-
Action Type
mouse_move, mouse_down, mouse_up, click, scroll, drag_start, drag_end,
keypress, text_input, tap, double_tap, long_press,
pinch_zoom_in, pinch_zoom_out,
orientation_change, back_button, home_button -
Timestamp
Millisecond-precision event time -
Coordinates
Screen-relative (x, y), when applicable -
Key Code
Normalized keyboard identifier -
Click Count
1 | 2 | 3 -
Direction
Up | Down | Left | Right -
Duration (ms)
For timed actions
Frame Object
Frames capture the visual UI state aligned to action completion.
- Frames are generated only for grouped human actions
- Not generated for every raw event
Fields
- Frame ID
- Relative Timestamp
- Resolution (width × height)
- Image (.webp, RGB)
- Frame Type
- Pre-action
- Post-action
Video Object
- Full screen recording of the entire task
- Captured at 60 FPS
- Time-aligned with action logs and frames
- Used for playback, verification, QA, and training
Metadata Object
Metadata summarizes execution context and quality signals.
- Task ID
- Environment (OS, device, resolution, UI theme)
- Tooling (tool name and version)
- Task Summary (duration, action count, FPS)
- Lag Percentage (typically 10–35%)
File Structure
- One directory per Task ID
- Deterministic naming
- Files ordered by timestamp
- No shared files across tasks
Usage Notes
The dataset supports multiple learning and evaluation paradigms using the same task data.
Behavior Cloning
Models learn to predict human actions from observed UI state.
Offline Reinforcement Learning
Full action sequences and timing are preserved.
Vision–Language–Action Models
Prompts, frames, and actions are time-aligned for instruction-following.
Analysis and Inspection
Consistent structure enables replay, validation, debugging, and workflow analysis.