Data Schema: Action Logs, Frames, and Metadata

Home

Data Schema

A General Data demonstration is a single, human-executed task captured as a time-aligned, multi-modal record.

Each demonstration corresponds to one unique task, identified by a Task ID (TID), and represents the full execution of that task by a human operator.

Demonstration Unit

Each task includes:

Task Prompt: Natural-language instruction presented to the human
Action Logs: Millisecond-precision interaction events
Video Recording: Continuous screen capture at 60 FP
Frame Images: Pre-action and post-action UI states
Metadata: Environment, execution, and quality signals
Chain-of-Thought (optional): Human reasoning behind the steps taken

Temporal Alignment Guarantee

All modalities are synchronized using millisecond-precision timestamps.

On Windows and macOS, mouse and keyboard events are aligned with their corresponding pre-action and post-action frames
On Android, touch events follow the same timestamp-based alignment
Video, frames, and action logs share a unified timeline

This alignment enables direct use of the data for:

Behavior cloning
Offline reinforcement learning
Vision–language–action model training

Data Type

The dataset captures user interactions across a wide range of task lengths and difficulty levels, supporting both precise UI control and long-horizon workflow learning.

Task Taxonomy

Task Type	Action Count	Scope	Primary Signal
Atomic	1	Single UI primitive	Perception-to-action mapping
Elementary	2–3	Single tool	UI building blocks
Multi-step	4–8	Tightly scoped context	Procedural flows
Workflow	10+	Multi-tool	Planning and recovery

Atomic Tasks

Single-step UI interactions such as clicking a button, selecting a menu item, or toggling a control.
Used to learn accurate perception-to-action mapping.

Elementary Tasks

Short sequences of 2 to 3 actions within a single tool, such as changing a setting or applying simple formatting.
Serve as reusable UI primitives.

Multi-step Tasks

Sequences of 4 to 8 actions within a single tool or tightly scoped context.
Capture procedural flows and short decision loops.

Workflow Tasks

Tasks involving 10 or more actions, often spanning multiple tools.
Include planning, context switching, and valid recovery paths that reflect real-world usage.

Action Definitions

Actions are logged as low-level, platform-native primitives, preserving the exact structure of human input.

macOS / Windows Actions

Mouse Movement

mouseover_start
mouseover_end

Mouse Clicks

mouse_down_left
mouse_up_left
click_count
- 1 = single
- 2 = double
- 3 = triple (same coordinates)

Drag Actions

mouse_down_left
drag_start
drag_end
mouse_up_left

Scroll Actions

scroll_start
scroll_end
direction: up | down | left | right

Keyboard Input

key_down
key_up
input_text_start
input_text_end

Rules

Combo keys use explicit concatenation
Example: CtrlLeft+AltRight+KeyA
Functional keys are logged explicitly
Modifier keys are logged separately and precisely

Android / iOS

Touch Actions

tap
double_tap
long_press (duration in milliseconds)

Scroll Gestures

scroll_up
scroll_down
scroll_left
scroll_right

Pinch Gestures

pinch_zoom_in
pinch_zoom_out
(two-point coordinates captured)

Drag Actions

drag_start
drag_end

Text Input

input_text
(masked values with character length)

Device-Level Events

orientation_lock
recent_button
back_button
home_button

Files Included (Per Task)

Each task generates a complete, deterministic set of files:

Prompt File (.txt)
Task instruction and optional setup context
Video File (.mp4)
Continuous screen recording at 60 FPS
Raw Action Log (.txt / .csv)
Machine-readable event stream with timestamps
Grouped Action Log (.txt)
Human-readable grouping of actions
Frame Images (.webp)
Pre-action and post-action UI snapshots
Chain-of-Thought File (.txt / .csv)
Human reasoning when applicable
Metadata File
Structured task summary and QA signals

Task Object

-The task object is the top-level record for a complete demonstration.

Core Fields

Task ID
Unique identifier linking all files
Prompt
Natural-language task instruction
Context
Optional setup or prerequisites
Environment
- OS: Windows | macOS | Android | iOS (in progress)
- Screen resolution: width × height (pixels)
- Device type: Desktop | Laptop | Mobile

Supported Execution Environments (Desktop)

macOS

macOS 12 Monterey, 13 Ventura, 14 Sonoma, 15 Sequoia
Screen resolution: 1920 × 1080 or higher
Devices: MacBook, iMac, Mac mini, Mac Studio
Architecture: Apple Silicon or Intel (64-bit)

Windows

Windows 10, Windows 11
Screen resolution: 1920 × 1080
Devices: Desktop or Laptop
Architecture: 64-bit required

Action Object

An action object represents a single user interaction in sequential order.

Fields

Action Type
mouse_move, mouse_down, mouse_up, click, scroll, drag_start, drag_end,
keypress, text_input, tap, double_tap, long_press,
pinch_zoom_in, pinch_zoom_out,
orientation_change, back_button, home_button
Timestamp
Millisecond-precision event time
Coordinates
Screen-relative (x, y), when applicable
Key Code
Normalized keyboard identifier
Click Count
1 | 2 | 3
Direction
Up | Down | Left | Right
Duration (ms)
For timed actions

Frame Object

Frames capture the visual UI state aligned to action completion.

Frames are generated only for grouped human actions
Not generated for every raw event

Fields

Frame ID
Relative Timestamp
Resolution (width × height)
Image (.webp, RGB)
Frame Type
- Pre-action
- Post-action

Video Object

Full screen recording of the entire task
Captured at 60 FPS
Time-aligned with action logs and frames
Used for playback, verification, QA, and training

Metadata Object

Metadata summarizes execution context and quality signals.

Task ID
Environment (OS, device, resolution, UI theme)
Tooling (tool name and version)
Task Summary (duration, action count, FPS)
Lag Percentage (typically 10–35%)

File Structure

One directory per Task ID
Deterministic naming
Files ordered by timestamp
No shared files across tasks

Usage Notes

The dataset supports multiple learning and evaluation paradigms using the same task data.

Behavior Cloning

Models learn to predict human actions from observed UI state.

Offline Reinforcement Learning

Full action sequences and timing are preserved.

Vision–Language–Action Models

Prompts, frames, and actions are time-aligned for instruction-following.

Analysis and Inspection

Consistent structure enables replay, validation, debugging, and workflow analysis.