Home
Data Schema

Data Schema

A General Data demonstration is a single, human-executed task captured as a time-aligned, multi-modal record.

Each demonstration corresponds to one unique task, identified by a Task ID (TID), and represents the full execution of that task by a human operator.

Demonstration Unit

Each task includes:

  • Task Prompt: Natural-language instruction presented to the human
  • Action Logs: Millisecond-precision interaction events
  • Video Recording: Continuous screen capture at 60 FP
  • Frame Images: Pre-action and post-action UI states
  • Metadata: Environment, execution, and quality signals
  • Chain-of-Thought (optional): Human reasoning behind the steps taken

Temporal Alignment Guarantee

All modalities are synchronized using millisecond-precision timestamps.

  • On Windows and macOS, mouse and keyboard events are aligned with their corresponding pre-action and post-action frames
  • On Android, touch events follow the same timestamp-based alignment
  • Video, frames, and action logs share a unified timeline

This alignment enables direct use of the data for:

  • Behavior cloning
  • Offline reinforcement learning
  • Vision–language–action model training

Data Type

  • The dataset captures user interactions across a wide range of task lengths and difficulty levels, supporting both precise UI control and long-horizon workflow learning.

Task Taxonomy

Task TypeAction CountScopePrimary Signal
Atomic1Single UI primitivePerception-to-action mapping
Elementary2–3Single toolUI building blocks
Multi-step4–8Tightly scoped contextProcedural flows
Workflow10+Multi-toolPlanning and recovery

Atomic Tasks

Single-step UI interactions such as clicking a button, selecting a menu item, or toggling a control.
Used to learn accurate perception-to-action mapping.

Elementary Tasks

Short sequences of 2 to 3 actions within a single tool, such as changing a setting or applying simple formatting.
Serve as reusable UI primitives.

Multi-step Tasks

Sequences of 4 to 8 actions within a single tool or tightly scoped context.
Capture procedural flows and short decision loops.

Workflow Tasks

Tasks involving 10 or more actions, often spanning multiple tools.
Include planning, context switching, and valid recovery paths that reflect real-world usage.

Action Definitions

Actions are logged as low-level, platform-native primitives, preserving the exact structure of human input.

macOS / Windows Actions

Mouse Movement

  • mouseover_start
  • mouseover_end

Mouse Clicks

  • mouse_down_left
  • mouse_up_left
  • click_count
    • 1 = single
    • 2 = double
    • 3 = triple (same coordinates)

Drag Actions

  • mouse_down_left
  • drag_start
  • drag_end
  • mouse_up_left

Scroll Actions

  • scroll_start
  • scroll_end
  • direction: up | down | left | right

Keyboard Input

  • key_down
  • key_up
  • input_text_start
  • input_text_end

Rules

  • Combo keys use explicit concatenation
    Example: CtrlLeft+AltRight+KeyA
  • Functional keys are logged explicitly
  • Modifier keys are logged separately and precisely

Android / iOS

Touch Actions

  • tap
  • double_tap
  • long_press (duration in milliseconds)

Scroll Gestures

  • scroll_up
  • scroll_down
  • scroll_left
  • scroll_right

Pinch Gestures

  • pinch_zoom_in
  • pinch_zoom_out
    (two-point coordinates captured)

Drag Actions

  • drag_start
  • drag_end

Text Input

  • input_text
    (masked values with character length)

Device-Level Events

  • orientation_lock
  • recent_button
  • back_button
  • home_button

Files Included (Per Task)

Each task generates a complete, deterministic set of files:

  • Prompt File (.txt)
    Task instruction and optional setup context

  • Video File (.mp4)
    Continuous screen recording at 60 FPS

  • Raw Action Log (.txt / .csv)
    Machine-readable event stream with timestamps

  • Grouped Action Log (.txt)
    Human-readable grouping of actions

  • Frame Images (.webp)
    Pre-action and post-action UI snapshots

  • Chain-of-Thought File (.txt / .csv)
    Human reasoning when applicable

  • Metadata File
    Structured task summary and QA signals

Task Object

-The task object is the top-level record for a complete demonstration.

Core Fields

  • Task ID
    Unique identifier linking all files

  • Prompt
    Natural-language task instruction

  • Context
    Optional setup or prerequisites

  • Environment

    • OS: Windows | macOS | Android | iOS (in progress)
    • Screen resolution: width × height (pixels)
    • Device type: Desktop | Laptop | Mobile

Supported Execution Environments (Desktop)

macOS

  • macOS 12 Monterey, 13 Ventura, 14 Sonoma, 15 Sequoia
  • Screen resolution: 1920 × 1080 or higher
  • Devices: MacBook, iMac, Mac mini, Mac Studio
  • Architecture: Apple Silicon or Intel (64-bit)

Windows

  • Windows 10, Windows 11
  • Screen resolution: 1920 × 1080
  • Devices: Desktop or Laptop
  • Architecture: 64-bit required

Action Object

An action object represents a single user interaction in sequential order.

Fields

  • Action Type
    mouse_move, mouse_down, mouse_up, click, scroll, drag_start, drag_end,
    keypress, text_input, tap, double_tap, long_press,
    pinch_zoom_in, pinch_zoom_out,
    orientation_change, back_button, home_button

  • Timestamp
    Millisecond-precision event time

  • Coordinates
    Screen-relative (x, y), when applicable

  • Key Code
    Normalized keyboard identifier

  • Click Count
    1 | 2 | 3

  • Direction
    Up | Down | Left | Right

  • Duration (ms)
    For timed actions

Frame Object

Frames capture the visual UI state aligned to action completion.

  • Frames are generated only for grouped human actions
  • Not generated for every raw event

Fields

  • Frame ID
  • Relative Timestamp
  • Resolution (width × height)
  • Image (.webp, RGB)
  • Frame Type
    • Pre-action
    • Post-action

Video Object

  • Full screen recording of the entire task
  • Captured at 60 FPS
  • Time-aligned with action logs and frames
  • Used for playback, verification, QA, and training

Metadata Object

Metadata summarizes execution context and quality signals.

  • Task ID
  • Environment (OS, device, resolution, UI theme)
  • Tooling (tool name and version)
  • Task Summary (duration, action count, FPS)
  • Lag Percentage (typically 10–35%)

File Structure

  • One directory per Task ID
  • Deterministic naming
  • Files ordered by timestamp
  • No shared files across tasks

Usage Notes

The dataset supports multiple learning and evaluation paradigms using the same task data.

Behavior Cloning

Models learn to predict human actions from observed UI state.

Offline Reinforcement Learning

Full action sequences and timing are preserved.

Vision–Language–Action Models

Prompts, frames, and actions are time-aligned for instruction-following.

Analysis and Inspection

Consistent structure enables replay, validation, debugging, and workflow analysis.