Data Schema: Action Logs, Frames, and Metadata

Home

Data Schema

Explains the exact structure, fields, and formats of trajectories in the dataset.

Trajectory

A trajectory is the atomic unit of the dataset: one complete human execution of a task.

Execution Lifecycle

Stage	Description
Task Assignment	Demonstrator receives task (via TID)
Start Trigger	3-second cooldown after entering TID → recording begins
Execution	Full human interaction captured (video + events)
End Trigger	Manual stop via OS-level shortcut
Submission	User either submits or discards
Storage	Only submitted trajectories are stored

One trajectory = one task = one continuous recording
Rehearsals are never recorded
All stored trajectories are post-submission

Trajectory Types

Type	Definition
Successful	Task meets intent and passes human audit
Failed	Includes all rejected recordings (incomplete, deviation, audit failures)

Failed ≠ Partial
Failed trajectories may still achieve the goal but violate quality constraints.

Data Type

The dataset spans multiple levels of task complexity to support both low-level and long-horizon model training.

Type	Description	Typical Length
Elementary	Single UI interaction (click, toggle, select)	1 step
Atomic	Small sequence within a single tool	2–3 steps
Multi-step	Procedural sequence within a bounded context	4–15 steps
Workflow	Long-horizon tasks involving retries, backtracking, and unbounded context switching	15+ steps
Games	Interactive environments with dynamic decision making required under changing state	Variable length

All types share identical schema
Only sequence length and task complexity vary

Action Space

Defines the complete set of interaction primitives captured in the dataset, representing all possible actions a human can perform in different environments.

Category	Action Type	Description
Mouse	Mouse Move / Trajectory	Continuous cursor movement across screen coordinates
Mouse	Mouse Click	Mouse press and release at a coordinate (includes click count)
Mouse	Drag & Drop	Press → drag → release sequence across coordinates
Mouse	Mouse Scroll	Scroll action with direction (up/down/left/right)
Keyboard	Key Press (Input Text)	Text input captured between input_text_start and input_text_end
Keyboard	Functional Key Press	Non-character keys (Enter, Tab, Escape, arrows, function keys)
Keyboard	Combo Key Press	Multi-key combinations (e.g., Ctrl + A, Cmd + Shift + P)
Keyboard	Modifier Key Press	Modifier-only inputs (Ctrl, Shift, Alt, Cmd combinations)

Action Space Representation

Actions are represented at two levels: Raw events & Semantic actions (or grouped actions).

Raw Event Log (Low-Level)

System-captured signals at millisecond precision, while preserving exact human behavior.

Action	Events
Mouse Move	Mouse moves to (x, y)
Mouse Click	Mouse press (left/right) at (x, y); Mouse release (left/right) at (x, y)
Mouse Scroll	Mouse scrolls (direction) at (x, y)
Mouse Drag	Mouse press (left/right) → drag to (x, y) → Mouse release
Key Press	Key press and Key release

Semantic Actions

Semantic abstraction over raw events.

Semantic Actions	Raw Events
Mouse Move	`mouseover_start` → mouse moves to (x, y) → `mouseover_end`
Mouse Click	Mouse press at (x, y); Mouse release (left/right) at (x, y); `click_count` (1 = single, 2 = double, 3 = triple, at the same coordinates ±2 pixels)
Mouse Drag	Mouse press (left/right) → `drag_start` → drag to (x, y) → `drag_end` → Mouse release (left/right)
Mouse Scroll	Mouse press (left/right) → `scroll_start` → `scroll_end` → Mouse release (left/right)
Key Press (Input Text)	`input_text_start` → `input_text_end`
Functional Key Press	`key_down` → `key_up`
Combo Key Press	`combo_key_down` → `combo_key_up`
Modifier Key Press	`modifier_keys_down` → `modifier_keys_up`

Deterministic(fixed) grouping rules, with edge-case handling for missing OS signals
Both raw and grouped always preserved

Reasoning Trace

Reasoning is layered on top of grouped actions.

Pipeline

Human execution↓

Raw events↓

Grouped actions (or step)↓

Human reasoning (per step)↓

Human validation

Derived Signals

Level	Description
Raw Thought	Human-written reasoning per step
Reasoning	Cleaned, grouped explanation of raw thoughts
Intent	Higher multiple-level abstraction across reasoning steps

State-Action Pair

A state is the visual UI snapshot or frame aligned to a human interaction. The dataset is fundamentally structured as state → action → next state, with all modalities (video, raw event logs, Semantic Actions, and frames) synchronized on a shared millisecond timeline.

Extracted from 60 FPS video
Only captured at grouped action boundaries with absolute and millisecond precision timestamp
State–action alignment is event-triggered rather than time-sampled, yielding a one-to-one, lossless mapping that enables exact trajectory replayability without temporal inference

Frame Types

Frames are stored in .webp format by default (configurable to .jpg, .png) and preserve native device resolution without normalization.

Platform	Interaction Type	Frame Capture
macOS / Windows	Mouse move, drag, scroll	Start + End
macOS / Windows	Click, keypress, combo, modifier	End
Android	Tap, scroll, pinch, drag	Start + End
Android	Key, system events	End

Start frame = UI state immediately before action
End frame = UI state captured immediately at action completion

Metadata Schema

Metadata is embedded inside StructuredTrajectory.json.

Core Fields

Field	Description
task_id	Unique identifier of each task executed
instruction	Task prompt
tool_name	Tool used
category	Domain classification
OS	Windows / macOS / Android
resolution	Native screen resolution
duration	Task duration
action_count	Raw event count
grouped_action_count	Number of steps
FPS	Video capture rate (60 FPS)
frame_count	Total number of frames
trajectory_status	Successfull / Failed
rejection_reason	Only for failed trajectories
video_url	Source recording

Data Formats & File Structure

Each export is a master folder containing multiple trajectories.

Directory Structure

<Master Folder>/
  ├── <TID>-<Tool>-<Prompt>/
        ├── <TID>.mp4
        ├── <TID>-EventLogs.txt
        ├── <TID>-StructuredTrajectory.json
        ├── Frames/
              ├── <TID>-Frame-Step-1-before.jpg
              ├── <TID>-Frame-Step-1-after.jpg

Files per Trajectory

File	Description
Video (.mp4)	Full execution, 60 FPS
Event Logs	Raw events (.txt / .csv / .json)
StructuredTrajectory.json	Grouped actions + metadata + reasoning
Frames	Action-aligned screenshots

Customization

Formats. metadata or schema of different files are configurable per enterprise delivery.