Temporal Semantics
Explains how time is represented, aligned, and validated across all modalities in the dataset, ensuring that every trajectory is replayable, temporally consistent, and aligned with real human execution.
Time Representation
All events use absolute timestamps at millisecond precision, sourced from the local OS clock of the recording device.
- No relative time normalization
- No resampling or interpolation of missing states
This preserves real interaction timing, including pacing and execution gaps.
State-Action Pair
State-action pairs define how grouped interactions are aligned with observable UI states.
State is not defined independently. It is materialized only at the boundary of a grouped action. Semantic Actions action is a high-level interaction derived from low-level OS events or states (e.g., mouse move, key press, scroll), representing a meaningful unit of execution.
Frame generation is tied directly to these boundaries:
- Continuous actions (e.g., mouse move, drag, scroll) → Frames generated at both Start and End states
- Discrete actions (e.g., click, key press) → Frame generated at the End state
This forms the State-Action Pair:
Stateₜ, Actionₜ → Stateₜ₊₁
Where:
-
Stateₜ = frame at action start (if applicable)
-
Actionₜ = Semantic action event
-
Stateₜ₊₁ = frame at action completion
-
No intermediate frames are used
-
No time-based sampling is applied
Frame Sampling Semantics
Frame generation follows the State–Action pairing defined in the previous section. Each grouped action produces frames at its boundary states, based on the nature of the action.
- Actions that evolve over time generate frames at both boundaries
- Actions that produce a single outcome generate a frame only at completion
The tables below specify this mapping for each action type across platforms.
OS: Windows / macOS
| Action Type | Frame Generation |
|---|---|
| Mouse Move / Trajectory | Start + End |
| Mouse Drag | Start + End |
| Mouse Scroll | Start + End |
| Mouse Click | End |
| Key Press (Input Text) | End |
| Functional Key Press | End |
| Combo Key Press | End |
| Modifier Key Press | End |
Properties
- Frames are sampled at 60 FPS (~16.67 ms intervals); events falling between intervals are aligned to the nearest corresponding video frame while preserving millisecond-level ordering from event logs.
- Every frame corresponds to a grouped action boundary — no intermediate frames are post-generated.
- Native screen resolution is preserved (no normalization).
Latency & Delay Modeling
Raw event logs capture the full temporal structure of execution, including idle periods and UI response delays as they naturally occur on the system.
However, since frames are generated only at state–action boundaries, periods without interaction do not produce states and are excluded from the structured representation. As a result, idle time and UI delays are preserved in raw signals but are not explicitly materialized in the learning layer.
Recording occurs on the local system, which avoids the problem of network-induced latency. In cases where system or network delays introduce significant UI lag that distorts execution timing, such trajectories are rejected during QA.
Multi-Modal Synchronization
All modalities are aligned on a shared millisecond timeline, derived from the same OS-level clock ensuring there is no temporal drift between modalities.
- Video (60 FPS recording)
- Frames (generated at grouped action boundaries)
- Raw Event Logs (OS-level signals)
- Semantic Actions (grouped actions)
Frame timestamps are directly mapped to video timestamps, ensuring frame ↔ video consistency.
Note: Legacy data (pre-Nov 2025) is recorded at centisecond precision. All current data is recorded at millisecond precision.