Temporal Semantics
Temporal Semantics ensures precise alignment between actions, visual states, and time in General Data demonstrations. All interactions are first captured in a raw action log that records every low-level mouse, keyboard, and touch event in exact execution order using millisecond-precision timestamps.
These raw events are then consolidated into grouped actions, where multiple related events are combined into a single interaction unit representing one user intent. For example, a click is represented by mouse down and mouse up, and a drag is represented by drag start and drag end.
Pre-action and post-action frames are generated only for grouped actions, not for every raw event. This event-driven design ensures consistent action–frame–video alignment while preserving true timing and ordering for inspection and model training.
Capture Strategy
- Frames are captured on every grouped interaction event, not at fixed intervals
- Frame capture is event-driven, triggered by mouse, keyboard, drag, scroll, or touch actions
- Video is recorded at 60 FPS, but supervision frames are selected based on action timing
- No fixed-FPS frame sampling is used for action–state alignment
State Representation
- Pre-action frame is captured immediately before an interaction begins
- Post-action frame is captured immediately after the interaction completes
- For grouped actions, frames are tied to explicit start and end events
- Each frame is linked to the exact action timestamp in milliseconds
- Visual state, action event, and timestamp share the same Task ID (TID)
Action Timing
- All actions are logged with millisecond-level timestamps
- Time gaps between actions preserve natural human latency
- No smoothing, interpolation, or time normalization is applied
- No manipulation or simulation of consecutive actions is performed synthetically
Known Edge Cases
Some UI behaviors do not update instantly or synchronously with user actions:
-
UI Animations- Visual changes may continue after an action is performed, such as button animations or transitions.
-
Loading Delays- Delays between user action and UI updates due to processing or loading.
-
Asynchronous UI Updates- Background updates without direct user actions.
-
Network-dependent State Changes- Variable delays based on connectivity and server response time.
Enforcement Rules
-
Every action must have:
- A start timestamp
- An end timestamp
- A pre-action frame of the grouped action
- A post-action frame of the grouped action
-
Timestamps must be strictly monotonic within a task
-
Grouped actions must expose all internal sub-events
-
Direction, duration, coordinates, and key identity must be explicitly logged
-
Missing frames, logs, or timestamps result in automatic rejection
-
Frame–action misalignment beyond defined thresholds invalidates the task
-
No inferred, derived, or interpolated timestamps are permitted