Temporal Semantics: State–Action Alignment

Home

Temporal Semantics

Temporal Semantics ensures precise alignment between actions, visual states, and time in General Data demonstrations. All interactions are first captured in a raw action log that records every low-level mouse, keyboard, and touch event in exact execution order using millisecond-precision timestamps.

These raw events are then consolidated into grouped actions, where multiple related events are combined into a single interaction unit representing one user intent. For example, a click is represented by mouse down and mouse up, and a drag is represented by drag start and drag end.

Pre-action and post-action frames are generated only for grouped actions, not for every raw event. This event-driven design ensures consistent action–frame–video alignment while preserving true timing and ordering for inspection and model training.

Capture Strategy

Frames are captured on every grouped interaction event, not at fixed intervals
Frame capture is event-driven, triggered by mouse, keyboard, drag, scroll, or touch actions
Video is recorded at 60 FPS, but supervision frames are selected based on action timing
No fixed-FPS frame sampling is used for action–state alignment

State Representation

Pre-action frame is captured immediately before an interaction begins
Post-action frame is captured immediately after the interaction completes
For grouped actions, frames are tied to explicit start and end events
Each frame is linked to the exact action timestamp in milliseconds
Visual state, action event, and timestamp share the same Task ID (TID)

Action Timing

All actions are logged with millisecond-level timestamps
Time gaps between actions preserve natural human latency
No smoothing, interpolation, or time normalization is applied
No manipulation or simulation of consecutive actions is performed synthetically

Known Edge Cases

Some UI behaviors do not update instantly or synchronously with user actions:

UI Animations- Visual changes may continue after an action is performed, such as button animations or transitions.
Loading Delays- Delays between user action and UI updates due to processing or loading.
Asynchronous UI Updates- Background updates without direct user actions.
Network-dependent State Changes- Variable delays based on connectivity and server response time.

Enforcement Rules

Every action must have:
- A start timestamp
- An end timestamp
- A pre-action frame of the grouped action
- A post-action frame of the grouped action
Timestamps must be strictly monotonic within a task
Grouped actions must expose all internal sub-events
Direction, duration, coordinates, and key identity must be explicitly logged
Missing frames, logs, or timestamps result in automatic rejection
Frame–action misalignment beyond defined thresholds invalidates the task
No inferred, derived, or interpolated timestamps are permitted