Data Provenance: Collection, QA, and Lineage

Home

Data Provenance

Data provenance ensures every demonstration is authentic, technically accurate, auditable, and easy to verify. Each task follows a controlled lifecycle, from prompt design and human execution to automated validation and human review, so researchers can reliably trace and reproduce the full interaction history used for model training.

Prompt Creation

A prompt is the written task instruction given to the recorder. It specifies what goal to achieve, what constraints to follow, and what outcome counts as completion, so the task can be executed and evaluated consistently.

How Prompts Are Created

Prompts are written by trained prompt authors using standardized guidelines
Each prompt is designed to be tool-realistic, with measurable completion criteria
Prompts are aligned to supported environments (Windows, macOS, Android; iOS in progress) and real tool behavior

Prompt Design Approach

Feature-based Prompts: Tasks are created around specific product features (e.g., filters, exports, formatting, sharing, settings) to ensure coverage of key UI functionality.
Workflow-based Prompts: Tasks represent real end-to-end user goals that require multiple steps, such as preparing a report, publishing content, or completing a transaction.
Complexity-tier Coverage: Prompts are created across atomic, elementary, multi-step, and workflow-level difficulty to ensure coverage from micro-actions to long-horizon sequences.

Prompt Quality Checks

Prompts are checked for ambiguity, missing prerequisites, and unclear success definitions
Prompts are verified to avoid sensitive content and reduce the risk of capturing PII during execution
Each approved prompt is mapped to a unique Task ID (TID) to ensure traceability through the full lifecycle

Demonstration Creation and Execution

Demonstrations are produced through live human execution in real software, under controlled guidelines that ensure consistency across recorders and full traceability per task.

The dataset includes over 1,500 hours of real human computer use and approximately 6 million recorded actions, reflecting genuine user behavior, natural pacing, and valid task and execution paths.

Who Performs the Task

Recorded by trained human operators with varied professional backgrounds, including engineers, software developers, CAD professionals, and other domain specialists
Task assignment is based on tool familiarity and task complexity, ensuring the appropriate skill level matches the task

How Execution Is Controlled

Every task is tied to a unique Task ID (TID) and mapped to a single prompt
Prompts span atomic, elementary, multi-step, and workflow-level tasks to cover multiple complexity tiers
Operators follow standardized task guidelines to ensure consistent execution across tools and environments
Tasks are performed using isolated tool accounts, preventing cross-task contamination and ensuring consistent system states

Where It Happens

Executed in real operating system environments: Windows, macOS, and Android
(iOS dataset curation in progress)
Coverage spans more than 120 tools, including Office Suite, Design and Creativity, CRM, Cloud Administration, Communication, and others
Recorded as a 60 FPS screen video and processed into frame images for frame-level supervision

What Is Explicitly Disallowed

Scripted automation
Macros
Simulated UI replays
Post-hoc fabricated logs

All demonstrations are generated through authentic human execution at scale, with measurable coverage and depth.

Human Review

Human QA operates as a multi-layer audit system to ensure accuracy-driven results.

Task Completion Correctness

One hundred percent of tasks undergo domain expert review
Reviewers confirm full completion against prompt requirements
Ten percent of accepted tasks undergo an inspector audit as a secondary quality check

Action Intent Validity

Reviewers verify that actions align with the intended meaning and objective of the task, not just the final result
Confirms that the correct UI elements, controls, and workflows were intentionally selected
Flags cases where the correct outcome is achieved through accidental, inefficient, or semantically incorrect actions
Ensures training data reflects purposeful decision-making rather than coincidental success

UI Interpretation Errors

Evaluates whether the operator correctly understood and interpreted UI elements as presented on screen
Flags interactions with incorrect menus, dialogs, buttons, or visual controls
Identifies visual grounding mistakes even when task completion appears successful
Prevents models from learning incorrect UI associations or misaligned visual reasoning

Edge Case Handling

Corrective behaviors such as retries, undo operations, and backtracking are preserved and explicitly reviewed.

These behaviors are present across tens of thousands of long-horizon tasks, enabling resilience-focused learning.

Automatic Validation

Automated system checks operate as a single validation layer for every recorded task, enforcing statistical and temporal consistency.

Timestamp Monotonicity

A timestamp represents the exact time, recorded in milliseconds, at which an action occurs. All recorded steps are verified to ensure timestamps move forward in time, confirming that actions were performed in a real, continuous sequence and not copied or recorded out of order.

Action-frame Alignment

Each task includes:

Frame-level annotations
FPS tracking
Total frame counts

Alignment is validated against action logs to ensure pixel-to-event consistency.

Missing File Detection

Each task is programmatically verified for the presence of:

One video file (.mp4)
Three action log formats (.txt, .csv, .json)
One frame annotation file (.csv)

Tasks missing any required file are automatically rejected.

Corrupted File Detection

Each task undergoes automated checks to ensure data integrity and prevent inclusion of incomplete or unreliable demonstrations.

Validation metrics include:

FPS Consistency
Detects dropped frames, unstable capture rates, or encoding issues that can break frame-level supervision
Lag Percentage
Measures idle time between actions to flag recordings with low activity or abnormal pauses
Time Deviation
Captures delays in recording initiation that can misalign early actions with the visual timeline

Tasks that fail integrity thresholds are rejected or re-recorded before inclusion in the dataset.

QA Outcomes

Each task results in one of three tracked outcomes:

Accept
The demonstration meets all automated validation and human review criteria
Reject
The demonstration fails validation or review checks
Re-run
The demonstration is rejected with feedback and re-recorded before re-evaluation

Data Lineage

The platform enforces deterministic, end-to-end lineage so every demonstration can be reconstructed, verified, and audited from the source prompt to the final task.

Canonical Lineage Chain

Prompt → Action Logs → Frame Logs and Frame Images → Video → Final Task

What Is Linked Per Task

For each demonstration, all components are connected through a Task ID (TID), which acts as the primary identifier across the entire lifecycle.

Prompt and context setup define the task objective and prerequisites
Action logs record all interaction events with precise timestamps and execution order
Frame logs and annotations link actions to specific visual states
Frame images and video provide a continuous visual record
Metadata captures execution environment, tool identifiers, task duration, action counts, and quality metrics

Lineage Guarantees

One-to-one Traceability
No file exists without a valid Task ID and upstream references
Temporal Integrity
Action and frame timestamps are monotonic and consistent
Alignment Integrity
Logged actions are verifiably aligned with frame and video records
Completeness Enforcement
Missing or corrupted components are automatically detected and rejected

Privacy

PII excluded by default

Personally Identifiable Information (PII) is excluded by default.

Prompts and recording setups are designed to avoid capturing sensitive or user-specific data, including names, contact details, credentials, personal messages, and identification numbers.

In rare cases where limited personal information may be inherently present, it is captured only with explicit consent and curated or masked to preserve privacy while maintaining technical usefulness.