Home
Data Provenance

Data Provenance

Data provenance ensures every demonstration is authentic, technically accurate, auditable, and easy to verify. Each task follows a controlled lifecycle, from prompt design and human execution to automated validation and human review, so researchers can reliably trace and reproduce the full interaction history used for model training.

Prompt Creation

A prompt is the written task instruction given to the recorder. It specifies what goal to achieve, what constraints to follow, and what outcome counts as completion, so the task can be executed and evaluated consistently.

How Prompts Are Created

  • Prompts are written by trained prompt authors using standardized guidelines
  • Each prompt is designed to be tool-realistic, with measurable completion criteria
  • Prompts are aligned to supported environments (Windows, macOS, Android; iOS in progress) and real tool behavior

Prompt Design Approach

  • Feature-based Prompts: Tasks are created around specific product features (e.g., filters, exports, formatting, sharing, settings) to ensure coverage of key UI functionality.

  • Workflow-based Prompts: Tasks represent real end-to-end user goals that require multiple steps, such as preparing a report, publishing content, or completing a transaction.

  • Complexity-tier Coverage: Prompts are created across atomic, elementary, multi-step, and workflow-level difficulty to ensure coverage from micro-actions to long-horizon sequences.

Prompt Quality Checks

  • Prompts are checked for ambiguity, missing prerequisites, and unclear success definitions
  • Prompts are verified to avoid sensitive content and reduce the risk of capturing PII during execution
  • Each approved prompt is mapped to a unique Task ID (TID) to ensure traceability through the full lifecycle

Demonstration Creation and Execution

Demonstrations are produced through live human execution in real software, under controlled guidelines that ensure consistency across recorders and full traceability per task.

The dataset includes over 1,500 hours of real human computer use and approximately 6 million recorded actions, reflecting genuine user behavior, natural pacing, and valid task and execution paths.

Who Performs the Task

  • Recorded by trained human operators with varied professional backgrounds, including engineers, software developers, CAD professionals, and other domain specialists
  • Task assignment is based on tool familiarity and task complexity, ensuring the appropriate skill level matches the task

How Execution Is Controlled

  • Every task is tied to a unique Task ID (TID) and mapped to a single prompt
  • Prompts span atomic, elementary, multi-step, and workflow-level tasks to cover multiple complexity tiers
  • Operators follow standardized task guidelines to ensure consistent execution across tools and environments
  • Tasks are performed using isolated tool accounts, preventing cross-task contamination and ensuring consistent system states

Where It Happens

  • Executed in real operating system environments: Windows, macOS, and Android
    (iOS dataset curation in progress)
  • Coverage spans more than 120 tools, including Office Suite, Design and Creativity, CRM, Cloud Administration, Communication, and others
  • Recorded as a 60 FPS screen video and processed into frame images for frame-level supervision

What Is Explicitly Disallowed

  • Scripted automation
  • Macros
  • Simulated UI replays
  • Post-hoc fabricated logs

All demonstrations are generated through authentic human execution at scale, with measurable coverage and depth.

Human Review

Human QA operates as a multi-layer audit system to ensure accuracy-driven results.

Task Completion Correctness

  • One hundred percent of tasks undergo domain expert review
  • Reviewers confirm full completion against prompt requirements
  • Ten percent of accepted tasks undergo an inspector audit as a secondary quality check

Action Intent Validity

  • Reviewers verify that actions align with the intended meaning and objective of the task, not just the final result
  • Confirms that the correct UI elements, controls, and workflows were intentionally selected
  • Flags cases where the correct outcome is achieved through accidental, inefficient, or semantically incorrect actions
  • Ensures training data reflects purposeful decision-making rather than coincidental success

UI Interpretation Errors

  • Evaluates whether the operator correctly understood and interpreted UI elements as presented on screen
  • Flags interactions with incorrect menus, dialogs, buttons, or visual controls
  • Identifies visual grounding mistakes even when task completion appears successful
  • Prevents models from learning incorrect UI associations or misaligned visual reasoning

Edge Case Handling

Corrective behaviors such as retries, undo operations, and backtracking are preserved and explicitly reviewed.

These behaviors are present across tens of thousands of long-horizon tasks, enabling resilience-focused learning.

Automatic Validation

Automated system checks operate as a single validation layer for every recorded task, enforcing statistical and temporal consistency.

Timestamp Monotonicity

A timestamp represents the exact time, recorded in milliseconds, at which an action occurs. All recorded steps are verified to ensure timestamps move forward in time, confirming that actions were performed in a real, continuous sequence and not copied or recorded out of order.

Action-frame Alignment

Each task includes:

  • Frame-level annotations
  • FPS tracking
  • Total frame counts

Alignment is validated against action logs to ensure pixel-to-event consistency.

Missing File Detection

Each task is programmatically verified for the presence of:

  • One video file (.mp4)
  • Three action log formats (.txt, .csv, .json)
  • One frame annotation file (.csv)

Tasks missing any required file are automatically rejected.

Corrupted File Detection

Each task undergoes automated checks to ensure data integrity and prevent inclusion of incomplete or unreliable demonstrations.

Validation metrics include:

  • FPS Consistency
    Detects dropped frames, unstable capture rates, or encoding issues that can break frame-level supervision

  • Lag Percentage
    Measures idle time between actions to flag recordings with low activity or abnormal pauses

  • Time Deviation
    Captures delays in recording initiation that can misalign early actions with the visual timeline

Tasks that fail integrity thresholds are rejected or re-recorded before inclusion in the dataset.

QA Outcomes

Each task results in one of three tracked outcomes:

  • Accept
    The demonstration meets all automated validation and human review criteria

  • Reject
    The demonstration fails validation or review checks

  • Re-run
    The demonstration is rejected with feedback and re-recorded before re-evaluation

Data Lineage

The platform enforces deterministic, end-to-end lineage so every demonstration can be reconstructed, verified, and audited from the source prompt to the final task.

Canonical Lineage Chain

Prompt → Action Logs → Frame Logs and Frame Images → Video → Final Task

What Is Linked Per Task

For each demonstration, all components are connected through a Task ID (TID), which acts as the primary identifier across the entire lifecycle.

  • Prompt and context setup define the task objective and prerequisites
  • Action logs record all interaction events with precise timestamps and execution order
  • Frame logs and annotations link actions to specific visual states
  • Frame images and video provide a continuous visual record
  • Metadata captures execution environment, tool identifiers, task duration, action counts, and quality metrics

Lineage Guarantees

  • One-to-one Traceability
    No file exists without a valid Task ID and upstream references

  • Temporal Integrity
    Action and frame timestamps are monotonic and consistent

  • Alignment Integrity
    Logged actions are verifiably aligned with frame and video records

  • Completeness Enforcement
    Missing or corrupted components are automatically detected and rejected

Privacy

PII excluded by default

Personally Identifiable Information (PII) is excluded by default.

Prompts and recording setups are designed to avoid capturing sensitive or user-specific data, including names, contact details, credentials, personal messages, and identification numbers.

In rare cases where limited personal information may be inherently present, it is captured only with explicit consent and curated or masked to preserve privacy while maintaining technical usefulness.