Data Provenance: Collection, QA, and Lineage

Home

Data Provenance

Explains how each trajectory is created through human execution, validated through automated and human QA, and fully traceable from prompt to action logs, frames, and video.

Data Source

All data is generated through live human execution in real software environments.

Captured across:

Windows, macOS (desktop)
Android (mobile), iOS (coming soon)
Browsers

Each trajectory reflects actual human interaction, including UI state transitions, and decision-making.

Task Design

Task prompts are designed to ensure coverage across tool functionality, reflect realistic task distributions, and eliminate synthetic or arbitrary instructions; they are not randomly generated.

They are created through a structured pipeline:

Tool capability mapping (function-level breakdown of the tool)
Workflow decomposition into interaction sequences
Prompt creation based on real user goals
Internal review before assignment

Human Operator Remote Network

Data is generated by a global network of 24,000+ Human Operators; this cross-domain and geographic breadth ensures natural diversity within the dataset.

It is a skill-aligned execution network designed for high-fidelity interaction data generation.

Operators are:

Distributed across remote and controlled environments
Assigned tasks based on domain relevance and tool familiarity

Qualification & Execution Pipeline

Human Operators are selected and validated through a structured, multi-stage qualification process aligned to domain expertise and tool complexity.

Stage 1: Shortlisting

Candidates are filtered based on domain familiarity and tool proficiency, aligned to task categories and complexity.

Stage 2: Task Evaluation (3–5 Tasks)

Each candidate completes 3–5 tasks of varying complexity within their specialization to assess:

Execution accuracy
Adherence to recording protocols
Interaction consistency in advanced workflows

Stage 3: Final Validation (Conditional)

Selected candidates undergo a live assessment to validate expertise, authenticity, and real-time problem-solving ability.

Onboarding

Only candidates meeting or exceeding gold-standard thresholds are approved and onboarded into the data production pipeline.

Data Collection Methodology

All data is captured and audited through centralized proprietary in-house software, to ensure consistency across all recorded signals that enforces real human execution while standardizing capture across operators, tools, and environments, preserving natural behavior with consistent and reproducible structure.

Recording System

60 FPS screen recording
Synchronized event logging
Absolute timestamp alignment (ms precision)
Native resolution capture

Data Collection Flow

Task received

TID assigned to operator

Task review

Operator reads and understands task

TID entered

Operator inputs TID into recording agent

3-second cooldown

Capture triggered automatically

Live execution recorded

Screen, events, timestamps captured

Recording stopped

Via system shortcut

Review outcome

Discarded

Rejected trajectory

Submitted

Accepted trajectory

Quality Assurance

Each task submission is evaluated through a combination of system-level validation and human audit.

System Validation

Deterministic start and stop triggers
Synchronized capture of video, event logs, and Semantic Actions
Temporal alignment across all modalities

Human Validation

Data completeness (video, logs, frames)
Action coherence (sequence is logically consistent without redundant or invalid steps)
UI-state alignment (actions correspond to the visible interface)
PII or policy violations

Outcomes

Type	Meaning
Successful trajectory	Trajectories that satisfies all automated validation checks and passes human QA for task completion, action correctness, and UI-state alignment.
Failed trajectory	Trajectories that fails one or more automated validation checks or human QA criteria for task completion, action correctness, UI-state alignment or PII.

Data Lineage

The entire process flow is designed in a way that every trajectory is fully traceable from task definition to final dataset inclusion.

Lineage Chain

Prompt → Human Operators → Trajectory Execution → QA → Dataset

Traceability

Every trajectory is uniquely linked through:

Task ID (TID)
Task Prompt
Action trace (video + logs)
QA outcome and rejection reason

Privacy & Compliance

Privacy is enforced at the level of task design, data collection, and validation, ensuring that all trajectories remain usable for research without exposing sensitive information.

By design, prompts and recording setups avoid the capture of Personally Identifiable Information (PII), including names, contact details, credentials, personal communications, identification numbers, and any user-specific data.

In rare cases where limited personal information may inherently appear during task execution, it is captured only with explicit consent from the operator and is subsequently curated or masked to prevent any exposure of individual identity while preserving the technical integrity of the data.

Controls

Privacy controls are embedded directly into the pipeline:

Prompts are constructed to avoid PII entry
Trajectories containing sensitive data are rejected during QA
Reasoning traces are reviewed prior to final submission
Only compliant data is included in final exports

Data Handling & Security

All data is stored within AWS-managed infrastructure with enforced access control and secure storage policies.

During trajectory recording, all generated files (video, event logs, frames) are encrypted at source, preventing operator-level access to raw data. No persistent local copies are exposed to operators beyond the controlled recording interface.

Access is restricted to controlled, surface-level interfaces for the audit team, without direct interaction with underlying data files, and is limited strictly to auditing workflows. Data collection is conducted under explicit operator consent, obtained prior to participation and applicable across the full recording and submission lifecycle.

Bias & Coverage

The dataset is designed for coverage and realism, not controlled uniformity. Variability is preserved where it contributes to learning signal, and constrained where it degrades data quality.

Coverage

Coverage is driven across multiple axes:

Tools across domains
Full task complexity spectrum
Cross-platform environments (Windows, macOS, Android, web)
Diverse operator base (24K+ global network)

This results in natural variation across:

Interaction styles
Screen resolutions
UI configurations (e.g., dark/light themes)

Observed Biases

Office suite and productivity tools are more represented due to higher usage and task availability
Differences in operator expertise and approach introduce variation in how tasks are performed
Reasoning traces reflect human interpretation; multiple valid approaches may exist for the same task, and a single reasoning path may not be optimal or unique

Behavioral Preservation

The dataset intentionally retains natural execution patterns, including:

Retries
Backtracking
Corrections
Alternative valid paths

These are treated as signals, not noise, as they reflect real interaction behavior.