Data Provenance
Explains how each trajectory is created through human execution, validated through automated and human QA, and fully traceable from prompt to action logs, frames, and video.
Data Source
All data is generated through live human execution in real software environments.
Captured across:
- Windows, macOS (desktop)
- Android (mobile), iOS (coming soon)
- Browsers
Each trajectory reflects actual human interaction, including UI state transitions, and decision-making.
Task Design
Task prompts are designed to ensure coverage across tool functionality, reflect realistic task distributions, and eliminate synthetic or arbitrary instructions; they are not randomly generated.
They are created through a structured pipeline:
- Tool capability mapping (function-level breakdown of the tool)
- Workflow decomposition into interaction sequences
- Prompt creation based on real user goals
- Internal review before assignment
Human Operator Remote Network
Data is generated by a global network of 24,000+ Human Operators; this cross-domain and geographic breadth ensures natural diversity within the dataset.
It is a skill-aligned execution network designed for high-fidelity interaction data generation.
Operators are:
- Distributed across remote and controlled environments
- Assigned tasks based on domain relevance and tool familiarity
Qualification & Execution Pipeline
Human Operators are selected and validated through a structured, multi-stage qualification process aligned to domain expertise and tool complexity.
Stage 1: Shortlisting
Candidates are filtered based on domain familiarity and tool proficiency, aligned to task categories and complexity.
Stage 2: Task Evaluation (3–5 Tasks)
Each candidate completes 3–5 tasks of varying complexity within their specialization to assess:
- Execution accuracy
- Adherence to recording protocols
- Interaction consistency in advanced workflows
Stage 3: Final Validation (Conditional)
Selected candidates undergo a live assessment to validate expertise, authenticity, and real-time problem-solving ability.
Onboarding
Only candidates meeting or exceeding gold-standard thresholds are approved and onboarded into the data production pipeline.
Data Collection Methodology
All data is captured and audited through centralized proprietary in-house software, to ensure consistency across all recorded signals that enforces real human execution while standardizing capture across operators, tools, and environments, preserving natural behavior with consistent and reproducible structure.
Recording System
- 60 FPS screen recording
- Synchronized event logging
- Absolute timestamp alignment (ms precision)
- Native resolution capture
Data Collection Flow
Task received
TID assigned to operator
Task review
Operator reads and understands task
TID entered
Operator inputs TID into recording agent
3-second cooldown
Capture triggered automatically
Live execution recorded
Screen, events, timestamps captured
Recording stopped
Via system shortcut
07
Review outcome
Discarded
Rejected trajectory
Submitted
Accepted trajectory
Quality Assurance
Each task submission is evaluated through a combination of system-level validation and human audit.
System Validation
- Deterministic start and stop triggers
- Synchronized capture of video, event logs, and Semantic Actions
- Temporal alignment across all modalities
Human Validation
- Data completeness (video, logs, frames)
- Action coherence (sequence is logically consistent without redundant or invalid steps)
- UI-state alignment (actions correspond to the visible interface)
- PII or policy violations
Outcomes
| Type | Meaning |
|---|---|
| Successful trajectory | Trajectories that satisfies all automated validation checks and passes human QA for task completion, action correctness, and UI-state alignment. |
| Failed trajectory | Trajectories that fails one or more automated validation checks or human QA criteria for task completion, action correctness, UI-state alignment or PII. |
Data Lineage
The entire process flow is designed in a way that every trajectory is fully traceable from task definition to final dataset inclusion.
Lineage Chain
Prompt → Human Operators → Trajectory Execution → QA → Dataset
Traceability
Every trajectory is uniquely linked through:
- Task ID (TID)
- Task Prompt
- Action trace (video + logs)
- QA outcome and rejection reason
Privacy & Compliance
Privacy is enforced at the level of task design, data collection, and validation, ensuring that all trajectories remain usable for research without exposing sensitive information.
By design, prompts and recording setups avoid the capture of Personally Identifiable Information (PII), including names, contact details, credentials, personal communications, identification numbers, and any user-specific data.
In rare cases where limited personal information may inherently appear during task execution, it is captured only with explicit consent from the operator and is subsequently curated or masked to prevent any exposure of individual identity while preserving the technical integrity of the data.
Controls
Privacy controls are embedded directly into the pipeline:
- Prompts are constructed to avoid PII entry
- Trajectories containing sensitive data are rejected during QA
- Reasoning traces are reviewed prior to final submission
- Only compliant data is included in final exports
Data Handling & Security
All data is stored within AWS-managed infrastructure with enforced access control and secure storage policies.
During trajectory recording, all generated files (video, event logs, frames) are encrypted at source, preventing operator-level access to raw data. No persistent local copies are exposed to operators beyond the controlled recording interface.
Access is restricted to controlled, surface-level interfaces for the audit team, without direct interaction with underlying data files, and is limited strictly to auditing workflows. Data collection is conducted under explicit operator consent, obtained prior to participation and applicable across the full recording and submission lifecycle.
Bias & Coverage
The dataset is designed for coverage and realism, not controlled uniformity. Variability is preserved where it contributes to learning signal, and constrained where it degrades data quality.
Coverage
Coverage is driven across multiple axes:
- Tools across domains
- Full task complexity spectrum
- Cross-platform environments (Windows, macOS, Android, web)
- Diverse operator base (24K+ global network)
This results in natural variation across:
- Interaction styles
- Screen resolutions
- UI configurations (e.g., dark/light themes)
Observed Biases
- Office suite and productivity tools are more represented due to higher usage and task availability
- Differences in operator expertise and approach introduce variation in how tasks are performed
- Reasoning traces reflect human interpretation; multiple valid approaches may exist for the same task, and a single reasoning path may not be optimal or unique
Behavioral Preservation
The dataset intentionally retains natural execution patterns, including:
- Retries
- Backtracking
- Corrections
- Alternative valid paths
These are treated as signals, not noise, as they reflect real interaction behavior.