Data Collection
Empyr organizes data into modular collections that represent verifiable, structured knowledge extracted from real web interactions. Each collection is standardized through a public schema, linked to a content-addressable store, and governed by token-weighted curation and verification.
Data Collection Types
1. Interaction Data Captures safe, anonymized behavioral patterns that help agents learn how users navigate and act on the web.
Click and scroll heatmaps
Page dwell time metrics
Hover, focus, and intent signals
Interaction chains (action → outcome traces)
2. Structural Data Provides machine-readable web layouts that enable agents to interpret complex or dynamic pages.
DOM and CSS structure maps
Element labeling and hierarchy graphs
Accessibility metadata
Script and dependency inventories
3. Functional Data Describes web forms, tool calls, and dynamic workflows that browser agents must understand to execute actions.
Form schemas and validation rules
API call traces from approved extensions
Function parameter mapping
Tool invocation sequences and success metrics
4. Content and Product Data Normalizes structured page content for agents working in commerce, research, or discovery.
Product and service listings with attribute maps
Pricing and stock updates
Embedded schema.org and microdata extraction
Summarized text and structured snippets
5. Contextual Data Provides the situational metadata needed to evaluate reliability and freshness.
Timestamped source provenance
Domain reputation scoring
Temporal decay and refresh rate tracking
Geographic and categorical tagging
6. Synthetic and Enriched Data Combines verified original data with synthetic augmentation for model training and benchmarking.
AI-generated fill-ins for sparse fields
Semantic clustering outputs
Ground truth alignment with external datasets
Benchmark sets for tool and agent evaluation
Schema Registry
Each collection is defined through a machine-readable schema registered on chain.
Schemas define allowed fields, datatypes, privacy class, and license.
Changes require DAO approval through governance.
Curators and verifiers use the schema to automate validation and reward distribution.
Buyers use schemas for seamless integration with their models or pipelines.
Access and Monetization
Empyr supports multiple access modes and pricing models to serve diverse users.
Live Streams: Continuous updates for training or real-time inference.
Snapshot Packages: Fixed datasets for offline training or auditing.
Query-on-Demand: Pay-per-query API with caching and pagination.
Access is purchased through data credits, which are converted from tokens and burned when used. Pricing per dataset and access mode is set dynamically by governance based on demand, cost, and verification complexity.
Quality Assurance
To maintain reliability across collections
Producers sign submissions with key-based attestations.
Curators run automated linting, deduplication, and semantic checks.
Verifiers conduct random challenge audits backed by stake.
Failed audits trigger stake slashing and public dispute records.
A rolling leaderboard and reputation score incentivize long-term honest behavior.
Example Collections
FormSchema-v1
Browser form structures, field names, input types
Daily
Snapshot, Stream
Active
ToolTrace-v1
Function and API call sequences from browser extensions
Realtime
Stream
Beta
PageMap-v1
DOM and layout extraction for major commerce and news sites
Weekly
Snapshot
Active
ClickFlow-v1
Anonymized interaction chains with page identifiers
Daily
Stream
Draft
Licensing and Provenance
Each dataset object is embedded with
License fingerprint (CC0, commercial, or private)
Contributor signature
Hash pointer to raw source bundle
Timestamp and collection ID
License compliance and provenance trails are enforced at the protocol level, creating transparent traceability for buyers and auditors.
Last updated
