Data Collection

Empyr organizes data into modular collections that represent verifiable, structured knowledge extracted from real web interactions. Each collection is standardized through a public schema, linked to a content-addressable store, and governed by token-weighted curation and verification.

Data Collection Types

1. Interaction Data Captures safe, anonymized behavioral patterns that help agents learn how users navigate and act on the web.

Click and scroll heatmaps
Page dwell time metrics
Hover, focus, and intent signals
Interaction chains (action → outcome traces)

2. Structural Data Provides machine-readable web layouts that enable agents to interpret complex or dynamic pages.

DOM and CSS structure maps
Element labeling and hierarchy graphs
Accessibility metadata
Script and dependency inventories

3. Functional Data Describes web forms, tool calls, and dynamic workflows that browser agents must understand to execute actions.

Form schemas and validation rules
API call traces from approved extensions
Function parameter mapping
Tool invocation sequences and success metrics

4. Content and Product Data Normalizes structured page content for agents working in commerce, research, or discovery.

Product and service listings with attribute maps
Pricing and stock updates
Embedded schema.org and microdata extraction
Summarized text and structured snippets

5. Contextual Data Provides the situational metadata needed to evaluate reliability and freshness.

Timestamped source provenance
Domain reputation scoring
Temporal decay and refresh rate tracking
Geographic and categorical tagging

6. Synthetic and Enriched Data Combines verified original data with synthetic augmentation for model training and benchmarking.

AI-generated fill-ins for sparse fields
Semantic clustering outputs
Ground truth alignment with external datasets
Benchmark sets for tool and agent evaluation

Schema Registry

Each collection is defined through a machine-readable schema registered on chain.

Schemas define allowed fields, datatypes, privacy class, and license.
Changes require DAO approval through governance.
Curators and verifiers use the schema to automate validation and reward distribution.
Buyers use schemas for seamless integration with their models or pipelines.

Access and Monetization

Empyr supports multiple access modes and pricing models to serve diverse users.

Live Streams: Continuous updates for training or real-time inference.
Snapshot Packages: Fixed datasets for offline training or auditing.
Query-on-Demand: Pay-per-query API with caching and pagination.

Access is purchased through data credits, which are converted from tokens and burned when used. Pricing per dataset and access mode is set dynamically by governance based on demand, cost, and verification complexity.

Quality Assurance

To maintain reliability across collections

Producers sign submissions with key-based attestations.
Curators run automated linting, deduplication, and semantic checks.
Verifiers conduct random challenge audits backed by stake.
Failed audits trigger stake slashing and public dispute records.

A rolling leaderboard and reputation score incentivize long-term honest behavior.

Example Collections

Collection Name

Description

Update Interval

Access Type

Governance Status

FormSchema-v1

Browser form structures, field names, input types

Daily

Snapshot, Stream

Active

ToolTrace-v1

Function and API call sequences from browser extensions

Realtime

Stream

Beta

PageMap-v1

DOM and layout extraction for major commerce and news sites

Weekly

Snapshot

Active

ClickFlow-v1

Anonymized interaction chains with page identifiers

Daily

Stream

Draft

Licensing and Provenance

Each dataset object is embedded with

License fingerprint (CC0, commercial, or private)
Contributor signature
Hash pointer to raw source bundle
Timestamp and collection ID

License compliance and provenance trails are enforced at the protocol level, creating transparent traceability for buyers and auditors.

PreviousArchitecture NextFlywheel

Last updated 4 days ago