Abstract pattern in red and blue. Via Arthur Mazi / Unsplash.

Why Content Needs a Fingerprint, Not Just a Watermark

May 18, 2026, 28 minutes read

ISCC, FAIA, CommonsDB, and O'Reilly's "Answers" platform are quietly building the infrastructure for verifiable content in the AI age. If your organization runs on valuable structured information and cares about its long-term integrity, this is for you.

The content economy is breaking. AI platforms increasingly answer user queries directly, bypassing the websites, articles, and videos that publishers spent years producing. When someone asks an AI assistant about EU regulation, the assistant synthesizes an answer from multiple sources. The user gets what they need. The publishers whose reporting grounded that answer get nothing.

This structural shift doesn't just affect revenue. It erodes the entire chain of provenance, attribution, and trust that holds professional content together. In a world where AI agents consume, remix, and redistribute content at machine speed, three questions become urgent: How do you prove what's yours? How do you declare what AI did to it? And how do you set the terms under which it can be used?

The European Stack: ISCC + FAIA + CommonsDB

A new infrastructure is emerging from Europe that addresses all three questions. It is built around open standards, anchored in EU trust frameworks, already running in production – and it provides the foundation that content producers need to participate in the AI-mediated information economy rather than be consumed by it.

ISCC: The Content Fingerprint

The International Standard Content Code (ISCC) is the foundation layer. Unlike traditional identifiers that are assigned by authorities, an ISCC is derived from the content itself using open-source algorithms published as an ISO standard. Anyone with access to a file can generate the same ISCC independently, without coordination, registration, or fees.

This is a fundamental architectural choice. The ISCC cannot be stripped because it doesn't live inside the file – it is computed from the file. If someone strips all metadata, re-encodes a video, converts a document to a different format, or screenshots an image, the ISCC can still be regenerated from the modified content and matched against declarations of ISCC codes and metadata that have been made accessible in a registry. Similarity-preserving hashes handle minor variations, so near-duplicate detection works across transformations.

The ISCC works across all media types (text, images, audio, video) using the same identification system. For a news publisher, this means the same standard covers press photos, articles, video reports, and podcasts. For an industrial company, it covers manuals, training videos, compliance documents, and service updates.

The standard is developed and maintained as open-source infrastructure by the ISCC Foundation (a Dutch non-profit) and standardised as ISO 24138:2024. Reference implementations (in Python and Rust) are available via the ISCC's GitHub repositories.

A key ecosystem partner building declaration, provenance, and rights infrastructure on top of ISCC is Liccium.

FAIA: Fair AI Attribution

The Fair AI Attribution Framework (FAIA), developed collaboratively by Liccium, the GO FAIR Foundation, and Leiden University with funding from SIDN Fund and TopsectorICT (Netherlands), adds AI transparency declarations on top of ISCC fingerprints. It addresses a gap that no other framework currently fills with this precision:

Structured AI involvement flags. Every piece of content can be classified as Human-Created Content (HCC), AI-Assisted Content (AAC), or AI-Generated Content (AIG). This directly maps to the distinctions the EU AI Act (Article 50) requires.

Granular activity codes. Beyond the flags, FAIA defines how AI was used: co-creation, contribution, enhancement, refinement, transformation, or analysis. A journalist who uses AI to transcribe an interview and then writes the article from scratch can declare that precisely. A media company that uses AI for translation but maintains human editorial control can document that distinction. The codes are media-agnostic and designed to integrate with domain-specific vocabularies such as the STM Association's classification for academic publishing or IPTC standards for news photography.

System attribution. Optionally, declarations can specify which AI model and provider were used (e.g., "Claude 3.5 Sonnet, Anthropic" or "ChatGPT 5.2, OpenAI"). This supports training data filtering – AI developers can identify and exclude content generated by specific systems – and regulatory compliance.

Machine-readable rights signaling. FAIA declarations can also include opt-out statements for AI training, licensing conditions, and usage permissions. This is the bridge between content provenance and the emerging infrastructure for machine-to-machine content commerce.

All FAIA declarations are digitally signed, timestamped, expressed as JSON-LD manifests aligned with established vocabularies (Schema.org, Dublin Core, PROV), and published to a federated peer-to-peer registry. The trust model is anchored in eIDAS-qualified certificates for organizations or W3C Verifiable Credentials for individuals – European trust infrastructure, governed under European regulation.

CommonsDB: The Proof It Works at Scale

CommonsDB is concrete evidence that this architecture functions in production. Funded by the European Commission and led by Open Future in collaboration with Liccium, Europeana Foundation, Wikimedia Sverige, and the Institute for Information Law (IViR), CommonsDB is a prototype registry of rights declarations of public domain and openly licensed works.

As of May 2026, the registry holds over 1,750,000 declarations from Europeana and Wikimedia Sweden, with a target of 5 million declarations by mid-2026. The workflow is operational: institutions generate ISCCs from their content, sign declarations with verified credentials, and submit them via API. The CommonsDB Explorer at registry.commonsdb.org allows anyone to upload a file and check whether matching declarations exist.

This is not a whitepaper concept. It is running infrastructure, processing real content from major European cultural institutions, with a public API and a governance study addressing long-term sustainability within the EU data space.

How This Compares to C2PA

The most widely discussed provenance framework is the Coalition for Content Provenance and Authenticity (C2PA), backed by Adobe, Amazon, Google, and other big players. C2PA embeds cryptographically signed "Content Credentials" directly into media files, recording provenance – who created the content, which tools were used, how it was modified.

C2PA is well-engineered and gaining real traction. Camera manufacturers are beginning to sign images at capture. Generative AI platforms attach credentials to outputs. A Conformance Program launched in mid-2025 to formalize trust. The standard is being submitted to ISO.

But the two approaches are complementary, but differ in architecturally significant ways, as this tabular overview shows:

Feature	C2PA	ISCC + FAIA
Provenance binding	Embedded in the file (breaks when metadata is stripped)	Derived from content itself (survives metadata stripping, re-encoding, format conversion)
AI transparency	Records tool/edit actions; no structured AI involvement taxonomy	Three-level flags (HCC/AAC/AIG) + granular activity codes + system attribution
Trust infrastructure	C2PA Trust List under Linux Foundation / Joint Development Foundation governance	eIDAS-qualified certificates + W3C Verifiable Credentials (EU-governed)
Declaration model	Manifest embedded in or attached to the asset	External JSON-LD manifests bound to ISCC fingerprints via publicly accessible, federated registries
Registry	No central registry; relies on individual manifest repositories (in the media file)	Federated P2P registries with local node operation for high-volume users
Rights signaling for AI	Not natively addressed	Supports opt-out declarations for AI training, licensing conditions, usage permissions
Standards basis	C2PA specification (industry consortium)	ISO 24138:2024 (ISCC), JSON-LD, Schema.org, PROV, eIDAS
Live deployment	Camera/platform integrations (Adobe, Google, Samsung, etc.)	CommonsDB: 300,000+ declarations (Europeana, Wikimedia); FAIA registry in proof-of-concept (faia.io, optout.directory)

Three structural differences deserve particular attention:

Metadata Fragility vs. Content-Derived Identity

C2PA credentials are embedded in the file container. When platforms strip metadata – which most social media platforms routinely do – the provenance chain breaks. C2PA proposes "soft bindings" and external manifest repositories based on proprietary watermarking technology as workarounds, but the fundamental vulnerability remains. ISCC sidesteps this entirely: the identifier is computed from the content, so it can be regenerated by anyone at any time.

Trust Governance

C2PA's Conformance Program governs a curated Trust List of approved Certification Authorities. This governance sits under the Linux Foundation's Joint Development Foundation structure. For European organizations operating under eIDAS, GDPR, and the EU AI Act, this creates a dependency on a trust infrastructure they don't control and and which is incompatible with the European trust model. FAIA's trust model uses existing European trust infrastructure directly: eIDAS-qualified certificates that European organizations already use for other regulatory purposes.

AI Attribution Granularity

C2PA records what happened to a file but does not provide a structured vocabulary for how AI was involved in content creation. In a regulatory environment where the EU AI Act mandates disclosure of AI-generated content, this granularity gap matters. FAIA was designed specifically to close it.

The approaches are not necessarily mutually exclusive – Liccium's's declaration structure can reference C2PA assertions, and both can coexist. But they address different layers of the problem with different architectural assumptions.

"O'Reilly Answers" As A Proof of Concept

The infrastructure described above – content identification, AI attribution, rights signaling – is one part of the picture. The other part is what happens when content is actually consumed by AI systems and monetized at the point of retrieval.
Florent Daudens, writing in his newsletter AI in the News in February 2026, describes three layers that must exist for AI-era content monetization to work: rights and permissions (machine-readable), access and enforcement (programmable gateways for AI agents), and payment and value exchange (automated micropayments).

Shuwei Fang's analysis in Radically Informed describes the same structural shift: when content becomes infinitely replicable at near-zero cost, value migrates from artifacts to capabilities: verification, sense-making, structured knowledge.

Now O'Reilly Media, in partnership with Miso Technologies (miso.ai), has built what is arguably the most complete working implementation of this full stack. It demonstrates that the model Daudens and Fang describe is not theoretical; it is already generating revenue.

Layer 1: Rights and Attribution – Per-Answer Royalties

Since 2020, O'Reilly and Miso have been building O'Reilly Answers, a RAG-based system that answers technical questions exclusively based on knowledge from O'Reilly's curated library. The system uses open-source models (currently Llama 3) fine-tuned as a pipeline of specialized workers for research, reasoning, and writing – not a single massive model trained on authors' content.

The critical innovation is forensic attribution. Miso performed deep chunking and metadata-mapping of every book in O'Reilly's catalog, paragraph by paragraph, recording source title, chapter, section, and proximity to code and figures. Every answer includes citations and deep links to source material. And because the attribution pipeline is precise, O'Reilly can calculate the contribution percentage of each author's work in each answer and pay royalties accordingly.

This is not a symbolic gesture: The system already serves 2.5 million paying subscribers. According to O'Reilly's chief product officer Julie Baron, the partnership with Miso is "really lucrative." Authors receive ongoing royalties based on actual usage of their knowledge – not a one-time licensing fee for archive access, but per-interaction compensation that scales with value delivered.

Tim O'Reilly himself has been vocal about the model. In an April 2025 essay Copyright-Aware AI: Let's Make It So, he wrote that if O'Reilly can build a system that tracks usage and pays royalties with far more limited resources than OpenAI, then there is no excuse for larger AI companies to follow a different (i.e. exploitative) approach. He has called for MCP extensions that enable copyright-aware negotiation between AI systems and content providers.

Layer 2: Access via MCP – Content Inside the Developer's Workflow

In November 2025, O'Reilly launched an MCP server that integrates directly into Cursor, Claude Code, and VS Code. A developer debugging a Kubernetes issue at 2am no longer needs to leave their IDE, navigate to oreilly.com, and search manually. They ask a question inside their development environment and get an answer grounded in O'Reilly content – with citations and deep links to source material.

This is the shift Daudens describes: from content as page (requiring a human to navigate to a website) to content as API (available wherever the user is working, consumed by AI agents on behalf of humans). O'Reilly's MCP server is available to enterprise customers, with planned expansions to granular chapter- and section-level search, full conversational interactions, skill mapping, and hands-on practice – all accessible programmatically through AI tools.

The customer surface expands dramatically. O'Reilly's traditional customer was someone who bought a book or a platform subscription. Their new customer is a developer who never leaves their code editor. The knowledge that used to be trapped in a book, behind a link, within a chapter, is now conversational, contextual, and available exactly where the developer is working.

Layer 3: Payment – Royalties That Flow with Usage

The payment layer at O'Reilly is currently internal; royalties flow to authors within the platform's subscription model. But the architecture demonstrates an important general principle: every answer has an attribution trail, every attribution has a value calculation, every value calculation produces a royalty payment.

The emerging external payment infrastructure – particularly the x402 protocol developed by Coinbase and integrated with Stripe in February 2026, with Cloudflare building native support – would extend this model beyond a single platform. When an AI agent requests content, the server responds with a price. The agent pays automatically in fractions of a cent, receives proof of payment, and gets the content. No invoices, no subscription management, no human in the loop.

O'Reilly has built the proof of concept within a closed ecosystem. The open infrastructure (RSL for machine-readable licensing, MCP for structured access, x402 for automated payments, ISCC/FAIA for content identity and rights declarations) would make this model available to any publisher, any content producer, any organization with structured knowledge.

The Case for Opening up the O'Reilly Pattern

O'Reilly has proven the concept works. But O'Reilly solved the problem for one publisher, within one platform, for one content domain. The real challenge is what happens when you try to scale this pattern across an entire ecosystem.

Consider the following three problems that O'Reilly's solution cannot address:

The Interoperability Problem

O'Reilly's attribution system tracks how their content is used within their platform. But what happens when an AI agent needs to reason across multiple knowledge sources – O'Reilly for technical documentation, Reuters for financial data, DW for geopolitical analysis, and a medical publisher for clinical guidelines?

Each publisher building their own proprietary fingerprinting and attribution system creates the same fragmentation that plagued digital rights management for two decades. How does content from one database speak to content from another? ISCC provides the answer: a single, open, ISO-standardized identifier that works the same way regardless of who generated it, which platform hosts the content, or what media type it is.

When every publisher fingerprints their content with the same open standard, cross-platform attribution becomes technically trivial rather than requiring bilateral integration agreements between thousands of parties.

The Contamination Problem

AI systems are already being trained on AI-generated content, often unknowingly. This leads to what researchers call "model collapse" – a progressive degradation of model quality as errors, biases, and artifacts from synthetic content are amplified across training generations. The problem is invisible without systematic labeling: when AI-generated text is indistinguishable from human-created text, how does a dataset curator know what they're training on?

FAIA's three-level classification (Human-Created, AI-Assisted, AI-Generated) directly addresses this. When declarations are bound to content fingerprints and published to a queryable registry, AI developers can systematically filter their training corpora – prioritizing human-created content, assessing AI-assisted material case by case, and excluding fully synthetic content where required.

Without this infrastructure, the training data supply chain remains opaque, and model collapse becomes a structural risk for the entire AI ecosystem.

The Unauthorized Use Problem

O'Reilly can control access within their platform. But across the open web, the situation is far less orderly. Miso's Project Sentinel has documented that even when publishers explicitly block AI crawlers through robots.txt, workaround prompts can retrieve their articles 90% of the time through proxy scrapers and third-party "content partners" the publisher never agreed to. Machine-readable rights declarations bound to content fingerprints (and not to file metadata that can be stripped or robots.txt that can be circumvented) provide the only durable mechanism for publishers to signal their terms. When a declaration states that content is human-created, available for AI retrieval under specific licensing conditions, and not available for training, that declaration persists regardless of how many times the content is copied, converted, or redistributed. It travels with the content's identity, not with its container.

The Shared Infrastructure That Makes It Scale

These are not problems any single publisher can solve alone. They require shared infrastructure: open standards for identification, a common vocabulary for AI attribution, federated registries for rights declarations. This is what ISCC, FAIA, and the CommonsDB architecture provide.

Content identity (ISCC) replaces proprietary chunking metadata with an open, ISO-standardized fingerprint that works across publishers, platforms, and media types – the common language that makes cross-platform attribution possible.

AI attribution and rights declarations (FAIA) make it possible for any organization to declare provenance, AI involvement (FAIA), and usage terms in a machine-readable format – not just within one platform, but across the entire ecosystem. They give AI developers the information they need to maintain training data quality, and they give regulators the verifiable records that the EU AI Act demands.

Federated registries provide the infrastructure for querying declarations at scale – what O'Reilly's internal attribution database does for one publisher, but open, distributed, and available to anyone.

The applications extend wherever structured knowledge has value. A media outlet covering financial regulation could have their verified, attributed content available inside a compliance officer's workflow – the same way O'Reilly's technical content is available inside a developer's IDE. A health publisher could make their evidence base queryable in a clinician's tools. A broadcaster like DW could make their multilingual interview archives, policy analyses, and expert briefings available as structured, queryable knowledge for AI retrieval – with provenance, AI transparency, and licensing terms attached and verifiable.

The general pattern would look like this:

→ Identify content.
→ Declare provenance and rights.
→ Make it accessible to AI agents.
→ Get compensated when it's used.

O'Reilly has proven it works for one publisher. The open standards make it work for thousands.

What to Do Now

For Media Producers and Journalists

Content provenance is becoming a requirement, not an option. The EU AI Act creates disclosure obligations. FAIA provides a practical path to compliance anchored in European trust infrastructure. Start by checking out the CommonsDB Explorer at registry.commonsdb.org to see how ISCC-based declarations work in practice.

For Developers and Tech Teams

The Liccium Declaration API is operational at https://dev.liccium.com. ISCC generation is available as open-source Python and Rust implementations. Integration into existing content pipelines follows a straightforward pattern: generate ISCC from content, construct declaration metadata, sign with organizational certificate or verifiable credential, submit via API. The O'Reilly MCP server implementation offers a concrete reference for how structured content access works in developer tools.

For Content Strategists and Decision Makers

The organizations that structure their knowledge now (with verifiable provenance, machine-readable rights, and AI transparency declarations) will be positioned to participate in the emerging content commerce infrastructure. O'Reilly demonstrates this is not speculative: their knowledge know reaches customers they never had access to before. The question for every organization with valuable structured information is whether they want to start building toward this – or wait until the infrastructure passes them by.

The Clock Is Ticking on AI Transparency

The FAIA whitepaper (to be published in June 2026) notes that the next 12 to 24 months will determine whether AI transparency becomes a verifiable layer of the digital ecosystem – or an afterthought applied once opacity is already entrenched. The window of opportunity is narrowing. The time to build something sustainable is now.

Author

Mirko Lorenz