TL;DR
AI training data is becoming a scarce and priced asset, according to Thorsten Meyer AI’s Control Series, as public web text approaches practical limits and copyright cases reshape access. The confirmed record includes Epoch AI’s token estimates, Anthropic’s $1.5 billion authors settlement and a growing market for licensed and proprietary datasets; the exact timeline and legal rules remain unsettled.
AI’s competitive fight is shifting toward control of training data as public web text nears practical limits and major datasets are being priced, licensed, litigated or guarded as proprietary assets, according to Thorsten Meyer AI’s latest Control Series installment.
The report cites Epoch AI’s estimate that the public internet contains about 300 trillion tokens of high-quality text and that frontier models are already training on datasets approaching that ceiling. Epoch AI projects that the stock of public human text could be fully used between 2026 and 2032, with a median around 2028. Elon Musk made a broader claim in early 2025, saying the cumulative sum of human knowledge had been essentially exhausted for AI training.
The confirmed legal shift is clearest in Anthropic’s $1.5 billion settlement with authors over pirated books. The case drew a line between training on legally acquired books, which the judge described as fair use, and downloading millions of pirated books, which was not protected. The settlement was reported at about $3,000 per work across roughly 500,000 titles and required Anthropic to destroy the pirated files. It covers past piracy claims, not future training practices or model outputs.
The source material also points to a market shift away from scrape-first data collection. The New York Times case against OpenAI remains in discovery, while News Corp and other publishers have moved toward licensing arrangements. On the commercial side, Meta’s reported $14.3 billion deal for 49% of Scale AI was cited as a warning for companies that share valuable data with vendors that could later compete with them.
Data: The One Thing You Can’t Rent
The free part of “all human knowledge” is running out. As compute and models commoditize, the corpus you can’t replicate becomes the moat — so data is being fenced, priced, and, in places, treated as a national asset.
Data was supposed to be the abundant input. It’s the scarce one. It’s also the chokepoint you can actually own — so guard your proprietary data, and don’t hand it to a provider who can become your competitor (the lesson everyone fled Scale to learn). Nations: license it like Ukraine — keep the model, keep the leverage.
Data Becomes the Moat
The shift matters because compute can be rented, but unique datasets cannot be copied on demand. Thorsten Meyer AI notes that H100 rental prices have fallen 60% to 75% from their peak, making raw computing access less distinct than it was during the first wave of large model growth. If models and chips become easier to obtain, the dataset beneath a model becomes a more durable source of advantage.
For AI companies, licensed content and expert-generated data raise costs and may favor incumbents with large cash reserves. For publishers, authors and rights holders, the same shift creates leverage in licensing talks. For enterprises, the message is more direct: proprietary records, workflows, customer interactions and expert judgments may be strategic assets, not free inputs for a vendor’s future model.
licensed AI training datasets
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Scraping Gives Way to Licensing
AI developers long relied on large crawls of public web material, then sorted out legal challenges later. The new pressure comes from two directions at once: public text may be nearing saturation for frontier training, and rights holders are demanding payment or control.
The report also says the most valuable remaining datasets are no longer broad public text collections. They are paywalled archives, enterprise records, expert-authored judgments, real-world sensor data and military or sovereign information. The move toward reinforcement learning and reasoning models has increased demand for lawyers, physicists, surgeons and other specialists who can define what a good answer looks like.
“Data was supposed to be the abundant input. It’s the scarce one.”
— Thorsten Meyer AI’s Control Series

NEMIX RAM 64GB (1X64GB) DDR4 2133MHz PC4-17000 4Rx4 1.2V 288-PIN ECC LRDIMM Compatible with Samsung M386A8K40BM1-CPB Load Reduced Server Memory
NEMIX RAM is a Distributor and Manufacturer of Computer Memory and Storage Upgrades. Specializing in Enterprise Storage RAM…
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Unsettled Rules for Training Data
It is not yet clear when usable public text will hit a hard practical limit, or whether better algorithms and synthetic data will delay that point. Synthetic data is already in use, including in examples cited by the report involving Nvidia’s $320 million Gretel acquisition and Microsoft training on large synthetic datasets, but research has warned that repeated training on machine-generated material can compound errors in hard-to-check areas.
The legal picture is also incomplete. Anthropic’s settlement resolved specific past piracy claims, while cases involving OpenAI and other companies continue. Future rules for model outputs, derivative works, data retention and enterprise data ownership remain developing.

Understanding Open Source and Free Software Licensing
Used Book in Good Condition
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Licensing Deals Set the Pace
The next phase is likely to be shaped by licensing contracts, court rulings and stricter data-governance terms. Companies will need to decide which data can be shared with AI providers, who keeps rights to models trained on it and whether vendors can reuse that material for other customers.
For governments and defense users, the report points to Ukraine’s handling of battlefield data as a sign that some datasets may be treated as sovereign assets rather than commercial inputs.

Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Key Questions
What is the main news in this development?
The main development is that AI training data is becoming a priced and protected asset as public web text approaches practical limits and legal pressure pushes companies toward licensing.
Has AI already run out of data?
No. The claim is narrower: Epoch AI estimates that high-quality public human text could be fully used for frontier training sometime between 2026 and 2032. That is a projection, not a confirmed endpoint.
What did the Anthropic settlement resolve?
Anthropic settled authors’ piracy claims for $1.5 billion and agreed to destroy pirated files. The settlement did not settle future training rules or questions about model outputs.
Why can’t companies rely only on synthetic data?
Synthetic data can help, but the source material cites the risk of model collapse when systems train repeatedly on machine-generated content, especially where answers are hard to verify.
What should businesses take from this?
Businesses should treat proprietary data as a strategic asset and review contracts before sharing it with AI vendors, especially terms covering training rights, retention, reuse and derivative models.
Source: Thorsten Meyer AI