Monday, June 29, 2026

We're Running Out of Internet to Train AI On. Here's What Comes Next

 

There's a problem the AI labs don't put on the keynote slides. The big language models are running low on fuel.

They learned to write by reading almost everything humans have ever published. Books, code, forums, the whole internet. That well is close to dry. You can't double the size of human writing on demand, and the easy answer - train models on text other models wrote - quietly poisons them. Quality degrades. The industry has a name for it now: model collapse. (I unpacked that failure mode here → Habsburg Jaw in making - Model Collapse in AI.)

So the question that actually decides the next decade isn't "how much bigger can the models get." It's "where does the next training data come from when the internet runs out?"

The most serious answer going right now is this: stop collecting data. Start generating experience. That's what world models do.

A different kind of model

A language model predicts the next word. A world model predicts the next state of an environment, what happens when you turn the wheel, drop the glass, brake on ice. It learns physics and cause-and-effect, then lets a machine rehearse actions inside its own simulation before doing them for real.

The idea is older than the hype. David Ha and Jürgen Schmidhuber published a paper called "World Models" back in 2018, where an agent learned to play a game inside its own dreamed-up version of it. What's new is that the models finally got good enough to matter outside a lab and that the data wall gave everyone an urgent reason to care.

2026 is when it left the lab

Watch what shipped in the last six months, because the timing isn't a coincidence:

🔹 Google DeepMind released Project Genie to the public in January. Type a sentence, walk around a playable 3D world in real time. By May, it connected to Street View.

🔹 Nvidia launched Cosmos 3 on June 1 - an open model built specifically to train robots and self-driving cars inside generated worlds, with a coalition of robotics companies around it.

🔹 Waymo built its own world model in February to create the dangerous driving situations it can't safely film on real roads. Wayve shipped GAIA-3 for the same reason.

🔹 Fei-Fei Li's World Labs opened Marble to the public. Odyssey raised $310M.

And the loudest signal: Yann LeCun bet his next chapter on the claim that scaling language models is a dead end, and that models which learn the structure of the world are the real path forward.

These aren't six companies chasing a demo. There are six answers to the same shortage.

Why this is a business story, not a robotics one

Here's the move that matters for anyone running a company. Real-world data is slow, expensive, and often impossible to gather - a billion driving miles, a million robot grasps, the rare disaster you can't stage. A world model lets you manufacture that experience. Simulate the edge cases by the million, overnight, for the cost of compute instead of years and lives.

If that sounds abstract, connect it to something you already use: the digital twin. A digital twin tells you what your factory or supply chain looks like right now. A world model is the layer that predicts what it does next - so you can test a change a thousand ways before spending a dollar in the real world.

That's why this spreads well past cars and robots: manufacturing lines, warehouse logistics, supply-chain shocks, financial scenarios, anywhere with humanoids or autonomy on the roadmap. The common thread is rehearsal - deciding after you've seen it play out, not before.

The divide it creates

For three years, the AI advantage was about generating content faster. The next advantage is quieter and harder to copy: generating experience. The companies that learn to simulate their own reality will train, test, and decide faster than the ones still waiting to collect data from the real one. That gap compounds the way cloud and data infrastructure did - invisibly, until a competitor is simply moving at a speed you can't match and you can't quite explain why.

LLMs hit a data wall. World models walk around it by building their own.

So the question worth sitting with: when the easy data runs out, will your company still be waiting to collect more - or generating what it needs?

 

Visual Recap 


 


Saturday, June 20, 2026

Your AI isn't Stuck on Technology, it's Stuck on You

 

One of the largest banks in the US ran 47 AI pilots last year. I asked a senior exec there how many had changed the real number on the P&L. He thought about it, then said: "One. Maybe."

Forty-seven experiments. One result. That gap is the whole story of AI in 2026, and almost nobody is naming it correctly.

For three years, the questions were about technology. Which model? Build or buy? How do we do RAG? Are we behind? Fair questions in 2023. They've quietly stopped mattering. The models are good enough, cheap enough, and available to everyone, including your competitors. The barrier to building something with AI has fallen to roughly zero.

So why is the promised return still stuck in pilot purgatory?

Because deploying AI is easy, changing how a company works is hard. And the second one is the actual job.

AI doesn't fix your company. It exposes it

The part executives don't want to hear: AI amplifies whatever was already there. Run it on top of a sharp, fast-deciding organization, and it compounds the speed. Run it on top of unclear ownership and slow approvals, and all you've done is generate confusion faster.

The pilot that summarizes contracts works fine in the demo. Then it dies in the org. Legal doesn't trust the output, nobody decides who's accountable when it's wrong, and the old manual process still runs in parallel "just to be safe." None of that is a data-science problem. You can't prompt your way out of an org chart.

That bank's 47 pilots weren't a technology achievement. They were a museum. Lots of impressive exhibits, nothing in production.

Three places the work actually breaks

Decisions move at committee speed while information moves at machine speed. AI now hands a team a forecast in minutes. Then that forecast waits two weeks for a Thursday steering meeting. When insight is instant and the decision is slow, the speed you paid for evaporates inside your own approval chain.

You're measuring effort in a world that no longer rewards it. Hours worked. Headcount. Tickets closed. Number of pilots launched. These tell you a team is busy. If one person now does what five used to, "busy" is the wrong thing to count. Cycle time, decision velocity, cost avoided, a customer kept. Those are the numbers that moved, and most dashboards don't track them.

The work sits between your departments, but your org chart doesn't. The useful AI workflows cut across marketing and analytics, finance and forecasting, product and support. Your structure still has walls exactly where the value wants to flow. So it doesn't flow.

The one comparison worth holding onto

When factories first wired up to electricity, productivity barely moved. The owners had swapped the steam engine for an electric motor and changed nothing else. Same layout, same workflow, same building designed around one central power source.

The gains came decades later, when a new generation of managers redesigned the whole floor around what electricity made possible: machines anywhere, smaller flexible lines, a different shape of work entirely. The technology had been ready for years. The management caught up late.

We are at the "wired up but unchanged" stage with AI. The motor is bolted in. The factory floor hasn't been touched.

What this asks of you

This is the uncomfortable shift. Your job stops being "what can AI do for us" and becomes "what has to change in how we run for any of that to land."

That means deciding, on purpose, which decisions stay human, which become AI-assisted, and which you'll let a system make on its own - and who is accountable for each. It means killing work, not just speeding it up; half your reports and approvals exist only because automation didn't exist. It means promoting the operator who can redesign a process, not only the engineer who can fine-tune a model.

None of that is exciting. It's slower and more political than buying another tool. Which is exactly why it's the real moat. Anyone can buy the model. Few people are willing to take on the management work around it.

The biggest risk to most executives right now isn't getting out-innovated. It's getting out-managed by a competitor running the same models you have - just inside a company built to actually use them.

So the question I'd put to you: where is your AI actually stuck - the technology, or the way your organization decides, measures, and owns the work? Be honest about which one you've been funding.

Monday, June 15, 2026

Habsburg Jaw in making: Model Collapse in AI

 


 AI model collapse is a degenerative process in machine learning where a model's future generations degrade in quality when trained on synthetic (AI-generated) data rather than original human-created data.

On November 1st, 1700, an entire dynasty of kings came to a crashing end with the death of Charles II of Spain. He was physically & mentally disabled and disfigured. A large tongue made his speech difficult to understand; he was bald by the age of 35, and he died senile and wracked by epileptic seizures. He had two wives, but being impotent, he had no children and thus, no heirs. What else do you expect after 16 generations of inbreeding!

How is it happening

To understand model collapse, you have to understand how AI models process probability.

When a generative AI (like an LLM or an image generator) creates content, it calculates the most statistically probable next word or pixel based on its training. Because it favors the "most likely" answer, its outputs naturally gravitate toward the average or the median of what it has learned.

Generation 0 Training Data

Training Data set originates from human-written books, articles, websites, images, audio, video, code, and human conversations.

The model learns the full spectrum of human expression, including rare, weird, and highly diverse "edge cases" (the tails of the statistical distribution).

Generation 1 Training Data

Training Data set mostly originates from human-generated sources and a few incidents of machine-generated data. The machine-generated data is entering the training data set knowingly as well as unknowingly.

Still good, but begins inheriting biases and omissions.

Generation 2+ Training Data

Most of the training datasets originate from machine generation.  The proportion of human-generated data reduces significantly with each generation.

Training data sets become generic, and different models feed each other. This is resulting in the disappearance of rare information, and mistakes get reinforced.


Every real-world human dataset has a distribution - a spread of outputs across many possible styles, topics, phrasings, edge cases, and rare examples. When a model trains on this and generates new content, it approximates that distribution but doesn't reproduce it perfectly. The tails - the unusual, the rare, the surprising - get underweighted. When the next model trains on those outputs, it further smooths out the tails. Over enough generations, you converge on an overly smooth, narrowly peaked distribution. The model becomes increasingly "average" and loses the ability to represent rare-but-real things.

Types of Model Collapse

Distribution Collapse – Loss of Tail Distribution

The model forgets rare but important patterns.

Example: Original data contains 1,000 bird species, but AI-generated data mostly discusses common birds.  This will result in future models forgetting uncommon species.

Error Amplification

Small errors become accepted facts.

Example: Model A hallucinates a historical date, so AI-generated articles repeat it.  Model B is consuming data from Model A’s output. Model B learns it as truth.

Diversity Collapse

As each model will be trained on other models’ output, outputs across Models become increasingly similar. This will result in the same writing style, explanations, examples, and reduced creativity.

Capability Collapse

The model loses nuanced reasoning abilities.

Examples:

  • Less robust coding
  • Poor edge-case handling
  • Weak scientific reasoning

Symptoms of Model Collapse

The symptoms include:

  • Loss of Variance: The model output becomes incredibly repetitive. In image generation, for example, all faces might start to look like the same generic, heavily averaged face.
  • Disappearance of the "Tails": The model forgets about rare concepts, subcultures, or complex vocabulary.
  • Compounding Hallucinations: If Generation 1 makes a slight factual error, Generation 2 treats that error as a fact and amplifies it. By Generation 5, the model is completely detached from reality.
  • Perceptual Blindness: The model loses the ability to understand the original distribution of the data, making it impossible for it to "recover" even if human data is reintroduced later.

Is Synthetic Data Always Bad?

The simple answer is No, but….

High-quality synthetic data can be extremely valuable in rare edge cases, safety training, mathematical proofs, coding exercises, and data augmentation

The danger arises when synthetic data dominates the training data set, no grounding in real-world human data, and quality controls are weak.

How the AI Industry is Fighting Back

Because high-quality, original human data is a finite resource (often referred to as "peak data"), the industry is actively developing defenses against model collapse:

  • Data Provenance and Watermarking: Developers are working on invisible watermarks for AI text, code, and images. This allows future web-scrapers to automatically filter out synthetic data and only train on verified human data.
  • Strict Data Curation: AI labs are moving away from "scrape everything" approaches. They are employing massive teams of humans to curate small, incredibly high-quality datasets of original human work.
  • Synthetic Data Filtering: While training exclusively on synthetic data causes collapse, researchers have found that training on a carefully curated mix of human data and high-quality synthetic data can actually be beneficial. The key is ensuring the synthetic data is strictly verified for accuracy.
  • Alternative Modalities: Since text and image data are easily polluted, some researchers are looking toward training models on physical world data (like video, robotics, and sensor data), which is much harder for an AI to fake.
  • Architectural or training innovations: Techniques that better preserve tails or use reinforcement to avoid plateaus.

Summary

Model collapse is the ultimate bottleneck for AI scaling. It proves that AI cannot replace the need for human creativity and original data. To continue advancing, AI models will always rely on a steady diet of genuine, messy, diverse human input. If the internet becomes an echo chamber of AI talking to AI, the technology will stagnate and degrade.