Neural Lens

A Gentle Introduction to World Models

Yusuf — Sun, 17 May 2026 13:03:13 GMT

Background and History

The model family at the foundation of the AI products revolution is Large Language Models. These models fundamentally operate in the language/token space, learning very complex and high-dimensional semantic meanings from sets of tokens using the enormous amount of text data available on the Internet. These language-centric models have gotten us very far, clever post-training techniques have managed to instill conversational styles, personas, and more importantly, reasoning and agentic capabilities into AI models. However, one key element that is claimed to be missing from these models is a fundamental understanding of the world and physical reality. Even though one could argue that we are seeing some emerging physical understanding capabilities with the addition of video as an input modality to language models, it is still highly debated whether this approach, essentially just encoding video data into the token embedding space, is anywhere close to teaching models the physical reality of the world. This kind of understanding is considered necessary for both the physical embodiment of AI and a potential superintelligence that would be expected to drive new scientific discoveries on its own.

The field of world models exists to address this gap, aiming to make AI models learn the realities of the world directly. The roots of the concept can be traced back to the very beginnings of Artificial Intelligence as a field in the 20th century. The idea is inspired by human learning and cognitive psychology, which suggest that humans build internal mental representations of the world and use these mental models to guide their actions. While this fundamental concept had been explored historically, the most well-known usage of the term “world model” begins with Jürgen Schmidhuber. He first used the term in a paper published in 1990, within a basic reinforcement learning setting. Then, in the modern era of the field, David Ha, alongside Schmidhuber, cemented the term in the modern deep learning lexicon with a paper titled simply “World Models,” demonstrating a model that could successfully learn from its own “dreams” using its world model in a model-based reinforcement learning setting.

Since then, there have been many works exploring world models, and the term has been used more and more frequently, positioning world models as one of the next frontiers in AI research. However, even though there are many survey papers attempting to categorize world models, I’ll be honest with you, the field is a little confusing. There is no common architecture or approach that defines the category, or even a common problem statement. Many different sets of techniques and models claim to introduce a world model while addressing entirely different problems. At the highest level, the concept is similar, but once you start digging into it, you find that many quite different models are all being called “world models,” despite being technically very distinct from one another.

In this post, I will try my best to introduce the field as clearly as I can and give you the broader picture of what world models actually are.

Subscribe now

Generative World Simulators

The most prominent, and perhaps most intuitive, category of world models is generative world simulators: systems explicitly designed to synthesize rich, interactive environments. Broadly, there are two main subcategories:

Video-based interactive world generation: Genie, Odyssey
Static and persistent 3D world generation: Marble

Genie: Video-Based Dynamic Interactive World Generation

Genie is Google DeepMind’s frontier foundation world model. Their latest release is Genie 3, and it’s partially accessible to the public via Project Genie. The model autoregressively generates high-frame-rate video. The main difference from a typical video generation model like Sora is that Genie is interactive, it accepts action inputs from users, who can interact with the generated video (or so-called “world”) in real time.

The model consists of three key parts:

Spatiotemporal video tokenizer, which encodes input videos into latent space for more efficient processing. This means the model doesn’t operate on raw pixels directly, instead, it processes frames as compressed latent tokens that are later decoded back to pixel space.
Latent action model, which learns to extract actions from the relationship between consecutive video frame pairs. This is used to train the dynamics model. At inference time, real user action inputs replace this latent model.
Autoregressive dynamics model, which predicts the next frame given the action input and past video tokens.

Genie 3 impressively runs at 720p with 24 frames per second. Unfortunately, we don’t have many details on how they achieved this, the last published paper covers the original architecture for Genie 1, and they have apparently changed quite a lot since then, such as switching from MaskGIT to a diffusion-type architecture, which we can only infer from small clues in their announcement pages. Beyond that, we don’t have much technical knowledge on their frontier world models.

Genie does not have any architectural layer that explicitly models the physics of the world, this is rather an emergent capability purely from video training. Another emergent capability is long-term memory. Even though it’s not perfect, the model shows emergent memory capabilities in its world generation. This is important because the generated world should stay coherent throughout the duration of an interaction.

The main vision with Genie is providing a physical training surface for AI agents, giving them an infinite amount of world data to interact with. This is considered a necessary step to prepare AI agents for the physical world. In my opinion, even digital AI agents designed to orchestrate the physical layer would need a better physical training foundation, compared to the language-abstracted understanding of current models.

Another example of a world model in this category is Odyssey-2 Max. Odyssey’s team recruited from big companies like DeepMind, OpenAI, Tesla, and Waymo, and they are also working towards building frontier world models. They reportedly use a similar autoregressive latent diffusion-style architecture, but haven’t published any technical details, so these claims are mostly speculation at this point.

Honorable mention: Decart’s Oasis

World Labs’ Marble: Persistent & Static 3D World Generation

Another lab considered to be at the forefront of the world model endeavour is World Labs, founded by Fei-Fei Li, a legendary figure in AI and computer vision, also referred to as the “Godmother of AI” by some.

World Labs’ approach to world models is also quite different. Instead of generating real-time video output like the Genie series, their model Marble generates static 3D worlds as Gaussian splats. You can think of Gaussian splats as a representation of a true 3D environment, created from image and text inputs, and explorable by simply flying inside them. This provides a real environment for agents to explore, but currently lacks interactivity: the world doesn’t dynamically change in response to agent actions. There is a feature to expand the generated world, but it isn’t truly dynamic, the addition is statically processed at the edge of the existing world when requested.

One short-to-medium-term advantage of this approach is that it’s more feasible with existing simulation environments, especially for robotics. You can export the generated world and use it in any simulation environment with a proper physics engine, like NVIDIA Isaac Sim or MuJoCo. This is possible because Marble simultaneously generates collider meshes alongside the world itself. You can generate a wide range of possible worlds with Marble quite easily, eliminating the bottleneck of manually curated simulation environments, and with the help of powerful simulation tools, you can train more generalizable robotics agents out of the box.

Of course, there are other use cases too, like creative industries. Filmmakers experiment with it to empower their storytelling, and studios can use these worlds for more immersive VR experiences. It’s available as a product today for anyone to experiment with.

Another model worth mentioning from World Labs is RTFM: Real-Time Frame Model. It’s quite different from their current flagship Marble, and has more in common with Google’s vision with Genie. It’s a neural renderer that generates new frames from a history of frames, similar to Genie. However, there are important differences. It’s not an end-to-end foundation model like Genie, it’s better described as a learned neural renderer with higher fidelity, such as better reflections, given some prior like Gaussian splats or 2D frames that describe the environment. It renders novel views from these priors in real time.

It also relies on an autoregressive diffusion transformer-style architecture, but has one strong feature: persistence. One of the main limitations of fully autoregressive video models like Genie is that memory is fully bounded by compute, it’s an emergent capability of the model. RTFM, on the other hand, maps frames to poses in a 3D environment, giving the memory of frames a spatial structure. It only considers nearby posed frames to render new views, so it has a compact prior and is unbounded in memory. This means you will always see the same frames when you turn back, no matter how long you’ve spent in that world, enabling true long-horizon interactions.

One key limitation though is that it’s not designed for action inputs, so it’s not truly dynamic in terms of the world and agent interactions. Considering the view Fei-Fei Li laid out in her influential essay, she also believes in the importance of interactivity, specifically, actions affecting the next states of the environment. I believe they are probably working on a more action-oriented model that we might hear more about.

Honorable mention: HunyuanWorld 2.0

Learned Internal World Models

Closer to the concept of internal mental models extracted from the world, another model space for the world models that doesn’t rely on generating environments is what I classify as “Learned Internal World Models”. As described in the first section, the current frontier models have limited understanding of the physical world. The models under this category focus more on giving generalized AI agents a cognitive scaffolding with physical world understanding.

Of course there are many research subfields for this, but in the scope of this article, I think it’s sufficient to provide the two predominant approaches.

Model-Based Reinforcement Learning

In reinforcement learning, the algorithms can be separated into these two main categories: model-free and model-based. This distinction lies in the heart of the learning paradigm in an environment. Model-free algorithms purely rely on the interactions within environments. In order to learn a policy (or value function) for the different future states in an environment, an agent needs to basically interact and get the rollout trajectories for those states. Model-based algorithms on the other hand rely on a dynamics/transition model for the environment. The architecture also introduces a learned model of the environment (or world), and the agent uses this learned dynamics model to imagine future trajectories through this model without requiring actual action steps in that environment.

Model-free algorithms like DQN, PPO, GRPO, or RLVR have been the dominant RL paradigm so far with their scalable optimization across domains like games, robotics simulations, or LLM post-training. Even though these models are very sample-inefficient, when the environment action step is not expensive, they perform quite well given enough compute. The downside is it’s more difficult to generalize to different situations.

Model-based approach on the other hand is more sample efficient due to the flexibility of predicting future states with learned world model, but it has historically suffered from compounding model error over longer-horizon imagined trajectories. In this approach, generally there are two phases, the world model is trained to learn the transitions in the environment with real rollouts, and policy learning by using the learned world model’s imagined trajectories and rewards. This also theoretically leads to more generalizable agents due to the consequent advantages of having a model of the world. The sample efficiency of this approach makes it especially important for physical/spatial intelligence application layer, like embodied AI. The success of the model-based approach heavily relies on an accurate and reliable world model, and there has been some recent significant advancements in this area.

In this post, I specifically want to talk about the Dreamer model series. The Dreamer model family is a series of model-based reinforcement learning agents that learn a world model from experiences, and uses this world model to train an action policy with imagined latent future trajectories. The latest one is Dreamer 4, published in September 2025, which outperforms other reinforcement learning algorithms in a variety of different environments with minimal parameter tuning.

Dreamer 4 is a major upgrade to its predecessor Dreamer 3. Dreamer 3 was built on Recurrent State-Space Models for its main world modeling block. RSSM mainly includes two types of state:

Deterministic state h_t, implemented as GRU (a version of recurrent neural networks) to represent the compressed memory of the previous states and actions.
Stochastic state z_t, predicted categorical state with additional representation of uncertainty on top of the deterministic state, implemented as a simple MLP layer. This is what the model imagines the state to be during inference. This is also called the prior.

During training, there is also an additional step with a simple CNN encoding the next state from the actual frames and the deterministic state (the output of GRU), which is called the posterior. This is used to compute the baseline loss to train the prior.

This world model setup is then used to train an actor-critic RL agent from imagined offline trajectories. The world model training itself still requires online training and environment interaction, but the agent policy training happens offline with imagined trajectories as a model-based RL setup. This setup alone made good progress and surpassed some model-free RL agent training approaches in some standard RL environments.

Dreamer 4 introduces a complete architectural overhaul, replacing V3’s RSSM with two transformer-based components: a causal tokenizer that compresses high-resolution frames into spatial latent tokens, and a dynamics model that predicts future tokens using a diffusion-based objective (shortcut forcing) within a block-causal transformer.

Unlike the previous version, the training pipeline also doesn’t require online world model training. The training happens in three stages:

World Model pre-training from mostly unlabeled offline video data with transformer-based architecture. It still requires some action-labeled training data for action conditioning
Agent finetuning by inserting new task-conditioned tokens as an additional modality to the dynamics transformer to predict actions and rewards
Rolling out the world model for agent policy training on imagined future trajectories. Dreamer 4 introduces a new policy optimization algorithm, PMPO, compared to the previous version’s REINFORCE.

Overall, this architectural shift trades V3’s compact GRU-based world model for a scalable foundation-model-style architecture that can learn from large offline video datasets. This pure offline training pipeline is a major step forward, especially for domains where online interaction is expensive, slow, or unsafe, such as physical robotics, where deploying a partially trained policy risks damaging the robot and its environment.

Honorable mention: DIAMOND

V-JEPA: Self-Supervised Representation Learning

Probably the most prominent figure in the AI field who believes LLMs are a dead end is none other than Yann LeCun, Turing Award winner, and former Meta VP and Chief AI Scientist. LeCun is widely credited as one of the pioneers of deep learning, best known for developing convolutional neural networks (CNNs), the architecture that transformed computer vision and laid the groundwork for modern AI. After over a decade at Meta’s AI Research lab, he left to found Advanced Machine Intelligence (AMI) Labs, betting his reputation and a billion-dollar seed round on a fundamentally different path to intelligence in an increasingly LLM-pilled space.

He lays out his vision in his 2022 paper “A Path Towards Autonomous Machine Intelligence” proposing Joint-Embedding Predictive Architecture (JEPA) as the cornerstone of a new cognitive architecture for more efficient and generalized intelligence. His central argument is that today’s foundation models are both architecturally misguided and fundamentally disconnected from physical reality. According to him, there are two main problems with the current AI landscape:

Current foundation models are trained primarily on language and text, which are high-level lossy abstractions of reality rather than reality itself. To achieve true general intelligence, models need to be exposed to raw, unfiltered sensory data, the kind of continuous physical experience that humans rely on long before they ever learn a word. This is a sentiment also shared broadly across the world models field, which similarly argues that grounding intelligence in physical reality, rather than language, is the path forward.
Current world model approaches are computationally inefficient because they attempt to reconstruct every pixel-level detail of visual frames. This is both expensive and counterproductive, humans don’t memorize every visual detail they encounter, but instead build mental models that retain only meaningful and useful information. LeCun argues similarly: world models should not be burdened with pixel-level reconstruction, but instead learn compact, meaningful representations in latent space that capture the causal structure of the world.

V-JEPA is the first practical step to bring this vision to reality. The most recent release is V-JEPA 2.1. In this article, I will focus on the fundamentals instead of the differences with each version. In simple terms, V-JEPA’s training uses two encoders implemented as Vision Transformers (ViT). One receives masked video frames and encodes their latent features, while a predictor network is trained to predict the latent features of the masked regions, as extracted by a second encoder that receives the full, unmasked frames. This second encoder is an exponentially moving average (EMA) of the first, introducing an asymmetry that prevents representation collapse. There is no decoder, no pixel reconstruction, just self-supervised learning entirely in representation space.

The key distinction is that V-JEPA operates purely in latent space. Unlike vision-language models, the encoder is trained solely to extract correct latent representations of the world, no language bias, no pixel reconstruction. This is V-JEPA’s core strength: the model is not distracted by irrelevant details, and its sole goal is learning to extract meaningful representations of the world, making it both more efficient and more focused than its alternatives.

The base model provides a strong foundation for world understanding, but to achieve true world-model behavior where the model predicts future states of the world, the authors fine-tuned the base architecture to produce V-JEPA-2-AC, an action-conditioned predictor. Since the encoders are already trained, they are kept frozen to preserve the rich representations learned during pretraining. The predictor is then further trained with one encoder receiving the current frames and the other receiving the target future frames. This enables training a model capable of next-state prediction given action inputs, making it well-suited for applications like robot planning.

Although this vision is promising, we have yet to see this architecture produce a competitive real-world application. With significant funding and a dedicated lab now behind it, it will be exciting to see whether V-JEPA and the broader JEPA philosophy can deliver genuinely capable models, and whether LeCun’s bet will carve out a new frontier in AI.

Honorable mentions: LeWorldModel, DINO-WM

World Understanding as Emergent Capability

There is also another view in the AI research community, claiming that we don’t actually need an explicit architectural design for physical world understanding. This emergence hypothesis states that world understanding can naturally emerge from scale, given sufficiently large and rich training data. There are two sides to this:

World understanding emerging from multi-modal foundation models
World understanding emerging from video diffusion models

Today, we indeed see these models showing some level of emergent capability in world and physics understanding. There is an active debate around whether this reflects genuine learned understanding, or sophisticated but unreliable mimicry. I will not dive into that debate or take a stand in this article. One thing worth noting is that neither of these model families includes any action-conditioned design, unlike the architectures we covered in earlier chapters, where acting in the world was central to the learning objective.

For this article, let’s briefly explore how world understanding plays out in each of these two model families.

Multi-Modal Foundation Models

These are the models we know, love, and use daily, like ChatGPT, Gemini, Claude, etc. They fueled the AI boom in the recent years. At their core, these are Large Language Models (LLMs), which are mostly about next token prediction. This autoregressive mechanism has given rise to a remarkable level of intelligence, or at least a convincing mimicry of it (which can be argued as functionally equivalent). For additional modalities like image or video inputs, these models use the same core language engine, encoding and tokenizing visual inputs into the same token space.

Despite being anchored in language, we see similar empirical evidence of world understanding (or good mimicry of understanding, you got the idea) emerging with visual inputs too. The models seem to demonstrate spatial reasoning and scene understanding with convincing natural language explanations.

Video Diffusion Models

Video diffusion models are the models specifically designed to generate videos from text, image, and/or video input, like OpenAI’s Sora or DeepMind’s Veo. The architecture relies on the diffusion mechanism, which means that unlike auto-regressive language models, the video models don’t just predict the next frame from an initially predicted set of frames, they pretdict the full clip sequence all at once with iterative denoising.

The backbone architecture is Diffusion Transformers (DiT). The diffusion happens in latent space rather than raw pixel space. Since the size of video data is huge, processing at the pixel level is not feasible. Instead, the model first encodes video frames into spatiotemporal sequence of patches, performs the diffusion process in that compressed latent space, and then decodes the result back into pixels. This encoder-decoder architecture is called Variational Autoencoder (VAE). Think of it as some sort of lossy compression. The model compresses the large input into a more manageable representation, does the heavy lifting there, then decompresses it back into the output video.

The goal of these models is simply to generate output video from various types of inputs. However, given enough scale and data, we see that these models can generate coherent and somewhat physically accurate video sequences. Even though the goal is not explicitly to understand the world or its physics, the models appear to learn to simulate physics well enough to generate better videos. This mirrors the pattern we saw with multi-modal LLMs, world understanding implicitly emerging as a byproduct of a broader training objective, given sufficient scale and data.

Honorable mentions: Runway Gen-4.5, Kling

Domain-Specific Models

So far, we’ve explored world models primarily through the lens of research, architectures designed to learn, predict, and reason about the dynamics of an environment. World models are increasingly showing up in industry too, deployed within specific domains where the ability to simulate the future has direct commercial value. The same pattern we’ve seen throughout this series holds here as well, references to “world models” in industry are plentiful, but the term is used loosely, without a clear architectural definition or common umbrella. Rather than trying to fit each of these systems neatly into our taxonomy, the goal here is simply to give you a broader sense of how the field is taking shape in practice with some other examples from the industry.

Robotics, and humanoid robotics in particular, is one domain where this is playing out in interesting ways. The dominant paradigm in the field today is Vision-Language-Action (VLA) models: systems that combine vision encoders, language understanding, and action generation into a single trainable stack. VLAs are built on top of Vision-Language Models, and as we discussed earlier in this series, LLM-based architectures can develop a form of emergent world understanding through scale and training. Although dedicated world-model-style architectures are not as popular as VLAs in the robotics domain yet, there is one good example worth mentioning. 1X, the company behind the NEO humanoid robot, released a world model, 1XWM, in early 2026 that uses video prediction grounded in real-world physics as the basis for robot decision-making, an approach that, in principle, may offer better adaptation to novel situations compared to classical policy-based methods. Whether this translates to meaningfully better real-world performance remains an open question, but it represents a distinct bet on world-model-style reasoning as a path toward more general robotic behavior.

Autonomous driving has also been one of the active domains for world models, with systems purpose-built for vehicle simulation, safety evaluation, and synthetic data generation. The core motivation is practical: testing in the real world is expensive, slow, and for safety-critical edge cases, dangerous. Wayve’s GAIA-3 is a representative example, a 15-billion-parameter latent diffusion model that takes real recorded driving sequences and re-drives them with parameterized variations, keeping the rest of the scene consistent. The result is a controlled evaluation environment where you can, for instance, alter the ego vehicle’s trajectory to produce a collision while everything else in the scene remains unchanged. On the other hand, Waymo has taken a different route, building its world model on top of Google DeepMind’s Genie 3, an interesting case of channeling broad foundation model knowledge into domain-specific outputs, including sensor modalities like lidar that the base model was never trained on.

Then there is NVIDIA Cosmos, which sits somewhat outside the taxonomy we’ve been using throughout this series. The models we’ve discussed are world models in a direct sense: they learn environment dynamics to support prediction, imagination, or planning. Cosmos is better described as a platform for building such systems. It bundles a suite of open-weight foundation models (Cosmos Predict, Cosmos Transfer, Cosmos Reason) together with a data curation pipeline, tokenizers, and a fine-tuning framework, giving developers in robotics, autonomous driving, and industrial AI a pre-trained starting point they can adapt to their own sensors, environments, and tasks. Companies across both domains are already using Cosmos for synthetic data generation and model evaluation. Whether it becomes broadly foundational for physical AI remains to be seen, but as a platform play it is notably different in kind from the research models so far explored in this post.

Conclusion

In this post, I’ve tried to categorize and explain the different approaches to world models as the field stands today. As I mentioned at the start, the term covers a wide and sometimes confusing range of work, but I hope laying out these categories made the broader picture a little easier to navigate. It’s a fast-moving space with many open questions, and it will be exciting to follow where each of these directions leads from here.

References

Beyond the Hype: The AI Divergence

Yusuf — Sun, 17 Aug 2025 19:53:24 GMT

Lights Out: The Beginning

In 2022, ChatGPT didn't just pass the Turing test, it made the test irrelevant overnight. But what happened next reveals deeper insights about where AI is really headed.

I have been following the AI space for more than six years now. Before ChatGPT's launch, there was much less public interest in AI and machine learning compared to today. Back then, AI meant many different things, and there were numerous specialized techniques and models used for very different purposes. The large language model revolution has fundamentally shifted this landscape. Now, the focus is on general foundation models that are trained on massive datasets with enormous amounts of compute power. The following years continued to reinforce this trend, we now have highly capable models that can handle a wide variety of tasks, and fine-tuning these general models for specialized applications has shown strong results.

The ChatGPT moment was truly transformative, and fundamentally changed how the world views AI. Following the unprecedented public attention after its launch, the race to build more powerful AI models intensified dramatically, with companies and investors pouring hundreds of billions of dollars into the competition. Today, we're experiencing what may be the greatest hype cycle in tech history, with countless companies and startups launching new things every week, claiming the world will never be the same.

Now, it's fair to say that we're in a much different world. These models have already changed a lot of things. But let's first take a step back and try to understand how the current AI paradigm really works.

Understanding the Current Paradigm

The core of these models is language. Language is something humans created to convey thoughts and knowledge to each other. It's the main way we express our thoughts and feelings, projecting the electrical signals in our brains down to interpretable expressions that we can build upon. Large language models are simply trained to generate text output from text input, based on language. The key idea to understand is that these models are auto-regressive models. This basically means they only predict the next token, given a sequence of past tokens.

This sounds very simple and not particularly smart. However, it turns out to be incredibly powerful when you scale the models by training with larger networks and feeding them more and more data. We had two key enablers to support this scale: powerful specialized hardware production (Nvidia GPUs) and the Internet. This is the bitter lesson of AI research, scaling data and computation takes over in the end.

This foundational phase, which develops the core language capabilities, is called pre-training. After this phase, in simple terms, we get a very powerful text engine that has learned incredibly complex connections between tokens from the world's text data. The quality of this data is the key factor in this engine’s effectiveness. Humans make mistakes, and the internet is full of humans with different biases and views. Still, this gives you a "good enough" text engine that can project the world's information down into the format we humans understand: language.

However, this text engine alone isn't particularly useful to us, because it can produce irrelevant tokens in unhelpful formats. It has no awareness of what humans expect from a chatbot or what would be considered a smart response to a difficult question. It's also not effective and accurate enough for questions that demand precise answers, like coding and math problems. This is where post-training comes in, to give this text engine an intuition about how to respond appropriately.

Here, supervised fine-tuning and reinforcement learning ground the general models using an extensive set of human-labeled data. This evolves the models from being naive token generators into impressive AI assistants (or some sort of sophisticated information engines) that can mimic expert-level knowledge across a wide range of tasks. The models feel smarter and much more useful after this stage.

The next performance multiplier for these models came from a more algorithmic-like enhancement. For cognitively demanding tasks, the breakthrough that further improved model capabilities was inference-time compute. This emerged after the realization that performance could be improved simply by adding “Let’s think step-by-step” to the model prompts. Increasing inference time, giving the models more room to tackle complex problems along with human examples of step-by-step reasoning proved to be very powerful. This probably isn't genuine reasoning by the models, but rather sophisticated mimicry of reasoning. Nonetheless, it enables these information engines to generate much better sequences of context for solving a given problem, which significantly improves performance on some challenging domains.

The Divergence: Two Paths Forward

There's no widely acknowledged definition of AGI as of today. As a baseline, we can say that so-called AGI needs to be capable of performing a significant percentage of economically valuable work at a human level without guidance. But the details remain subjective. There are many different timelines for reaching this stage in AI, which I believe is mostly due to the lack of a clear definition, everyone interprets it differently. I think AGI will become more of a marketing term than a technical one in the coming years, and there won't be consensus on whether we've actually achieved it.

Whatever is called, AGI or a "country of geniuses", I don't think this framing is the right way to analyze current AI progress. With the current AI paradigm, I see two main projects (sticking to the digital realm and excluding robotics):

- An engineering-heavy project of building an automation engine (autonomous agents) that can mimic expert-level human intelligence across a wide range of digital tasks.

- A science-heavy project of building powerful AI with the ability to self-evolve, potentially leading to breakthrough scientific discoveries and accelerated progress on many fronts.

We've reached current model capabilities through many scientific advancements. But today, in my opinion, there is scientific saturation within the current paradigm, it seems we now have most of the essential tools and techniques needed to build highly effective automation engines. Today, most frontier labs appear to be investing more heavily on the first project: automating economically valuable tasks by engineering the models into agentic workflows, most likely due to the enormous economic potential.

This engineering project is mostly about scaling and engineering context, scaling reinforcement learning to obtain better reward models with more human-labeled data, reducing costs, optimizing performance, building infrastructure, and designing tooling and systems around it. Within a few years, we'll quite likely see more capable tools that threaten to automate an increasing number of white-collar jobs.

Before the end of this decade, we'll likely see some jobs become fully automated, while others will be drastically different from how they are today. The economic value distribution of labor and skills will also shift very rapidly. To survive this new wave, we all need to become more adaptable and embrace change rather than resist it. Some of the things we like doing or existing skills we're good at may no longer be as economically valuable as before. The solution lies in willingness to change and learn continuously. Current AI models are also amazing tools for learning new things, so the key is to keep looking for ways to stay relevant by using these tools, not avoiding them.

That being said, I don't think the current engineering project will deliver the intelligence level required to solve novel problems without any human in the loop. R&D will remain safe from this automation wave until we achieve new breakthroughs through the scientific project.

I don't believe the current AI paradigm is sufficient to reach superior intelligence levels. Frontier labs continue improving models on benchmarks, and there are new algorithmic advances, but most appear to be incremental optimizations that enhance specific capabilities within existing systems. While this is fine, I believe that without fundamental architectural breakthroughs like transformers, the current approach won't be sufficient to build self-evolving intelligence by scaling alone. The current recipe with autoregressive models seem to work pretty well at building systems that can mimic near-expert-level intelligence across a range of tasks, but I don't see this surpassing the intelligence threshold necessary to drive new discoveries.

I think there are three major bottlenecks with the current paradigm:

Data inefficiency and scarcity: Current models are extremely data-hungry, and we're running out of high-quality training data (the fossil fuel of AI). Synthetic data approaches are short-term workarounds that I don't see as a robust long-term solution. Synthetic data can only recombine existing knowledge, not generate truly novel insights. The rumored heavy use of synthetic data in GPT-5, coupled with the lack of dramatic performance improvements, suggests we're already hitting this ceiling. Current architectures are so data-inefficient that relying on vast amounts of human-generated content for every application, especially novel domains, isn't realistic. While this approach might be enough for powerful AI assistants that can deliver world knowledge with impressive accuracy, it becomes problematic for autonomous agents that need to discover new things and design reliable systems. Humans are remarkably sample-efficient learners. We need AI techniques that can learn faster and adapt to new situations with minimal data.
Lack of continual learning: Current AI models lack sophisticated memory structures and cannot learn continuously over extended periods. These systems might adapt to your style and preferences during a single session with effort, but you start from scratch each time. While there are primitive memory implementations, like ChatGPT's user memory or Claude Code's summaries, we need far more robust continual learning capabilities. Whether for personal or professional use, current systems fall dramatically short of human-level adaptation when it comes to understanding individual styles, preferences, or evolving situations in long-term work. Understanding someone's communication style, finding product-market fit, grasping client preferences and customer dynamics, or developing strategies for system design or research directions... All of these require years of continual learning that current systems simply cannot handle (also see).
Lack of environment design for real-world experience: After training on all existing human knowledge, AI models need new frontiers for improvement. They need sophisticated world models, the ability to take meaningful actions, and genuine experience in complex environments. The next feedback loop must come from real-world experiences, a largely unsolved problem. Building businesses, solving engineering challenges, or advancing scientific discoveries all require feedback loops grounded in reality: customer responses, real-world test results, laboratory experiment outcomes. These are all bound to the world's timeline, and there's no clear path for AI to interact effectively with this complexity and establish proper reward systems. This constraint means we'll be limited by real-world and human timelines, making fast takeoff scenarios implausible. Welcome to the era of experience.

Whatever the future brings, we're certainly living in both exciting and worrying times. Many fundamental questions remain open. Nobody knows the trajectory of AI progress or what we'll ultimately achieve in the years ahead. But one thing is certain: even if the "science-heavy project" yields no meaningful breakthroughs in the short term, the "engineering-heavy project" alone will be profoundly transformative, demanding urgent answers to critical questions about humanity's future within the next few years.

I plan to keep writing about AI, from technical deep dives to its broader implications. Your engagement helps shape these discussions, so please share your thoughts in the comments, and if you'd like to follow along, consider subscribing.