COLORLESS GREEN IDEAS SLEEP FURIOUSLY

Walter De Brouwer

This Chomsky paradox is a grammatically perfect sentence—yet completely nonsensical. It showed that syntax and meaning are distinct—just as LLMs now generate flawless text without "human" understanding.

I had no idea what to study at university. Math seemed attractive—not because I was particularly good at it, but because it promised freedom from rote memorization. However, my conviction was weak, and I let my friends' choices sway me into linguistics at the University of Ghent, Belgium.

There I stumbled—almost by accident—onto my true passion: Chomskyan linguistics, the blueprint of theoretical Computer Science which branched off from Computer Engineering. MIT and Purdue began offering CS courses in 1980. Pink Floyd had just unleashed "Another Brick in the Wall (Part II)," and Pac‑Man was chomping its way into arcades. The future felt boundless.

Noam Chomsky did not analyze language. He asked what kind of machine could possibly generate it. That question still haunts us. Today, in AI, there are 3 camps who have different answers for the road ahead.

Three Camps

CAMP 1: The Pessimists argue we've entered a zone of diminishing returns. They point to empirical scaling laws, like those from Kaplan et al.'s (arXiv, 2025) paper, which shows the gains decelerate. Recent research, such as "The Race to Efficiency" arXiv, 2025, explicitly discusses these "diminishing returns" and suggests efficiency as a necessary countermeasure. While scaling might eventually plateau, data scarcity or energy limits could be greater barriers.

CAMP 2: The Pragmatists accept that raw scale is becoming cost-prohibitive, but point to "algorithmic efficiency doubling—on par with Moore's Law for hardware—as a path toward a higher global optimum". The Chinchilla paper arXiv 2203.15556, 2022, demonstrated that under fixed compute (FLOPs), optimal performance comes from balancing model size and training data (train smaller models on more tokens). This has influenced models like Llama 2/3.

CAMP 3: The Optimists, led by voices like Eric Schmidt (from Business Insider, Nov 2024: "there's no evidence that the scaling laws … have begun to stop"), counter that. Schmidt's view echoes optimists like Sam Altman, who push for more compute (e.g., via supercomputers) and the authors of the recent AI Action Plan of the USA.

Today's AI optimists occupy the equivalent of Everest's Camp 3—exhilarated by their altitude but facing the mountain's steepest challenges ahead. At 7,200 meters, climbers must decide whether to push higher or preserve their oxygen. At this stage, climbers are likely using supplemental oxygen and dealing with extreme fatigue. Reaching Camp 3 is far more demanding than arriving at Camp 2, and by this point, many climbers who cannot handle the altitudes or other challenges may decide to turn back. However, Camp 3 is still well below the summit, and climbers need to overcome additional camps, the South Col, and the summit push to finish the ascent.

That is where we are today. Raw scaling is expensive (data centers consume massive energy). We either need incremental tweaks (pruning, quantization, sparsity, LoRA, distillation, custom silicon) or another kind of mountain to jump to AGI.

FOUR MOUNTAINS TOWARD A TRUE GLOBAL MAXIMUM

1. Hybrid Architectures & Symbolic-Neural Integration

The Chomsky hierarchy is a classification of formal grammars in formal language theory (a significant field of discrete mathematics) that includes four types, organized by their generative power. The hierarchy is nested, meaning that a grammar of a higher type can also be classified as a grammar of a lower type, but not vice versa.

Now here is a controversial idea. Admittedly, Chomsky's hierarchy is a math concept for grammars/automata. But those who lived through the beginnings of Computer Science find the analogy striking with what is now happening.

• Agentic AI: Type-0 / Turing-complete. OpenAI, Microsoft, Anthropic, AWS and Google (Vertex) have all come out with Agentic AI

• Mixture-of-Experts: Type-1, Context-sensitive. Mistral AI, Databricks, Meta AI, DeepSeek and xAI, all came out with MoE

• Artificial languages: Context-free grammars. Parsers for programming languages. Python is the number one coding language because of its libraries (PyTorch, TensorFlow).

• Tokenizers "Type-3" Regular grammars. Regex, BPE are standard nowadays

Fusing these paradigms deliberately—symbolic planners with sparse routing, dense neural backbones with iterative samplers—may unlock capabilities that pure scaling cannot reach.

Another model that could be worth integrating would be Diffusion Models as compositional samplers, recently shown to enrich structure in protein folding when combined with CNNs and Transformers in AlphaFold 3 (Nature, 2024).

2. Algorithmic & Hardware Efficiency. Even if parameters can't double easily, efficiency gains are real:

• Linearized Attention a math trick to calculate attention, reducing complexity from quadratic to linear (Performers, arXiv 2009.14794, 2020)

• Sparse Experts (Switch Transformers, arXiv 2101.03961, 2021) scale to trillions with constant compute by routing tokens (Mixtral)

• Pruning/quantization: reduces GPT models by 4x with minimal loss

• Custom silicon: Graphcore/Cerebras claim 2–5x speedups; NVIDIA's GPUs follow Moore's Law-ish trends.

3. Truly Multimodal Foundations. So far most foundation models focus on one modality. But converged models that jointly process vision, audio, structured data, and RL signals could exploit cross-modal synergies that simple ensembles miss. Imagine a single model that could read satellite imagery to anticipate supply-chain disruptions, listen to corporate earnings calls for sentiment shifts and simulate market reactions via embedded reinforcement learners.

Recent advancements focus on scaling multimodal capabilities, improving efficiency, and integrating more modalities (e.g., audio, video, 3D data). For example, models like OpenAI's GPT-4o and Supergrok push boundaries in real-time multimodal processing. But training unified models is compute-intensive and data-hungry.

4. Beyond the Manifold: Towards Machine Understanding. Optimists argue that "Machine Learning is solved" and "Deep Learning is Nobel-worthy," but ML is not solved (e.g., issues with hallucinations, bias, and generalization persist), and neither is deep learning (DL got a Physics Nobel in 2024 for Hopfield/Hinton). ML and DL models remain more art than science.

Exploration's greatest gift is the "fuite en avant", the spectacular escape forwards. Transformers were a paradigm shift, not just an "incremental tweak." I have found that when research stalls in incremental progress without clear solutions, it often helps to abandon it for a while and chase bolder goals to renew bravery.

In any case, the business stakes couldn't be higher: those who master the next wave of architectures and efficiency tricks will lead the market—while the rest risk being stranded on yesterday's peak. That is the saddest thing I can think of, to spend a lifetime climbing a mountain only to find out that it was the wrong one.

So there is really no choice, whether you are an optimist or not: If radical shifts don't pan out, we might see an "AI winter" reprise, but I do not think so. I think we deserve a small party.

The view from Camp 3 is spectacular, but the summit—true machine understanding—still beckons through thinning air. Bring your own bottle.

ALL BLOGS