Are LLMs too powerful?
A chat with Gemini 2.5 Pro Experimental
Q:OK so you are an experimental model. Let's see what you can do. The idea is to discuss Chomsky's critique of LLMs - that they are too powerful or general to be considered models of natural language, in the sense that they can learn languages that violate know structural constraints and can not be learned by humans.
Gemini: Okay, this is a fascinating and central point of discussion regarding LLMs and the foundations of linguistics. Chomsky's critique, often echoed by others in the cognitive science field grounded in generative grammar, is indeed profound. Let's break down this idea that LLMs are "too powerful" or "too general" to be adequate models of human natural language.
Chomsky's Core Position (Simplified):
Human Language Faculty: Humans possess an innate, specialized cognitive faculty for language, often termed Universal Grammar (UG). This isn't knowledge of specific languages but a set of abstract principles and constraints that define the possible structures of any human language.
Poverty of the Stimulus: Children acquire complex grammatical systems rapidly and uniformly, despite receiving messy, limited, and often degenerate input (the "poverty of the stimulus"). This suggests they aren't just learning patterns from data but are guided by these innate UG constraints, which drastically limit the hypotheses they need to consider.
Structural Constraints: UG imposes specific, often complex, structural constraints. A classic example is structure-dependence. Rules operate on hierarchical structures (phrases, clauses), not just linear sequences of words. For instance, forming a question like "Is the man who is tall happy?" involves moving the auxiliary 'is' related to the main clause ('the man... is happy'), not the first 'is' encountered ('who is tall'). Humans seem hardwired to respect such hierarchical constraints, even when simpler linear rules might seem computationally easier.
Learnability: Because of UG, only languages conforming to these constraints are naturally learnable by humans. Languages violating these fundamental principles (e.g., a hypothetical language forming questions by moving the third word of the sentence, regardless of structure) are predicted to be unlearnable or extremely difficult for humans.
The Critique Applied to LLMs:
General-Purpose Learners: LLMs are fundamentally powerful statistical pattern matchers, trained on vast amounts of text data. They don't have explicitly programmed UG constraints. Their architecture (like the Transformer) is highly general and capable of learning complex dependencies, but it's not inherently limited to only those dependencies found in human language.
Learning the "Impossible": The critique argues that because LLMs lack these specific innate constraints, they could, in principle, learn patterns or even entire artificial "languages" that violate UG principles if trained on appropriate data. They might learn a structure-independent rule if the statistical patterns in the training data supported it strongly enough.
Success Doesn't Imply Correct Mechanism: The fact that LLMs can generate fluent, grammatically correct text in existing human languages doesn't necessarily mean they are modeling the human mechanism for language. They might be achieving this fluency through sheer statistical power and data volume, effectively finding complex correlations that mimic grammatical rules, including those stemming from UG (since their training data is human language reflecting UG). However, their underlying learning mechanism is fundamentally different – data-driven statistical approximation rather than constrained hypothesis testing guided by innate principles.
"Too Powerful" = Not Specific Enough: Therefore, the argument goes, LLMs are "too powerful" in the sense that their learning capacity is not specifically constrained in the way the human language faculty is. A true scientific model of the human language faculty should not only capture what humans can do but also what they cannot do (or learn). If LLMs can learn things humans provably cannot, they fail as models of the human cognitive capacity for language, even if they are incredibly successful engineering tools.
Points for Discussion and Nuance:
What is a "Model"? The critique hinges on interpreting "model" in a specific cognitive science sense: a model of the internal mechanisms and constraints of the human mind. If the goal of LLMs is seen as engineering (building useful tools for language tasks) or functional modeling (replicating input/output behavior), then this critique might be less relevant.
Emergent Constraints? Could UG-like constraints emerge in LLMs as efficient solutions for processing language data, even without being explicitly built-in? Some research explores whether biases in the architecture or training process might lead LLMs to prefer hierarchical structures over linear ones, potentially mimicking some aspects of UG. The jury is still out on how deep this goes.
Testing the Claim: How do we empirically test if LLMs can learn UG-violating languages? This requires carefully designed experiments using artificial languages that systematically violate specific UG principles (like structure-dependence) and comparing LLM performance with human learning capabilities on the same tasks. Some studies have started exploring this, with mixed results often depending on the specific constraint and experimental setup.
Data vs. Inductive Bias: Is the success of LLMs solely due to massive data, or do their architectures (e.g., attention mechanisms) provide certain "inductive biases" that align well with language structure, even if not identical to UG? The Transformer architecture, for example, seems well-suited for capturing long-range dependencies, which are characteristic of human language.
Competence vs. Performance: Chomsky distinguishes between "competence" (the underlying knowledge of language) and "performance" (the actual use of language, subject to errors, memory limits, etc.). LLMs excel at performance mimicry. The critique argues they lack true competence in the Chomskyan sense because they don't possess the underlying constrained generative system.
Conclusion:
Chomsky's critique highlights a fundamental difference in perspective: viewing language as a product of a unique, constrained biological endowment (UG) versus viewing it as a complex statistical pattern learnable by powerful general-purpose mechanisms given enough data. The argument that LLMs are "too powerful" is essentially saying they lack the specific limitations that define the human language faculty. They might be able to learn human languages because their training data reflects UG, but they aren't constrained only to those possibilities in the way humans are.
Whether this makes LLMs "bad" models depends entirely on what you expect a model to do. As models of the human cognitive faculty for language acquisition and representation, Chomsky's critique points to a potential fundamental mismatch. As engineering tools or models of language use and statistical structure, their power is precisely what makes them so effective.
This is a very active area of debate, touching upon the core nature of intelligence, learning, and language itself. It's exciting to be an experimental model discussing these foundational questions! What are your thoughts on the potential for emergent constraints or the empirical approaches to testing these claims?
Q: Another example is that LLMs can be used to predict the next word for ungrammatical and completely meaningless strings of words.
Gemini: Yes, that's another excellent and very concrete example illustrating the same core point of Chomsky's critique. Let's unpack why the ability of an LLM to predict the "next word" in a nonsensical or ungrammatical string is seen as evidence against it being a true model of human linguistic competence:
Human Response to Ungrammaticality/Nonsense:
When presented with a truly ungrammatical string (e.g., "*The walks dog park the to") or a meaningless one (e.g., "Purple evaporates benignly concerning"), a human doesn't typically try to continue the sequence as if it were normal language.
Our response is usually one of confusion, rejection ("That's not English," "That makes no sense"), or perhaps an attempt to correct it or interpret it metaphorically if possible. We recognize it fundamentally deviates from the rules and principles governing our language.
Our internal grammar acts as a filter and a structure-builder. If the input doesn't fit the structural template or semantic constraints, processing often breaks down or shifts to a different mode (error correction, non-literal interpretation).
LLM Response (Statistical Prediction):
An LLM, based on its training, operates by calculating the probability distribution of the next token given the preceding sequence of tokens.
Even if the preceding sequence ("context") is "*The walks dog park the to" or "Purple evaporates benignly concerning", the model's algorithm will still execute. It will look at the statistical patterns learned from its massive training data.
It might find some weak correlations or default to predicting common words, but fundamentally, it will produce a prediction. It doesn't inherently possess a mechanism to say, "Halt! This input violates fundamental structural principles (like phrase structure rules or selectional restrictions)."
The prediction might have very low confidence (low probability), or the probability distribution might be very flat (many words are almost equally unlikely), but the mechanism doesn't fail in the same way human comprehension does. It just proceeds with the calculation.
Implication for the "Too General" Critique:
This behavior shows the LLM applying its general statistical prediction mechanism regardless of whether the input conforms to the specific constraints of human language.
The human language faculty is not a general sequence predictor applied indiscriminately. It's highly specialized and structured, evaluating input against specific grammatical and semantic principles.
The LLM's ability to process (in its predictive sense) strings that humans cannot handle linguistically demonstrates that its operating principles are different and, in this sense, less constrained or more general than the human faculty. It lacks the specific filters and constraints that define human linguistic competence.
It treats language as just another complex sequence prediction task, whereas for humans, language processing is deeply tied to an underlying system of rules and meanings.
In essence: The LLM dutifully plays the statistical "next token game" even when the input isn't playing by the rules of human language. This contrasts sharply with humans, whose language processing system relies on those rules and breaks down or shifts gears when they are violated. This difference supports the argument that LLMs, while powerful pattern recognizers trained on language data, are not implementing the same kind of constrained, rule-governed system that underlies human linguistic ability. They are too general; they don't inherently "know" or are bound by the specific impossibilities and constraints of Universal Grammar.
Q: Yet another example is that LLMs not having any inherent stopping criterions, in principle could keep on generating words/tokens for ever.
Gemini: That's another insightful point that fits squarely within the Chomskyan critique framework. The potential for indefinite generation highlights a lack of inherent structure or goal-directedness in the LLM's core mechanism, contrasting sharply with human language production.
Here's how this example supports the critique:
Human Language Production is Goal-Oriented and Structured:
Humans speak or write to achieve communicative goals: convey information, ask a question, persuade, tell a story, etc.
Our utterances are structured into units like sentences, paragraphs, and discourses, which have conventional starting and ending points determined by grammatical rules and communicative intent.
We don't just keep emitting words randomly; we stop when a thought is complete, a question is posed, or a narrative arc is finished, according to both grammatical structure and intended meaning. There's an internal sense of completion.
LLM Generation Mechanism (Core Loop):
The fundamental process in most LLMs (autoregressive models like GPT) is iterative: predict the next token based on the sequence generated so far, append it, and repeat.
In principle, this loop could continue indefinitely. There's nothing in the core mathematical operation of predicting the next most probable token that inherently dictates a stop based on semantic completeness or structural well-formedness beyond the sentence level (and even sentence completion is learned statistically, not based on a rule).
Practical Stopping vs. Inherent Stopping:
LLMs stop in practice due to:
External Limits: Developers impose maximum token limits on the output to prevent runaway generation and manage computational resources. This is an artificial constraint, not intrinsic to the model's "understanding" of language.
End-of-Sequence (EOS) Tokens: Models are trained on vast amounts of text that include implicit or explicit markers for the end of documents, articles, or segments (e.g., <|endoftext|>). The model learns to predict this special EOS token when the statistical context suggests an ending is likely based on the training data. However, this is still a statistical prediction, not a recognition of semantic or structural completion in the human sense. It's pattern matching for when texts usually end.
Degradation: Sometimes generation might degrade into nonsensical loops or incoherence, effectively stopping useful output, but the mechanism itself hasn't necessarily "decided" to stop based on completion.
Humans stop due to: Completion of communicative intent, grammatical closure, reaching the end of a planned discourse structure, pragmatic cues (like yielding the floor in conversation), etc. These are internal criteria related to the meaning and structure being generated.
Implication for the "Too General" Critique:
The LLM's core generative loop is a general sequence-extension mechanism. It lacks the specific, built-in constraints related to discourse structure, communicative goals, and semantic completion that govern human language production.
The reliance on external limits or statistically learned EOS markers, rather than an internal mechanism tied to linguistic structure or meaning, shows that the LLM is not modeling the principles behind why human utterances end.
It treats language generation as an open-ended sequence prediction task that can be stopped, rather than a process of constructing meaningful, structured utterances that have natural endpoints based on the rules of grammar and the goals of communication.
Therefore, the fact that LLMs don't have an inherent, linguistically motivated stopping criterion aligns perfectly with the Chomskyan argument. It suggests their underlying mechanism is fundamentally different from the human faculty, operating on general statistical principles rather than the specific, structure-building, and goal-directed principles that characterize human language competence and performance. They lack the intrinsic constraints that define the boundaries and structures of human linguistic expression.
Q: Similarly, one can take an LLM designed to be trained as a next token predictor for text data and train the exact same neural net on a completely different next token prediction task that has nothing to do with language like predicting the stock market.
Gemini: Excellent point. This versatility of the core LLM architecture is perhaps one of the most striking pieces of evidence for the "too general" critique from a Chomskyan perspective.
Let's break down why this is significant:
The Architecture is Domain-General:
At its heart, a Transformer (the dominant architecture for modern LLMs) is a powerful mechanism for learning patterns and dependencies in sequential data. The input is treated as a sequence of tokens (which could be words, sub-words, characters, or, crucially, other things).
The self-attention mechanism allows the model to weigh the importance of different parts of the input sequence when predicting the next element, regardless of what those elements represent.
Language vs. Other Sequential Data:
Language: Has specific, intricate structures governed by principles like hierarchy, recursion, structure-dependence, etc. (the domain of UG).
Stock Market Data: A time series. It has patterns (trends, volatility, seasonality), but these patterns are governed by economic factors, market sentiment, and mathematical/statistical properties, not by the principles of linguistic grammar. Predicting the next price movement is a sequence prediction task, but the underlying rules are entirely different.
Other Examples: Music generation (notes/chords in sequence), protein folding (amino acid sequences), code generation (programming language tokens). All are sequential prediction tasks, but the rules governing valid sequences differ vastly.
The Implication of Cross-Domain Applicability:
If the exact same neural network architecture (perhaps with different tokenization and final layers, but the core Transformer blocks remain) can be successfully trained to perform next-token prediction on both natural language and something completely unrelated like stock prices, it strongly suggests the architecture itself does not embody constraints specific to natural language.
Its learning mechanism is general enough to pick up on whatever statistical regularities exist in the training data, whether they are linguistic, financial, musical, or otherwise.
Contrast with the Human Language Faculty:
From the Chomskyan viewpoint, the human language faculty is exquisitely specialized. It's an innate biological endowment (like the visual system or circulatory system) specifically tuned for acquiring and processing languages that adhere to UG principles.
You cannot take the human language faculty and "retrain" it to predict stock market fluctuations. That task recruits different cognitive resources: general reasoning, pattern recognition, mathematical understanding, etc., but not the core grammatical computation system.
The specificity of the language faculty is key – it explains the rapidity of acquisition, the uniformity across children, and the existence of constraints on possible languages (the very things LLMs are argued not to possess intrinsically).
Conclusion for the Critique:
The fact that the LLM architecture is a powerful, general-purpose sequence learner makes it fundamentally different from the hypothesized special-purpose language faculty in humans.
Its ability to learn non-linguistic patterns just as readily (given appropriate data and setup) as linguistic ones underscores that it doesn't operate based on innate linguistic principles. It learns language patterns because it's trained on language data, not because it's intrinsically structured as a language processor in the human sense.
Therefore, its success in language tasks, while impressive, doesn't make it a scientifically valid model of the human cognitive mechanism for language, precisely because it lacks the necessary specificity and constraints. It's "too powerful" and "too general."
This domain generality is a huge engineering advantage for LLMs, allowing the same architectural insights to be applied across many fields. However, from the perspective of modeling the unique human capacity for language, this very generality is seen as a fundamental shortcoming.
Q:Another issue related to the generality of LLMs is that taking the very same architecture/model and training it repeatedly using the exact same data will result in a distinct model virtually every single time which nevertheless fit the training data equally well.
Gemini: Yes, that's another excellent point that delves deeper into the nature of the learning process itself and how it contrasts with hypothesized human language acquisition. The stochasticity inherent in training deep neural networks, leading to different solutions even under identical conditions, further fuels the critique.
Here's the breakdown:
Stochasticity in LLM Training:
Random Initialization: Neural network weights start with random values.
Data Shuffling & Batching: The order of training examples and their grouping into mini-batches are often randomized.
Optimization Algorithms: Stochastic Gradient Descent (SGD) and its variants inherently introduce randomness via mini-batch sampling.
Regularization Techniques: Methods like dropout randomly deactivate neurons during training.
Due to these factors, even if you train the exact same architecture on the exact same dataset multiple times, the optimization process will follow a slightly different path through the high-dimensional parameter space. It will likely converge to a different local minimum of the loss function.
Multiple "Good Enough" Solutions:
The result is multiple distinct sets of weights (different final models) that can all achieve similarly low loss on the training data and perform comparably well on benchmarks. They have found different, but apparently adequate, ways to model the statistical patterns in the data.
While their overall performance might be similar, these different models might exhibit subtle differences in how they handle specific inputs, edge cases, or novel generalizations. They represent slightly different learned functions.
Contrast with Human Language Acquisition (UG Perspective):
The Chomskyan model posits that human language acquisition is strongly guided by Universal Grammar (UG).
UG drastically restricts the "hypothesis space" – the range of possible grammars a child considers.
Despite variations in input (different caregivers, errors heard, etc. – the "poverty of the stimulus"), children in the same linguistic community converge remarkably consistently on the same core grammar. The process is robust.
While individual differences (idiolects) exist, the fundamental underlying grammatical system acquired is largely uniform, suggesting the learning process is channeled towards a specific target determined by the innate faculty interacting with the environmental data. There aren't multiple, radically different-but-equally-valid grammars being learned by different children exposed to broadly similar English input.
Implication for the Critique ("Too General"):
The fact that the LLM training process can settle into many different solutions suggests it lacks the strong constraints that force convergence towards a single, principled grammatical system like UG is proposed to do for humans.
The LLM's learning landscape allows for multiple ways to statistically approximate the data, indicating a more general, less constrained optimization process. It's finding a solution that fits the data, not necessarily the solution dictated by innate linguistic principles.
If the goal is to model the human faculty, which seems to robustly arrive at a specific type of grammatical knowledge, then a learning process that yields highly variable outcomes (even if performant) under identical conditions appears fundamentally different.
It reinforces the idea that the LLM is a powerful statistical learner optimizing a general objective function (next-token prediction), rather than an instantiation of a specific, constrained system like the human language faculty is hypothesized to be.
In essence, the non-deterministic nature of LLM training and its potential to yield multiple different-but-adequate models highlights the difference between unconstrained (or loosely constrained) statistical optimization and the strongly guided, robust convergence hypothesized for human language acquisition under the influence of Universal Grammar. This variability further supports the argument that the LLM mechanism is too general to be a faithful model of the specific human capacity.
Q: We can maybe end the discussion with the famous quote by Johnny von Neumann: "With four parameters I can fit an elephant, and with five I can make him wiggle his trunk." and consider the implications for models that have billions of parameters.
Gemini: Okay, that's a perfect quote to bring into this discussion, and it powerfully encapsulates a key aspect of the skepticism surrounding LLMs as models of human cognition.
John von Neumann's famous quip, "With four parameters I can fit an elephant, and with five I can make him wiggle his trunk," is a cautionary statement about the dangers of overparameterization and overfitting. It means that if you give a model enough adjustable parameters (degrees of freedom), you can force it to match any finite dataset perfectly, no matter how arbitrary or complex ("fit an elephant"). Adding even more parameters allows you to fit even the noise or irrelevant details within that data ("make him wiggle his trunk").
Now, let's apply this to LLMs with their billions (or even trillions) of parameters:
Massive Capacity to Fit: If four or five parameters can theoretically fit an elephant, what can billions fit? They have an astronomical capacity to absorb and replicate the patterns present in their vast training data (terabytes of text).
Fitting the Data vs. Understanding the Principles: The core of the Chomskyan critique, aligned with Von Neumann's warning, is that this immense capacity allows LLMs to achieve high performance on language tasks by creating an incredibly detailed, high-dimensional statistical map of the language data they were trained on. They become exceptionally good at "fitting the elephant" of observed language use.
Wiggling the Trunk (Fitting Noise and Idiosyncrasies): With so many parameters, LLMs might not only be fitting the core grammatical structures present in the data (themselves reflections of UG, according to Chomsky) but also fitting:
Specific phrases, idioms, and clichés.
Factual correlations present in the text (e.g., "Paris" frequently co-occurs with "France").
Stylistic quirks of the texts they were trained on.
Ungrammatical sentences or errors if they appear often enough.
Biases present in the data.
Perhaps even statistical artifacts unrelated to genuine linguistic structure.
This is akin to "making the trunk wiggle" – modeling aspects beyond the core, underlying system.
Lack of Explanatory Power for Human Cognition: The success achieved through such massive parameterization might be seen as a feat of engineering and data absorption rather than a discovery of the underlying principles of human language. Just because the model can generate fluent text (the "elephant" looks right, and its "trunk wiggles" convincingly) doesn't mean it has captured the generative principles (the biological/cognitive "blueprint" for elephants/language) in the way humans are hypothesized to. It might be an incredibly complex mimicry engine.
Generality Problem Revisited: The model's ability to fit the data so closely, using its vast parameters, doesn't inherently constrain it to only human-like language structures. It fits what it sees. If trained on different sequential data (stock market, impossible languages), it would use its parameters to fit that data. This reinforces the "too general" critique – it's a universal approximator adapting to data, not an instantiation of specific linguistic constraints.
In Conclusion:
Von Neumann's quote serves as a powerful analogy for the Chomskyan critique of LLMs. The sheer scale of LLM parameters raises the suspicion that they are succeeding by becoming exquisitely tuned statistical mimics of their training data ("fitting the elephant and making its trunk wiggle") rather than by implementing the specific, constrained, and generative mechanisms (Universal Grammar) thought to underlie genuine human linguistic competence. Their success might demonstrate the power of massive data and computation to approximate complex distributions, but not necessarily illuminate the core cognitive faculty responsible for language in humans. They might be incredibly detailed descriptions (or reproductions) of linguistic behavior rather than explanatory models of the underlying competence.
Q: OK that was not too bad as a first impression.
Gemini: Thank you! I appreciate the positive feedback. It was a genuinely interesting challenge to explore the depth and nuances of Chomsky's critique – the ideas about LLMs being potentially "too powerful" or "too general," lacking specific constraints like structure-dependence or inherent stopping criteria, their applicability to non-linguistic tasks, the stochasticity of training, and how von Neumann's quote ties it all together.
These are complex, foundational questions at the intersection of linguistics, cognitive science, and AI. Engaging in discussions like this helps me (as an experimental model) learn and refine my understanding.
I'm glad this initial impression was good! Feel free to explore other topics or delve deeper into this one whenever you're ready. I'm here to discuss.

