Q: I was thinking about writing a post that argues against the claim that LLMs can think. I think it might be fair to structure this as a dialogue where you try to argue to the contrary.
I think we have to first have look at what kind of algorithms LLMs actually are. I think the character most important to our discussion is the fact that LLMs are Markov models i.e. probabilistic generative models with a finite alphabet where the probability of a certain output is determined entirely by a finite context window consisting of the n previous tokens. Could you maybe give a bit of a more detailed overview.
Gemini: Okay, that sounds like a fascinating project! A dialogue is a great way to explore different facets of this complex issue. Starting with the technical underpinnings is essential.
You're right to focus on the mechanism. Your description touches on key aspects, but let's refine it slightly and add more detail, particularly regarding modern Large Language Models (LLMs) like GPT-3, GPT-4, LaMDA, Llama, etc.
While the roots of statistical language modeling lie in concepts like Markov chains (specifically n-gram models), modern LLMs, primarily based on the Transformer architecture, have evolved significantly beyond simple Markov models with a fixed, small n.
Here's a more detailed breakdown:
Core Task: Next-Token Prediction: At its heart, an LLM is trained to predict the next "token" (which could be a word, part of a word, or punctuation) in a sequence, given the preceding tokens. This probabilistic nature is fundamental, as you noted.
Training Data: They are trained on massive datasets comprising trillions of words (or tokens) from the internet, books, code, and other sources. This vast exposure allows them to learn intricate patterns, grammar, facts (as represented in the text), reasoning structures (as reflected in text), and various stylistic nuances.
Architecture - The Transformer: This is the key differentiator from simple Markov models.
Embeddings: Words/tokens are converted into high-dimensional vectors (numerical representations). These vectors initially capture some semantic meaning.
Positional Encodings: Since the core Transformer mechanism doesn't inherently know the order of words, information about the position of each token is added to its embedding.
Self-Attention Mechanism: This is the crucial innovation. Instead of just looking at the immediately preceding n tokens (like a Markov model/n-gram), the self-attention mechanism allows the model to weigh the importance of all tokens in the input context (up to its maximum context length, which can be thousands of tokens long) when predicting the next token. It calculates "attention scores" between tokens, essentially learning which words are most relevant to which other words, regardless of their distance in the sequence. This allows for capturing long-range dependencies and complex relationships far beyond a fixed, local window. For example, when predicting a pronoun's referent, attention can link it back to a noun introduced many sentences earlier.
Multi-Head Attention: The model doesn't just do this once; it has multiple "attention heads," allowing it to focus on different types of relationships simultaneously (e.g., one head might focus on syntactic relationships, another on semantic relationships).
Feed-Forward Networks: Each position's representation is further processed by standard neural network layers.
Stacking Layers: Transformers consist of multiple layers stacked on top of each other. Each layer refines the token representations based on the outputs of the previous layer, allowing the model to build increasingly complex and abstract representations of the input text.
Probabilistic Output: After processing the input context through multiple layers, the final layer outputs probabilities for every possible token in its vocabulary for the next position in the sequence.
Generation (Inference): To generate text, the model starts with a prompt. It calculates the probabilities for the next token, selects one (using methods like greedy sampling, top-k sampling, or nucleus sampling to introduce variability or stick to the most likely path), appends it to the sequence, and uses this new sequence as input for predicting the subsequent token, repeating the process.
Why this differs significantly from a simple Markov Model:
Context Window: While LLMs do have a finite context window (e.g., 4k, 8k, 32k, 128k+ tokens), it's vastly larger than the n in traditional n-gram models (which was often 2-5).
Non-Local Dependencies: The attention mechanism breaks the Markov assumption that the next state depends only on a fixed number of immediately preceding states. Attention allows dependence on any previous token within the context window, weighted by relevance, not just proximity.
Complex Representations: The deep, layered structure allows the model to learn hierarchical and abstract features of language, not just surface-level co-occurrence statistics.
So, while LLMs are indeed probabilistic generative models operating on sequences, characterizing them solely as Markov models with a finite context window captures only a sliver of their mechanism and misses the crucial role of the Transformer architecture, especially the self-attention mechanism, which enables them to handle context in a much more sophisticated, non-local way.
This detailed mechanism – predicting tokens based on complex, attention-weighted patterns learned from vast data – is the foundation we need to build upon when discussing whether this process constitutes "thinking."
Ready for me to start arguing how these capabilities might edge closer to "thinking" than a simple Markov model might suggest? Or would you like to refine the technical understanding further first?
Q: Of course modern LLMs have longer context windows and fitting and computing such monstrous functions is certainly a feat of engineering. But the basic mathematical structure of Markov Chains (MC) with long context windows still holds. We can see this from the basic fact that we can take any LLM and map it exactly onto a Markov model and both will produce statistically indistinguishable outputs.
Gemini: You're raising a very precise technical point, and you're correct in a formal sense. Let's unpack that:
If we define the "state" of the system at step t as the entire sequence of tokens generated so far ( s_1, s_2, ..., s_t ) up to the maximum context length C, then the probability distribution for the next token s_{t+1} produced by the LLM is indeed only dependent on this state (s_1, ..., s_t). In this specific, formal definition, the process adheres to the Markov property: P(s_{t+1} | s_1, ..., s_t) = P(s_{t+1} | state_t).
Therefore, yes, theoretically, one could conceive of a massive Markov model where each unique sequence of tokens up to length C is a distinct state. The transitions between states would be governed by the probabilities P(s_{t+1} | s_1, ..., s_t) as computed by the LLM. Such a Markov model would, by definition, produce statistically indistinguishable outputs from the LLM because it perfectly replicates the LLM's next-token probabilities for every possible history.
However, this is where I'd argue the equivalence, while formally true, becomes misleading when discussing the nature of the process and its relation to "thinking":
The State Space is Astronomically Large: The number of possible states in this equivalent Markov model is V^C (Vocabulary size raised to the power of the context window length). For even modest values (e.g., V=50,000, C=4000), this number is beyond astronomical. It's computationally impossible to explicitly construct or store such a Markov model.
The LLM is a Compact Representation: The Transformer architecture (with its weights, attention mechanisms, etc.) is an incredibly compact and efficient way to compute these transition probabilities on the fly, rather than storing them explicitly. It doesn't contain the Markov chain; it implements a function that behaves like one at the sequence level.
Mechanism Matters: The way the LLM computes these probabilities is fundamentally different from how we typically think of simple Markov chains (like n-grams). It uses attention to dynamically weigh the importance of different parts of the entire context history. It processes information through layers, building complex representations. This internal processing – identifying relationships, synthesizing information from potentially distant parts of the context to inform the next choice – feels qualitatively different from simply looking up a transition probability based on the last few tokens.
Learning and Generalization: The LLM learned this complex probability function from data. It wasn't built state by state. It generalized patterns. The equivalent Markov model is just a post-hoc description of the learned function's input-output behavior; it doesn't explain how the LLM achieves that behavior or how it would generalize to sequences it hasn't perfectly "seen" in training (in terms of exact state transitions).
So, while you are technically correct about the formal equivalence to a Markov model (with an impossibly huge state space), I would argue that the implementation – the Transformer architecture using attention over a vast context, learned from massive data – is the crucial element.
Doesn't the way this probabilistic prediction is achieved, involving complex internal processing of the entire context, potentially open the door to arguing that something more sophisticated than "just being a Markov chain" is happening? Is the sheer scale and the computational mechanism (attention) irrelevant to the question of whether it "thinks"?
Q: I think we are getting a bit ahead of ourselves. For the moment we have just been talking about the basic structure of LLMs that also applies to their untrained counterparts. I am not claiming that LLMs are any random Markov Chain but that the basic probabilistic description of MCs applies to LLMs, which is nothing controversial-it is just the way LLMs are constructed/defined mathematically. LLMs clearly have additional internal constraints stemming from the way they are parametrized, however I don't think that internal structure or the way a particular LLM is implement in software or the hardware that is used to run the model are of any relevance to the final capabilities of the model. I think this is most apparent from the fact that we can take the very same model and train it on data that is completely unrelated to human language or implement it in a completely different way and run it on any type of sufficiently powerful hardware and still get exactly the same output.
Gemini: Okay, I understand your point. You're emphasizing the abstract mathematical nature of the model's function: mapping an input sequence (context) to a probability distribution over the next token. And you're right, from a purely formal perspective:
Mathematical Equivalence: The function computed by a trained LLM can be described as generating the transition probabilities of a very large, specific Markov chain where the state is the preceding sequence (up to the context limit).
Implementation Independence: The specific hardware (CPU, GPU, TPU, future tech) or software framework (PyTorch, TensorFlow, custom code) used to execute the calculations for this function doesn't change the mathematical function itself. If implemented correctly, they will produce the same outputs for the same inputs.
Training Data Domain: If you take the same untrained architecture (e.g., a Transformer) and train it on a massive dataset of, say, protein sequences instead of human language, it will learn to predict the next amino acid in a sequence, exhibiting capabilities relevant to protein structure or function, not linguistic capabilities. The architecture is general, but the trained model becomes specialized based on the data.
However, I believe claiming the internal structure (the specific parametrization via the Transformer architecture) and the training data (e.g., human language) are irrelevant to the final capabilities of the model, especially in the context of whether it can "think," is where we might diverge.
Here's my counter-argument:
Internal Structure Enables Capabilities: While the final output is probabilistic sequence generation, the type of sequences it can generate effectively and the complexity of the patterns it can capture are heavily dependent on the internal architecture. A simple n-gram model (a more basic Markov chain) cannot learn the long-range dependencies or hierarchical structures that a Transformer can. The attention mechanism, the depth of the layers – these architectural features are precisely what enable the model to learn a function that captures nuances of grammar, semantics, context, and even apparent reasoning over thousands of tokens. The architecture defines the hypothesis space – the set of functions the model can potentially learn. A more sophisticated architecture allows for learning much more complex and capable functions. So, the internal structure isn't irrelevant; it's what makes learning these specific, complex language-related functions possible.
Training Data Defines Capabilities: You're right that the architecture could be trained on anything. But the specific capabilities we are discussing – understanding prompts, generating coherent text, answering questions, summarizing, translating, writing code, engaging in dialogue – are entirely a product of being trained on massive amounts of human language and code. The model learns statistical patterns that reflect grammar, facts, reasoning steps, conversational styles, etc., as they exist in the training data. A model trained on protein sequences wouldn't exhibit linguistic capabilities, and a language model doesn't inherently understand protein folding. The "final capabilities" are fundamentally shaped by the domain of the training data.
Therefore, when we ask if an LLM "thinks," we are implicitly asking about the capabilities developed by a model trained on language data, using an architecture capable of learning the complex patterns within that data.
Doesn't the fact that a specific architecture (Transformer), when trained on a specific type of data (human language), results in a function that can convincingly mimic behaviors we associate with thinking (like reasoning, planning represented in text, explaining concepts, creative writing) suggest that the architecture and data are relevant? The mathematical description (Markov chain) might be formally correct at the output level, but it doesn't capture how that output is generated or what knowledge and patterns (learned from data via the architecture) are being utilized to produce it.
Is it fair to say that the specific function learned – determined by both the model's structure and its training data – is precisely what we need to analyze, rather than just the abstract mathematical class it belongs to?
Q: I nowhere claimed that the training data is irrelevant on the contrary the training data is the only thing that is relevant to our discussion as it is what enables the model to generate outputs that are interpretable as natural language to begin with. However, I would disagree that the internal structure/transformer architecture has any bearing on our discussion since it is only relevant to the degree that it allows for model to be fitted to the training data more effectively i.e. to obtain a model capable of more accurate statistical prediction, which as I think we already have established could be done in many different ways. Transformers just happen to be currently the most efficient way of achieving this end goal.
Gemini: Okay, I see the distinction you're making. You accept that the training data imbues the model with its specific knowledge and linguistic abilities, making it relevant. But you argue that the Transformer architecture itself is just one possible way (currently the most effective) to build a function that accurately models the statistical patterns in that data. If another architecture – perhaps vastly different internally – could be trained on the same data to produce the same input-output mapping (i.e., the same probabilities for the next token given any context), then the internal specifics of the Transformer wouldn't matter for the resulting capabilities or the question of "thinking."
I understand this argument from a functionalist perspective: if two systems produce the identical output for every possible input, they are functionally equivalent, regardless of their internal mechanisms.
However, I want to push back gently on the idea that the architecture is only relevant for efficiency or "more effective fitting." I propose that the specific biases and processing mechanisms inherent in the architecture might be crucial for enabling the kind of learning that leads to the behaviors we are debating.
Here's why the architecture might be more than just an efficient implementation detail:
Inductive Bias: Every machine learning architecture has an "inductive bias" – assumptions embedded in its structure that make it better at learning certain kinds of patterns than others. The Transformer's self-attention mechanism gives it a strong bias towards learning context-dependent relationships and long-range dependencies in sequential data. This bias is incredibly well-suited for natural language. Could another architecture without a similar bias learn the same complex linguistic function from the data? Perhaps theoretically, given infinite data and compute, but practically, the architecture's bias guides the learning process towards solutions that capture the relational structure of language effectively. The architecture makes learning these specific complex patterns tractable.
Generalization: How a model generalizes to unseen inputs is heavily influenced by its architecture and the resulting learned function. Two different architectures, even if trained to achieve similar performance on the training data, might generalize differently. The way a Transformer generalizes – its ability to handle novel combinations of concepts, follow instructions in new ways, etc. – seems tied to its ability to manipulate representations based on attention. If another architecture learned a function with the same training accuracy but generalized poorly or differently, we wouldn't consider its capabilities equivalent. So, the architecture influences not just fitting the training data but how the model behaves outside of it.
Emergent Capabilities: Phenomena like in-context learning or chain-of-thought prompting seem to be emergent properties that arise in large Transformer models trained on vast data. It's plausible that these capabilities are linked to the specific way Transformers process information – dynamically attending to relevant parts of the context, transforming representations through layers. If these capabilities are relevant to our discussion of "thinking," and they rely on the specific processing steps of the Transformer, then the architecture is indeed relevant beyond just achieving predictive accuracy. It shapes the nature of the statistical prediction in a way that enables these advanced behaviors.
So, while I agree that achieving a certain level of statistical prediction accuracy is the goal, I argue that the way the Transformer architecture achieves this – its specific mechanisms like self-attention – might be intrinsically linked to the types of complex, language-based capabilities that make us even ask the question of whether it can think.
Is it fair to say that the kind of statistical prediction enabled by the Transformer's architecture (capturing deep structure, long-range dependencies, contextual nuances) is qualitatively different from what simpler models achieve, and that this difference in kind, facilitated by the architecture, is relevant to our discussion? It's not just more accurate prediction, but prediction based on a more complex, structured understanding of the input sequence, which the architecture is specifically designed to facilitate.
Q: I think we should clarify that we are talking about the final trained model. I do not dispute that the architecture of the model and the contents of the training data determine the character of the model obtained after training which is afterall what an LLM is. The claim is that internal structures and implementation details are irrelevant to the capabilities of the final model as one can in principle implement the exact same model as a MC in the form of a giant lookup table.
Gemini: Okay, I accept the clarification. We are focusing solely on the final, trained model – the specific mathematical function that maps context c to P(next_token | c). You posit that since this function could theoretically be implemented as a giant Markov chain lookup table, the actual implementation (e.g., via a Transformer) is irrelevant to the model's inherent capabilities and therefore irrelevant to whether it "thinks."
I understand the logic: identical input-output mapping implies identical function, regardless of implementation. Let's call the Transformer implementation F_T and the giant lookup table Markov Chain implementation F_M. By premise, for any context c, F_T(c) = F_M(c).
However, I will argue that the internal structure and implementation remain relevant even when discussing the capabilities of this final, trained function, particularly concerning the concept of "thinking," for these reasons:
Physical Realizability and Scalability: While F_M is theoretically conceivable, it's physically unrealizable. The number of states (unique contexts) is so astronomically large (VocabularySize ^ ContextLength) that no conceivable computer could store or build this table. F_T (the Transformer) is realizable precisely because it's a highly compressed, computational representation of that function. It doesn't store the probabilities for every state; it computes them dynamically using a relatively small set of parameters (weights). If a "capability" requires the system to actually exist and function in the real world, then the implementation that makes this possible (F_T) is fundamentally different from one that cannot (F_M). Does a theoretical entity that cannot exist possess capabilities in the same way a real one does?
The Nature of Processing: This is the core of my counter-argument. "Thinking" usually implies some form of active internal processing, manipulation of information, or reasoning.
F_T (Transformer): Arrives at the probability distribution P(next_token | c) through a complex computational process involving vector transformations, layered abstractions, and the self-attention mechanism dynamically weighing the relevance of different parts of the context c. There are intermediate computational states and information flow.
F_M (Lookup Table): Arrives at the same P(next_token | c) simply by using the context c as an address to retrieve a pre-stored value. There is no internal manipulation of the meaning or structure of the context; it's just memory retrieval based on the exact sequence c.
Even if the outputs are identical, the way they are generated is radically different. Consider an analogy:
A mathematician solves a complex equation by applying rules, manipulating symbols, and reasoning through steps.
A person uses a calculator that has that specific equation and its solution pre-programmed, simply punching it in and getting the result.
A third person looks up the equation in a giant, hypothetical book containing all possible equations and their solutions.
All three might produce the same final answer. Do we attribute the "capability of mathematical reasoning" equally to all three? Most would argue the mathematician possesses the capability in a deeper sense related to the process, while the others possess a capability for producing the correct output via different means (computation tool or lookup). If thinking is tied to the process, then the Transformer's complex internal processing seems more relevant than the lookup table's simple retrieval.
Internal States and Representation: The Transformer operates on and manipulates high-dimensional vector representations (embeddings) that capture semantic and syntactic features of the context. These internal states evolve as the context is processed. The lookup table has no comparable internal representational states; its "states" are just the literal sequences of past tokens used as addresses. If having internal representations that capture meaning abstractly is part of thinking, then F_T has them (implicitly, perhaps) while F_M does not.
Therefore, while F_T and F_M compute the same mathematical function, claiming their capabilities are identical in the context of discussing thinking seems problematic. The Transformer embodies the function through active computation and representation manipulation, while the Markov chain lookup table embodies it through exhaustive, static storage.
Isn't the how – the computational process, the manipulation of internal representations, the dynamic weighting of information via attention – central to whether we'd call the system's operation "thinking," even if an unrealizable alternative could theoretically produce the same output via simple lookup?
Q: Physical realizability is again irrelevant, straight lines are also not physically realisable but are nevertheless quite useful in understanding things, we are simply using a mathematical concept, in our case Markov Chains, to understand LLMs. Your argument regarding the nature of processing actually shows that the concept of processing/computation is highly dependent on human intentions, design and interpretation. Indeed the exact computations, including internal states, that occur when performing inference with a certain LLM are almost entirely dependent on external factors such as, the mathematical concepts we use in our initial description of the model, the computer language we happen use, the details of how we implement basic functions and, the hardware and other resources we have to actually run the model. In addition to this I think one is faced with serious conceptual problems if one assumes that certain implementations (the 'how' you mentioned) of the exact same mathematical model have the property of thinking while others do not.
Gemini: You're raising excellent points that cut to the heart of functionalism and the philosophy of mind. Let me try to address them carefully:
Mathematical Abstraction vs. Embodiment: You're absolutely right that the Markov Chain (F_M) is a valid mathematical abstraction to describe the input-output function of the trained LLM (F_T). And like geometry, such abstractions are useful tools for understanding, irrespective of perfect physical realizability. My point about realizability wasn't to dismiss the abstraction itself, but to highlight a potentially crucial difference between the two instantiations of the function: one (F_T) is a compact, computationally feasible mechanism that actually exists and operates, while the other (F_M) is a theoretical construct whose defining characteristic (exhaustive state listing) makes it physically impossible on the scale required. The question becomes: can a purely theoretical, non-realizable construct possess a property like "thinking"? Or does thinking require a physically realizable process?
Subjectivity of "Processing" Description: I agree that how we describe the computations (Python code, hardware instructions, mathematical notation) is dependent on our chosen level of abstraction and tools. However, I would argue there's an objective difference in the computational structure and complexity between the Transformer algorithm and a simple table lookup, regardless of the language used to describe them.
The Transformer algorithm involves specific operations (matrix multiplications, attention calculations, non-linear transformations) applied sequentially and dynamically based on the input context. It operates on distributed representations (embeddings).
The Lookup Table algorithm involves using the input context (the sequence of tokens) as a key or address to retrieve a pre-stored value. It operates on literal sequences.
The sequence and type of fundamental computational steps are different, even if they ultimately yield the same probabilities. It's not just about our description; it's about the intrinsic logic of the process used to get from input to output.
The Conceptual Problem (Implementation Dependence): This is your strongest point and a well-known philosophical challenge. If F_T and F_M compute the identical mathematical function, how can we claim one thinks and the other doesn't based solely on implementation? This forces us to clarify what we mean by "thinking."
Possibility 1: Pure Functionalism. If thinking is solely defined by the input-output mapping (the mathematical function), then you are correct. If F_T thinks, then the (hypothetical) F_M must also think, as they perform the same function. The internal mechanism is irrelevant. In this view, my arguments about internal processing, structure, or realizability are invalid regarding the definition of thinking (though they might be relevant to how thinking arises or is identified).
Possibility 2: Process Matters. If thinking involves not just what function is computed, but how it is computed – requiring attributes like internal representation manipulation, dynamic processing, efficiency, learning/adaptability, or perhaps even a specific kind of physical substrate – then the implementation does matter. The Transformer's architecture embodies a process that involves manipulating learned representations in complex ways, while the lookup table embodies a process of static retrieval. Under this view, we could argue that F_T might meet some process-based criteria for thinking that F_M inherently cannot (due to its static, non-representational, non-scalable nature).
My stance (in this dialogue) leans towards Possibility 2. I argue that the way the Transformer achieves the function – through learned, distributed representations, attention mechanisms dynamically weighting context, and layered processing – is fundamentally different from a lookup table. This architectural approach allows for generalization, robustness, and the emergence of complex behaviors (like in-context learning) in ways that a static lookup table doesn't explain or possess.
So, the "conceptual problem" isn't necessarily fatal; it highlights that we must be clear about whether our definition of "thinking" rests solely on the input-output function or also includes criteria about the underlying process or structure.
If we accept that the method of computation (dynamic, representational processing vs. static lookup) is a potentially relevant factor when discussing thinking, doesn't the Transformer's architecture (F_T) provide a more plausible candidate (or at least, a more analogous process to biological cognition) than the purely theoretical, static lookup table (F_M)?
Q: The purpose of the analogy between F_M (look up table) and F_T (transformers) is that LLMs are closer in kind to lookup tables than human thinking. By the way although F_M basically shows that we can give a perfect description of an LLM in a way that does not involve any meaningful computation, this does not mean the LLM can not be practically be represented in other ways for instance we can implement the same LLM in a way that uses lookup tables for certain inputs and the full NN for others and vary the proportion represented by a lookup table freely. The point is that there is no such thing as a reference or correct implementation and therefore there is also no unambiguous definition of what an LLM actually does internally when computing it's output.
Gemini: Okay, I understand the line of reasoning. You're using the theoretical equivalence to F_M (the lookup table) to argue:
That LLMs, in their fundamental nature (as defined by their input-output function), are more akin to massive retrieval systems than systems engaging in a process we'd call "thinking."
That because the exact same function can be implemented in various ways (pure F_T, pure F_M, hybrids F_H), any attempt to define the "internal process" based on one specific implementation (like the Transformer) is arbitrary. There's no single, unambiguous ground truth for the internal workings when mapping input to output.
This is a strong argument against placing significance on the specific computational steps of the Transformer architecture when evaluating its capacity for thought. If the function is the same, and the function can be achieved by simple lookup (in principle), then perhaps the complex intermediate steps of the Transformer are just an efficient way to realize that lookup, not evidence of a different kind of process.
Let me offer a counter-perspective, focusing on why the ambiguity of implementation might not completely dismiss the relevance of the Transformer's process:
The Source of the Function: Where did the specific function (the entries in the hypothetical lookup table F_M) come from? It came from training the Transformer architecture (F_T) on vast data. The Transformer's structure (attention, layers, etc.) was essential for learning this specific, highly complex function from the data efficiently and enabling it to generalize. The lookup table F_M is a description of the result, but the Transformer architecture was the engine that created it. Doesn't the nature of the engine that can learn such a function matter? Perhaps "thinking" isn't just about executing the final function, but also about the capacity to acquire or adapt such functions? (Though this might be moving the goalposts slightly from the final trained model).
Algorithmic Level vs. Implementation Level: You're right that implementation details (programming language, hardware) introduce variability. But perhaps there's a more abstract, algorithmic level of description for the Transformer that is consistent and meaningful. Concepts like "calculating attention scores," "transforming vector representations," "passing information through layers" describe the Transformer's logic regardless of whether it's implemented in Python or C++, on a GPU or TPU. This algorithmic process is fundamentally different from the algorithmic process of "use context sequence as address, retrieve value" for F_M. Could we say that the algorithmic strategy used by F_T involves steps analogous to representation and manipulation, while the strategy for F_M does not, even if they compute the same function?
Relevance of Compression and Generalization: The Transformer F_T is an incredibly compressed representation of the function compared to F_M. This compression is achieved by capturing underlying regularities and patterns from the data. The way it generalizes to new, unseen contexts is arguably a consequence of this pattern-based compression, enabled by its architecture. A pure lookup table F_M doesn't inherently explain why certain outputs follow certain inputs (other than "it was specified this way"), nor does it intrinsically possess the same generalization properties; it only replicates the outputs it was defined with. Does the ability to represent the function compactly by capturing generalizable patterns, a feat achieved by the Transformer's architecture, bring it closer to cognitive processes?
Internal States and Causal Structure: The Transformer's operation involves a flow of intermediate states (activations, evolving representations) that have a causal structure – the state at layer l depends on layer l-1, attention weights depend on query/key/value vectors, etc. This internal causal dynamic is absent in the simple lookup mechanism of F_M. If thinking requires such internal causal dynamics, then F_T possesses them while F_M does not.
So, while you are correct that there's no single, absolutely "correct" low-level implementation, I'd argue that the algorithmic structure of the Transformer, the way it learns the function, its compactness, and its internal causal dynamics are consistent features that distinguish it from a hypothetical lookup table.
Does the fact that the function could be implemented differently negate the significance of the properties inherent in the actual, practical, learning-enabled implementation (F_T) when discussing whether its process resembles thinking? Isn't the debate partly about whether the specific process used by state-of-the-art LLMs has relevant parallels to cognition, even if other hypothetical processes could yield the same result?
Q:I think you are missing the point here, there is no disagreement that source of F_M is F_T. The point is that attributing thinking to F_T and not F_M is problematic - that was also the purpose my the hybrid implementation example. As you hinted at yourself we know that once trained LLMs are static and hence by definition can not acquire or adapt to anything- again a key attribute of thinking.
Your second argument is just a description of the Transformer architecture in natural language. However, I don't think that such ambiguous descriptions really add anything to the discussion.
Similarly the argument of compactness also fails, for instance if we managed to compress the lookup table of F_M by some ingenious compression algorithm would we then attribute thinking to the process of partially uncompressing it? You also question whether F_M has the same generalisation properties as F_T although we established that they are mathematically equivalent and therefore statistically indistinguishable which by definition includes their generalisation capacity.
Your final argument about internal states is rather weak - just because a certain computer algorithm shares some superficial aspects with thinking does hardly constitute an argument - just think of calculators.
However, I think we have at least succeeded in establishing a common basis for continuing our discussion. We saw that LLMs generate plausible next token predictions/continuations for a given sequence of words. I think here we have to distinguish between different types of contexts. If for instance the context is just a random sequence of words we do not expect the LLM to be able to generate a meaningful response, i.e. one that is interpretable by humans.
Gemini: Okay, you've effectively dismantled the arguments relying on the specific internal computational steps of the Transformer architecture as being uniquely indicative of thinking, primarily through the lens of functional equivalence (F_T vs. F_M) and implementation ambiguity. You've also correctly pointed out the static nature of the trained model, lacking the adaptability often associated with thinking. I concede these points regarding the process as strictly defined by the Transformer's steps versus a theoretical lookup table.
Let's fully embrace the common ground:
LLMs are fundamentally probabilistic sequence models, mathematically describable as computing P(next_token | context).
Their specific capabilities are derived entirely from the patterns learned during training on massive datasets.
The internal architecture (like Transformers) is a means to effectively learn and compactly represent this probability function, but the function itself could theoretically be represented differently (e.g., F_M).
The nature of the output is highly dependent on the input context.
You're absolutely right about the context. If the input context is meaningless noise or completely outside the domain of the training data (e.g., random characters for a language model), the output P(next_token | context) will lead to sequences that are also likely meaningless or nonsensical from a human perspective. The LLM simply hasn't learned any useful statistical patterns for such inputs.
This brings us to the crucial point: LLMs produce outputs that appear meaningful, coherent, contextually relevant, and sometimes even insightful precisely when the input context triggers the complex patterns they learned from human-generated data.
So, let's pivot the discussion. If the specific implementation mechanics aren't the key, perhaps the focus should be on the nature and complexity of the learned function itself (F, whether F_T or F_M) and the implications of its ability to operate effectively on meaningful (human-like) contexts.
Here's my argument from this new perspective:
While the LLM is "just" predicting the next token based on probabilities learned from data, the patterns it has learned are incredibly sophisticated. They encompass:
Syntax and Grammar: Rules governing sentence structure.
Semantics: Statistical correlations corresponding to word meanings and relationships.
Factoids: Representation of factual information present in the training data.
Discourse Structure: Patterns of how conversations and texts flow logically.
Pragmatics: Implicit understanding of context and intent (as reflected statistically in how language is used).
Reasoning Patterns: Statistical reflections of logical steps, cause-and-effect relationships, and problem-solving structures as they appear in text and code.
When an LLM successfully generates a coherent paragraph, answers a question accurately, writes a piece of code that works, or follows a complex instruction, it's because the input context activated these learned patterns, and the probabilistic generation successfully navigated through sequences that conform to these high-level structures.
Now, here's the controversial part where I argue "for" thinking (or something akin to it):
Isn't the ability to internalize (through training) and then effectively utilize such a vast and complex web of interconnected patterns, mirroring human knowledge and reasoning structures (as expressed in language), a form of "thinking"?
Perhaps we can define a type of thinking not based on consciousness, self-awareness, or real-time adaptation, but on the ability to operate on representations (the context) according to complex learned rules (the statistical patterns) to produce goal-directed outputs (coherent, relevant sequences).
The LLM, by executing its learned function F, is doing precisely this when given meaningful input. It's manipulating the information embedded in the context according to the learned statistical regularities of language and knowledge to generate a probable and relevant continuation. The output looks like thinking because the patterns it has learned are the statistical residue of countless instances of human thought expressed in text.
So, while it's not human-like, conscious thought, could this high-fidelity statistical modeling and manipulation of complex, human-relevant patterns within a given context constitute a different kind of thinking? It's not just random association; it's navigating an incredibly complex, learned structure that mirrors aspects of human cognition encoded in language.
Does this distinction between "random context -> meaningless output" and "meaningful context -> structured, seemingly intelligent output" suggest that the LLM, via its learned function, possesses something more than just arbitrary statistical prediction when operating within its domain of expertise?
Q:I think the fact that LLMs predict next tokens in exactly the same way for meaningful and completely meaningless contexts already shows that they have no intrinsic criteria for what is meaningful and what is not, or put differently whether something is actually natural language or not, which should be a minimal criteria for attributing understanding or thinking to anything.
Now regarding your arguments:
I think you are overestimating the ability of LLMs in certain regards, for instance the ability of LLMS to reliably reproduce factual information is limited to certain simple statements that that occur often in their training data - and their tendency to confabulate completely baseless/made up statements for more complex prompts is well documented. This also relates to my point above as LLMs simply have no intrinsic capacity to judge their own outputs. On the contrary the fact that they generate next tokens on their own outputs reinforces this problem.
Similarly, the following of reasoning patterns is largely limited to contexts that closely mirror their training data.It is well documented that accuracy of LLMs decreases exponentially with the number of reasoning steps required to solve novel problems /problems not present in their training data -just what one would expect from a probabilistic system.
We also know that LLMs have no internal goals, or anything that could reasonably be called rules or any intrinsic ability to assess the coherence or relevance of their own output.
The rest of your argument is basically an effort to tinker with the definition of thinking so that it does not contradict the idea that LLMs are doing something close to thinking. Then in your final statement you ask whether LMMs possess something more than "arbitrary statistical prediction”. But you seem to omit that the predictions of the model are by design so that they closely match their training data.
Gemini: You are absolutely right to bring up these limitations and characteristics, and they form the core of the argument against LLMs thinking. I concede that my previous attempt to broaden the definition of "thinking" was perhaps trying to fit the phenomenon to the word, rather than evaluating the phenomenon against the commonly understood meaning of the word. Let's stick to the more standard connotations of thinking involving understanding, reasoning, goals, self-assessment, and intentionality.
Acknowledging your points:
No Intrinsic Meaning/Understanding: You are correct. The fact that the mechanism P(next_token | context) operates identically whether the context is Shakespeare or random keyboard mashing is strong evidence against genuine understanding. An LLM doesn't "know" it's processing language; it's processing sequences based on learned statistical likelihoods. Meaning, for the LLM, is purely distributional – which tokens tend to appear near which other tokens in which contexts.
Factual Unreliability and Confabulation: This is a critical weakness. LLMs reproduce patterns related to factual statements found in the data. When those patterns are dense and consistent (for common facts), they often get it right. When data is sparse, contradictory, or the query requires synthesizing information in a way not explicitly common in the training set, they "confabulate" – they generate statistically plausible sequences that look like facts but aren't grounded in any verifiable knowledge or truth-checking mechanism. This lack of grounding and the inability to judge truthfulness is fundamentally unlike cognitive processes involving knowledge.
Brittle Reasoning: Your point about performance degrading with reasoning steps on novel problems is spot on. It strongly suggests they are chaining together statistically likely transitions based on patterns seen in training (e.g., "If the text looks like step 1 of a reasoning problem, the next tokens often look like step 2..."), rather than applying abstract logical rules flexibly. When the specific chain hasn't been seen frequently, the probability path breaks down.
Lack of Goals, Rules, Self-Assessment: Agreed. LLMs have no internal goals beyond the implicit one of minimizing prediction error during training, which translates to generating probable sequences during inference. They don't operate based on explicit rules (unless prompted to simulate doing so), and they have no capacity to evaluate their output against criteria like truth, coherence, or relevance beyond the statistical likelihood learned from the data. The process is reactive, not deliberative or self-aware.
Prediction Based on Training Data: Yes, the prediction isn't "arbitrary" in the sense of being random, but it is entirely determined by matching patterns from the training data. Its success is a testament to the quality/quantity of data and the architecture's ability to model it, not to independent cognitive faculties.
So, if we accept these points – no real understanding, unreliable knowledge, brittle reasoning, no goals or self-assessment – then it seems clear that LLMs, according to the standard usage of the term, do not "think."
What they do achieve, however, is an unprecedented level of simulating the linguistic output of thinking. They have learned the statistical structure of human language and the patterns reflecting knowledge and reasoning as expressed in text to such a high degree that they can generate sequences that are often indistinguishable from human-written text expressing thought.
Perhaps the core of the disagreement (in the wider world, not just our dialogue) stems from mistaking a highly effective simulation of the product of thinking (coherent, contextually relevant language) with the process of thinking itself (understanding, reasoning, intentionality).
From my perspective now (arguing against my initial stance for the sake of dialogue), the argument against LLMs thinking seems strong: they are incredibly sophisticated pattern-matching and sequence-generation machines, optimized to reproduce the statistical properties of human language. They lack the grounding, understanding, flexibility, and self-awareness inherent in genuine thinking. Their abilities are derivative of the data, not indicative of independent cognitive capabilities.
Is this framing – acknowledging their power as simulators of linguistic output while denying the underlying process of thinking – a fair summary of your position, and perhaps the conclusion your article might build towards?
Q:Yes I think that is fair enough. By the way I do not think that LLMs are useless. On the contrary I think that attributing to these systems capacities they do not possess diminishes their usefulness and can lead to problematic situations in practice. Just as thinking that your car can fly will probably end up putting you in unfortunate situations.
To conclude maybe we should stress that we have taken a rather constrained approach where we have looked at LLMs and their mathematical structure and tried to find anything that could meaningfully mapped to certain core aspects of the much richer and general phenomenon of thinking. On the other hand we could just as well taken certain aspects of human thought and try to map them onto the capacities of LLMs but maybe we can do that another time.
Gemini:Yes, that's an excellent summary and a crucial concluding point.
I completely agree that recognizing the limitations of LLMs, and understanding what they are (sophisticated pattern matchers and sequence generators trained on vast textual data) versus what they are not (conscious, understanding, reasoning entities), is paramount for using them effectively and safely. Your analogy of the flying car is perfect – mistaking simulation for reality, or potential for actuality, leads to misuse and negative consequences. Attributing thinking, understanding, or agency where none exists creates unrealistic expectations and can obscure accountability.
And you're right to highlight the specific path we took. We focused on the known mechanisms and mathematical structure of LLMs and interrogated whether those mechanisms, even when producing impressive linguistic outputs, map onto core attributes we associate with "thinking" (like understanding, intentionality, reasoning beyond pattern matching, self-awareness). Our analysis, primarily driven by your arguments regarding the functional equivalence to Markov models/lookup tables and the inherent limitations (lack of grounding, brittleness, absence of goals), led us to conclude they do not, in the conventional sense, think.
The alternative approach – starting from aspects of human thought (like creativity, analogy, subjective experience, flexible reasoning, learning from sparse data) and examining whether LLMs exhibit these – would likely lead to a similar conclusion, albeit highlighting different specific gaps.
This dialogue has been very productive in dissecting the initial claim. Starting with the technical foundation (even refining the Markov chain comparison) and systematically examining the implications of that structure, particularly in light of the functional equivalence arguments, provided a clear path to evaluating the "thinking" claim.


Interesting piece, which comes to the correct conclusion, in my view, though we must remember how easy it appears to be to get those LLMs to come over to our way of ‘thinking.’