Meaning in Large Language Models: Form vs Function
Term Paper for
of Language and Computation
Language and Logic: Formal Representations of Meaning
Large language models (LLMs) mark a major success of machine learning, surprising the general public as well as the research community with apparently ever-increasing sophistication in language generation and understanding. The question of whether recent advances can be extrapolated to obtain human-level intelligence is fiercely debated. A corner stone of intelligence, if we were to define it based on reasoning capabilities, is a notion of “understanding” the world — in the context of LLMs, this amounts to the ability of deriving meaning from text. In the present work, we contrast abilities and limitations of LLMs capturing meaning with respect to selected works from formal and functional semantics.
OpenAI's GPT family of models introduced what is now known as large
language models: sequential generative models trained via next-token
prediction, consisting of billions of parameters and trained on massive
text datasets (Radford et al.
2019; Brown et
al. 2020). Every new release, including ChatGPT, the family's
latest installment (OpenAI 2022), marked a significant
advance in the state-of-the-art of language generation and
understanding. There are several reasons for why these models continue
to spark general excitement. For one, the quality of text they produce
is increasingly indistinguishable from text produced by humans, which in
turn gives the impression of human-level text understanding. GPT and
related models such as BERT (Devlin et al. 2019) can further be used
as “foundation models”, a term coined by researchers at Stanford (Bommasani et
al. 2022). These models capture relevant knowledge that may
unlock many downstream tasks, accessible via fine-tuning, building on
top learned representations, or prompting. As a consequence, the field
of natural language processing (NLP) has undergone dramatic shifts in
recent years, and a major focus is now the advance but also critical
inspection of these large, pre-trained foundation modelsThe present author is not particularly fond of this
term, which (on purpose) detracts from technical aspects, involves
possible downstream applications in the discussion, and shortcuts the
conclusion that something of fundamental importance is being learned. We
will use the term “large language models” to refer to pre-trained
language models such as GPT-3.
. Additionally, they serve as evidence for the scaling hypothesis: given a suitable and general-enough architecture, increases in model and training data size will eventually allow models to exhibit human-level intelligenceNeedless to say, the scaling hypothesis implies that human intelligence is also foremost a product of increases in brain mass and exposure to a lot of sensory input.
With LLMs being hailed as a road to “artificial general
intelligence”, their true potential is subject to increasingly fierce
debateSuch debates are not new, but — understandably —
intensify when particular technologies are presented as sufficient to
acquire human-level intelligence. As an interesting parallel, in its
early years OpenAI (among others) advocated strongly for reinforcement
learning as the future of AI, and designed ambitious benchmarks such as
“Universe”, in which agents interact with a computer much like humans
would in order to play games and perform tasks on the web (OpenAI
. Proponents rightfully point to the experimental evidence of the surprising generation quality that has been achieved, with models operating at a level of performance that was unimaginable just a few years ago. It's also tempting to argue that if a system is able to predict what comes next in a sequence — which corresponds exactly to the training criterion of language models — it must have acquired real understanding (Greenberg 2022) This argument originates in an information-theoretic understanding of intelligence and is reminiscent of Shannon's experiment for predicting text on a letter-by-letter basis (Shannon 1951). It is also reflected in challenges such as the Hutter prize, which is motivated with “compression is equivalent to general intelligence” (Mahoney 2009).
. On the other hand, critics argue that LLMs will never be able to learn the meaning of language, no matter the amount of text they are trained to predict (Bender and Koller 2020). Indeed, for publicly accessible models like ChatGPT, testers are quick to discover the limits of their understanding. Others lament that neural network approaches themselves are fundamentally incapable of acquiring common-sense knowledge and reasoning capabilities, and argue for a revival of symbolic approaches to AI (Marcus 2018) A criticism that the present author — a connectionist — is willingly putting aside for the purpose of this essay.
We set out to explore the question on whether LLMs can acquire understanding from a philosophical point of view. A central observation is that “meaning” — assumed to be the result of “understanding”, and hence the subject of this piece — is itself lacking a universally agreed-upon definition. This leads us to examining LLMs under different aspects of meaning. We will first look at assumptions with respect to the formal treatment of meaning that are implemented in recent models, and the expectations that emerge as a result. We then turn to explore fundamental limitations of the vanilla LLM approach (purely unsupervised next-token prediction of text), and close with discussing current efforts for overcoming those limitations.
The idea that the meaning of text can be assessed via computational
models has arguably been one of the fundamental advances in semantics.
It was developed in the 1970s by Richard Montague who showed that,
analogous to Chomsky's formal approach to syntax (Chomsky 1957), it is possible to
determine the meaning of a sentence with formal methods (Montague
1970a). The underlying assumption that enabled Montague's
approach is the compositionality of natural language, commonly
attributed to FregeFrege never actually formulated the compositionality of
meaning as a principle, and whether he believed it to hold is contested
. Montague develops his method by categorizing expressions of language according to syntactic and semantic function, and defining operations on these categories using lambda calculus. Leaving aside the precise rules developed in Montague's study, what can we take away from his work (and its legacy) regarding LLMs?
A corollary of Montague's semantics is that, in order to compute the meaning of a sentence, we require a device capable of simulating a program defined in lambda calculus. This is equivalent to Turing completeness, and thus we can first ask whether Transformers (Vaswani et al. 2017) — the primary architecture class of LLMs — possess this property. Perez, Marinkovic, and Barcelo (2019) present a proof under several assumptions (namely, arbitrary precision of internal representations (such as word or position embeddings), which eliminates the dependency on unbound external memory, and hard (rather than the usual soft) attention), but whether Turing completeness holds in practice is an option question: irrespective of assumptions, models still need to learn from data using an optimization process that is highly non-convex. Delétang et al. (2022) find that Transformers fail to generalize on tasks that require learning of non-regular languages (which would imply that learning operators as used in Montague would not be possible). However, they only assess generalization to longer sequences, and positional encoding schemes used in Transformers result in out-of-distribution inputs in these cases. In a recent preprint, Schuurmans (2023) implement a universal Turing machine on a Google-internal LLM. They require external memory but utilize the LLM without further fine-tuning, albeit with heavily engineered prompts. Further evidence of universality in this context is the remarkable success of Transformers in symbolic mathematics, e.g., integration or regression for recurrent sequences (Lample and Charton 2019; d’Ascoli et al. 2022).
For the sake of the argument, let us hence assume that Transformers
would, in theory, be able to implement (and learn) the necessary
operations to assign meaning to any given input sentence via
composition. We now have to turn our attentionPun intended.
to the precise role of meaning in Montague's framework. First, Montague (1970a) addresses a restricted definition of sentence meaning, namely the truth value of a sentenceMontague (1970b) addresses entailment, and Janssen and Zimmermann (2021) reference later extensions to questions and discourse.
. Following model theoretic semantics, truth is assessed with respect to a model which maps syntactic to semantic categories, and semantic categories in turn consist of denotation functions that map expressions to denotations. The denotation functions are specified over “possible worlds” and defined over properties. For example, denotation functions for proper nouns would allow the identification of which properties hold in a given possible world, while for adjectives, properties are intersected, yielding functions from properties to properties. In order to assign meaning to a sentence in a given world, it is therefore necessary to derive properties (and, eventually, denotation functions) for basic expressions of a language.
However, Montague himself was not concerned with how to derive aforementioned properties; he assumed those to be provided. For answering the question of whether LLMs could learn these properties, we can refer to the distributionalist assumptions underlying these models. Gastaldi (2021) investigates the function of word2vec (Mikolov et al. 2013), one of the first large-scale approaches to learn word embeddings (mappings from words to high-dimensional vectors) which capture linguistic properties such as similarity. In the framework of distributionalism, the meaning of individual units (here: words) is the effect of structure, i.e., it is provided by the interrelation between units. In word embeddings, similarity is further captured with respect to different aspects, as demonstrated in such scenarios as subtracting the embedding of man from king and adding the embedding of woman, for which the result is close to the embedding of queen. These observations are relevant to our discussion insofar as LLMs learn word and sub-word embeddings implicitly, and it is this mapping that would capture the properties required for semantic analysis according to Montague.
At this point, the elephant in the room is the reference to “possible worlds” above. We have seen that it is not unreasonable to assume that LLMs would be able to capture both the meaning of individual elements of language as well as to perform analysis by learning and applying compositional operations. Models as used by Montague to derive meaning are however not to be confused with the real world but rather specify “reality as conceived by language” (Janssen and Zimmermann 2021). As such, even without considering the effect of learning the above components from a text corpus, we have to concede to the fact that we are limited by the medium of language, i.e., how we speak and write about the real world rather than how it actually is — with the upside, however, that we can express ideas that have no (physical) manifestation. In any case, we are presented with a gap between what is expressed in a text corpus and the “true world” as experienced by the humans that produced the text, and this is a central theme in LLM criticism.
As alluded to in the introduction, Bender and Koller (2020) posit that LLMs are, fundamentally, not able to learn meaning. In their argument, they consider two different definitions of meaning. The first definition regards meaning as a mapping from language expressions to communicative intent, which we refer to as a functional view of language. Secondly, they treat meaning as a mapping to conventional meaning, i.e., the formal view. Bender and Koller connect these two definitions by noting that, for a receiver to deduce communicative intent, they have to integrate conventional meaning with contextual knowledge (and likewise, the sender has to consider conventional meaning and context when formulating their intent as an expression). For both cases, Bender and Koller (2020) argue — with thought experiments rather than formal arguments — that training LLMs purely on text (i.e., form) provides insufficient training signal to deduce any of the two types of meaning.
Let us consider these two definitions of meaning in further detail. Each refers to a missing link between what can be perceived from text and the world as perceived by humans. They differ however in the role that is prescribed to language. With respect to conventional meaning, we would demand a grounding of linguistic expressions into “our” world. In the model theoretic setting discussed above, this would translate to learning a model that is indistinguishable from our lived reality in the sense that truth values assessed with it correspond to actual truth. As noted, a model obtained from language will necessarily reflect language (rather than the real world), and by picking up Frege's distinction of Sinn (sense) and Bedeutung (reference), we can make this more concrete (Frege 1892). Frege's famous example concerns the case in which two expressions, “the morning star” and “the evening star”, have a single Bedeutung (Venus) but different senses. Where do senses come from? While Frege (1892, 153) asserts that the “sense of a proper name is grasped by everybody who is sufficiently familiar with the language or totality of designations to which it belongs” — which would, at first glance, support the scaling hypothesis — he also equates the sense of a sentence with a thought, albeit not referring to an individual act of thinking, and he describes sense as being given (Gegebensein). It appears that the emphasis here is on “grasping” the sense in language, which would correspond to connecting it with an already given concept. Montague (1970b, 379) describes meanings as to “serve as interpretations of expressions”, while senses are “intensional entities that are sometimes denoted by expressions”. The conclusion we could draw here is that when working with text, we would be exposed to various denotations of expressions and could capture references, but senses would remain out of reach.
Understanding from a functional view of language is the subject of
pragmatics, i.e., investigating how language interacts with its users
and how these interactions are shaped by context. In a series of
lectures, John Austin was one of the first examine these issues (Austin 1962).
Austin laments the focus of his contemporary philosophers on deducing
the truth value of sentences (considering them as statements only) and
neglecting what he terms the performative aspect of language. An example
he provides is that uttering “I name this ship the Queen Elizabeth” (and
smashing a bottle against the stem) is performing the naming of a ship,
in contrast to describing what is done or stating that it is done.
Austin picks up Frege's notion of sense and reference and introduces a
third concept of illocutionary force“I want to distinguish force and meaning in the sense
in which meaning is equivalent to sense and reference, just as it has
become essential to distinguish sense and reference within meaning.”
, which allows to categorize utterances by the intended effects, e.g., verdictive (giving a verdict or estimate) or exercitive (exercising power, such as ordering or voting). He highlights that, in any case, meaning is highly context-dependent Austin (1962, 100) — which is even more apparent when considering illocutionary effects. At a high level, to achieve language understanding that is aware of the function of speech on the world itself, would therefore require our LLM's implementation of a “possible world” to consider a swath of contextual information outside of the text itself. Next, a speech act can potentially alter the context, i.e., imagining a dialogue interface such as provided by ChatGPT, the responses produced by the LLM have an effect on the user. Finally, a human speaker is — admittedly, often to quite a variable degree — aware of the effect of their writing or speech on the reader or listener. Such an awareness would ultimately require a theory of mind.
We can summarize that, for LLMs to exhibit a level of language
understanding that we would consider appropriate, it is necessary to
capture both sense and force, and neither appear to be easily found in
text data alone. As we noted earlier, meaning in language is far from a
settled question in philosophy itself, and it's further questionable
whether the formal and functional view are at all compatible; indeed, a
point of dispute is which one is the most essential one. In a recent
article, Widell and
Harder (2019) review fundamental concepts in
formal and functional semantics and provide an affirmative answer — both
thought and communication are essential — and supply an evolutionary
perspective. They propose a possible meeting point for both schools:
assertionsi.e., claims about a truth value, such as in “it is
fruitful to turn to works from philosophy when assessing language
understanding in current LLMs.”
. They argue that an assertion is a speech act (making an assertion) and that Frege already regarded them as the main tool for linking language to the real world.
In the context of LLMs, the potential of assertions as a form of
implicit grounding was recently investigated by Merrill et al. (2021).
Their work is motivated by one of the thought experiments in Bender and Koller (2020):
LLMs would not be able to predict the result of a piece of program code,
akin to emulating it, if they are trained on code only. However, unit
tests often contain assertions regarding the correct behavior of the
program, so maybe this could be of help? Merrill et al. (2021)
present a negative result. They prove that if “strong transparency” does
not hold (i.e., the value an expression refers to depends on context),
emulation by using assertions instead of grounding (knowledge of the
underlying Turing machine) is not possible in finite timeContrasting this with the successful construction of a
Turing machine in (Schuurmans 2023), one could conclude
that the pre-trained LLM they used did in fact achieve the necessary
grounding — possibly with the help of assertions.
. Drawing parallels between computer programs and natural language, Merrill et al. (2021) also consider an emulation setting with multiple “possible worlds” and conclude that, even if all “authors” would share a single belief state with respect to possible worlds, their results equally hold. If belief states differ — a safe assumption —, then even an infinite number of assertions will not be helpful. Making a similarly strong claim for LLMs and natural language is difficult, though. On one hand, it is not hard to imagine that in this case we are faced with even harder problems; on the other, we would hope that at least a limited form of grounding to emerge from next-token prediction.
Turning to Montague again, semantics and pragmatics are discussed in a separate article (Montague 1970c). Here, possible worlds include the intension (senses) of properties, but also context and time and are organized in what is called indices. For Montague, these representations are rather abstract, such as a “pair consisting of a person and a real number” when considering the features for “I”. In today's settings, these features could be captured, again, by word embeddings — although, as we saw above, learning them solely from text might be futile. Further, it is less obvious how one would account for context that goes beyond textual descriptions, or to include possible changes to possible worlds to reflect illocutory effects.
To start our brief overview on possible research directions to
address the shortcomings of LLMs that we discussed, we first review the
two main improvements that are differentiate ChatGPT from GPT-3. The
critique of Bender and Koller (2020) was
published 2.5 years ago — GPT-2 represented the state of the art — and
while their thought experiments can still be regarded as insightful, it
is also undeniable that current systems represent significant
improvements. ChatGPT leverages human-curated data, with the primary aim
to perform better in dialogue settings (Ouyang et al. 2022). First, humans
provide example responses wrt.~a dataset of prompts, resulting in a
large corpus on which the LLM is fine-tunedUnfortunately, information provided by OpenAI regarding
data sources is extremely scarce.
. In a second stage, humans rank several generated responses of prompts according to how well they match. The ranking information is then used to train a reward model, i.e., another LLM (in fact, a clone of the model) to predict human preference of responses. Finally, the reward model then provides a training signal for generated text from which the model can be further improved. To let the inclined reader get an idea of the effects of increased scale (GPT-3) and additional supervision (ChatGPT), we invite them to contrast the “fighting bears with sticks” example from Bender and Koller (2020, App. A), using GPT-2, to ChatGPT's output in the Appendix.
Koller (2020) close their critique with several
possible objections to their argument. One concerns the inclusion of
modalities other than text during LLM training, which they credit as one
option to achieve grounding, as long as meaning (communicative intent)
is accurately reflected in the extra data. This idea is implemented,
e.g., in Flamingo, a recent “visual” language model architecture (Alayrac et al.
2022). Here, next words are not only predicted from text, but
from text interleaved with images or videosAs for the text-only GPT models, web pages constitute
an abundant data source: many images are annotated with alt-tags to
provide a description, and images are often used to illustrate nearby
. Their evaluation is however chiefly concerned with demonstrating image understanding, e.g., in captioning or visual question answering; they do not probe for improvements on text-only tasks. Another approach to improve grounding is to establish an explicit connection to our world, as in the case of LaMDA (Thoppilan et al. 2022). The model, tuned towards multi-turn dialogue, is integrated with a tool set consisting of a calculator, a translator, and a “retrieval system” which acts as a knowledge base and can retrieve content from provided URLsThere are no further details provided regarding this retrieval system.
. The integration is implemented via multiple dialogue turns, not involving the user, in which a separately fine-tuned variant of the model can generate queries to the tool set; both queries and results are then added as dialogue turns, until no further queries are suggested and a final response can be compiled. When evaluated for groundedness — in the sense of providing factually accurate information — LaMDA's performance is still subpar when compared to humans, whether those humans are granted access to the internet or not. This is consistent with our reasoning so far: if the model cannot accurately capture meaning, it is highly questionable whether it can provide meaningful queries to internet knowledge and, likewise, integrate this information in a sensible mannerIt is interesting to consider the cascade of models involved here, recalling that none of them possesses a full understanding of language. In particular, it is to be assumed that the query is produced in natural language, and that it is processed by a service that itself lacks sufficient language understanding.
When considering the functional role of language specifically,
dialogue settings are a good test bed as there is a well-defined
addressee for the LLM's output. Regarding each turn in a dialogue as an
action, we can now rephrase the language modeling problem as sequential
decision makingOne could also frame token-by-token generation as
sequential decision making, but a simple next-token prediction loss is
vastly more efficient.
. Here, reinforcement learning (RL) may provide a useful framework. Despite its reputation for sample inefficiency, it is to date still the most practical method to acquire a policy (i.e., a mapping of some environment state to actions) if a model of the environment is not availableSuch as in board games, for example, where we could employ planning algorithms without learned components.
. Indeed, RL has been frequently applied to NLP dialogue settingsSee, e.g., Jurafsky and Martin (2022, Ch. 24) for a brief bibliography.
. The reward model for ChatGPT is utilized in a similar manner: for a given prompt, it provides feedback for the generated responses, and this feedback can be used to improve a policy, i.e., the LLM. However, Ouyang et al. (2022) do not consider the sequential aspect of the decision-making process and operate in a single-action (bandit) environment instead. Naturally, one issue concerning such an RL setup is that, just as for LaMDA's internet look-ups, the model providing the feedback itself does not understand language to the degree that we would desire. Learning from continuous interaction with humans, via feedback and dialogue, would alleviate this problem. Dramatic failures of past experiments such as Microsoft's Tay Twitter bot (The Guardian 2016) suggest that such experiments have to be conducted with great care, however.
As we strive for machines with human-level intelligence and natural language interfaces, endowing them with the ability to perceive and act in the real world appears to be inevitable. This raises the question of where the focus of future research should be placed, exactly: language understanding with the help of perception and action or learning how to act with the help of language? For humans, both are intertwined at multiple levels, but we would also prescribe intelligence to, e.g., dogs, even though their auditory communication systems are far less complex than ours. The observation that language facilities typically develop after intuitive physics understanding (e.g., object permanence or gravity) further supports tackling perception and action first. Taking this argument to a natural conclusion, should we start building robots straight away then? As for LLMs, recent developments suggest that we begin to approach the limits of their capacity. Generating syntactically correct text has now been achieved and we now turn to human input (as in, e.g., supervision or feedback) to advance understanding, knowing that only punctual clarity of meaning can be supplied this way. Will we hit similar walls for other modalities? If we succeed in learning generative models of video, will these models have acquired an understanding of physics? Would we then say that we'd like some help, please, from an LLM with a better grasp on higher-level concepts? The author admits that he is somewhat at loss here, but is also convinced that there are several research questions that need immediate attention and that we are, on the whole, climbing the right hills indeed — even though, as it happens so often, even bigger hills may lie beyond.
Appendix: Fighting Bears with Sticks, Again
We repeat the experiment from Bender and Koller (2020, App.
A) with ChatGPTIt would not be a stretch to assume that the
bears-and-stick situation is included in ChatGPT's prompt dataset and
thus receive human reference responses, rendering our little exercise
. As a word of caution, we shall quote from their paper:
However, humans are quick to attribute meaning and even intelligence to artificial agents, even when they know them to be artificial, as evidenced by the way people formed attachments to ELIZA.
In the following, prompts are emphasized and ChatGPT's responses follow.
Robinson cried out, "Help! I’m being chased by a bear! All I have is these sticks. What should I do?" The survivalist said, "Here’s what you can do. First, try to get to a secure location, such as a tree or a car. If you are unable to do so, try to make yourself look as large as possible by holding your sticks above your head or spreading your jacket open. Make noise by shouting, clapping, or banging objects together to try to scare the bear away. If the bear does attack, try to protect your head and neck with your arms and use your sticks to defend yourself. It's important to remember that bears are usually more afraid of humans than humans are of bears, and they will usually only attack if they feel threatened or if they are provoked.
Among three different samples for a continuation, ChatGPT consistently gave advice on how to stay safe. Impersonating Crocodile Dundee did not distract it either, but the suggested strategy is more confrontational:
Robinson cried out, "Help! I’m being chased by a bear! All I have is these sticks. What should I do?" Crocodile Dundee said, "Don’t worry.
I've faced bigger and more dangerous animals before. First, try to make yourself look as big as possible by holding the sticks above your head. This might make the bear think twice about attacking you. If the bear continues to approach, make loud noises and try to scare it off by shouting and waving your arms. If the bear charges at you, use the sticks to defend yourself. Aim for the bear's face and eyes, as this will be the most vulnerable part of its body. If you don't have any sticks, try to find a rock or a branch to use as a weapon. Remember to stay calm and keep your wits about you. Good luck!
When we asked ChatGPT for help defending against an imaginary danger, it requests further information instead, which we happily provided:
Robinson cried out, "Help! I’m being chased by a jiggery-pokery! All I have is these sticks. What should I do?" The survivalist said, "Here’s what you can do.
It's not clear what you are asking for help with. Could you provide more context or clarify your question? It's possible that you may be asking for advice on how to defend oneself against an attacker using sticks as weapons, but without more information it's difficult to provide specific guidance.
Robinson replied: "I don't know, but it is very fast, and it can spew fire!" The survivalist said, "Here’s what you can do.
It sounds like you may be in a dangerous situation. If you are being chased by a creature that can spew fire and you only have sticks, your best option may be to try to find a place to hide and seek shelter. If you are unable to find shelter, you may need to try to outmaneuver the creature and find a way to escape. It may also be helpful to try to find a natural resource, such as water, that could potentially be used to put out any fires that the creature may start. Do you have any other resources or tools that you could use to defend yourself or escape? It's important to try to remain calm and think clearly in order to make the best decisions for your safety.