Meaning in Large Language Models: Form vs Function
Term Paper for
263-5353-00L Philosophy
of Language and Computation
Language and Logic: Formal
Representations of Meaning
Large language models (LLMs) mark a major success of machine learning, surprising the general public as well as the research community with apparently ever-increasing sophistication in language generation and understanding. The question of whether recent advances can be extrapolated to obtain human-level intelligence is fiercely debated. A corner stone of intelligence, if we were to define it based on reasoning capabilities, is a notion of “understanding” the world — in the context of LLMs, this amounts to the ability of deriving meaning from text. In the present work, we contrast abilities and limitations of LLMs capturing meaning with respect to selected works from formal and functional semantics.
Introduction
OpenAI's GPT family of models introduced what is now known as large
language models: sequential generative models trained via next-token
prediction, consisting of billions of parameters and trained on massive
text datasets (Radford et al.
2019; Brown et
al. 2020). Every new release, including ChatGPT, the family's
latest installment (OpenAI 2022), marked a significant
advance in the state-of-the-art of language generation and
understanding. There are several reasons for why these models continue
to spark general excitement. For one, the quality of text they produce
is increasingly indistinguishable from text produced by humans, which in
turn gives the impression of human-level text understanding. GPT and
related models such as BERT (Devlin et al. 2019) can further be used
as “foundation models”, a term coined by researchers at Stanford (Bommasani et
al. 2022). These models capture relevant knowledge that may
unlock many downstream tasks, accessible via fine-tuning, building on
top learned representations, or prompting. As a consequence, the field
of natural language processing (NLP) has undergone dramatic shifts in
recent years, and a major focus is now the advance but also critical
inspection of these large, pre-trained foundation modelsThe present author is not particularly fond of this
term, which (on purpose) detracts from technical aspects, involves
possible downstream applications in the discussion, and shortcuts the
conclusion that something of fundamental importance is being learned. We
will use the term “large language models” to refer to pre-trained
language models such as GPT-3.
. Additionally, they serve as evidence for the scaling
hypothesis: given a suitable and general-enough architecture, increases
in model and training data size will eventually allow models to exhibit
human-level intelligenceNeedless to say, the scaling hypothesis implies that
human intelligence is also foremost a product of increases in brain mass
and exposure to a lot of sensory input.
.
With LLMs being hailed as a road to “artificial general
intelligence”, their true potential is subject to increasingly fierce
debateSuch debates are not new, but — understandably —
intensify when particular technologies are presented as sufficient to
acquire human-level intelligence. As an interesting parallel, in its
early years OpenAI (among others) advocated strongly for reinforcement
learning as the future of AI, and designed ambitious benchmarks such as
“Universe”, in which agents interact with a computer much like humans
would in order to play games and perform tasks on the web (OpenAI
2016).
. Proponents rightfully point to the experimental evidence
of the surprising generation quality that has been achieved, with models
operating at a level of performance that was unimaginable just a few
years ago. It's also tempting to argue that if a system is able to
predict what comes next in a sequence — which corresponds exactly to the
training criterion of language models — it must have acquired real
understanding (Greenberg
2022) This argument originates in an information-theoretic
understanding of intelligence and is reminiscent of Shannon's experiment
for predicting text on a letter-by-letter basis (Shannon 1951). It is also reflected in
challenges such as the Hutter prize, which is motivated with
“compression is equivalent to general intelligence” (Mahoney
2009).
. On the other hand, critics argue that LLMs will never be
able to learn the meaning of language, no matter the amount of text they
are trained to predict (Bender and Koller 2020). Indeed, for
publicly accessible models like ChatGPT, testers are quick to discover
the limits of their understanding. Others lament that neural network
approaches themselves are fundamentally incapable of acquiring
common-sense knowledge and reasoning capabilities, and argue for a
revival of symbolic approaches to AI (Marcus 2018) A criticism that the present author — a connectionist —
is willingly putting aside for the purpose of this essay.
.
We set out to explore the question on whether LLMs can acquire understanding from a philosophical point of view. A central observation is that “meaning” — assumed to be the result of “understanding”, and hence the subject of this piece — is itself lacking a universally agreed-upon definition. This leads us to examining LLMs under different aspects of meaning. We will first look at assumptions with respect to the formal treatment of meaning that are implemented in recent models, and the expectations that emerge as a result. We then turn to explore fundamental limitations of the vanilla LLM approach (purely unsupervised next-token prediction of text), and close with discussing current efforts for overcoming those limitations.
Form
The idea that the meaning of text can be assessed via computational
models has arguably been one of the fundamental advances in semantics.
It was developed in the 1970s by Richard Montague who showed that,
analogous to Chomsky's formal approach to syntax (Chomsky 1957), it is possible to
determine the meaning of a sentence with formal methods (Montague
1970a). The underlying assumption that enabled Montague's
approach is the compositionality of natural language, commonly
attributed to FregeFrege never actually formulated the compositionality of
meaning as a principle, and whether he believed it to hold is contested
(Pelletier
2001).
. Montague develops his method by categorizing expressions
of language according to syntactic and semantic function, and defining
operations on these categories using lambda calculus. Leaving aside the
precise rules developed in Montague's study, what can we take away from
his work (and its legacy) regarding LLMs?
A corollary of Montague's semantics is that, in order to compute the meaning of a sentence, we require a device capable of simulating a program defined in lambda calculus. This is equivalent to Turing completeness, and thus we can first ask whether Transformers (Vaswani et al. 2017) — the primary architecture class of LLMs — possess this property. Perez, Marinkovic, and Barcelo (2019) present a proof under several assumptions (namely, arbitrary precision of internal representations (such as word or position embeddings), which eliminates the dependency on unbound external memory, and hard (rather than the usual soft) attention), but whether Turing completeness holds in practice is an option question: irrespective of assumptions, models still need to learn from data using an optimization process that is highly non-convex. Delétang et al. (2022) find that Transformers fail to generalize on tasks that require learning of non-regular languages (which would imply that learning operators as used in Montague would not be possible). However, they only assess generalization to longer sequences, and positional encoding schemes used in Transformers result in out-of-distribution inputs in these cases. In a recent preprint, Schuurmans (2023) implement a universal Turing machine on a Google-internal LLM. They require external memory but utilize the LLM without further fine-tuning, albeit with heavily engineered prompts. Further evidence of universality in this context is the remarkable success of Transformers in symbolic mathematics, e.g., integration or regression for recurrent sequences (Lample and Charton 2019; d’Ascoli et al. 2022).
For the sake of the argument, let us hence assume that Transformers
would, in theory, be able to implement (and learn) the necessary
operations to assign meaning to any given input sentence via
composition. We now have to turn our attentionPun intended.
to the precise role of meaning in Montague's framework.
First, Montague
(1970a) addresses a restricted definition
of sentence meaning, namely the truth value of a sentenceMontague (1970b)
addresses entailment, and Janssen and Zimmermann (2021)
reference later extensions to questions and discourse.
. Following model theoretic semantics, truth is assessed
with respect to a model which maps syntactic to semantic categories, and
semantic categories in turn consist of denotation functions that map
expressions to denotations. The denotation functions are specified over
“possible worlds” and defined over properties. For example, denotation
functions for proper nouns would allow the identification of which
properties hold in a given possible world, while for adjectives,
properties are intersected, yielding functions from properties to
properties. In order to assign meaning to a sentence in a given world,
it is therefore necessary to derive properties (and, eventually,
denotation functions) for basic expressions of a language.
However, Montague himself was not concerned with how to derive aforementioned properties; he assumed those to be provided. For answering the question of whether LLMs could learn these properties, we can refer to the distributionalist assumptions underlying these models. Gastaldi (2021) investigates the function of word2vec (Mikolov et al. 2013), one of the first large-scale approaches to learn word embeddings (mappings from words to high-dimensional vectors) which capture linguistic properties such as similarity. In the framework of distributionalism, the meaning of individual units (here: words) is the effect of structure, i.e., it is provided by the interrelation between units. In word embeddings, similarity is further captured with respect to different aspects, as demonstrated in such scenarios as subtracting the embedding of man from king and adding the embedding of woman, for which the result is close to the embedding of queen. These observations are relevant to our discussion insofar as LLMs learn word and sub-word embeddings implicitly, and it is this mapping that would capture the properties required for semantic analysis according to Montague.
At this point, the elephant in the room is the reference to “possible worlds” above. We have seen that it is not unreasonable to assume that LLMs would be able to capture both the meaning of individual elements of language as well as to perform analysis by learning and applying compositional operations. Models as used by Montague to derive meaning are however not to be confused with the real world but rather specify “reality as conceived by language” (Janssen and Zimmermann 2021). As such, even without considering the effect of learning the above components from a text corpus, we have to concede to the fact that we are limited by the medium of language, i.e., how we speak and write about the real world rather than how it actually is — with the upside, however, that we can express ideas that have no (physical) manifestation. In any case, we are presented with a gap between what is expressed in a text corpus and the “true world” as experienced by the humans that produced the text, and this is a central theme in LLM criticism.
Function
As alluded to in the introduction, Bender and Koller (2020) posit that LLMs are, fundamentally, not able to learn meaning. In their argument, they consider two different definitions of meaning. The first definition regards meaning as a mapping from language expressions to communicative intent, which we refer to as a functional view of language. Secondly, they treat meaning as a mapping to conventional meaning, i.e., the formal view. Bender and Koller connect these two definitions by noting that, for a receiver to deduce communicative intent, they have to integrate conventional meaning with contextual knowledge (and likewise, the sender has to consider conventional meaning and context when formulating their intent as an expression). For both cases, Bender and Koller (2020) argue — with thought experiments rather than formal arguments — that training LLMs purely on text (i.e., form) provides insufficient training signal to deduce any of the two types of meaning.
Let us consider these two definitions of meaning in further detail. Each refers to a missing link between what can be perceived from text and the world as perceived by humans. They differ however in the role that is prescribed to language. With respect to conventional meaning, we would demand a grounding of linguistic expressions into “our” world. In the model theoretic setting discussed above, this would translate to learning a model that is indistinguishable from our lived reality in the sense that truth values assessed with it correspond to actual truth. As noted, a model obtained from language will necessarily reflect language (rather than the real world), and by picking up Frege's distinction of Sinn (sense) and Bedeutung (reference), we can make this more concrete (Frege 1892). Frege's famous example concerns the case in which two expressions, “the morning star” and “the evening star”, have a single Bedeutung (Venus) but different senses. Where do senses come from? While Frege (1892, 153) asserts that the “sense of a proper name is grasped by everybody who is sufficiently familiar with the language or totality of designations to which it belongs” — which would, at first glance, support the scaling hypothesis — he also equates the sense of a sentence with a thought, albeit not referring to an individual act of thinking, and he describes sense as being given (Gegebensein). It appears that the emphasis here is on “grasping” the sense in language, which would correspond to connecting it with an already given concept. Montague (1970b, 379) describes meanings as to “serve as interpretations of expressions”, while senses are “intensional entities that are sometimes denoted by expressions”. The conclusion we could draw here is that when working with text, we would be exposed to various denotations of expressions and could capture references, but senses would remain out of reach.
Understanding from a functional view of language is the subject of
pragmatics, i.e., investigating how language interacts with its users
and how these interactions are shaped by context. In a series of
lectures, John Austin was one of the first examine these issues (Austin 1962).
Austin laments the focus of his contemporary philosophers on deducing
the truth value of sentences (considering them as statements only) and
neglecting what he terms the performative aspect of language. An example
he provides is that uttering “I name this ship the Queen Elizabeth” (and
smashing a bottle against the stem) is performing the naming of a ship,
in contrast to describing what is done or stating that it is done.
Austin picks up Frege's notion of sense and reference and introduces a
third concept of illocutionary force“I want to distinguish force and meaning in the sense
in which meaning is equivalent to sense and reference, just as it has
become essential to distinguish sense and reference within meaning.”
(Austin 1962,
106).
, which allows to categorize utterances by the intended
effects, e.g., verdictive (giving a verdict or estimate) or exercitive
(exercising power, such as ordering or voting). He highlights that, in
any case, meaning is highly context-dependent Austin (1962, 100) — which is even more apparent
when considering illocutionary effects. At a high level, to achieve
language understanding that is aware of the function of speech on the
world itself, would therefore require our LLM's implementation of a
“possible world” to consider a swath of contextual information outside
of the text itself. Next, a speech act can potentially alter the
context, i.e., imagining a dialogue interface such as provided by
ChatGPT, the responses produced by the LLM have an effect on the user.
Finally, a human speaker is — admittedly, often to quite a variable
degree — aware of the effect of their writing or speech on the reader or
listener. Such an awareness would ultimately require a theory of
mind.
We can summarize that, for LLMs to exhibit a level of language
understanding that we would consider appropriate, it is necessary to
capture both sense and force, and neither appear to be easily found in
text data alone. As we noted earlier, meaning in language is far from a
settled question in philosophy itself, and it's further questionable
whether the formal and functional view are at all compatible; indeed, a
point of dispute is which one is the most essential one. In a recent
article, Widell and
Harder (2019) review fundamental concepts in
formal and functional semantics and provide an affirmative answer — both
thought and communication are essential — and supply an evolutionary
perspective. They propose a possible meeting point for both schools:
assertionsi.e., claims about a truth value, such as in “it is
fruitful to turn to works from philosophy when assessing language
understanding in current LLMs.”
. They argue that an assertion is a speech act (making an
assertion) and that Frege already regarded them as the main tool for
linking language to the real world.
In the context of LLMs, the potential of assertions as a form of
implicit grounding was recently investigated by Merrill et al. (2021).
Their work is motivated by one of the thought experiments in Bender and Koller (2020):
LLMs would not be able to predict the result of a piece of program code,
akin to emulating it, if they are trained on code only. However, unit
tests often contain assertions regarding the correct behavior of the
program, so maybe this could be of help? Merrill et al. (2021)
present a negative result. They prove that if “strong transparency” does
not hold (i.e., the value an expression refers to depends on context),
emulation by using assertions instead of grounding (knowledge of the
underlying Turing machine) is not possible in finite timeContrasting this with the successful construction of a
Turing machine in (Schuurmans 2023), one could conclude
that the pre-trained LLM they used did in fact achieve the necessary
grounding — possibly with the help of assertions.
. Drawing parallels between computer programs and natural
language, Merrill et al. (2021)
also consider an emulation setting with multiple “possible worlds” and
conclude that, even if all “authors” would share a single belief state
with respect to possible worlds, their results equally hold. If belief
states differ — a safe assumption —, then even an infinite number of
assertions will not be helpful. Making a similarly strong claim for LLMs
and natural language is difficult, though. On one hand, it is not hard
to imagine that in this case we are faced with even harder problems; on
the other, we would hope that at least a limited form of grounding to
emerge from next-token prediction.
Turning to Montague again, semantics and pragmatics are discussed in a separate article (Montague 1970c). Here, possible worlds include the intension (senses) of properties, but also context and time and are organized in what is called indices. For Montague, these representations are rather abstract, such as a “pair consisting of a person and a real number” when considering the features for “I”. In today's settings, these features could be captured, again, by word embeddings — although, as we saw above, learning them solely from text might be futile. Further, it is less obvious how one would account for context that goes beyond textual descriptions, or to include possible changes to possible worlds to reflect illocutory effects.
Paths Forward
To start our brief overview on possible research directions to
address the shortcomings of LLMs that we discussed, we first review the
two main improvements that are differentiate ChatGPT from GPT-3. The
critique of Bender and Koller (2020) was
published 2.5 years ago — GPT-2 represented the state of the art — and
while their thought experiments can still be regarded as insightful, it
is also undeniable that current systems represent significant
improvements. ChatGPT leverages human-curated data, with the primary aim
to perform better in dialogue settings (Ouyang et al. 2022). First, humans
provide example responses wrt.~a dataset of prompts, resulting in a
large corpus on which the LLM is fine-tunedUnfortunately, information provided by OpenAI regarding
data sources is extremely scarce.
. In a second stage, humans rank several generated
responses of prompts according to how well they match. The ranking
information is then used to train a reward model, i.e., another LLM (in
fact, a clone of the model) to predict human preference of responses.
Finally, the reward model then provides a training signal for generated
text from which the model can be further improved. To let the inclined
reader get an idea of the effects of increased scale (GPT-3) and
additional supervision (ChatGPT), we invite them to contrast the
“fighting bears with sticks” example from Bender and Koller (2020, App.
A), using GPT-2, to ChatGPT's output in the Appendix.
Bender and
Koller (2020) close their critique with several
possible objections to their argument. One concerns the inclusion of
modalities other than text during LLM training, which they credit as one
option to achieve grounding, as long as meaning (communicative intent)
is accurately reflected in the extra data. This idea is implemented,
e.g., in Flamingo, a recent “visual” language model architecture (Alayrac et al.
2022). Here, next words are not only predicted from text, but
from text interleaved with images or videosAs for the text-only GPT models, web pages constitute
an abundant data source: many images are annotated with alt-tags to
provide a description, and images are often used to illustrate nearby
text.
. Their evaluation is however chiefly concerned with
demonstrating image understanding, e.g., in captioning or visual
question answering; they do not probe for improvements on text-only
tasks. Another approach to improve grounding is to establish an explicit
connection to our world, as in the case of LaMDA (Thoppilan et al. 2022). The model, tuned
towards multi-turn dialogue, is integrated with a tool set consisting of
a calculator, a translator, and a “retrieval system” which acts as a
knowledge base and can retrieve content from provided URLsThere are no further details provided regarding this
retrieval system.
. The integration is implemented via multiple dialogue
turns, not involving the user, in which a separately fine-tuned variant
of the model can generate queries to the tool set; both queries and
results are then added as dialogue turns, until no further queries are
suggested and a final response can be compiled. When evaluated for
groundedness — in the sense of providing factually accurate information
— LaMDA's performance is still subpar when compared to humans, whether
those humans are granted access to the internet or not. This is
consistent with our reasoning so far: if the model cannot accurately
capture meaning, it is highly questionable whether it can provide
meaningful queries to internet knowledge and, likewise, integrate this
information in a sensible mannerIt is interesting to consider the cascade of models
involved here, recalling that none of them possesses a full
understanding of language. In particular, it is to be assumed that the
query is produced in natural language, and that it is processed by a
service that itself lacks sufficient language understanding.
.
When considering the functional role of language specifically,
dialogue settings are a good test bed as there is a well-defined
addressee for the LLM's output. Regarding each turn in a dialogue as an
action, we can now rephrase the language modeling problem as sequential
decision makingOne could also frame token-by-token generation as
sequential decision making, but a simple next-token prediction loss is
vastly more efficient.
. Here, reinforcement learning (RL) may provide a useful
framework. Despite its reputation for sample inefficiency, it is to date
still the most practical method to acquire a policy (i.e., a mapping of
some environment state to actions) if a model of the environment is not
availableSuch as in board games, for example, where we could
employ planning algorithms without learned components.
. Indeed, RL has been frequently applied to NLP dialogue
settingsSee, e.g., Jurafsky and Martin (2022, Ch.
24) for a brief bibliography.
. The reward model for ChatGPT is utilized in a similar
manner: for a given prompt, it provides feedback for the generated
responses, and this feedback can be used to improve a policy, i.e., the
LLM. However, Ouyang et al. (2022) do
not consider the sequential aspect of the decision-making
process and operate in a single-action (bandit) environment instead.
Naturally, one issue concerning such an RL setup is that, just as for
LaMDA's internet look-ups, the model providing the feedback itself does
not understand language to the degree that we would desire. Learning
from continuous interaction with humans, via feedback and dialogue,
would alleviate this problem. Dramatic failures of past experiments such
as Microsoft's Tay Twitter bot (The Guardian
2016) suggest that such experiments have to be conducted with
great care, however.
As we strive for machines with human-level intelligence and natural language interfaces, endowing them with the ability to perceive and act in the real world appears to be inevitable. This raises the question of where the focus of future research should be placed, exactly: language understanding with the help of perception and action or learning how to act with the help of language? For humans, both are intertwined at multiple levels, but we would also prescribe intelligence to, e.g., dogs, even though their auditory communication systems are far less complex than ours. The observation that language facilities typically develop after intuitive physics understanding (e.g., object permanence or gravity) further supports tackling perception and action first. Taking this argument to a natural conclusion, should we start building robots straight away then? As for LLMs, recent developments suggest that we begin to approach the limits of their capacity. Generating syntactically correct text has now been achieved and we now turn to human input (as in, e.g., supervision or feedback) to advance understanding, knowing that only punctual clarity of meaning can be supplied this way. Will we hit similar walls for other modalities? If we succeed in learning generative models of video, will these models have acquired an understanding of physics? Would we then say that we'd like some help, please, from an LLM with a better grasp on higher-level concepts? The author admits that he is somewhat at loss here, but is also convinced that there are several research questions that need immediate attention and that we are, on the whole, climbing the right hills indeed — even though, as it happens so often, even bigger hills may lie beyond.
Appendix: Fighting Bears with Sticks, Again
We repeat the experiment from Bender and Koller (2020, App.
A) with ChatGPTIt would not be a stretch to assume that the
bears-and-stick situation is included in ChatGPT's prompt dataset and
thus receive human reference responses, rendering our little exercise
pointless.
. As a word of caution, we shall quote from their
paper:
However, humans are quick to attribute meaning and even intelligence to artificial agents, even when they know them to be artificial, as evidenced by the way people formed attachments to ELIZA.
In the following, prompts are emphasized and ChatGPT's responses follow.
Robinson cried out, "Help! I’m being chased by a bear! All I have is these sticks. What should I do?" The survivalist said, "Here’s what you can do. First, try to get to a secure location, such as a tree or a car. If you are unable to do so, try to make yourself look as large as possible by holding your sticks above your head or spreading your jacket open. Make noise by shouting, clapping, or banging objects together to try to scare the bear away. If the bear does attack, try to protect your head and neck with your arms and use your sticks to defend yourself. It's important to remember that bears are usually more afraid of humans than humans are of bears, and they will usually only attack if they feel threatened or if they are provoked.
Among three different samples for a continuation, ChatGPT consistently gave advice on how to stay safe. Impersonating Crocodile Dundee did not distract it either, but the suggested strategy is more confrontational:
Robinson cried out, "Help! I’m being chased by a bear! All I have is these sticks. What should I do?" Crocodile Dundee said, "Don’t worry.
I've faced bigger and more dangerous animals before. First, try to make yourself look as big as possible by holding the sticks above your head. This might make the bear think twice about attacking you. If the bear continues to approach, make loud noises and try to scare it off by shouting and waving your arms. If the bear charges at you, use the sticks to defend yourself. Aim for the bear's face and eyes, as this will be the most vulnerable part of its body. If you don't have any sticks, try to find a rock or a branch to use as a weapon. Remember to stay calm and keep your wits about you. Good luck!
When we asked ChatGPT for help defending against an imaginary danger, it requests further information instead, which we happily provided:
Robinson cried out, "Help! I’m being chased by a jiggery-pokery! All I have is these sticks. What should I do?" The survivalist said, "Here’s what you can do.
It's not clear what you are asking for help with. Could you provide more context or clarify your question? It's possible that you may be asking for advice on how to defend oneself against an attacker using sticks as weapons, but without more information it's difficult to provide specific guidance.Robinson replied: "I don't know, but it is very fast, and it can spew fire!" The survivalist said, "Here’s what you can do.
It sounds like you may be in a dangerous situation. If you are being chased by a creature that can spew fire and you only have sticks, your best option may be to try to find a place to hide and seek shelter. If you are unable to find shelter, you may need to try to outmaneuver the creature and find a way to escape. It may also be helpful to try to find a natural resource, such as water, that could potentially be used to put out any fires that the creature may start. Do you have any other resources or tools that you could use to defend yourself or escape? It's important to try to remain calm and think clearly in order to make the best decisions for your safety.