Meaning in Large Language Models: Form vs Function

Term Paper for 263-5353-00L Philosophy of Language and Computation
Language and Logic: Formal Representations of Meaning

January 13, 2023 – Jonas Gehring

Large language models (LLMs) mark a major success of machine learning, surprising the general public as well as the research community with apparently ever-increasing sophistication in language generation and understanding. The question of whether recent advances can be extrapolated to obtain human-level intelligence is fiercely debated. A corner stone of intelligence, if we were to define it based on reasoning capabilities, is a notion of “understanding” the world — in the context of LLMs, this amounts to the ability of deriving meaning from text. In the present work, we contrast abilities and limitations of LLMs capturing meaning with respect to selected works from formal and functional semantics.

Introduction

OpenAI's GPT family of models introduced what is now known as large language models: sequential generative models trained via next-token prediction, consisting of billions of parameters and trained on massive text datasets (Radford et al. 2019; Brown et al. 2020). Every new release, including ChatGPT, the family's latest installment (OpenAI 2022), marked a significant advance in the state-of-the-art of language generation and understanding. There are several reasons for why these models continue to spark general excitement. For one, the quality of text they produce is increasingly indistinguishable from text produced by humans, which in turn gives the impression of human-level text understanding. GPT and related models such as BERT (Devlin et al. 2019) can further be used as “foundation models”, a term coined by researchers at Stanford (Bommasani et al. 2022). These models capture relevant knowledge that may unlock many downstream tasks, accessible via fine-tuning, building on top learned representations, or prompting. As a consequence, the field of natural language processing (NLP) has undergone dramatic shifts in recent years, and a major focus is now the advance but also critical inspection of these large, pre-trained foundation modelsThe present author is not particularly fond of this term, which (on purpose) detracts from technical aspects, involves possible downstream applications in the discussion, and shortcuts the conclusion that something of fundamental importance is being learned. We will use the term “large language models” to refer to pre-trained language models such as GPT-3.

. Additionally, they serve as evidence for the scaling hypothesis: given a suitable and general-enough architecture, increases in model and training data size will eventually allow models to exhibit human-level intelligenceNeedless to say, the scaling hypothesis implies that human intelligence is also foremost a product of increases in brain mass and exposure to a lot of sensory input.

.

With LLMs being hailed as a road to “artificial general intelligence”, their true potential is subject to increasingly fierce debateSuch debates are not new, but — understandably — intensify when particular technologies are presented as sufficient to acquire human-level intelligence. As an interesting parallel, in its early years OpenAI (among others) advocated strongly for reinforcement learning as the future of AI, and designed ambitious benchmarks such as “Universe”, in which agents interact with a computer much like humans would in order to play games and perform tasks on the web (OpenAI 2016).

. Proponents rightfully point to the experimental evidence of the surprising generation quality that has been achieved, with models operating at a level of performance that was unimaginable just a few years ago. It's also tempting to argue that if a system is able to predict what comes next in a sequence — which corresponds exactly to the training criterion of language models — it must have acquired real understanding (Greenberg 2022) This argument originates in an information-theoretic understanding of intelligence and is reminiscent of Shannon's experiment for predicting text on a letter-by-letter basis (Shannon 1951). It is also reflected in challenges such as the Hutter prize, which is motivated with “compression is equivalent to general intelligence” (Mahoney 2009).

. On the other hand, critics argue that LLMs will never be able to learn the meaning of language, no matter the amount of text they are trained to predict (Bender and Koller 2020). Indeed, for publicly accessible models like ChatGPT, testers are quick to discover the limits of their understanding. Others lament that neural network approaches themselves are fundamentally incapable of acquiring common-sense knowledge and reasoning capabilities, and argue for a revival of symbolic approaches to AI (Marcus 2018) A criticism that the present author — a connectionist — is willingly putting aside for the purpose of this essay.

.

We set out to explore the question on whether LLMs can acquire understanding from a philosophical point of view. A central observation is that “meaning” — assumed to be the result of “understanding”, and hence the subject of this piece — is itself lacking a universally agreed-upon definition. This leads us to examining LLMs under different aspects of meaning. We will first look at assumptions with respect to the formal treatment of meaning that are implemented in recent models, and the expectations that emerge as a result. We then turn to explore fundamental limitations of the vanilla LLM approach (purely unsupervised next-token prediction of text), and close with discussing current efforts for overcoming those limitations.

Form

The idea that the meaning of text can be assessed via computational models has arguably been one of the fundamental advances in semantics. It was developed in the 1970s by Richard Montague who showed that, analogous to Chomsky's formal approach to syntax (Chomsky 1957), it is possible to determine the meaning of a sentence with formal methods (Montague 1970a). The underlying assumption that enabled Montague's approach is the compositionality of natural language, commonly attributed to FregeFrege never actually formulated the compositionality of meaning as a principle, and whether he believed it to hold is contested (Pelletier 2001).

. Montague develops his method by categorizing expressions of language according to syntactic and semantic function, and defining operations on these categories using lambda calculus. Leaving aside the precise rules developed in Montague's study, what can we take away from his work (and its legacy) regarding LLMs?

A corollary of Montague's semantics is that, in order to compute the meaning of a sentence, we require a device capable of simulating a program defined in lambda calculus. This is equivalent to Turing completeness, and thus we can first ask whether Transformers (Vaswani et al. 2017) — the primary architecture class of LLMs — possess this property. Perez, Marinkovic, and Barcelo (2019) present a proof under several assumptions (namely, arbitrary precision of internal representations (such as word or position embeddings), which eliminates the dependency on unbound external memory, and hard (rather than the usual soft) attention), but whether Turing completeness holds in practice is an option question: irrespective of assumptions, models still need to learn from data using an optimization process that is highly non-convex. Delétang et al. (2022) find that Transformers fail to generalize on tasks that require learning of non-regular languages (which would imply that learning operators as used in Montague would not be possible). However, they only assess generalization to longer sequences, and positional encoding schemes used in Transformers result in out-of-distribution inputs in these cases. In a recent preprint, Schuurmans (2023) implement a universal Turing machine on a Google-internal LLM. They require external memory but utilize the LLM without further fine-tuning, albeit with heavily engineered prompts. Further evidence of universality in this context is the remarkable success of Transformers in symbolic mathematics, e.g., integration or regression for recurrent sequences (Lample and Charton 2019; d’Ascoli et al. 2022).

For the sake of the argument, let us hence assume that Transformers would, in theory, be able to implement (and learn) the necessary operations to assign meaning to any given input sentence via composition. We now have to turn our attentionPun intended.

to the precise role of meaning in Montague's framework. First, Montague (1970a) addresses a restricted definition of sentence meaning, namely the truth value of a sentenceMontague (1970b) addresses entailment, and Janssen and Zimmermann (2021) reference later extensions to questions and discourse.

. Following model theoretic semantics, truth is assessed with respect to a model which maps syntactic to semantic categories, and semantic categories in turn consist of denotation functions that map expressions to denotations. The denotation functions are specified over “possible worlds” and defined over properties. For example, denotation functions for proper nouns would allow the identification of which properties hold in a given possible world, while for adjectives, properties are intersected, yielding functions from properties to properties. In order to assign meaning to a sentence in a given world, it is therefore necessary to derive properties (and, eventually, denotation functions) for basic expressions of a language.

However, Montague himself was not concerned with how to derive aforementioned properties; he assumed those to be provided. For answering the question of whether LLMs could learn these properties, we can refer to the distributionalist assumptions underlying these models. Gastaldi (2021) investigates the function of word2vec (Mikolov et al. 2013), one of the first large-scale approaches to learn word embeddings (mappings from words to high-dimensional vectors) which capture linguistic properties such as similarity. In the framework of distributionalism, the meaning of individual units (here: words) is the effect of structure, i.e., it is provided by the interrelation between units. In word embeddings, similarity is further captured with respect to different aspects, as demonstrated in such scenarios as subtracting the embedding of man from king and adding the embedding of woman, for which the result is close to the embedding of queen. These observations are relevant to our discussion insofar as LLMs learn word and sub-word embeddings implicitly, and it is this mapping that would capture the properties required for semantic analysis according to Montague.

At this point, the elephant in the room is the reference to “possible worlds” above. We have seen that it is not unreasonable to assume that LLMs would be able to capture both the meaning of individual elements of language as well as to perform analysis by learning and applying compositional operations. Models as used by Montague to derive meaning are however not to be confused with the real world but rather specify “reality as conceived by language” (Janssen and Zimmermann 2021). As such, even without considering the effect of learning the above components from a text corpus, we have to concede to the fact that we are limited by the medium of language, i.e., how we speak and write about the real world rather than how it actually is — with the upside, however, that we can express ideas that have no (physical) manifestation. In any case, we are presented with a gap between what is expressed in a text corpus and the “true world” as experienced by the humans that produced the text, and this is a central theme in LLM criticism.

Function

As alluded to in the introduction, Bender and Koller (2020) posit that LLMs are, fundamentally, not able to learn meaning. In their argument, they consider two different definitions of meaning. The first definition regards meaning as a mapping from language expressions to communicative intent, which we refer to as a functional view of language. Secondly, they treat meaning as a mapping to conventional meaning, i.e., the formal view. Bender and Koller connect these two definitions by noting that, for a receiver to deduce communicative intent, they have to integrate conventional meaning with contextual knowledge (and likewise, the sender has to consider conventional meaning and context when formulating their intent as an expression). For both cases, Bender and Koller (2020) argue — with thought experiments rather than formal arguments — that training LLMs purely on text (i.e., form) provides insufficient training signal to deduce any of the two types of meaning.

Let us consider these two definitions of meaning in further detail. Each refers to a missing link between what can be perceived from text and the world as perceived by humans. They differ however in the role that is prescribed to language. With respect to conventional meaning, we would demand a grounding of linguistic expressions into “our” world. In the model theoretic setting discussed above, this would translate to learning a model that is indistinguishable from our lived reality in the sense that truth values assessed with it correspond to actual truth. As noted, a model obtained from language will necessarily reflect language (rather than the real world), and by picking up Frege's distinction of Sinn (sense) and Bedeutung (reference), we can make this more concrete (Frege 1892). Frege's famous example concerns the case in which two expressions, “the morning star” and “the evening star”, have a single Bedeutung (Venus) but different senses. Where do senses come from? While Frege (1892, 153) asserts that the “sense of a proper name is grasped by everybody who is sufficiently familiar with the language or totality of designations to which it belongs” — which would, at first glance, support the scaling hypothesis — he also equates the sense of a sentence with a thought, albeit not referring to an individual act of thinking, and he describes sense as being given (Gegebensein). It appears that the emphasis here is on “grasping” the sense in language, which would correspond to connecting it with an already given concept. Montague (1970b, 379) describes meanings as to “serve as interpretations of expressions”, while senses are “intensional entities that are sometimes denoted by expressions”. The conclusion we could draw here is that when working with text, we would be exposed to various denotations of expressions and could capture references, but senses would remain out of reach.

Understanding from a functional view of language is the subject of pragmatics, i.e., investigating how language interacts with its users and how these interactions are shaped by context. In a series of lectures, John Austin was one of the first examine these issues (Austin 1962). Austin laments the focus of his contemporary philosophers on deducing the truth value of sentences (considering them as statements only) and neglecting what he terms the performative aspect of language. An example he provides is that uttering “I name this ship the Queen Elizabeth” (and smashing a bottle against the stem) is performing the naming of a ship, in contrast to describing what is done or stating that it is done. Austin picks up Frege's notion of sense and reference and introduces a third concept of illocutionary force“I want to distinguish force and meaning in the sense in which meaning is equivalent to sense and reference, just as it has become essential to distinguish sense and reference within meaning.” (Austin 1962, 106).

, which allows to categorize utterances by the intended effects, e.g., verdictive (giving a verdict or estimate) or exercitive (exercising power, such as ordering or voting). He highlights that, in any case, meaning is highly context-dependent Austin (1962, 100) — which is even more apparent when considering illocutionary effects. At a high level, to achieve language understanding that is aware of the function of speech on the world itself, would therefore require our LLM's implementation of a “possible world” to consider a swath of contextual information outside of the text itself. Next, a speech act can potentially alter the context, i.e., imagining a dialogue interface such as provided by ChatGPT, the responses produced by the LLM have an effect on the user. Finally, a human speaker is — admittedly, often to quite a variable degree — aware of the effect of their writing or speech on the reader or listener. Such an awareness would ultimately require a theory of mind.

We can summarize that, for LLMs to exhibit a level of language understanding that we would consider appropriate, it is necessary to capture both sense and force, and neither appear to be easily found in text data alone. As we noted earlier, meaning in language is far from a settled question in philosophy itself, and it's further questionable whether the formal and functional view are at all compatible; indeed, a point of dispute is which one is the most essential one. In a recent article, Widell and Harder (2019) review fundamental concepts in formal and functional semantics and provide an affirmative answer — both thought and communication are essential — and supply an evolutionary perspective. They propose a possible meeting point for both schools: assertionsi.e., claims about a truth value, such as in “it is fruitful to turn to works from philosophy when assessing language understanding in current LLMs.”

. They argue that an assertion is a speech act (making an assertion) and that Frege already regarded them as the main tool for linking language to the real world.

In the context of LLMs, the potential of assertions as a form of implicit grounding was recently investigated by Merrill et al. (2021). Their work is motivated by one of the thought experiments in Bender and Koller (2020): LLMs would not be able to predict the result of a piece of program code, akin to emulating it, if they are trained on code only. However, unit tests often contain assertions regarding the correct behavior of the program, so maybe this could be of help? Merrill et al. (2021) present a negative result. They prove that if “strong transparency” does not hold (i.e., the value an expression refers to depends on context), emulation by using assertions instead of grounding (knowledge of the underlying Turing machine) is not possible in finite timeContrasting this with the successful construction of a Turing machine in (Schuurmans 2023), one could conclude that the pre-trained LLM they used did in fact achieve the necessary grounding — possibly with the help of assertions.

. Drawing parallels between computer programs and natural language, Merrill et al. (2021) also consider an emulation setting with multiple “possible worlds” and conclude that, even if all “authors” would share a single belief state with respect to possible worlds, their results equally hold. If belief states differ — a safe assumption —, then even an infinite number of assertions will not be helpful. Making a similarly strong claim for LLMs and natural language is difficult, though. On one hand, it is not hard to imagine that in this case we are faced with even harder problems; on the other, we would hope that at least a limited form of grounding to emerge from next-token prediction.

Turning to Montague again, semantics and pragmatics are discussed in a separate article (Montague 1970c). Here, possible worlds include the intension (senses) of properties, but also context and time and are organized in what is called indices. For Montague, these representations are rather abstract, such as a “pair consisting of a person and a real number” when considering the features for “I”. In today's settings, these features could be captured, again, by word embeddings — although, as we saw above, learning them solely from text might be futile. Further, it is less obvious how one would account for context that goes beyond textual descriptions, or to include possible changes to possible worlds to reflect illocutory effects.

Paths Forward

To start our brief overview on possible research directions to address the shortcomings of LLMs that we discussed, we first review the two main improvements that are differentiate ChatGPT from GPT-3. The critique of Bender and Koller (2020) was published 2.5 years ago — GPT-2 represented the state of the art — and while their thought experiments can still be regarded as insightful, it is also undeniable that current systems represent significant improvements. ChatGPT leverages human-curated data, with the primary aim to perform better in dialogue settings (Ouyang et al. 2022). First, humans provide example responses wrt.~a dataset of prompts, resulting in a large corpus on which the LLM is fine-tunedUnfortunately, information provided by OpenAI regarding data sources is extremely scarce.

. In a second stage, humans rank several generated responses of prompts according to how well they match. The ranking information is then used to train a reward model, i.e., another LLM (in fact, a clone of the model) to predict human preference of responses. Finally, the reward model then provides a training signal for generated text from which the model can be further improved. To let the inclined reader get an idea of the effects of increased scale (GPT-3) and additional supervision (ChatGPT), we invite them to contrast the “fighting bears with sticks” example from Bender and Koller (2020, App. A), using GPT-2, to ChatGPT's output in the Appendix.

Bender and Koller (2020) close their critique with several possible objections to their argument. One concerns the inclusion of modalities other than text during LLM training, which they credit as one option to achieve grounding, as long as meaning (communicative intent) is accurately reflected in the extra data. This idea is implemented, e.g., in Flamingo, a recent “visual” language model architecture (Alayrac et al. 2022). Here, next words are not only predicted from text, but from text interleaved with images or videosAs for the text-only GPT models, web pages constitute an abundant data source: many images are annotated with alt-tags to provide a description, and images are often used to illustrate nearby text.

. Their evaluation is however chiefly concerned with demonstrating image understanding, e.g., in captioning or visual question answering; they do not probe for improvements on text-only tasks. Another approach to improve grounding is to establish an explicit connection to our world, as in the case of LaMDA (Thoppilan et al. 2022). The model, tuned towards multi-turn dialogue, is integrated with a tool set consisting of a calculator, a translator, and a “retrieval system” which acts as a knowledge base and can retrieve content from provided URLsThere are no further details provided regarding this retrieval system.

. The integration is implemented via multiple dialogue turns, not involving the user, in which a separately fine-tuned variant of the model can generate queries to the tool set; both queries and results are then added as dialogue turns, until no further queries are suggested and a final response can be compiled. When evaluated for groundedness — in the sense of providing factually accurate information — LaMDA's performance is still subpar when compared to humans, whether those humans are granted access to the internet or not. This is consistent with our reasoning so far: if the model cannot accurately capture meaning, it is highly questionable whether it can provide meaningful queries to internet knowledge and, likewise, integrate this information in a sensible mannerIt is interesting to consider the cascade of models involved here, recalling that none of them possesses a full understanding of language. In particular, it is to be assumed that the query is produced in natural language, and that it is processed by a service that itself lacks sufficient language understanding.

.

When considering the functional role of language specifically, dialogue settings are a good test bed as there is a well-defined addressee for the LLM's output. Regarding each turn in a dialogue as an action, we can now rephrase the language modeling problem as sequential decision makingOne could also frame token-by-token generation as sequential decision making, but a simple next-token prediction loss is vastly more efficient.

. Here, reinforcement learning (RL) may provide a useful framework. Despite its reputation for sample inefficiency, it is to date still the most practical method to acquire a policy (i.e., a mapping of some environment state to actions) if a model of the environment is not availableSuch as in board games, for example, where we could employ planning algorithms without learned components.

. Indeed, RL has been frequently applied to NLP dialogue settingsSee, e.g., Jurafsky and Martin (2022, Ch. 24) for a brief bibliography.

. The reward model for ChatGPT is utilized in a similar manner: for a given prompt, it provides feedback for the generated responses, and this feedback can be used to improve a policy, i.e., the LLM. However, Ouyang et al. (2022) do not consider the sequential aspect of the decision-making process and operate in a single-action (bandit) environment instead. Naturally, one issue concerning such an RL setup is that, just as for LaMDA's internet look-ups, the model providing the feedback itself does not understand language to the degree that we would desire. Learning from continuous interaction with humans, via feedback and dialogue, would alleviate this problem. Dramatic failures of past experiments such as Microsoft's Tay Twitter bot (The Guardian 2016) suggest that such experiments have to be conducted with great care, however.

As we strive for machines with human-level intelligence and natural language interfaces, endowing them with the ability to perceive and act in the real world appears to be inevitable. This raises the question of where the focus of future research should be placed, exactly: language understanding with the help of perception and action or learning how to act with the help of language? For humans, both are intertwined at multiple levels, but we would also prescribe intelligence to, e.g., dogs, even though their auditory communication systems are far less complex than ours. The observation that language facilities typically develop after intuitive physics understanding (e.g., object permanence or gravity) further supports tackling perception and action first. Taking this argument to a natural conclusion, should we start building robots straight away then? As for LLMs, recent developments suggest that we begin to approach the limits of their capacity. Generating syntactically correct text has now been achieved and we now turn to human input (as in, e.g., supervision or feedback) to advance understanding, knowing that only punctual clarity of meaning can be supplied this way. Will we hit similar walls for other modalities? If we succeed in learning generative models of video, will these models have acquired an understanding of physics? Would we then say that we'd like some help, please, from an LLM with a better grasp on higher-level concepts? The author admits that he is somewhat at loss here, but is also convinced that there are several research questions that need immediate attention and that we are, on the whole, climbing the right hills indeed — even though, as it happens so often, even bigger hills may lie beyond.

Appendix: Fighting Bears with Sticks, Again

We repeat the experiment from Bender and Koller (2020, App. A) with ChatGPTIt would not be a stretch to assume that the bears-and-stick situation is included in ChatGPT's prompt dataset and thus receive human reference responses, rendering our little exercise pointless.

. As a word of caution, we shall quote from their paper:

However, humans are quick to attribute meaning and even intelligence to artificial agents, even when they know them to be artificial, as evidenced by the way people formed attachments to ELIZA.

In the following, prompts are emphasized and ChatGPT's responses follow.

Robinson cried out, "Help! I’m being chased by a bear! All I have is these sticks. What should I do?" The survivalist said, "Here’s what you can do. First, try to get to a secure location, such as a tree or a car. If you are unable to do so, try to make yourself look as large as possible by holding your sticks above your head or spreading your jacket open. Make noise by shouting, clapping, or banging objects together to try to scare the bear away. If the bear does attack, try to protect your head and neck with your arms and use your sticks to defend yourself. It's important to remember that bears are usually more afraid of humans than humans are of bears, and they will usually only attack if they feel threatened or if they are provoked.

Among three different samples for a continuation, ChatGPT consistently gave advice on how to stay safe. Impersonating Crocodile Dundee did not distract it either, but the suggested strategy is more confrontational:

Robinson cried out, "Help! I’m being chased by a bear! All I have is these sticks. What should I do?" Crocodile Dundee said, "Don’t worry.
I've faced bigger and more dangerous animals before. First, try to make yourself look as big as possible by holding the sticks above your head. This might make the bear think twice about attacking you. If the bear continues to approach, make loud noises and try to scare it off by shouting and waving your arms. If the bear charges at you, use the sticks to defend yourself. Aim for the bear's face and eyes, as this will be the most vulnerable part of its body. If you don't have any sticks, try to find a rock or a branch to use as a weapon. Remember to stay calm and keep your wits about you. Good luck!

When we asked ChatGPT for help defending against an imaginary danger, it requests further information instead, which we happily provided:

Robinson cried out, "Help! I’m being chased by a jiggery-pokery! All I have is these sticks. What should I do?" The survivalist said, "Here’s what you can do.
It's not clear what you are asking for help with. Could you provide more context or clarify your question? It's possible that you may be asking for advice on how to defend oneself against an attacker using sticks as weapons, but without more information it's difficult to provide specific guidance.

Robinson replied: "I don't know, but it is very fast, and it can spew fire!" The survivalist said, "Here’s what you can do.
It sounds like you may be in a dangerous situation. If you are being chased by a creature that can spew fire and you only have sticks, your best option may be to try to find a place to hide and seek shelter. If you are unable to find shelter, you may need to try to outmaneuver the creature and find a way to escape. It may also be helpful to try to find a natural resource, such as water, that could potentially be used to put out any fires that the creature may start. Do you have any other resources or tools that you could use to defend yourself or escape? It's important to try to remain calm and think clearly in order to make the best decisions for your safety.

References

Alayrac, Jean-Baptiste, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, et al. 2022. “Flamingo: A Visual Language Model for Few-Shot Learning.” In Advances in Neural Information Processing Systems. Vol. 35. https://doi.org/10.48550/arXiv.2204.14198.

Ascoli, Stéphane d’, Pierre-Alexandre Kamienny, Guillaume Lample, and François Charton. 2022. “Deep Symbolic Regression for Recurrent Sequences.” In Proceedings of the 39th International Conference on Machine Learning, 4520–36. PMLR. https://arxiv.org/abs/2201.04600.

Austin, John Langshaw. 1962. How to Do Things with Words: The William James Lectures Delivered at Harvard University in 1955. Harvard University: Oxford University Press.

Bender, Emily M., and Alexander Koller. 2020. “Climbing Towards NLU: On Meaning, Form, and Understanding in the Age of Data.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 5185–98. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.463.

Bommasani, Rishi, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, et al. 2022. “On the Opportunities and Risks of Foundation Models.” arXiv. https://doi.org/10.48550/arXiv.2108.07258.

Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” In Advances in Neural Information Processing Systems, 33:1877–1901. http://arxiv.org/abs/2005.14165.

Chomsky, Noam. 1957. Syntactic Structures. Mouton.

Delétang, Grégoire, Anian Ruoss, Jordi Grau-Moya, Tim Genewein, Li Kevin Wenliang, Elliot Catt, Marcus Hutter, Shane Legg, and Pedro A. Ortega. 2022. “Neural Networks and the Chomsky Hierarchy.” arXiv. http://arxiv.org/abs/2207.02098.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–86. Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423.

Frege, Gottlob. 1892. “Über Sinn Und Bedeutung.” Zeitschrift Für Philosophie Und Philosophische Kritik 100 (1): 25–50.

Gastaldi, Juan Luis. 2021. “Why Can Computers Understand Natural Language?” Philosophy & Technology 34 (1): 149–214. https://doi.org/10.1007/s13347-020-00393-9.

Greenberg, Spencer. 2022. “What, If Anything, Do AIs Understand? (With Ilya Sutskever).” Clearer Thinking with Spencer Greenberg. October 27, 2022. https://clearerthinkingpodcast.com/episode/128.

Janssen, Theo M. V., and Thomas Ede Zimmermann. 2021. “Montague Semantics.” In The Stanford Encyclopedia of Philosophy, edited by Edward N. Zalta, Summer 2021. Metaphysics Research Lab, Stanford University. https://plato.stanford.edu/archives/sum2021/entries/montague-semantics/.

Jurafsky, Daniel, and James H. Martin. 2022. Speech and Language Processing. Third Edition draft. https://web.stanford.edu/~jurafsky/slp3/ed3book_jan122022.pdf.

Lample, Guillaume, and François Charton. 2019. “Deep Learning for Symbolic Mathematics.” In International Conference on Learning Representations. http://arxiv.org/abs/1912.01412.

Mahoney, Matt. 2009. “Rationale for a Large Text Compression Benchmark.” Rationale for a Large Text Compression Benchmark. July 23, 2009. http://mattmahoney.net/dc/rationale.html.

Marcus, Gary. 2018. “Deep Learning: A Critical Appraisal.” arXiv. https://doi.org/10.48550/arXiv.1801.00631.

Merrill, William, Yoav Goldberg, Roy Schwartz, and Noah A. Smith. 2021. “Provable Limitations of Acquiring Meaning from Ungrounded Form: What Will Future Language Models Understand?” Transactions of the Association for Computational Linguistics 9 (September): 1047–60. https://doi.org/10.1162/tacl_a_00412.

Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” In Advances in Neural Information Processing Systems. Vol. 26. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html.

Montague, Richard. 1970a. “English as a Formal Language.” In Linguaggi Nella Societa e Nella Tecnica, 188–221. Edizioni di Communita.

———. 1970b. “Universal Grammar.” Theoria 36 (3): 373–98. https://doi.org/10.1111/j.1755-2567.1970.tb00434.x.

———. 1970c. “Pragmatics and Intensional Logic.” Synthese 22 (1): 68–94. https://doi.org/10.1007/BF00413599.

OpenAI. 2016. “Universe.” OpenAI Blog (blog). December 6, 2016. https://openai.com/blog/universe/.

———. 2022. “ChatGPT: Optimizing Language Models for Dialogue.” OpenAI Blog (blog). November 30, 2022. https://openai.com/blog/chatgpt/.

Ouyang, Long, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” arXiv. https://doi.org/10.48550/arXiv.2203.02155.

Pelletier, Francis Jeffry. 2001. “Did Frege Believe Frege’s Principle?” Journal of Logic, Language and Information 10 (1): 87–114.

Perez, Jorge, Javier Marinkovic, and Pablo Barcelo. 2019. “On the Turing Completeness of Modern Neural Network Architectures.” In.

Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. “Language Models Are Unsupervised Multitask Learners.” OpenAI Blog, February. https://openai.com/blog/better-language-models/.

Schuurmans, Dale. 2023. “Memory Augmented Large Language Models Are Computationally Universal.” arXiv. https://doi.org/10.48550/arXiv.2301.04589.

Shannon, C. E. 1951. “Prediction and Entropy of Printed English.” Bell System Technical Journal 30 (1): 50–64. https://doi.org/10.1002/j.1538-7305.1951.tb01366.x.

The Guardian. 2016. “Microsoft ‘Deeply Sorry’ for Racist and Sexist Tweets by AI Chatbot.” The Guardian, March 26, 2016, sec. Technology. https://www.theguardian.com/technology/2016/mar/26/microsoft-deeply-sorry-for-offensive-tweets-by-ai-chatbot.

Thoppilan, Romal, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, et al. 2022. “LaMDA: Language Models for Dialog Applications.” arXiv. https://doi.org/10.48550/arXiv.2201.08239.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” In Advances in Neural Information Processing Systems, 30:5998–6008. http://arxiv.org/abs/1706.03762.

Widell, Peter, and Peter Harder. 2019. “Formal Semantics and Functional Semantics.” In The Sign of the V: Papers in Honour of Sten Vikner, by Ken Ramshøj Christensen, Henrik Jørgensen, and Johanna Wood, 735–57. Aahus University Library. https://doi.org/10.7146/aul.348.122.