Towards Natural Natural Language Processing

Term Paper for 263-5353-00L Philosophy of Language and Computation
Structuralism: The Search for the Right Units

January 13, 2023 – Jonas Gehring

In natural language processing, the chief subject of research is language in its textual form. The grounding of text in speech, the primordial form of language, is taken for granted. Recently, language understanding, production, and discourse on an acoustic level has been approaching practical feasibility through large-scale statistical modeling with neural networks. In this essay, we set out to explore the connection between spoken and written language from a structuralist viewpoint and examine works from both (textual) language and speech processing. Finally, we present recent work on spoken language processing along with opportunities and challenges for future work.

People attach even more importance to the written image of a vocal sign than to the sign itself. A similar mistake would be in thinking that more can be learned about someone by looking at his photograph than by viewing him directly.

Ferdinand de Saussure, “Course on General Linguistics”

Introduction

The central tenet of structuralism is that complex phenomena, such as human language, can only be understood by studying the relations of their individual elements. This approach necessitates a suitable definition what constitutes an element. Fundamentally, language manifests as acoustic waves subject to biological but also information-theoretic constraints, such as discreteness and duality of patterning (Hockett 1960). These constraints are reflected in the symbolic abstractions, i.e., writing systems, that emerged to represent language (Meletis 2020). From the outset, structuralist linguistic theories were designed to encompass all manifestations of language, starting at the acoustic level, albeit with the assumption of an underlying symbolic representation. However, Harris remarks that a specific choice of elements to study affects the analyses and insights that can be obtained:

Phonology and morphology, therefore, each independently provides information concerning regularities in selected aspects of human behavior. The general methods of scientific technique are the same for both: associating discrete elements with particular features of portions of continuous events, and then stating the interrelations among these elements. But the results in each — the number of elements and classes of elements, the type of interrelations — are different. [...] phonology is more useful in taking down anthropological texts, learning a new dialect, etc., while morphology is more useful in the understanding of texts, in discovering “what is said” in a new language, etc. (Zellig S. Harris 1963, 22–23)

Echoes of Harris' argument are clearly audible when considering how fields of research concerned with syntactic and semantic understanding selected text as their medium of choice, whereas phonetics is of interest, for example, in acoustic modeling for automatic speech recognition (ASR). One may however question such distinctions based on the medium of language; as Saussure stresses in the Course on General Linguistics, “The linguistic object is not both the written and the spoken forms of words; the spoken forms alone constitute the object.” (Saussure 1916, 23–24). Can the strong focus on textual representations in contemporary NLP thus be considered harmful, and which opportunities are potentially missed?

To set the stage, we will briefly review how Saussure and Harris relate the spoken and written form of language in their works, and then trace how the issue of defining the proper elements of study, linking back Harris' observation above, developed similarly in text and audio applications in the wake of deep learning. We then discuss spoken language modeling and conclude with challenges and opportunities in this emerging field.

Spoken and Written Language in Structural Linguistics

Structuralism was initially developed by Ferdinand de Saussure and treated in a series of lecture notes in the posthumously published Course of Linguistics. At its core, Saussure establishes the linguistic sign as the chief unit of interest, which in turn consists of the signified (the mental concept) and the signifier (the physical realization). Notably, the signifier is defined as a “sound-image” of a concept and hence concerns the acoustic level. Indeed, Saussure objects to studying language purely in its written form, going so far as to speak of a “tyranny of writing” that not only “disguises” but “influences and modifies language”, and criticizing earlier linguistics for confusing language and writing (Saussure 1916, 31;24) The insistence on spoken language might stem in part from Saussure's interest in how languages evolve over time (“to describe and the trace the history of all observable languages” is listed as the first item of the scope of linguistics), and his observations on how grammar and writing can stifle such evolution and in turn obscure developments that actually took place. However, he also emphasizes the need to study written language, in order to be “acquainted with its usefulness, shortcomings, and dangers.”

.

The structuralist approach is then further developed by considering the fundamental element (the sign) in detail and positing that it is defined solely by separating it from other signs, at which point its function can be studied by the relation to other signs. This approach does however lead to immediate difficulties, as Saussure well admits: it is not at all clear how such a separation should work in practice, and Saussure is dissatisfied with resorting to concepts of syllables, words or sentences (Saussure 1916, 105).

In Structural Linguistics, Zellig Harris sets out to develop practical methods for studying interrelations of linguistic elements. As an American structuralist, Harris was concerned with synchronic linguistics, investigating a single language over a brief period of time, presented as a corpus of utterances. The linguistic element is now defined on a symbolic level, which is integral as “[these] elements can be manipulated in ways in which records or descriptions of speech can not be: and as a result regularities of speech are discovered which would be far more difficult to find without the translation into linguistic symbols.” (Zellig S. Harris 1963, 18). While Saussure posits that “linguistic signs, though basically psychological, are not abstractions,” (Saussure 1916, 15) Harris admits that he does in fact perform an abstraction of speech and that, consequently, many features of speech cannot be capturedFor Saussure, language and speech are to be clearly separated, leading to the conclusion that “we can dispense with the other elements of speech; indeed, the science of language is possible only if the other elements are excluded,” (Saussure 1916, 15). On the other hand, Harris writes that “if we ever become able to state with some regularity the distribution of these other behavioral features, we would associate them too with particular linguistic elements.” (Zellig S. Harris 1963, 19).

. The notion of working with utterance fragments themselves is dismissed as defining corresponding elements would be too cumbersome, and because once a subdivision of speech yields correspondences of symbols between utterances, nothing would be gained by further dividing elements into their constituents. With respect to current deep learning methods applied to text and audio signals, however, this reasoning does not hold up well as we shall see in the next section.

Harris distinguishes between phonetic and morphological analysis. While phonemes and morphemes are portrayed as two sides of the same coin in that both can fully describe an utterance, they differ in their purpose. As alluded to in the introduction of this essay, Harris regards morphemes as suitable to determine the meaning of an utterance. Consequently, while Harris spends considerable effort on distributional analysis on a phonetic level, his work culminates in approaches to derive morphological structurePhonetics was of high importance to American structuralism, not at least due to an interest in the study of indigenous languages of the Americas which frequently lacked standardized writing systems. Harris also performed distributional analysis on the Cherokee language (Zellig S. Harris 1954), for which a written form was however available.

. Similar attitudes can be found in other structuralist works from this period, such as Firth's “no meaning without morphology” (Firth 1957). With Chomsky shifting focus to analysis of whole sentences, units of interested moved to yet more abstract levels as it was regarded as “absurd, or even futile, to state principles of sentence construction in terms of phonemes or morphemes” due to the complexity that such descriptions would entail (Noam Chomsky 1957, 59). This is not to say that the relevance of phonetics in linguistics disappeared; Chomsky himself set out to define grammars for production at a phonetic level several years later (N. Chomsky and Halle 1968).

Statistical Modeling of Audio and Text

Pioneering work in artificial intelligence, in particular in natural language processing (NLP), was carried out under a strong structural influence from Chomsky and Wittgenstein (Biggs 1987). Liddy (2001) notes that Chomsky's work resulted in a split of the NLP community, with generative grammar playing an important role in language processing, while statistical methods dominated applications of speech processing. With renewed interest in connectionist methods in the last decade due to the success of deep learning, statistics is at the heart of today's NLP. In this section, we will give examples how deep neural network models encourage the usage of primitive units that carry no semantic information themselves: individual characters.

Text

When working with text, a natural choice for an elementary unit seems to be the word. The definition of a word is however not as trivial as it might appear. Mielke et al. (2021) review past and recent trends in tokenization (transforming text into primitive units) for NLP applications and provide several examples where difficulties arise, e.g., when deciding on a tokenization for “don't”. Besides technical feasibility, two main considerations determine the choice of units in a statistical setting: (1) the ability to gather useful statistics from a given training corpus, and (2) how well these statistics can be generalized to new dataGeneralization is determined a combination of many factors, but deciding which data to model in the first place is an important one. Which data a system should generalize to further depends on the particular research setting considered.

. With the first point being addressed by ever-growing datasets, the second point becomes more relevant with the increasing capabilities and applications of NLP models. For example, a machine learning system might encounter unseen words at test time; a graceful fall-back might consider the new word's morphological information, for example. With powerful neural network models that can model long sequences, it is appealing to consider characters as the fundamental element (Sutskever, Martens, and Hinton 2011; Al-Rfou et al. 2019). But even here things are not as clear-cut: when opting for raw bytes or for a particular encoding such as Unicode, different trade-offs arise (Mielke et al. 2021). The prevalent approach in today's systems is sub-word modeling, e.g., by combining individual letters based on their co-occurrence (Sennrich, Haddow, and Birch 2016). This compromise between words and characters allows for generalization while processing of several thousands words in sequence is still feasibleSuch trade-offs depend on the balance between available hardware (to ensure reasonable training times) and limitations in modeling and are thus subject to constant and rapid change.

. Interestingly, recent work attempts to sidestep the issue of selecting textual units by considering visual language processing, e.g., by modeling images with rendered sentences (Salesky, Etter, and Post 2021) It is tempting to imagine how Saussure would have reacted to such a setup.

.

Audio

In traditional speech transcription systems, separate models are utilized for audio and text modalities and combined with Bayes' rule to find the most likely text sequence \(T^*\) given an audio signal \(X\), i.e., \(T^* = \operatorname{argmax}_T p(T | X) = \operatorname{argmax}_T p(X | T) p(T)\). The first quantity, \(p(X | T)\), is referred to as the acoustic model and serves as a good example regarding the fundamental units used in audio processing. The acoustic signal \(X\) is typically represented as features over fixed-length audio fragments, extracted via Fourier transforms. Which units should be predicted from such featuresFor ease of exposition, we disregard the problem of finding the optimal sequence of linguistic units and focus solely on the classical acoustic modeling task, i.e., predicting a single unit wrt. given acoustic features. In full ASR systems, a decoder integrates both acoustic and language model scores to determine a transcription hypothesis.

?

Historically, directly predicting letters or sub-word units from audio in continuous speech recognitionIn continuous speech recognition systems, words and sentences (and possibly speakers) are not explicitly separated by pauses.

was considered practical only for languages with a strong correspondence between written and spoken forms, such as German or Spanish (Killer, Stuker, and Schultz 2003). For languages such as English, phonetic units were of higher interest, which came with the additional requirement of dictionaries to map words to their corresponding phone sequences. Early systems used simple phonetic units from a manually designed set of phones, but significant advances were made by a division into subphonetic states and taking context information of surrounding states into account (Schwartz et al. 1985). Eventually, advances in neural network modeling and larger training corpora enabled the prediction of characters directly from digital wave form representations, alleviating the need of expert knowledge on pronunciation and audio featurization (Collobert, Puhrsch, and Synnaeve 2016).

Representation Learning

The conclusion thus far seems to be that for language tasks, modeling techniques are ultimately converging single letters as the unit of choice. While letters do not constitute any meaning in a linguistic sense (i.e., the smallest units of text that convey meaning would be morphemes), this is in line with the promise of deep learning to construct suitable internal representations from raw signals automatically (LeCun, Bengio, and Hinton 2015). However, a key assumption in NLP is that the tasks of interest are already defined in a textual form: parsers are evaluated on parsing text, language models on generating text, and speech recognition systems on transcribing speech to text.

A separate branch of contemporary deep learning literature deals with learning representations as a goal in itself. Such techniques have recently been of high interest under the term of self-supervised learning, i.e., representation learning from unlabeled data, and can be broadly filed into two categories. In generative representation learning, the aim is to learn a compressed representation that allows for the reconstruction of the input or a future datum (Ballard 1987; Kingma and Welling 2014); in contrastive learning, the learning signal is the disambiguation of similar from unrelated data (Arora et al. 2019). These approaches can also be combined; one example is Contrastive Predictive Coding (CPC) (Oord, Li, and Vinyals 2018) in which a representation is learned to model (representations of) future observations in sequential data, with a contrastive loss to avoid explicit prediction of these observations.

Could the issue of unit selection thus be sidestepped by learning a representation instead? In any case, it is necessary to define the input data, which constitutes a definition of the unit's “natural” instantiation, e.g., tokenized text or a fixed-length temporal signal. On the other hand, representation learning enables a direct application of structuralist maximes from first principles: the construction of elements (here: representations) defined by their similarity to other elements. Harris' methods for identifying phonetic units relies primarily on the ability to substitute fragments of utterances with each other (Zellig S. Harris 1954, 29) — in other words, elements are further defined by the context in which they occur, which is implemented in methods such as word2vec (Mikolov et al. 2013).

Spoken Language Processing

A recently emerging work stream considers natural language processing without textual representations (Lakhotia et al. 2021; Kharitonov et al. 2022). This can be regarded as a return to the origins of Saussure's work, but there are several further research opportunities by focusing on spoken language. First, putting fundamental differences across modalities aside, breaking the reliance on text can be regarded as a next step in terms of data presentation, advancing to units more fine-grained than individual characters. Second, a key question in linguistics is how humans acquire language. Chomsky postulates that humans are endowed with innate abilities for language learning and rejects the notion of language learning purely from external feedback (Noam Chomsky 1975, 1959). However, concrete evidence for either hypothesis has remained elusive to date. With large neural network models approaching human capabilities in text understanding and generation, insights into language acquisition might be derived from their study (Warstadt and Bowman 2022). In this line of thinking, it can be argued that one may need to expose learned models to identical modalities, i.e., speech rather than text (Dupoux 2018). Finally, a practical benefit is the option to deploy systems to languages without a writing system, which constitute the majority of the world's languages (Tjandra, Sakti, and Nakamura 2019).

As an exemplary system for spoken language generation, we consider the setup proposed by Lakhotia et al. (2021) and depicted in the figure above. In a first stage (S2u), acoustic units are discovered via representation learning. Here, techniques such as CPC are used to obtain representation for fixed-length audio segments (here: 10ms), and discrete units are then obtained by clustering the representations via k-means. A language model (uLM) is then trained on sequences of discrete units as they appear in the dataset, implemented as a Transformer (Vaswani et al. 2017). Finally, a decoder (u2S) is employed to translate acoustic units into spectrogram features for subsequent audio synthesis (Shen et al. 2018). The resulting system can prompted with an audio segment and predicts possible continuations, akin to large language models like GPT (Brown et al. 2020).

In the setting above, the explicit discovery of acoustic units seems to be add odds with the trend towards building end-to-end deep learning systems. The motivations for discrete acoustic units are two-fold. First, discrete units allow the sampling of sequences from the language model based on class probabilities obtained via softmax normalization, whereas sampling directly on the level of acoustic features would require the assumption of an underlying probability distribution. Predicting a sequence of raw wave form samples would again involve discrete units, but leads to enormous sequence lengths; furthermore, variations at this level have only little influence on the resulting sound so that, e.g., generating two different spoken words would involve a large number of suitable correlated sampling stepsBoth issues — handling very large sequences as well as learning flexible distributions and sampling from them — are important topics in current generative modeling research, so future work may well challenge the motivation outlined here.

.

In follow-up work, it was pointed out that acoustic unit discovery via representation learning fails to capture non-phonetic information and thus misses out on one potential of spoken language processing: the integration of non-verbal and prosodic features which convey, e.g., emotion and emphasis (Kharitonov et al. 2022). As a possible solution, the duration (number of repetitions) of clustered acoustic units as well as the fundamental frequency can be modeled explicitly with the uLM component.

A major challenge addressed in Lakhotia et al. (2021) is the evaluation of spoken language generation models; as noted above, typical NLP benchmarks are defined on a textual level. While native speakers can judge the generated speech in terms of intelligibility and semantics, automatic evaluations are valuable for swift and deterministic benchmarking. Their work proposes evaluation methods for both intermediate models and overall generation. Acoustic units are evaluated by estimating whether units that fall into the same discrete category are closer to each other than units in separate categories (Schatz et al. 2013) — a notion which again corresponds to Harris' criteria. Language-level performance is measured by a word spotting task, in which single-word utterances of real and pseudo-words (e.g., “brick” vs. “blick”) are discriminated based on uLM scores (Nguyen et al. 2020). Finally, generated speech is evaluated by utilizing an off-the-shelf speech transcription system. Intelligibility can be gauged with transcription system's acoustic model: an utterance is encoded and decoded with the S2u and u2S components, and the discrepancy between acoustic model output of the original and re-synthesized utterance is computed. Judging the content of the generated text is however more involved, and the authors propose to measure two signals. First, the diversity of words within a transcribed sentence is measured on the level of words and word combinations. Second, the perplexity of a standard language model is computed, serving as a (weak) proxy on whether the entire transcription in sensible (as in, likely with respect to a given corpus of text).

Outlook

What role with spoken language processing play going forward? While practical benefits exist, e.g., the inclusion of languages that are spoken only, or integration of non-phonetic features into language processing and analysis, today's primary medium for human-computer interaction is text. Furthermore, text is increasingly hailed as the interface of choice for content generation in such diverse modalities as images, animation, or code (Ramesh et al. 2022; Zhang et al. 2022; Chen et al. 2021). On the other hand, one can imagine shifts in user interfaces, e.g., fueled by creation and consumption of richer content such as video, or an expansion of speech interfaces, e.g., in immersive environments in which interaction via speech is natural, to pave the ground for new applications. For example, a conversational agent could directly react to irony in a user's voice, and likewise provide emotional cues in its response (if so desired). Dupoux (2018) argues that modeling spoken language can provide insight into language acquisition in children, tackling the problem from a reverse-engineering point of view. This may allow researchers to address the bootstrapping problem, for example, where children apparently learn multiple interdependent aspects of language concurrently (e.g., learning phonemes requires the concept of words while learning a vocabulary requires knowledge of phonemes). Constructing artificial systems that successfully acquire language purely from acoustic signals could thus put Chomsky's universal grammar hypothesis to the test. Likewise, the search for the most suitable units of language may be advanced, either with improvements in representation learning for explicit discovery, or from studying intermediate representations in end-to-end models.

What are open challenges in this field? As mentioned above, current limitations in deep learning necessitate a separate learning stage to acquire acoustic units. The question of how these units are acquired has not been settled, as we've discussed in the context of capturing non-phonetic information. Besides finding good inductive biases for representation learning, there are further design considerations, such as specifying the number of discrete units, or whether discretization is necessary at all. Nguyen, Sagot, and Dupoux (2022) perform a preliminary investigation into the latter issues, albeit without considering the generation of speech, and arrive at the conclusion that continuous acoustic representations generally perform worse and that the number of units should at least amount to a typical phoneme inventory of about 40 units. Finally, relying on speech transcription systems for evaluation is at odds with the promise of being able to model low-resource languages, or those without a writing system. By applying this line of research to languages that have been extensively studied in NLP and ASR, the additional requirement for techniques to generalize across languages might alleviate this issue (Lakhotia et al. 2021).

Finally, has the circle to Saussure been closed for good now? Indeed, spoken language processing picks up the proposition to leave textual representations aside and is likewise motivated by a desire to access the “true” manifestation of language. The results obtained in recent work are at least starting to undermine Harris's claim that, ultimately, textual representations are more useful for understanding “what is said”. Scholars have however also lamented the downstream effects of early structuralism's focus on speech, and argue that the study of writing systems is useful on its own (Meletis 2020). For example, regarding possible new advances in understanding language acquisition, how would such findings apply to deaf people acquiring reading and writing facilities? A likely outcome is that spoken language processing adds to the overall canon of methodologies that can help us understand human language as a phenomenon, alongside the study of text. This would, once again, be in accord with Saussure's recognition of how both speech and writing systems mutually influence each other.

References

Al-Rfou, Rami, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. 2019. “Character-Level Language Modeling with Deeper Self-Attention.” In Proceedings of the AAAI Conference on Artificial Intelligence, 3159–66. AAAI Press. https://doi.org/10.1609/aaai.v33i01.33013159.

Arora, Sanjeev, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. 2019. “A Theoretical Analysis of Contrastive Unsupervised Representation Learning.” In Proceedings of the 36th International Conference on Machine Learning, 5628–37. PMLR. https://doi.org/10.48550/arXiv.1902.09229.

Ballard, Dana H. 1987. “Modular Learning in Neural Networks.” In Proceedings of the 6th National Conference on Artificial Intelligence. Seattle, WA, USA, July 1987, edited by Kenneth D. Forbus and Howard E. Shrobe, 279–84. Morgan Kaufmann. http://www.aaai.org/Library/AAAI/1987/aaai87-050.php.

Biggs, Simon. 1987. “Cybernetics in a Post-Structuralist Landscape.” 1987. http://littlepig.org.uk/texts/cybernetics.htm.

Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” In Advances in Neural Information Processing Systems, 33:1877–1901. http://arxiv.org/abs/2005.14165.

Chen, Mark, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, et al. 2021. “Evaluating Large Language Models Trained on Code.” arXiv. https://doi.org/10.48550/arXiv.2107.03374.

Chomsky, N., and M. Halle. 1968. The Sound Pattern of English. Harper & Row, New York.

Chomsky, Noam. 1957. Syntactic Structures. Mouton.

———. 1959. “Review of Verbal Behavior.” Language 35 (1): 26–58.

———. 1975. Reflections on Language. Pantheon Books.

Collobert, Ronan, Christian Puhrsch, and Gabriel Synnaeve. 2016. “Wav2Letter: An End-to-End ConvNet-Based Speech Recognition System.” arXiv. https://doi.org/10.48550/arXiv.1609.03193.

Dupoux, Emmanuel. 2018. “Cognitive Science in the Era of Artificial Intelligence: A Roadmap for Reverse-Engineering the Infant Language-Learner.” Cognition 173 (April): 43–59. https://doi.org/10.1016/j.cognition.2017.11.008.

Firth, J. R. 1957. “A Synopsis of Linguistic Theory 1930-55.” Oxford: Philological Society.

Harris, Zellig S. 1954. “Distributional Structure.” Word-Journal of The International Linguistic Association 10 (2-3): 146–62.

Harris, Zellig S. 1963. Structural Linguistics. Chicago University Press.

Hockett, Charles F. 1960. “The Origin of Speech.” Scientific American 203 (3): 88–97.

Kharitonov, Eugene, Ann Lee, Adam Polyak, Yossi Adi, Jade Copet, Kushal Lakhotia, Tu-Anh Nguyen, et al. 2022. “Text-Free Prosody-Aware Generative Spoken Language Modeling.” In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 8666–81. Association for Computational Linguistics. https://doi.org/10.48550/arXiv.2109.03264.

Killer, Mirjam, Sebastian Stuker, and Tanja Schultz. 2003. “Grapheme Based Speech Recognition.” In 8th European Conference on Speech Communication and Technology (Eurospeech 2003), 3141–44. ISCA. https://doi.org/10.21437/Eurospeech.2003-785.

Kingma, Diederik P., and Max Welling. 2014. “Auto-Encoding Variational Bayes.” In International Conference on Learning Representations. http://arxiv.org/abs/1312.6114.

Lakhotia, Kushal, Eugene Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, et al. 2021. “On Generative Spoken Language Modeling from Raw Audio.” Transactions of the Association for Computational Linguistics 9: 1336–54. https://doi.org/10.1162/tacl_a_00430.

LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. 2015. “Deep Learning.” Nature 521 (7553): 436–44. https://doi.org/10.1038/nature14539.

Liddy, Elizabeth D. 2001. “Natural Language Processing.” In Encyclopedia of Library and Information Science, 2nd ed. Marcel Dekker, Inc.

Meletis, Dimitrios. 2020. The Nature of Writing. A Theory of Grapholinguistics. Vol. 3. Grapholinguistics and Its Applications. Brest, France: Fluxus Editions. http://www.fluxus-editions.fr/gla3.php.

Mielke, Sabrina J., Zaid Alyafeai, Elizabeth Salesky, Colin Raffel, Manan Dey, Matthias Gallé, Arun Raja, et al. 2021. “Between Words and Characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP.” arXiv. https://doi.org/10.48550/arXiv.2112.10508.

Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” In Advances in Neural Information Processing Systems. Vol. 26. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html.

Nguyen, Tu Anh, Benoit Sagot, and Emmanuel Dupoux. 2022. “Are Discrete Units Necessary for Spoken Language Modeling?” IEEE Journal of Selected Topics in Signal Processing 16 (6): 1415–23. https://doi.org/10.1109/JSTSP.2022.3200909.

Nguyen, Tu Anh, Maureen de Seyssel, Patricia Rozé, Morgane Rivière, Evgeny Kharitonov, Alexei Baevski, Ewan Dunbar, and Emmanuel Dupoux. 2020. “The Zero Resource Speech Benchmark 2021: Metrics and Baselines for Unsupervised Spoken Language Modeling.” In. arXiv. https://doi.org/10.48550/arXiv.2011.11588.

Oord, Aaron van den, Yazhe Li, and Oriol Vinyals. 2018. “Representation Learning with Contrastive Predictive Coding.” arXiv:1807.03748 [Cs, Stat], July. http://arxiv.org/abs/1807.03748.

Ramesh, Aditya, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. “Hierarchical Text-Conditional Image Generation with CLIP Latents.” arXiv. https://doi.org/10.48550/arXiv.2204.06125.

Salesky, Elizabeth, David Etter, and Matt Post. 2021. “Robust Open-Vocabulary Translation from Visual Text Representations.” In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 7235–52. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp-main.576.

Saussure, Ferdinand de. 1916. Course in General Linguistics. Translated by Wade Baskin. Philosophical Library.

Schatz, Thomas, Vijayaditya Peddinti, Francis Bach, Aren Jansen, Hynek Hermansky, and Emmanuel Dupoux. 2013. “Evaluating Speech Features with the Minimal-Pair ABX Task: Analysis of the Classical MFC/PLP Pipeline.” In INTERSPEECH 2013 : 14th Annual Conference of the International Speech Communication Association, 1781–85. Lyon, France: ISCA. https://hal.archives-ouvertes.fr/hal-00918599.

Schwartz, R., Y. Chow, O. Kimball, S. Roucos, M. Krasner, and J. Makhoul. 1985. “Context-Dependent Modeling for Acoustic-Phonetic Recognition of Continuous Speech.” In ICASSP ’85. IEEE International Conference on Acoustics, Speech, and Signal Processing, 10:1205–8. IEEE. https://doi.org/10.1109/ICASSP.1985.1168283.

Sennrich, Rico, Barry Haddow, and Alexandra Birch. 2016. “Neural Machine Translation of Rare Words with Subword Units.” In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1715–25. Association for Computational Linguistics. http://arxiv.org/abs/1508.07909.

Shen, Jonathan, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, et al. 2018. “Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions.” In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4779–83. IEEE. https://doi.org/10.1109/ICASSP.2018.8461368.

Sutskever, Ilya, James Martens, and Geoffrey Hinton. 2011. “Generating Text with Recurrent Neural Networks.” In, 1017–24.

Tjandra, Andros, Sakriani Sakti, and Satoshi Nakamura. 2019. “Speech-to-Speech Translation Between Untranscribed Unknown Languages.” In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 593–600. IEEE. https://doi.org/10.1109/ASRU46091.2019.9003853.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” In Advances in Neural Information Processing Systems, 30:5998–6008. http://arxiv.org/abs/1706.03762.

Warstadt, Alex, and Samuel R. Bowman. 2022. “What Artificial Neural Networks Can Tell Us About Human Language Acquisition.” In Algebraic Structures in Natural Language, 1st ed., 17–60. CRC Press. https://doi.org/10.1201/9781003205388-2.

Zhang, Mingyuan, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. 2022. “MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model.” arXiv. https://doi.org/10.48550/arXiv.2208.15001.