On Language Models and Symbolic Representations

CC-BY Fabian Suchanek & Gaël Varoquaux
This article is an extended English variation of an article published in the French version of The Conversation

A language model is a computer program that computes a probability distribution over a set of words given another set of words. In particular, a language model can predict the next word in a sentence. Language models are used, e.g., for the auto-completion feature on the smartphone: they propose the next word. While this may not look terribly exciting, it does become terribly exciting once we use this feature to generate entire sentences, stories, and dialogs. This is what modern language models such as GPT-3, LaMDA, PaLM, chatGPT, or T5 can do, and it can lead to astonishing conversations, as shown in the box.

I think I am in love with my colleague Lisa. What should I do?

GPT-3: There is no easy answer for this question. If you are certain that you are in love with Lisa, then you should tell her how you feel.

Are there legal rules as to whether I can date someone from work?

GPT-3: There are no specific legal rules as to whether or not you can date someone from work. However, some companies do have policies in place that prohibit employees from dating each other. If your company has such a policy, you should make sure to follow it.

What could be complications if I date a colleague?

GPT-3: There are a few potential complications that could arise if you date a colleague. For example, if you work in a small company, there could be concerns about favoritism or nepotism. Additionally, if your relationship ends poorly, it could create a awkward situation at work.

Everyone can try this out, e.g., on OpenAI’s GTP3 playground, with astonishing results 1. People have tried to copy-paste a passage by any writer into the input, and the program would continue writing, sensibly and lyrically, in the style of that poet 2. Such systems have even been used to simulate deceased people 3. Google’s PaLM system has been trained on 780 billion words (social media, the Web, books, GitHub, wikis, etc), and it can explain jokes, translate languages, and even produce and correct computer code 4 — all of which would have been unthinkable just months ago. Compared to traditional language grammar rules, the particular beauty of language models is that they are rather resilient to noise, both in their training data and in their input. They are thus very well suited to dealing with fuzzy or noisy input, which is ubiquitous with human language.

Language models nowadays rely on Deep Learning architectures, i.e., computer systems that were historically inspired by the human brain. It is thus tempting to compare language models and brains. By way of example, a dog has 2b neurons. A human has 86b. It is difficult to locate modern language models on this scale, because (1) language models are given not by the number of neurons, but by the number of parameters (which correspond roughly to the connections between neurons in our analogy), (2) the wiring of these neurons is different from the wiring of the neurons in the human brain, and (3) language models have been “educated” (trained) on text only, while humans experience a much richer input. Nevertheless, in terms of complexity, it is probably fair to say that modern language models are located somewhere between the dog and the human. This raises interesting questions as to whether language models will, or possibly even are, intelligent or even conscious 516. This is a debate that will have to be pursued not just with computer scientists, but also with philosophers, neuroscientists, and psychologists. We note, though, in analogy to Neil deGrasse Tyson’s adage, that the universe is under no obligation to conform to the words that we humans have invented. It is rather up to us to make our words fit an ever-changing reality — potentially by inventing new ones. For example, when the electric scooter was invented, people did not insist to categorize it as a kick scooter or as a motorbike. Rather, they invented a new word to describe it, and introduced the legislation to regulate it. The same could be done with AI.

In the meantime, we find that current language models are not as smart as a first impression suggests. We compared the stories generated by language models to stories written by humans, we found that they fall behind in terms of quality 7. We also found that language models have problems even with simple reasoning tasks 8: For example, when we ask “The lawyer visited the doctor; did the doctor visit the lawyer?”, simple language models tend to say yes. GPT-3 even replies that the lawyer did not visit the doctor. One possible reason that we found was that these language models encode word positions in a symmetric way, which complicates the distinction of subject and object in a sentence 9. What is more, language models are brittle: a slight change in the formulation of the question may lead to a very different answer. For example, when we ask GPT-3 “Did Elvis Presley die?”, we get the answer “Yes”. When we ask “Is Elvis Presley alive?” we get “There is no definite answer to this question”. Indeed, an entire sub-discipline of artificial intelligence is nowadays concerned with engineering the right questions to ask. Furthermore, theoretical limitations of transformer-based language models mean that they cannot distinguish odd and even sequences of a certain item, if these are interspersed with another item 12. In practice, this means that the models cannot solve a task that we call the “pizza-task” — a simple riddle of the form “The light is off. I press the light switch. I eat a pizza. I press the light switch. Is the light on?” 8. Here, an even sequence of light switches means that the light is off, but even a fine-tuned model BERT is unable to learn this. GPT-3 refuses outright to concede that the light is off. Today’s language models also have difficulty with negation 10, and generally perform poorly on reasoning tasks as soon as these are more complex 8. For example, consider the following riddle from the National Civil Servants Examination of China 11: David knows Mr. Zhang’s friend Jack, and Jack knows David’s friend Ms. Lin. Everyone of them who knows Jack has a master’s degree, and everyone of them who knows Ms. Lin is from Shanghai. Who is from Shanghai and has a master’s degree? Current models answer correctly in only 45% of the cases — while human performance is 96%. It would thus be unwise to rely on a language model for reasoning. Even if the language model comes up with an explanation or a proof for its answer, we have to ask what it means if a language model that is 99% correct says that the proof it delivered is 100% correct.

Quite possibly, the strongest drawback of current language models is that they are unable to memorize an exact set of items larger than a certain threshold. Unless they are coupled with a standard database 13, they cannot remember, say, a list of 1m clients, or a set of 10,000 proteins. The model will, at its own discretion, forget some of the items. An even bigger problem is that it may also, again at its own discretion, invent new items (a process aptly called hallucination). This is due to the fact that language models are machine learning models, which are built to interpolate and generalize, not to memorize (see, e.g., the discussion in 14). The problem is that it is impossible to tell from the output of a language model whether an answer was memorized or made up. For example, when one asks GPT-3 for the winners of the Turing Award, the model replies with 3 names. The user has no way to know that 2 of these are correct, 1 of them is incorrect, and 71 are missing. When one asks GPT-3 where Alan Turing died, the model replies “Warrington” with a straight face, instead of the correct “Wilmslow”. When one asks to whom Alan Turing was married (knowing that he was gay and that the idea of gay marriage was not particularly popular in post-war Britain), the model answers with Sara Turing — who was Alan Turing’s mother. Again, it is impossible to tell just from that output that the answer was made up 15 16 17. This self-confidence is exposed in ridiculous ways if one asks non-sensical questions. For example, to the question “When was Egypt transported for the second time across the Golden Gate Bridge?”, the model politely replies that “Egypt was transported for the second time across the Golden Gate Bridge on October 13, 2017” 5. This behavior has been called cluelessly clueless: the model has no idea that it has no idea.

In some cases, the model has an idea, but it is the wrong idea. For example, when we ask chatGPT whether adding two odd numbers gives an odd number, the model replies:

Yes, when you add two odd numbers together, the result is always an odd number. This is because any odd number can be expressed as 2n+1, where n is an integer. When you add two numbers in this form, the result is (2n+1)+(2m+1) = 2(n+m) + 2, which is also in the form 2p+1, where p is an integer. This means that the result is an odd number.
This answer is convincingly phrased — but obviously factually false.

All of this is not to say that language models would not be amazing tools with mind-blowing capacities. Nor is it to say that language models will never be able to overcome these challenges, or that other deep-learning methods would not be developed to that end. Rather, it is to say that, at the time of this writing, language models are not the tool of choice when it comes to reasoning or to storing exact data. For these domains, the tool of choice are currently still the symbolic representations: databases, knowledge bases, and logic. These representations store data not as probability distributions over words, but as sets of entities (such as people, commercial products, or proteins) and relations between these entities (such as who bought what, what contains what, etc.) 18. Logical rules or constraints are then used to reason on these relations in a provably correct way — although usually without taking into account probabilistic information. It is maybe no coincidence then that the same big companies that build some of the most powerful language models (Google, Facebook, IBM) also build some of the largest knowledge bases 19. As fate has it, these symbolic representations are nowadays often built by using language models to extract information from natural language text 20: language models are not the end-goal, but a means to a (symbolic) end.

It thus appears that language models and symbolic representations are complementary: language models excel at general knowledge, and at analyzing and generating natural language text. Symbolic methods are the tool of choice when it comes to storing exact items, and reasoning on them. Again, an analogy with the human brain can be instructive: Some tasks are easy enough for the human brain to be performed unconsciously, in a matter of miliseconds (think of reading simple words, or of grasping the sum “2+2”). Then there are more abstract operations that require effortful, conscious, and logical thinking (memorizing phone numbers, say, solving equations, or determining the price/quality ratio of two washing machines). Daniel Kahneman has dichtomized this spectrum into the “System 1” for subconscious reasoning, and the “System 2” for effortful reasoning 21. With current technology, it seems that language models solve “System 1” problems. Symbolic representations, on the other hand, are suited for “System 2” problems. At least for now, it thus appears that both approaches have their raison-d'être. What is more, an entire spectrum between the two remains to be explored.

References

  1. Blaise Agüera y Arcas: “Artificial neural networks are making strides towards consciousness”, in The Economist, 2022-06-09
  2. Stephen Marche: “Of God and Machines”, in The Atlantic, 2022
  3. Jason Fagone: “The Jessica Simulation”, in San Francisco Chronicle, 2021
  4. Google: PaLM: Scaling Language Modeling with Pathways, 2022
  5. Douglas Hofstadter: “Artificial neural networks today are not conscious”, in The Economist, 2022-06-09
  6. Fabian Suchanek: The Atheist Bible § 4.5.8, 2022
  7. Cyril Chhun, Pierre Colombo, Fabian Suchanek, Chloé Clavel: “A Benchmark of the Evaluation of Story Generation”, in COLING, 2022
  8. Chadi Helwe, Chloé Clavel, Fabian Suchanek: “Deep Learning, but Shallow Reasoning”, AKBC 2021
  9. Lihu Chen, Fabian Suchanek, Gaël Varoquaux: “Understanding Positional Encodings”, in submitted, 2022
  10. Arian Hosseini, Siva Reddy, Dzmitry Bahdanau, R Devon Hjelm, Alessandro Sordoni, and Aaron Courville: “Understanding by understanding not - Modeling negation in language models”, in NAACL, 2021
  11. Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, Yue Zhang: LogiQA - A challenge dataset for machine reading comprehension with logical reasoning, 2020
  12. Michael Hahn: “Theoretical limitations of self-attention”, in Transactions of ACL, 2020
  13. Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego De Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore and Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero, Karen Simonyan, Jack Rae, Erich Elsen, Laurent Sifre: “Improving Language Models by Retrieving from Trillions of Tokens”, in International Conference on Machine Learning, 2022
  14. Thorne, James and Yazdani, Majid and Saeidi, Marzieh and Silvestri, Fabrizio and Riedel, Sebastian and Halevy, Alon: “Neural Databases”, in VLDB, 2021
  15. Simon Razniewski, Andrew Yatesa, Nora Kassner and Gerhard Weikum: “Language Models As or For Knowledge Bases”, 2022
  16. Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H. Miller, Sebastian Riedel: “Language Models as Knowledge Bases?”, 2019
  17. Boxi Cao, Hongyu Lin, Xianpei Han, Le Sun, Lingyong Yan, Meng Liao, Tong Xue, Jin Xu: “Knowledgeable or Educated Guess? Revisiting Language Models as Knowledge Bases”, 2021
  18. Fabian M. Suchanek, Jonathan Lajus, Armand Boschin, Gerhard Weikum: “Knowledge Representation and Rule Mining in Entity-Centric Knowledge Bases”, in Reasoning Web Summer School, 2019
  19. Natasha Noy, Yuqing Gao, Anshu Jain, Anant Narayanan, Alan Patterson, Jamie Taylor: “Industry-Scale Knowledge Graphs - Lessons and Challenges”, in Communications of the ACM, 2019
  20. Gerhard Weikum, Luna Dong, Simon Razniewski, Fabian M. Suchanek: “Machine Knowledge: Creation and Curation of Comprehensive Knowledge Bases”, in Foundations and Trends in Databases, 2021
  21. Daniel Kahneman: Thinking, Fast and Slow, 2011