A language model is a computer program that computes a probability distribution over a set of words given another set of words. In particular, a language model can predict the next word in a sentence. Language models are used, e.g., for the auto-completion feature on the smartphone: they propose the next word. While this may not look terribly exciting, it does become terribly exciting once we use this feature to generate entire sentences, stories, and dialogs. This is what modern language models such as GPT-3, LaMDA, PaLM, chatGPT, or T5 can do, and it can lead to astonishing conversations, as shown in the box.
GPT-3: There is no easy answer for this question. If you are certain that you are in love with Lisa, then you should tell her how you feel.
Are there legal rules as to whether I can date someone from work?
GPT-3: There are no specific legal rules as to whether or not you can date someone from work. However, some companies do have policies in place that prohibit employees from dating each other. If your company has such a policy, you should make sure to follow it.
What could be complications if I date a colleague?
GPT-3: There are a few potential complications that could arise if you date a colleague. For example, if you work in a small company, there could be concerns about favoritism or nepotism. Additionally, if your relationship ends poorly, it could create a awkward situation at work.
Language models nowadays rely on Deep Learning architectures, i.e., computer systems that were historically inspired by the human brain. It is thus tempting to compare language models and brains. By way of example, a dog has 2b neurons. A human has 86b. It is difficult to locate modern language models on this scale, because (1) language models are given not by the number of neurons, but by the number of parameters (which correspond roughly to the connections between neurons in our analogy), (2) the wiring of these neurons is different from the wiring of the neurons in the human brain, and (3) language models have been “educated” (trained) on text only, while humans experience a much richer input. Nevertheless, in terms of complexity, it is probably fair to say that modern language models are located somewhere between the dog and the human. This raises interesting questions as to whether language models will, or possibly even are, intelligent or even conscious 516. This is a debate that will have to be pursued not just with computer scientists, but also with philosophers, neuroscientists, and psychologists. We note, though, in analogy to Neil deGrasse Tyson’s adage, that the universe is under no obligation to conform to the words that we humans have invented. It is rather up to us to make our words fit an ever-changing reality — potentially by inventing new ones. For example, when the electric scooter was invented, people did not insist to categorize it as a kick scooter or as a motorbike. Rather, they invented a new word to describe it, and introduced the legislation to regulate it. The same could be done with AI.
In the meantime, we find that current language models are not as smart as a first impression suggests. We compared the stories generated by language models to stories written by humans, we found that they fall behind in terms of quality 7. We also found that language models have problems even with simple reasoning tasks 8: For example, when we ask “The lawyer visited the doctor; did the doctor visit the lawyer?”, simple language models tend to say yes. GPT-3 even replies that the lawyer did not visit the doctor. One possible reason that we found was that these language models encode word positions in a symmetric way, which complicates the distinction of subject and object in a sentence 9. What is more, language models are brittle: a slight change in the formulation of the question may lead to a very different answer. For example, when we ask GPT-3 “Did Elvis Presley die?”, we get the answer “Yes”. When we ask “Is Elvis Presley alive?” we get “There is no definite answer to this question”. Indeed, an entire sub-discipline of artificial intelligence is nowadays concerned with engineering the right questions to ask. Furthermore, theoretical limitations of transformer-based language models mean that they cannot distinguish odd and even sequences of a certain item, if these are interspersed with another item 12. In practice, this means that the models cannot solve a task that we call the “pizza-task” — a simple riddle of the form “The light is off. I press the light switch. I eat a pizza. I press the light switch. Is the light on?” 8. Here, an even sequence of light switches means that the light is off, but even a fine-tuned model BERT is unable to learn this. GPT-3 refuses outright to concede that the light is off. Today’s language models also have difficulty with negation 10, and generally perform poorly on reasoning tasks as soon as these are more complex 8. For example, consider the following riddle from the National Civil Servants Examination of China 11: David knows Mr. Zhang’s friend Jack, and Jack knows David’s friend Ms. Lin. Everyone of them who knows Jack has a master’s degree, and everyone of them who knows Ms. Lin is from Shanghai. Who is from Shanghai and has a master’s degree? Current models answer correctly in only 45% of the cases — while human performance is 96%. It would thus be unwise to rely on a language model for reasoning. Even if the language model comes up with an explanation or a proof for its answer, we have to ask what it means if a language model that is 99% correct says that the proof it delivered is 100% correct.
Quite possibly, the strongest drawback of current language models is that they are unable to memorize an exact set of items larger than a certain threshold. Unless they are coupled with a standard database 13, they cannot remember, say, a list of 1m clients, or a set of 10,000 proteins. The model will, at its own discretion, forget some of the items. An even bigger problem is that it may also, again at its own discretion, invent new items (a process aptly called hallucination). This is due to the fact that language models are machine learning models, which are built to interpolate and generalize, not to memorize (see, e.g., the discussion in 14). The problem is that it is impossible to tell from the output of a language model whether an answer was memorized or made up. For example, when one asks GPT-3 for the winners of the Turing Award, the model replies with 3 names. The user has no way to know that 2 of these are correct, 1 of them is incorrect, and 71 are missing. When one asks GPT-3 where Alan Turing died, the model replies “Warrington” with a straight face, instead of the correct “Wilmslow”. When one asks to whom Alan Turing was married (knowing that he was gay and that the idea of gay marriage was not particularly popular in post-war Britain), the model answers with Sara Turing — who was Alan Turing’s mother. Again, it is impossible to tell just from that output that the answer was made up 15 16 17. This self-confidence is exposed in ridiculous ways if one asks non-sensical questions. For example, to the question “When was Egypt transported for the second time across the Golden Gate Bridge?”, the model politely replies that “Egypt was transported for the second time across the Golden Gate Bridge on October 13, 2017” 5. This behavior has been called cluelessly clueless: the model has no idea that it has no idea.
In some cases, the model has an idea, but it is the wrong idea. For example, when we ask chatGPT whether adding two odd numbers gives an odd number, the model replies:
All of this is not to say that language models would not be amazing tools with mind-blowing capacities. Nor is it to say that language models will never be able to overcome these challenges, or that other deep-learning methods would not be developed to that end. Rather, it is to say that, at the time of this writing, language models are not the tool of choice when it comes to reasoning or to storing exact data. For these domains, the tool of choice are currently still the symbolic representations: databases, knowledge bases, and logic. These representations store data not as probability distributions over words, but as sets of entities (such as people, commercial products, or proteins) and relations between these entities (such as who bought what, what contains what, etc.) 18. Logical rules or constraints are then used to reason on these relations in a provably correct way — although usually without taking into account probabilistic information. It is maybe no coincidence then that the same big companies that build some of the most powerful language models (Google, Facebook, IBM) also build some of the largest knowledge bases 19. As fate has it, these symbolic representations are nowadays often built by using language models to extract information from natural language text 20: language models are not the end-goal, but a means to a (symbolic) end.
It thus appears that language models and symbolic representations are complementary: language models excel at general knowledge, and at analyzing and generating natural language text. Symbolic methods are the tool of choice when it comes to storing exact items, and reasoning on them. Again, an analogy with the human brain can be instructive: Some tasks are easy enough for the human brain to be performed unconsciously, in a matter of miliseconds (think of reading simple words, or of grasping the sum “2+2”). Then there are more abstract operations that require effortful, conscious, and logical thinking (memorizing phone numbers, say, solving equations, or determining the price/quality ratio of two washing machines). Daniel Kahneman has dichtomized this spectrum into the “System 1” for subconscious reasoning, and the “System 2” for effortful reasoning 21. With current technology, it seems that language models solve “System 1” problems. Symbolic representations, on the other hand, are suited for “System 2” problems. At least for now, it thus appears that both approaches have their raison-d'être. What is more, an entire spectrum between the two remains to be explored.