🚠 ✊🏾 ✌🏽 Cars are already ahead of people in reading tests; but do they understand what they read? ⤵️ 👦🏼 🙇🏽

A tool called BERT is capable of overtaking people in reading and comprehension tests. However, it also demonstrates which way the AI still needs to go.

In the fall of 2017, Sam Bowman , a computational linguist from New York University, decided that computers still do not understand the text very well. Of course, they learned quite well enough to simulate this understanding in certain narrow areas, such as automatic translations or analysis of feelings (for example, to determine whether a sentence is "rude or sweet," as he said). However, Bowman wanted a measurable testimony: a true understanding of what was written, outlined in human language. And he came up with a test.

In an April 2018 paper written in collaboration with colleagues at Washington University and DeepMind, a Google-owned company engaged in artificial intelligence, Bowman presented a set of nine reading comprehension tasks for computers under the general name GLUE (General Language Understanding Evaluation) [understanding assessment generalized language]. The test was designed as “a fairly indicative example of what the research community considers interesting tasks,” Bowman said, but in a way that is “easy for people.” For example, in one task, the question is asked about the truth of a sentence, which must be estimated based on information from a previous sentence. If you can say that the message “President Trump has landed in Iraq, having begun his seven-day visit” implies that “President Trump is visiting abroad,” you pass the test.

Cars failed him. Even the advanced neural networks scored no more than 69 out of 100 points in total for all tests - the top three with a minus. Bowman and colleagues were not surprised. Neural networks — multilayered constructions with computational connections that roughly resemble the work of neurons in the mammalian brain — show good results in the area of “Natural Language Processing”, but researchers were not sure that these systems were taught something serious about language. And GLUE proves it. “Early results show that passing GLUE tests goes beyond the capabilities of existing models and methods,” Bowman et al.

But their assessment did not last long. In October 2018, Google introduced a new method, BERT (Bidirectional Encoder Representations from Transformers) [bidirectional encoder presentations for transformers]. He received a score of 80.5 in GLUE. In just six months, cars jumped from three with a minus to a four with a minus in this new test, which measures the real understanding of natural language by machines.

“It was like 'damn it,'” Bowman recalls, using a more colorful word. - This message was received with distrust by the community. BERT received in many tests grades close to what we considered the maximum possible. ” Indeed, before the appearance of BERT in the GLUE test, there weren’t even any assessments of human achievements to compare with. When Bowman and one of his graduate students added them to GLUE in February 2019, they lasted only a few months, and then Microsoft's BERT-based model also beat them .

At the time of this writing, almost all of the first places in the GLUE tests are occupied by systems that include, extend or optimize the BERT model. Five of them are superior in human abilities.

But does this mean that AI is beginning to understand our language, or is it just learning to beat our systems? After the BERT-based neural networks took the GLUE type tests by storm, new evaluation methods appeared that considered these NLP systems to be computer versions of “ smart Hans, ” a horse that lived in the early 20th century and was supposedly smart enough to to make arithmetic calculations in the mind, but actually read the unconscious signs given to it by its owner.

“We know that we are somewhere in the gray zone between understanding the language in a very boring and narrow sense, and creating AI,” Bowman said. - In general, the reaction of specialists could be described as follows: How did this happen? What does it mean? What do we do now?"

Writing Your Own Rules

In the famous “ Chinese Room ” thought experiment, a person who does not know the Chinese language sits in a room filled with many books with rules. In the books you can find the exact instructions on how to accept the sequence of Chinese characters entering the room and give a suitable answer. A person outside palms questions written in Chinese under the door of the room. The person inside turns to the books with the rules, and formulates perfectly reasonable answers in Chinese.

This experiment was used to prove that despite the outside impression, one cannot say that the person in the room has any understanding of Chinese. However, even a simulation of understanding was an acceptable goal of the NLP.

The only problem is the lack of perfect books with rules, because the natural language is too complex and unsystematic to be reduced to a solid set of specifications. Take, for example, the syntax: rules (including empirical) that determine the grouping of words into meaningful sentences. The sentence " violently sleeping colorless green ideas " has the syntax, but any person who knows the language understands its meaninglessness. What specially designed rule book could include this unwritten fact related to natural language - not to mention countless other facts?

NLP researchers tried to find this quadrature of the circle , forcing the neural networks to write their own artisanal rule books in the process of so-called "Pre-training" or pretraining.

Until 2018, one of the main training tools was something like a dictionary. This dictionary used a vector representation of the words [word embedding], describing the connections between words in the form of numbers so that the neural networks could perceive this information as input - something like a rough glossary for a person in a Chinese room. However, the pre-trained on the vector dictionary neural network still remained blind to the meaning of words at the sentence level. “From her point of view, the sentences 'man bit the dog' and 'dog bit the man' are identical,” said Tel Linsen , a computational linguist at Johns Hopkins University.

Tel Linsen, Computing Linguist at Johns Hopkins University.

The improved method uses pre-training to provide the neural network with richer rule books - not only a dictionary, but also syntax with a context - before teaching it to perform a specific NLP task. At the beginning of 2018, researchers from OpenAI, the University of San Francisco, the Allen Institute for Artificial Intelligence, and the University of Washington at the same time came up with a tricky way to get closer to this. Instead of training only one, the first layer of the network using the vector representation of words, the researchers began to train the entire network for a more general task called language modeling.

“The simplest way to model a language is as follows: I will read a bunch of words and try to predict the following,” explained Mile Ott , a Facebook researcher. “If I say, 'George W. Bush was born in,' then the models need to predict the next word in this sentence.”

Such language models with deep training can be quite efficiently created. Researchers simply feed huge amounts of written text from free resources like Wikipedia to their neural networks — billions of words arranged in grammatically correct sentences — and allow the network to predict the next word on its own. In fact, this is equivalent to the fact that we will invite a person in a Chinese room to create their own set of rules, using incoming Chinese messages for reference.

“The beauty of this approach is that the model gains a ton of syntax knowledge,” Ott said.

Moreover, such pre-trained neural networks can apply their language representations to teach a narrower task, not related to word prediction, to the fine-tuning process.

“You can take the model from the pre-training phase and adapt it to any real task that you need,” Ott explained. “And after that you get much better results than if you tried to solve your problem directly from the very beginning.”

In June 2018, when OpenAI introduced its GPT neural network , with a language model included in it, which spent a month training for a billion words (taken from 11,038 digital books), its result in the GLUE test, 72.8 points, immediately became the most the best. Nevertheless, Sam Bowman suggested that this area will develop for a very long time before any system can at least get closer to the level of man.

And then BERT appeared.

Promising recipe

So what is BERT?

Firstly, it is not a fully trained neural network, capable of immediately delivering results on a human level. Bowman says this is a "very accurate recipe for training the neural network." As a baker can, following the recipe, guarantee to give out delicious cake cakes - which can then be used for different cakes, from blueberry to spinach quiche - and Google researchers have created a BERT recipe that can serve as an ideal foundation for “baking” neural networks (that is , their fine-tuning), so that they cope well with various tasks in processing the natural language. Google made the BERT code open, which means that other researchers no longer need to repeat this recipe from scratch - they can just download it; it's kind of like buying pre-baked cake for cake in the store.

If BERT is a recipe, then what is its list of ingredients? “This is the result of three different things connected together so that the system starts to work,” said Omer Levy , a Facebook researcher who analyzed the BERT device.

Omer Levy, Facebook Researcher

The first is the pre-trained language model, that is, those same directories from the Chinese room. The second is the opportunity to decide which of the features of the proposal are the most important.

In 2017, Jacob Uzkoreit , an engineer at Google Brain, worked on ways to accelerate the company's attempts to understand the language. He noted that all advanced neural networks suffer from their inherent limitations: they study the sentence by words. Such a “sequence” seemed to coincide with the idea of how people read the text. However, Uzkoreit became interested, "could it not be that understanding the language in a linear, sequential mode is not the most optimal."

The narrow rate with colleagues developed a new architecture of neural networks, focusing on “attention”, a mechanism that allows each layer of a neural network to assign large weights to certain features of the input data compared to others. This new architecture with attention, a transformer, can take a sentence like “a dog bites the man” as an input and encode each word in parallel in different ways. For example, a transformer can bind “bites” and “person” as a verb and subject-object, ignoring the article “a”; at the same time, she can relate “bite” and “dog” as a verb and subject-subject, ignoring the article “the”.

The inconsistent nature of the transformer presents sentences more expressively, or, as Uzkoreit says, tree-like. Each layer of the neural network establishes many parallel connections between certain words, ignoring the rest - approximately how a student in primary school disassembles a sentence into parts. These connections are often made between words that may not be nearby. “Such structures look like an overlay of several trees,” Uzkoreit explained.

Such tree-like representations of sentences give transformers the opportunity to model contextual meanings, as well as effectively study the connections between words that are far apart in complex sentences. “This is somewhat counterintuitive,” Uzkoreit said, “but comes from linguistics, which has long been involved in tree-like language models.”

Jacob Uzkoreit, head of the Berlin team Google AI Brain

Finally, the third ingredient in the BERT recipe expands non-linear reading even more.

Unlike other pre-trained language models created by processing terabytes of text from left to right by neural networks, the BERT model reads from right to left and simultaneously from left to right, and learns to predict which words were randomly excluded from sentences. For example, BERT can accept a sentence of the form “George W. Bush [...] in Connecticut in 1946” and predict which word is hidden in the middle of the sentence (in this case, “born”), having processed the text in both directions. “This bi-directionality forces the neural network to extract as much information as possible from any subset of words,” Uzkoreit said.

The BERT-based pretending used like a word game - language modeling with masking - is not a new thing. It has been used for decades to measure people's understanding of the language. For Google, he provided a practical way to use bi-directionality in neural networks instead of the one-way pre-training methods that had dominated this area before. “Before BERT, unidirectional language modeling was the standard, although this is an optional limitation,” said Kenton Lee , a Google researcher.

Each of these three ingredients - a deep language model with pre-training, attention and bidirectionality - existed before BERT separately. But until Google released their recipe at the end of 2018, no one combined them in such a successful way.

Refining Recipe

Like any good recipe, BRET was soon adapted by various chefs to their tastes. In the spring of 2019, there was a period “when Microsoft and Alibaba stepped on each other's heels, changing places in the ranking weekly, adjusting their model,” Bowman recalls. When the improved version of BERT was first released in August under the name RoBERTa, researcher Sebastian Ruder from DeepMind dryly remarked in his popular NLP newsletter : "New month, and a new advanced language model with pre-training."

Like the cake, BERT has several design decisions that affect the quality of its work. This includes the size of the baked neural network, the amount of data used for pre-training, the method of masking words, and how long the neural network has been working with this data. And in subsequent recipes, such as RoBERTa, researchers tweak these decisions - like a chef specifying a recipe.

In the case of RoBERTa, researchers from Facebook and Washington University increased the number of some ingredients (pretraining data, length of incoming sequences, training time), one ingredient was deleted (the task of “predicting the next sentence”, which was originally in BERT and negatively affected the results ), and the other was changed (complicated the task of masking individual words). As a result, they briefly took first place in the GLUE ranking. Six weeks later, researchers from Microsoft and the University of Maryland added their refinements to RoBERTa, and pulled out the next victory. At the moment, another model took the first place in GLUE, ALBERT (an abbreviation for “lite BERT”, that is, “lite BERT”), which slightly changed the basic structure of BERT.

“We're still sorting out which recipes work, which ones don't,” said Ott of Facebook, who worked on RoBERTa.

But, as the improvement of the technique for pre-baking cakes does not teach you the basics of chemistry, the gradual improvement of BERT will not give you especially much theoretical knowledge about the development of NLP. "I will be extremely honest with you - I do not follow these works, as for me, they are extremely boring," said Linsen, a computational linguist at Johns Hopkins University. “There is a certain scientific mystery here,” he admits, but not how to make BERT and all its descendants smarter, and not even to figure out why they are so smart. Instead, "we are trying to understand how much these models really understand the language," he said, "rather than learning strange tricks that somehow work on the datasets on which we usually evaluate these models."

In other words, BERT is doing something right. But what if he does it for the wrong reason?

Tricky but not smart

In July 2019, two researchers from Taiwan State University, Cheng Kun, used BERT with impressive results on a relatively little-known performance test called the “argument comprehension task”. To complete the task, it is necessary to choose an implicit initial condition (“foundation”) that supports the argument in favor of any statement. For example, to prove that “smoking causes cancer” (statement) since “scientific studies have shown a link between smoking and cancer” (argumentation), you need to choose the argument “scientific research can be trusted” (“foundation”), and not another option: “Scientific research is expensive” (even so, however, this is not relevant in this context). All clear?

If not all, do not worry. Even people are not very good at this task without practice. The average baseline for a non-workout person is 80 out of 100. BERT reached 77 — which the authors said was “unexpected.”

But instead of deciding that BERT is able to give neural networks the ability to reason no worse than Aristotle, they suspect that everything is actually simpler: BERT found superficial patterns in the formulation of the grounds. Indeed, after analyzing their training data, the authors found a lot of evidence of such a so-called "False clues." For example, if you just select all the bases containing the “not” particle, you can correctly answer questions in 61% of cases. Having cleared all such regularities from the data, scientists found that the BERT result fell from 77 to 53 - which is almost equivalent to a random choice. An article in the machine learning magazine The Gradient from Stanford Artificial Intelligence Lab comparedBERT with Smart Hans, a horse supposedly strong in arithmetic.

In another article, “ Rights for Wrong Reasons,” Linsen et al published evidence that BERT's high results in certain GLUE tests can also be attributed to the existence of false clues in training data. An alternative set of data was developed that was designed to deprive BERT of the ability to work in this way. The dataset was called Hans (Heuristic Analysis for Natural-Language-Inference Systems, HANS) [heuristic analysis of systems that draw conclusions based on natural language].

So, is BERT and all of his relatives storming the high score tables just a hoax? Bowman agrees with Lensen that some of the GLUE data are sloppy. They are filled with cognitive distortions inherent in the people who created it, and this can potentially be exploited by a powerful BERT-based network. "There is no universal trick that would solve all the problems in GLUE, but there are many possibilities to“ cut corners "that help in this," Bowman said, "and the model can find them." But he does not think that BERT is based on anything of value. “We apparently have a model that has learned something really interesting about the language,” he said. “However, she certainly does not understand human language in a general sense.”

According to Yojin Choi, a computer scientist at the University of Washington and Allen Institute, one of the ways to stimulate progress towards a common understanding of the language is to concentrate not only on improving BERT versions, but also on developing better quality tests and training data that reduce the likelihood of occurrence fake technology in the style of "smart Hans." Her work explores an adversarial filtering approach that uses algorithms to validate training data for NLPs and remove examples that are overly repeated or otherwise leave implicit clues to neural networks. After such competitive filtering, “BERT’s effectiveness can drop significantly,” she said, and “human’s effectiveness is not falling so much.”

And yet, some NLP researchers believe that even with the improvement of teaching procedures for language models, there will still be real obstacles to real understanding of the language. Even with powerful training, BERT is not able to perfectly model the language in the general case. After the tweaks, he models “a specific NLP task, or even a specific data set for that task,” said Anna Rogers , a computational linguist at the Machine Text Laboratory at the University of Massachusetts. It is likely that no set of training data, however carefully prepared or carefully filtered, will be able to include all extreme cases and unpredictable input data that people using natural language can easily handle.

Bowman points out that it’s hard even to understand what can convince us that the neural network has reached a real understanding of the language. Standard tests should reveal something socialized regarding the knowledge of the tested. However, any student knows that tests are easy to fool. “It is very difficult for us to come up with tests that are heavy enough and sufficiently protected from deception so that their solution convinces us that we really solved the problem in some aspect of the language technologies of AI,” he said.

Bowman and colleagues recently presented a test called SuperGLUEspecifically designed to be complex for BERT based systems. So far, no network has been able to overtake a person in it. But even if (or when) this happens, does this mean that machines can learn to understand the language better than before? Or is it just that science will become better at teaching machines how to pass this test?

“Good analogy,” Bowman said. “We figured out how to pass the LSAT and MCAT tests, but we might not have the qualifications to become doctors or lawyers.” And yet, judging by everything, this is exactly how research in the field of AI moves. “Chess seemed like a serious test of intelligence until we figured out how to write a program for the game,” he said. “We have definitely entered an era when the goal was to invent increasingly complex tasks that represent an understanding of the language, and to come up with ways to solve them.”

Cars are already ahead of people in reading tests; but do they understand what they read?

A tool called BERT is capable of overtaking people in reading and comprehension tests. However, it also demonstrates which way the AI ​​still needs to go.

Writing Your Own Rules

Promising recipe

Refining Recipe

Tricky but not smart

More articles:

A tool called BERT is capable of overtaking people in reading and comprehension tests. However, it also demonstrates which way the AI still needs to go.