University of Amsterdam’s Lab helps create language technologies that goes beyond where ChatGPT ends
UvA’s Language and Technology Lab helps create language technologies for languages for which little data is available and which are not served by the big tech companies.
In recent months, text generator ChatGPT has amazed the world with automatic writing of humanlike texts in all kinds of styles. Based on prompts that you type in ChatGPT can generate news articles, long reads, essays, poems, dialogues, scripts and even jokes or computercode. It can also answer questions and translate.
Scale up
The fundamental techniques of ChatGPT date from 2017, but since then OpenAI, the company that developed the commercial text generator, has scaled up the model from 200 million parameters to 175 billion parameters last year. In addition, it has scaled up computing power and training data to such an extent that this year’s results have astonished even experts in the field.
‘Scientists could see ChatGPT coming,’ says UvA professor Christof Monz, ‘but I was still surprised at how well it works. It is great to see how much interest there is now in language technology. That shows how close human thinking skills and language are and also how important language is to give the impression of an intelligent system.’
It is great to see how much interest there is now in language technology. That shows how close human thinking skills and language are and also how important language is to give the impression of an intelligent system.
Christof Monz
Having said that, ChatGPT hasn’t solved everything in natural language processing and generation. Monz: ‘It can, for example, generate plausible-looking text that is factually incorrect, logically inconsistent, or contains harmful pre-judgments. You should be well aware that you cannot fully trust ChatGPT’s texts.’
At the Informatics Institute Monz leads the Language and Technology Lab, which goes beyond where ChatGPT ends. One of ChatGPT’s shortcomings is that it needs enormous amounts of data. The text generator is trained on so much text, all scraped from the internet, Wikipedia, online libraries and other sources, that if a single human would read eight hours per day and seven days per week, they would need 22,000 years to read what ChatGPT has processed during its training.
‘Smaller’ languages
Of the more than seven thousand languages spoken worldwide, however, most have so little digital data available that ChatGPT cannot understand, generate or translate these ‘smaller’ languages, many of which still have many millions of speakers. ‘Google Translate works for something like 140 languages,’ says Monz, ‘and the European equivalent DeepL for something like twenty languages. From the point of view of inclusiveness though, you want to offer language technology for those smaller languages as well. There is a lot to be gained there, and that is an important part of what we do in our lab.’
The Language and Technology Lab that Monz leads focusses on machine translation, question-answering systems, summarising documents and on non-toxic language generation. Multilingual aspects of language technologies are a common thread.
Monz: ‘We want to be able to translate languages for which there exist little or no data. Let’s take the example of translating between Arabic and Dutch. Surprisingly, few texts translated from Arabic to Dutch are available, too few to train our deep learning models on. Therefore, we train our systems on other language pairs for which we do have a lot of data, for example Arabic-English, English-Chinese and Dutch-English. We try to develop a system that can find language-independent representations for multilingual sentences with the same meaning.’
Neural networks
Deep learning systems are essentially neural networks in which artificial neurons are ordered in tens or hundreds of layers that connect thousands to billions of neurons with each other. The number of connections between the neurons is the number of parameters of the model. Two sentences in two different languages have the same representation if all the parameters are equal or roughly equal.
‘We are trying to invent techniques that give the same representation for multilingual sentences with the same meaning’, says Monz. ‘We are not there yet, but ideally, if an Arabic sentence has the same representation as a Dutch sentence, you have found the Dutch translation of the Arabic sentence without any explicit translation data from Arabic to Dutch being available.’