Study shows risks and limitations of ChatGPT and Bing Chat
Chatbots like OpenAI’s ChatGPT, Microsoft’s Bing Chat, and Google’s Bard are grabbing the world’s attention. In an opinion piece for News24, Dr Tanya de Villiers-Botha (Centre for Applied Ethics) emphasises the need for greater awareness about the ethical risks arising from this technology.
Tanya de Villiers-Botha
A little over two years ago, the co-leader of Google’s Ethical AI team Dr Timnit Gebru was forced out of the company on the back of a peer-reviewed paper she and members of her team wrote on some of the ethical risks relating to Large Language Models (LLMs), the AI model that underlies chatbots like OpenAI’s ChatGPT, Microsoft’s Bing Chat, and Google’s Bard. The paper flags potential dangers arising from this technology and makes some recommendations on how these might be mitigated.
Some within Google reportedly found the paper “too bleak” and countered that the technology had been engineered to avoid at least some of the problems flagged. Two-odd years and the much-hyped introduction of two chatbots based on this technology later, and Gebru and her team seem largely vindicated.
In their paper, the researchers note that using deep learning to create general models that can process natural language and perform language-based tasks comes with unique risks, three of which are particularly salient in the context of the ChatGPT- and Bing-related stories currently in the news. (Microsoft’s limited-release Bing Chat purportedly makes use of a next-generation version of ChatGPT.) These are bias, lack of actual understanding, and the potential to mislead. To understand how these risks arise, we need to look at how these systems are trained.
Large Language Models are trained on massive amounts of data relating to human language use so that they can appropriately and convincingly simulate such language use in response to prompts. In many ways, this technology is very successful. ChatGPT’s facility with language is astounding. The technology can be useful in specific, contained contexts that require natural language-use abilities, such as transcription, translation, and answering FAQs. Nevertheless, as is vividly demonstrated by Bing’s more unhinged recent outputs and by the enthusiastic attempts to jailbreak ChatGPT, the technology does not come without significant risks and limitations.
Biases
LLMs require enormous amounts of data to learn, which means that most of their training data is scraped off the internet, the largest and most readily accessible source of examples of “natural” human language use available. An almost inevitable result is that any biases inherent in that data are “baked into” the system. For example, part of the data used to train GPT-2, forerunner to ChatGPT, came from scraping outbound links from Reddit, which has a relatively homogenous user-base in the United States of 67% men, 64% aged between 18 and 29. ChatGPT has almost certainly ingested this source as well as Wikipedia, which has its own documented problems with bias.
As Gebru et al. note, data derived from the web often skews towards English content created by young men from developed countries. Any source of data that consists in text or other information generated by a relatively homogenous group is likely to result in data that reflects the worldview and biases of that group. This is how racists, sexist, homophobic, etc. language enters the training sets of LLMs. Any such biases will be replicated, and sometimes amplified, by the model: it outputs what it learns.
OpenAI, the creators of ChatGPT, seem to have tried to mitigate this risk, both through reinforcement learning from human feedback (a process fraught with ethical issues), which attempts to have the system “unlearn” such biases, and through adding guardrails that prohibits it from outputting some more overtly offensive responses. Those who bemoan the purportedly “woke”-values they see exemplified in the current guardrails seem to lose sight of the fact (or perhaps they don’t) that without these in place, ChatGPT is not unbiased, but will reflect the biases encoded in its training data.
The apparent left-leaning bias of Chat-GPT’s guardrails are a somewhat clumsy attempt to compensate for the existing underlying biases in the opposite direction. As Gebru et al. point out, a better fix for such biases partly lies in better, more carefully-curated training data; however, this option is costly and time-consuming, as there simply isn’t enough free, quality data available.
Lack of understanding
A further and very significant problem is that LLMs do not actually understand language. Ultimately, they merely mimic the language use they have been trained on, which is why Gebru et al. call them “stochastic parrots”. Advances in building and training these models have allowed for truly uncanny mimicry, but the illusion at understanding remains just that, an illusion.
Ultimately, these systems remain mimics—they generate text based on statistical analyses that indicates what kind of text tends to follow a given bit of inputted text, based on their training data. Basically, they’re autocomplete on steroids.
This lack of understanding underlies these systems’ uneasy relationship with the truth and explains why the internet is currently awash with examples of outright nonsense from both ChatGPT and Bing Chat. None of this comes as much of a surprise to those familiar with the technology.
What does come as a surprise is the apparent lack of awareness of (or concern about) these limitations on the part of those rushing to roll it out. Given that LLM’s do not understand language and cannot distinguish between truth and falsehood, using them to power general-purpose web search does not seem wise or feasible. We have no reason to think that we are even close to a technical solution for LLM’s falsehood problem; both Bing Chat and Google’s Bard made factual errors in their product unveiling demonstrations.
Misleading
Gebru et al. point out the further risk that LLM-based chatbots’ facility in mimicking human-like language use, along with our tendency to anthropomorphise entities that act in human-like ways, can lead us to place too much faith in these bots, crediting them insight, authority, and agency where there is none. Even some of the more technically-informed testers of the limited-release Bing Chat report catching themselves thinking of the bot as intelligent and sentient, while it, in turn, declares that it is in love with its users or becomes belligerent and in exchanges. This makes it much more likely that people will be taken in by any falsehoods they generate and be emotionally impacted by them.
A related risk mentioned in the Gebru paper and elaborated on by Dr Gary Marcus is the fact that while these systems are not very good at reliably generating factual information, they are very good at generating massive amounts of convincing-sounding misinformation. This makes them ideally suited to mass-propaganda, mass-spamming, mass producing fake, mutually-reinforcing websites, and any use that requires language fluency but not veracity. Moreover, if the web becomes flooded with masses of well-written misinformation, web-linked LLMs will take that output as input, further exacerbating their misinformation problem.
The current crop of LLM-based chatbots is extremely impressive in some ways and truly terrible in others. Overall, they have the feel of being technical marvels in search of useful applications, and their limited roll-outs have the feel of mass product-testing on the back of overconfidence in their abilities from industry.
What does seem clear, especially as far as web search or other applications that require high accuracy and low bias are concerned, they are not ready to be released, and it is still an open question whether they ever will be. It remains to be seen whether Microsoft will be hurt by seemingly prematurely setting off the “AI arms race” to roll out LLM-based web search.
In the meantime, society will be confronted with more ethical risks arising from this technology. It also remains to be seen whether the advantages to be had from LLMs outweigh the risks. It is worth reiterating that none of the limitations or risks mentioned here were unknown or inevitable, which further underscores the need for greater ethical awareness throughout the tech industry.