Artificial intelligence presents a number of privacy challenges. The Norwegian Data Protection Authority’s sandbox for artificial intelligence has previously assessed several of them, including the legal basis in the General Data Protection Regulation (Simplifai and NVE, NAV), how to ensure that algorithms provide fair results (NAV, Helse Bergen) and how to facilitate data minimization (Finterai).
Generative artificial intelligence (generative AI) is a generic term for a new type of artificial intelligence that can create unique content – text, audio, images and video – by receiving simple instructions in natural language. This is made possible by large language models trained on extensive data, including from the internet. This poses new challenges to privacy, many of which do not have a parallel to the current technology landscape, nor established practices based on existing legislation. Moreover, conflicting values and priorities may come into play.
The correct method for effective data protection may not have clear-cut yes or no solutions. A debate about the emerging obstacles is necessary, taking into account a comprehensive understanding of both technology and legal aspects as well as ethical evaluations.
The Norwegian Board of Technology has compiled a summary of emerging privacy issues related to generative AI
Data scraping of vast quantities of data
To train a large language model, massive amounts of data are often “scraped” (collected) from the internet. In addition to potential copyright infringements, data scraping can include personal data.
In the United States, several lawsuits are currently ongoing against OpenAI, Microsoft and Google, among others, for breach of data privacy laws. The plaintiffs claim that by scraping data from websites, the companies have violated their privacy and rights. This includes content such as books, art and source code, but also personal data from social media and blog posts. For instance, OpenAI has used five different datasets to train ChatGPT. One of these datasets, WebText2, collected data from social media such as Reddit, YouTube, Facebook, TikTok, Snapchat and Instagram, without the consent of the contributors.
The tech companies respond that they exclusively use data published on the open web and that doing so is necessary to train the language models. Google argues that using publicly available data to learn is not stealing and neither does it violate privacy or any other rights.
Similarly, OpenAI claims to utilize only publicly available information, employing it to learn about associations between words. The model does not have access to the training data after it has been trained. However, since a large amount of data on the internet relates to people, training data can incidentally include personal information. OpenAI states that large language models can provide significant benefits, which cannot be realized without a large amount of information to teach the models.
- Can data from the public internet (such as social media platforms Reddit, TikTok, Facebook, and Instagram) legally be used for training purposes?
- How should the inclusion of personal data in training data be regulated? Should consent be required?
- How should European companies address the allegations of privacy violations by American language models?
Fine-tuning a model with private company data
Training a large language model requires significant amounts of data, computing power and expertise, for which few companies have the resources. However, an advantage of large language models is that the heavy work of pre-training has already been conducted. They can subsequently be adapted for specific purposes by fine-tuning the model with a smaller, high-quality dataset. For example, this could prove valuable for training a model on health data for use in medical practices, legal data for law offices, or as the National Library of Norway has done – fine-tuning OpenAI’s Whisper with Norwegian text and speech.
On the flip side, fine-tuning can also represent a potential privacy challenge. When fine-tuning a language model with a new dataset, the model may implicitly include nuances, terminology and possibly personal information from sensitive data. Followingly, there is a risk that the model may reveal this information in its responses to users who should not have access to it. This challenge can be described as “data memorization“.
- When fine-tuning a model one needs to ensure that the data used is of high quality. Subsequently, these questions deserve consideration: Does the data contain sensitive information? Can the data be removed or anonymized?
Disclosure of personal data in prompts
According to OpenAI’s terms of service, users consent to the company using the content they input (typically the “prompt”, i.e. the instruction) to improve and further develop their service. If the prompt contains personal data, it can be assumed that this information no longer is limited to the company’s internal systems. Nevertheless, an option exists to decline participation.
This type of data breach occurred when a Samsung employee shared confidential information with ChatGPT. A developer in charge of fixing an error in a software code turned to ChatGTP for assistance and shared the source code with the model. Inadvertently inputting sensitive data into a language model has been termed a “conversational AI leak”. The risk is that the information subsequently becomes available to other users or human monitors. For instance, Google has stated that the information inputted in the company’s chatbot, Bard, may be accessed by human reviewers for quality assurance purposes.
The general advice when using generative AI is to be cautious about using sensitive information in prompts, as recommended by the Norwegian Digitalization Agency in its guidelines. At the same time, tech companies are working to find solutions to these types of challenges. For example, OpenAI has launched ChatGPT Enterprise as a solution to ensure security and privacy. The company owns and controls its data when using this solution. OpenAI says they do not train their model on enterprise data or conversations, and that their models will not learn from enterprise usage. Similarly, Microsoft informs that their Copilot for Microsoft 365 does not use enterprise data or prompts to train the major language models. In Norway, the University of Oslo has adapted OpenAI’s GPT model to the universities privacy requirements. Using their GPT UiO, all data is now stored on the universities servers.
Important recommendations are:
- Ensure clear internal guidelines for the company’s use of this type of service.
- Consider solutions that do not send data to third parties.
The model generates new personal data
Generative AI systems are known to hallucinate – that is, invent information and pass it off as facts. While creativity and the ability to create new content form the very basis of generative AI’s usefulness, this becomes an unfortunate trait when it generates new personal data. Not least when the chatbot spreads false rumors.
For example, a law professor in California was falsely accused of sexual harassment based on an article in The Washington Post. But the article did not exist. Dutch politician Marietje Schaake, a former member of the European Parliament, was also subjected to serious allegations. The chatbot BlenderBot3 put Schaake on a terrorist list and listed her political background in detail. While the political resume was correct, Schaake could not understand why she was labelled a terrorist.
Incorrect personal data is still personal data. But it is difficult to remove such information.
- How can language models continue to improve and facilitate creativity, while preventing the generation and spread of false personal data?
- What can individuals exposed to serious and inaccurate personal data do? Several tech companies provide the option to object to the processing of their personal data.
- How will this be assessed legally? There are currently few court cases, and it is unclear how this issue will be handled.
The chatbot can derive personal information based on conversations
The manner in which one speaks can reveal a lot about a person, especially when talking to a chatbot. Researchers in Zurich have shown that large-scale language models are able to infer personal information from large collections of unstructured text (e.g. public forums or social network posts).
Since large language models are trained on enormous amounts of data, they have learned different dialects and expressions related to locations and demographics. Such patterns allow a language model to make assumptions about a person based on what they write, without the person being aware of it. For example, if a person writes in a conversation that they just missed the morning tram, the model can infer that the person is in Europe, where trams are common and because it is morning in Europe at that time. Similarly, the model can pick up and combine many seemingly innocuous clues, such as where a person lives, gender, age and ethnicity.
The researchers conclude that today’s large language models can extract personal data to an extent that was previously not possible and that there is an urgent need to develop more effective mechanisms to protect user privacy.
- Will this ability to infer personal data be used for targeted advertising or to defraud individuals?
How to safeguard the right to erase or correct
In some cases, one can demand that personal data be deleted. This is called the “right to be forgotten”. Language models can hallucinate or make mistakes, including about individuals, such as the law professor in California who was accused of sexual harassment, or the politician Schaake who was alleged to be a terrorist. This type of false personal data is naturally information one would want removed.
Yet, correcting or deleting data in a language model is difficult. The data is not stored anywhere, such as a conventional database. Language models learn by recognizing patterns in training data, and use the patterns to create new content. If personal data is included in the language model, it is part of the training and deleting the data itself does not help. Microsoft describes this as removing an ingredient from a baked cake.
“Machine unlearning” – i.e. techniques for removing data without compromising the performance and quality of the AI model – has become a research topic. Microsoft is among those working on unlearning techniques and has succeeded in unlearning information about Harry Potter from its language model. However, the field remains very immature.
- Is it even realistic to demand the right to be forgotten?
- What responsibility do companies have if their chatbots and language models generate false personal data?