LibGuides: Generative Artificial Intelligence: Ethics & Copyright

Understanding Bias

Structural Bias

Beyond bias on points of individual facts, large language models reflect the language, attitudes, and perspectives of the creators of its training data. Thus, the style of language and types of thoughts expressed, and even conclusions the LLM comes to reflect those creators, and not a general "universal" human.

This is true of gender, demographics, and also location. The vast majority of data comes from the Global North.

The image below, from a recent paper, shows the locations of place names found in the Llama-2-70b LLM. Europe, North America, and Asia are fairly well represented, while Africa and South America are nearly absent. As a result, the language model may reflect the attitudes and cultural assumptions of people in those areas far more consistently.

Gurnee, W., & Tegmark, M. (2023). Language Models Represent Space and Time (arXiv:2310.02207). arXiv. https://doi.org/10.48550/arXiv.2310.02207

Training Data

GenAI tools are often trained on biased data, reflecting issues like racism and sexism, and may underrepresent diverse perspectives. As a result, they can produce responses that are factually incorrect, biased, or both. Factually incorrect information is simply wrong, while biased information presents a skewed perspective based on incomplete or prejudiced data.

“White supremacist and misogynistic, ageist, etc., views are overrepresented in the training data, not only exceeding their prevalence in the general population but also setting up models trained on these datasets to further amplify biases and harms.”

Timnit Gebru, Emily M. Bender, Angelina McMillian-Major, and Shmargret Shmitchell, “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?”, p. 613.

Ethics & AI

The model could be built on biased input.

Several major Generative AI models have been trained on sources that include much of the internet: the good, the bad, the weird - everything. Models respond to prompts building from their model inputs, which can carry over the biases from the original material. Even if specific biased resources are excluded from the model, the overall training material could under-represent different groups and perspectives.

Many of the large AI companies acknowledge and try to mitigate these risks, it is not clear how successful they can be.

Generative AI uses other people’s content and ideas, with little or no attribution or compensation.

As stated earlier in this Libguide, early LLM development was based on scraping the internet. There is a severe lack of transparency as to what exactly is included in most large language models & their data sets, but there are allegations that the models include pirated content, such as a number of books, images not licensed by their creators for reuse, and content from the internet that was intended to support its creators with advertising revenue. Some companies are beginning to let web sites be excluded from future scraping, including OpenAI and Anthropic.

There are legal questions surrounding fair use and what content can be scraped, put into a model and/or a data set, and used to generate new content.

Creating text and images in the style of another creator devalues their work.

You can use Generative AI to create content in the style of many artists and writers. How would an artist react when an AI creates a new but similar work to their own? If you created a piece of art and then someone fed a GenAI your corpus of work and created content in your style, would you feel slighted?

Using some AI tools requires sharing personal information that could then be used in the model.

GenAI tools can take content and transform it in a number of ways. What happens when that content has personal or private information, such as personal data or medical data? Many AI companies utilize user input and information to improve their models. Interacting with a GenAI may feel personal, but that GenAI could be interacting with millions of other people. When sharing private or personal information, keep in mind that information may be used to train and develop that model further.

NOTE: In OpenAI's ChatGPT, go to Settings -> Data Controls to prevent chat history from being used to further train the model. This does not mean your data will be deleted.

“Better” models may only be available at a high cost.

In the case of ChatGPT, one can use a free model (as of writing, ChatGPT 4o mini) that requires creating an account, or pay extra for a newer, more robust model with more features. If Gen AI is used in academic work, what does it mean when some students can afford the best AI, and others cannot? Paid models may have, amongst other aspects, increased processing capabilities, more in-depth responses, reasoning capabilities, or longer "memory" capabilities.

Running generative AI has a high environmental impact.

Creating a GenAI model and tool set takes an enormous amount of computing resources. All of these computers require electricity to operate and water to cool the data centers powering GenAI. When creating an image of a dachshund wearing rabbit ears or a 5-page paper for an assignment, are the emissions associated with that power consumption and cooling worth the result?