Using existing data to train GenAI tools raises legal risks of infringement on intellectual property and copyright. Currently, intellectual property rights and copyright are being litigated to answer questions about fair use, data ownership, and licensing concerning the functioning and use of AI tools in the US and around the world. The decisions about these questions will have implications for developers of AI tools and users as they consider legal and ethical use cases.
In the US, copyright currently requires human authorship to be granted. This has implications for creators about how their work product is protected.
Beyond bias on points of individual facts, large language models reflect the language, attitudes, and perspectives of the creators of its training data. Thus, the style of language and types of thoughts expressed, and even conclusions the LLM comes to reflect those creators, and not a general "universal" human.
This is true of gender, demographics, and also location. The vast majority of data comes from the Global North.
The image below, from a recent paper, shows the locations of place names found in the Llama-2-70b LLM. Europe, North America, and Asia are fairly well represented, while Africa and South America are nearly absent. As a result, the language model may reflect the attitudes and cultural assumptions of people in those areas far more consistently.
Gurnee, W., & Tegmark, M. (2023). Language Models Represent Space and Time (arXiv:2310.02207). arXiv. https://doi.org/10.48550/arXiv.2310.02207
GenAI tools are often trained on biased data, reflecting issues like racism and sexism, and may underrepresent diverse perspectives. As a result, they can produce responses that are factually incorrect, biased, or both. Factually incorrect information is simply wrong, while biased information presents a skewed perspective based on incomplete or prejudiced data.
“White supremacist and misogynistic, ageist, etc., views are overrepresented in the training data, not only exceeding their prevalence in the general population but also setting up models trained on these datasets to further amplify biases and harms.”
Timnit Gebru, Emily M. Bender, Angelina McMillian-Major, and Shmargret Shmitchell, “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?”, p. 613.
Several major Generative AI models have been trained on sources that include much of the internet: the good, the bad, the weird - everything. Models respond to prompts building from their model inputs, which can carry over the biases from the original material. Even if specific biased resources are excluded from the model, the overall training material could under-represent different groups and perspectives.
Many of the large AI companies acknowledge and try to mitigate these risks, it is not clear how successful they can be.
As stated earlier in this Libguide, early LLM development was based on scraping the internet. There is a severe lack of transparency as to what exactly is included in most large language models & their data sets, but there are allegations that the models include pirated content, such as a number of books, images not licensed by their creators for reuse, and content from the internet that was intended to support its creators with advertising revenue. Some companies are beginning to let web sites be excluded from future scraping, including OpenAI and Anthropic.
There are legal questions surrounding fair use and what content can be scraped, put into a model and/or a data set, and used to generate new content.
You can use Generative AI to create content in the style of many artists and writers. How would an artist react when an AI creates a new but similar work to their own? If you created a piece of art and then someone fed a GenAI your corpus of work and created content in your style, would you feel slighted?
GenAI tools can take content and transform it in a number of ways. What happens when that content has personal or private information, such as personal data or medical data? Many AI companies utilize user input and information to improve their models. Interacting with a GenAI may feel personal, but that GenAI could be interacting with millions of other people. When sharing private or personal information, keep in mind that information may be used to train and develop that model further.
In the case of ChatGPT, one can use a free model (as of writing, ChatGPT 4o mini) that requires creating an account, or pay extra for a newer, more robust model with more features. If Gen AI is used in academic work, what does it mean when some students can afford the best AI, and others cannot? Paid models may have, amongst other aspects, increased processing capabilities, more in-depth responses, reasoning capabilities, or longer "memory" capabilities.
Creating a GenAI model and tool set takes an enormous amount of computing resources. All of these computers require electricity to operate and water to cool the data centers powering GenAI. When creating an image of a dachshund wearing rabbit ears or a 5-page paper for an assignment, are the emissions associated with that power consumption and cooling worth the result?