- Dec 18, 2023
- 3 min read

Your Personal Information Is Probably Being Used to Train Generative AI Models

Updated: Jul 18

Written by Arda Kizilkaya

Artists and writers are up in arms about generative AI systems understandably so. These AI machine learning models are only capable of pumping out images and text because they’ve been trained on mountains of real people’s creative work, much of it copyrighted. Major AI developers including OpenAI, Meta, and Stability AI now face multiple lawsuits because of this. Independent analyses support such legal claims; in August, for instance, the Atlantic reported finding that Meta trained its LLM (Large Language Model) in part on a data set called Books3, which had more than 170,000 pirated and copyrighted books. Training data sets for these AI models include more than books. In the rush to build and train ever-larger AI models, developers have swept up a lot of the searchable Internet. This not only has the potential to violate copyrights but also threatens the privacy of the billions of people who share information online like on X, YouTube, Instagram, and so on. It also means that supposedly neutral models could be trained on biased data for example if the AI model gets trained on data that includes racist thoughts, the AI model would be more inclined to give racist answers. A lack of corporate transparency makes it difficult to figure out exactly where companies get their training data.

How Does AI Training Data Work:

Learning from data isn’t new. The Ancient Egyptians used long-term observations to predict the level of water in the Nile River, which means they were into something we would today call statistical predictive models. The era of modern AI systems started with the rise of big data. Once you have large amounts of logged structured data, like clicks on the products in an online store, and time spent on a certain webpage in a browser, is where data science steps in. Building models to predict outcomes like loan return rates or predicting how many views the video would get according to the trend becomes a standard task for a data science team. However, what actually happens is that the data is often either not structured or, even worse, does not exist at all. Further, a search engine will only learn how to rank the most relevant sites on top after “seeing” millions of user queries and web page documents, judged by the relevance of the match, while a voice assistant will only learn to correctly activate after the model analyses thousands of hours of speech recordings made by different voices and accents amidst surrounding noises. A brand new AI-powered app could be able to recommend you the trendiest outfit if it is trained on a vast and up-to-date dataset of the trendiest outfits. And if the creators of the App fail to update their dataset after every season, before long, it will be suggesting something that had gone out of fashion seasons ago.

AI is a massive thing that helps us all in many different areas of life, like driving a car for us making us rest and not worry about the road, or helping us dress properly during that season. But because of corporate greed one of the most important things in AI systems, training data is being mishandled, which give off bad result for the people that use these type of systems e.g. the AI model gets trained on data that includes racist thoughts, the AI model would be more inclined to give racist answers to the person uses the program. Often to train these programs, they get a bulk of data from sites that we put out private information, which would lead into massive problems in the future. For example, the company that uses these types of private information got hacked by an anonymous person, now that anonymous person knows where you live, what you do etc. Not only that they also violate copyrights of many different things like books.

References:

Purple death: The great flu of 1918. (n.d.). PAHO/WHO | Pan American Health Organization. https://www.paho.org/en/who-we-are/history-paho/purple-death-great-flu-1918
Spanish flu: What is it, causes, symptoms & pandemic. (n.d.). Cleveland Clinic. https://my.clevelandclinic.org/health/diseases/21777-spanish-flu
Influenza pandemic of 1918–19 | Cause, origin, & spread. (1998, July 20). Encyclopedia Britannica. https://www.britannica.com/event/influenza-pandemic-of-1918-1919
The influenza epidemic of 1918. (n.d.). National Archives |. https://www.archives.gov/exhibits/influenza-epidemic/
The 1918 influenza pandemic. (n.d.). virus. https://virus.stanford.edu/uda/
Spanish flu pandemic of 1918. (2023, December 12). Historic UK. https://www.historic-uk.com/HistoryUK/HistoryofBritain/The-Spanish-Flu-pandemic-of-1918/

Hevsel Times

Your Personal Information Is Probably Being Used to Train Generative AI Models

Written by Arda Kizilkaya

How Does AI Training Data Work:

References:

Recent Posts

Comments

Join Our Newsletter

Hevsel Times