The Ethics of GAN and LLM Training - Part 3

Jun 9

The ethical concerns around AI training data apply to both image generators and large language models (LLMs).

Image generators like Stable Diffusion and DALL-E were all built on a dataset called LAION that was created from billions of images in Getty Images, Flickr, Behance, Pinterest, and websites. There’s a good chance, if you had photographs on the internet any time between 2008 and 2021, your images are in there somewhere.

For LLMs, Common Crawl was the foundation dataset, and it went through 3.1 BILLION web pages including blog posts, forum comments, Reddit threads, and personal websites. Online newspapers, like The New York Times, Forbes, The Atlantic, as well as Books3, which contains nearly 200,000 copyrighted books, also got sucked into the training data.

No one asked for permission. No one was compensated. And no privacy law can remove your content once the training is complete. You can block future crawling, but the data already collected is baked into the already deployed models.

How did we get here?

Common Crawl was a non-profit with the stated goal of making a public, open archive of the web freely available to researchers, educators, journalists, and developers. Tens of thousands of research papers relied on it. It was created to level the playing field and allow the kinds of data that was normally only accessible to big tech companies, available to everyone.

LAION also had an academic origin. It was created to conduct basic research into dataset creation. Could you create a set of training images using only automated methods? It was also used in hundreds of academic projects. The creators of LAION explicitly warned against using the data set for real-world commercial products.

There was no malicious intent with the creation of the original datasets and archives. The researchers who built them were operating under academic fair use assumptions - the same legal framework that allows a university library to archive content for research purposes. The problem wasn’t the researchers. It was the pipeline. Academic tools built under research exemptions were an open public resource with no access controls. Anyone could access and download it – researcher, student, startup…

Stable Diffusion was created by a bunch of researchers playing around with the LAION data and refining the models. Stability AI stepped up to pay the computing costs of training new data. They were patrons of open-source AI research – not a product company. Then they released Stable Diffusion 1.4 in 2022 and the company’s profile changed overnight. In two months, the firm was suddenly valued at $1 billion dollars.

Nobody in that original research group was sitting around plotting how to mine photographers' work for profit. They were academics trying to answer a research question about image synthesis. The commercialization happened to them almost as fast as it happened to everyone else. The LAION warning against using the dataset for commercial products wasn't ignored out of malice - it was overtaken by events moving faster than anyone anticipated.

The ethical failure wasn't individual bad actors. It was a system with no brakes: open source data, open source models, venture capital moving at internet speed, and no legal or regulatory framework that could keep up. By the time anyone thought to ask whether fair use assumptions still applied, the billion-dollar valuations were already in.

LLMs also started with a similar benign intention. Just a research problem. Unfortunately, funding for the philanthropic ideals ran out and OpenAI deliberately turned into a commercial venture to sustain their growth. They needed billions to do the research and run the training models. The billions came with strings. The non-profit that was worried GPT-2 was too dangerous to release became the company that put GPT-4 in the hands of 900 million weekly users. The only people with enough money to keep it going were the ones who wanted a return on their investment.

And so here we are. The neural nets that are consuming our resources were built on the accumulated wealth of human knowledge.

We can’t put the toothpaste back in the tube – the data has already been used without consent. Individual compensation for the data is impossible. Once an image or piece of writing is broken down into the statistical weights of a neural network, individual works don’t retain discrete value. Who would you even pay, and to whom, divided across billions of contributions?

Senator Bernie Sanders recently proposed a more interesting answer. His argument: "The foundation of AI is our collective human intelligence. Our books, songs, artwork, journalism, computer code, scientific research, videos, conversations, images, and ideas spanning generations." His proposed American AI Sovereign Wealth Fund Act would create a public fund by imposing a one-time 50% tax on the stock of OpenAI, Anthropic, and other major AI companies - giving ordinary Americans voting rights, board representation, and eventually, a dividend check. Whether or not you agree with the mechanism, the underlying logic is hard to dismiss: if the models were built on all of our collective output, perhaps the wealth they generate shouldn't flow exclusively to a handful of people.

References

Common Crawl Foundation. (n.d.). About Common Crawl. https://commoncrawl.org/about

Mozilla Foundation. (2024). Training data for the price of a sandwich: Common Crawl's impact on generative AI. https://www.mozillafoundation.org/en/research/library/generative-ai-training-data/common-crawl/

Machines. (2022). Models all the way down. https://knowingmachines.org/models-all-the-way

LAION. (n.d.). FAQ. https://laion.ai/faq/

MIT Technology Review. (2023, February 6). The original startup behind Stable Diffusion has launched a generative AI for video. https://www.technologyreview.com/2023/02/06/1067897/runway-stable-diffusion-gen-1-generative-ai-for-video/

Runway Research. (2022, October 5). The research origins of Stable Diffusion. https://research.runwayml.com/the-research-origins-of-stable-difussion

Britannica Money. (2025). OpenAI: ChatGPT, Sam Altman, & Microsoft. https://www.britannica.com/money/OpenAI

LLM Rumors. (2026, February). From altruism to empire: The full OpenAI origin story. https://www.llmrumors.com/news/openai-origin-story-founding-lore-empire

Waxy.org. (2022, September). AI data laundering: How academic and nonprofit researchers shield tech companies from accountability. https://waxy.org/2022/09/ai-data-laundering-how-academic-and-nonprofit-researchers-shield-tech-companies-from-accountability/

PPC Land. (2026, May). News publishers target Common Crawl, the AI training data backdoor. https://ppc.land/news-publishers-target-common-crawl-the-ai-training-data-backdoor/

Sanders, B. (2026, June 1). The public should own half of the big A.I. companies. Senator Bernie Sanders official website. https://www.sanders.senate.gov/op-eds/the-public-should-own-half-of-the-big-a-i-companies/

‍ ‍

Danica Barreau https://pouka.com

The Ethics of GAN and LLM Training - Part 3

References

THE MUDI

AI Resource Concerns - Part 2