Legal challenges against AI companies for alleged copyright infringement in training their models

The rapid rise of generative AI has unleashed a wave of creativity — and a wave of courtroom activity. Over the last few years, publishers, photographers, visual artists, software developers and film studios have sued AI companies, accusing them of copying copyrighted works at scale to train large language models (LLMs) and image generators, and of producing outputs that infringe those same works. These cases are testing long-standing copyright doctrines, stretching familiar legal concepts (like “copying” and “fair use”) into unfamiliar territory, and could determine how generative AI is built, sold, and governed for years to come. This article explains the leading lawsuits, the legal theories on both sides, the key questions courts must resolve, the remedies at stake, and what creators and AI companies are doing now to reduce risk.

A snapshot of the major cases and claimants

Several headline-grabbing suits illustrate the diversity and scale of the disputes.

Authors (fiction writers) vs. OpenAI. The Authors Guild, joined by dozens of individual novelists, filed class-action litigation alleging that OpenAI copied millions of copyrighted books without authorization to train GPT-style models and that those models reproduce or “regurgitate” protected expression. The Authors Guild and allied plaintiffs argue the training and outputs constitute copyright infringement. The Authors Guild4ipcouncil.com
The New York Times v. OpenAI & Microsoft. The New York Times sued OpenAI and Microsoft, saying the companies took Times articles and used them to train models without permission or payment, seeking damages and changes to practice. The suit highlights news publishers’ concerns about commercial re-use of journalistic content. The GuardianHarvard Law Review
Getty Images v. Stability AI (and related UK litigation). Getty Images sued Stability AI in 2023 in the U.S. for allegedly scraping more than 12 million photos and associated metadata to train Stable Diffusion and similar systems. Getty later pressed related issues in the U.K., where its case proceeded in the London High Court in 2025. Getty’s claims focus on unauthorized copying of photographs and the commercial exploitation of those images via image-generation models. Reuters+1
Visual artists (Andersen, McKernan, Ortiz) vs. image-generator companies. A group of visual artists brought class actions alleging that image generators (Stability AI, Midjourney, DeviantArt, and others) used registered artworks to train models and then produced images that replicate those styles or content. Judges have allowed parts of these suits to proceed, and some rulings have been interpreted as early wins for artists. Center for Art LawThe Art Newspaper
GitHub Copilot / open-source code litigation. Lawsuits against GitHub (Microsoft) and OpenAI allege that code-generation tools trained on public/open-source repositories violate open-source licenses and authors’ rights by reproducing licensed snippets without honoring license terms. Courts have winnowed claims in some cases, but litigation continues to shape how code training and outputs are treated. saverilawfirm.comLegal.io
Entertainment studios vs. image-generator companies. Major studios (Disney, Universal, and more recently Warner Bros. Discovery) have sued companies such as Midjourney, alleging the models were trained on copyrighted film and TV imagery and facilitating the creation and distribution of images of copyrighted characters and other protected elements. These suits seek damages and injunctions to stop further commercial exploitation. AP NewsReuters

Taken together, the claims cover a wide spectrum of creative content (books, news articles, photos, artworks, source code, and film/TV imagery) and paint a legal landscape that is fragmented: cases are filed in multiple U.S. districts and in other countries, and judges are reaching different conclusions about which claims can proceed.

Core legal theories plaintiffs are using

Plaintiffs rely mainly on traditional copyright doctrines, adapted to the realities of machine learning.

Direct copyright infringement for copying into training datasets. Plaintiffs allege AI companies copied (i.e., reproduced) copyrighted works when they scraped and stored text, images, captions, and metadata that were then used as training data. Under U.S. law, copying a copyrighted work without authorization can be actionable, even if later transformed, unless a defense (like fair use) applies. ReutersThe Authors Guild
Creation of infringing derivative works or outputs. Plaintiffs also claim model outputs are derivative or substantially similar to protected works — for example, an image generator reproducing a copyrighted illustration or a language model producing passages close to a novelist’s text. If outputs reproduce protected expression beyond what fair use allows, those outputs could themselves be infringing. Center for Art LawThe Art Newspaper
Violation of exclusive rights (reproduction, distribution, display). Storage of copyrighted files for training, making copies available (e.g., through model APIs), or distributing outputs that replicate copyrighted content are framed as violations of the copyright holder’s exclusive rights. ReutersThe Guardian
License and contract-based claims. When plaintiffs can show that content was used in ways that breach licensing terms (including open-source licenses for code), they press contract-based claims alongside copyright claims. The Copilot litigation is a clear example where license compliance is a central issue. saverilawfirm.com

Common defenses asserted by AI companies

AI companies have advanced several defenses — some doctrinal, some factual.

Fair use / transformative use. Defendants often argue that using copyrighted materials to train statistical models is transformative: the model extracts patterns, not expressive content, and the resulting outputs are not substitutes for the original works. Courts will weigh purpose, nature, amount, and market effect (the four fair-use factors). The “transformative” question is central and contested. Harvard Law Review4ipcouncil.com
No actionable copying (technical arguments). Some companies argue that their training makes ephemeral or non-human-readable transformations (e.g., parameter weight updates) rather than literal copies accessible to humans — so they say the legal concept of a “copy” hasn’t been implicated in the traditional sense. Courts may disagree depending on evidence about how training datasets were stored and used. 4ipcouncil.com
Output causation and user responsibility. Defendants sometimes say infringements (if any) are caused by users’ prompts and that the company is not directly responsible for each output. This shifts focus toward secondary liability doctrines (contributory or vicarious infringement) and platform immunity, raising thorny factual questions. AP News
License or implied permission. Where companies can show licenses or lawful access to data, or argue that content was in the public domain, they will rely on those bases to defeat infringement claims. For open-source code, demonstrating compatibility with copyleft licenses is a key line of defense. saverilawfirm.com
Procedural victories and claim narrowing. Courts have sometimes dismissed or narrowed claims early (for example, trimming many claims in the Copilot litigation), signaling that not all legal theories survive initial scrutiny. But dismissal of some claims does not end the broader policy fight. Legal.io

Key legal questions courts must answer

These lawsuits force courts to resolve several novel — and foundational — questions:

Is copying for model training a “copy” under the Copyright Act? Does ingesting copyrighted material and encoding it into model weights count as making an infringing reproduction, or is it a functional/transformative use that does not infringe? How courts answer this affects whether training without a license is per se unlawful. 4ipcouncil.com
When is model output an infringing derivative work? If an AI output resembles a copyrighted work in style or content, when is that resemblance substantial enough to be infringement? Courts will examine similarity, the amount and importance of what’s reproduced, and whether the output displaces the market for the original. Center for Art LawThe Art Newspaper
Whose liability is it — user or provider? If a user prompts a model to produce infringing material, is the user solely liable, or is the provider (platform) also liable for contributory infringement or inducement? The answer influences platform moderation duties and architecture. AP News
What remedies should courts order? Beyond damages, plaintiffs seek injunctions that could block the sale or distribution of models (or require dataset purges). Courts must balance harms to rightsholders against potential innovation and First Amendment considerations. Recent studio suits, for example, seek to enjoin image-generator services from producing or distributing certain outputs. Reuters

Recent rulings and procedural posture (what’s happened so far)

Judges have taken different tacks in early rounds. Some rulings allowed lawsuits to proceed past initial dismissal motions, while others narrowed plaintiffs’ claims.

In the visual-artist class actions, judges allowed several claims to move forward, a development many saw as a partial victory for creators and a sign that discovery will probe how training datasets were compiled and used. The Art NewspaperArtnet News
In the Copilot litigation, courts dismissed many claims but left a couple to proceed — showing judicial caution and a tendency to parse plaintiff theories claim-by-claim. Legal.io
Major content owners (e.g., Getty, NYT, large studios) are pursuing robust litigation strategies — some cases are active in U.S. courts and others in the U.K. and elsewhere — raising the possibility of divergent cross-border jurisprudence. Reuters+1

Because discovery is often ongoing, many of the decisive factual records (datasets, internal policies, filtering choices) are still being created through litigation, and thus the ultimate legal doctrines are still in flux.

What’s at stake — remedies, industry impacts, and markets

The legal stakes are high:

Monetary damages. Copyright law allows significant statutory damages (up to $150,000 per willful infringement in the U.S.), plus actual damages and disgorgement of profits. Large-scale copying could, in theory, expose companies to very large damages awards — or, more likely, to settlements. AP News
Injunctions and business model disruption. Courts could (and plaintiffs often request that they) order injunctions that limit model distribution or force companies to remove datasets. Injunctions could require operational changes that are costly and slow product roll-outs. Studios have sought injunctions against image-generator platforms. Reuters
Licensing markets and new business models. If courts rule that training requires licenses, expect the rise of licensing deals (news publishers, photo agencies, art collectives, and code licensors) and possibly an ecosystem of dataset rights clearinghouses or opt-out registries. Getty and news publishers are already pressing for recognition of rights. ReutersThe Guardian
Global fragmentation. Differing outcomes across jurisdictions (U.S., U.K., EU) could fragment how AI companies operate globally, affecting where models are trained, what datasets are used, and what features are offered to customers.

How companies and creators are responding now

Faced with litigation risk, stakeholders have adopted short- and medium-term strategies:

Seeking licenses and partnerships. Some AI firms are negotiating licenses with publishers, photo agencies, and rights holders to legalize training pipelines or to offer paid access to training-grade datasets. Reuters
Dataset audits and documentation. Companies are doing more provenance work — documenting dataset sources, building “dataset inventories,” and publishing model cards — to reduce uncertainty in court and show good-faith practices. 4ipcouncil.com
Technical mitigations. These include filtering options, style/character blacklists, watermarking, and prompt controls to block clear recreations of copyrighted works. Studios’ lawsuits argue some services removed safeguards; companies say technical measures exist or can be improved. AP News
Creators’ tactics. Artists and writers are registering copyrights more proactively, launching class actions, and calling for opt-out registries or legislative reforms. Some creators also pursue collaboration opportunities (licensed models that pay royalties). The Authors GuildArtnet News

Where this could go — likely scenarios

Predicting outcomes in unsettled litigation is hazardous, but several plausible paths emerge:

Settlement and licensing equilibrium. Many high-stakes plaintiffs may secure settlements and licensing deals that monetize training datasets, creating a market-based solution that lets AI companies continue building models while compensating creators.
Narrow judicial rules favoring fair use for training. Courts might hold that non-expressive, statistical uses of copyrighted works during training are fair use while reserving infringement liability for outputs that substantially reproduce protected content. That would permit much training but still expose companies when outputs are close copies.
Broad rulings for rightsholders. Courts could adopt doctrines that treat training as copying requiring a license, forcing large-scale operational changes and potentially slowing the industry while licensing systems emerge.
Legislative action. Lawmakers (in the U.S., EU, and elsewhere) could enact tailored rules — e.g., compulsory licensing for data used in model training, or clearer safe harbors — to provide predictable standards. Because judicial outcomes will differ across borders, pressure for legislation will likely grow. (Note: legislative activity is ongoing and evolving in multiple jurisdictions.) ReutersThe Authors Guild

Practical takeaways for creators, companies and policymakers

Creators: Register copyright proactively; document instances where AI outputs replicate your work; consider collective action or licensing pools to monetize uses at scale.
AI companies: Audit datasets, secure licenses where practical, document provenance, implement technical mitigations for re-creation of protected works, and prepare for discovery obligations in litigation.
Policymakers: Consider creating transparent, efficient licensing mechanisms or clearer statutory safe harbors to balance innovation and creators’ rights; avoid rushed, one-size-fits-all rules.

Conclusion

The copyright wars over AI training datasets are more than a set of discrete lawsuits: they are a contest over how society values creative labor in a world of algorithmic scale. Courts will shape whether the future of generative AI is built on compulsory bargains with creators, broad doctrines of transformative use, or a patchwork of settlements and injunctions. For creators, lawyers, technologists, and policymakers, the coming years will be a critical period of jurisprudence and market design — and the outcomes will shape who gets paid (and who gets to build) in the era of machine learning.

Selected sources & further reading (representative): Authors Guild press materials and case filings; Reuters and AP reporting on Getty, Midjourney and studio suits; Harvard Law Review analysis of NYT v. OpenAI; legal analyses of Andersen v. Stability AI; reporting and court summaries on GitHub Copilot litigation. The Authors GuildReutersThe GuardianThe Art Newspapersaverilawfirm.com

https://bitsofall.com/australias-biggest-bank-decides-not-to-replace-jobs-with-ai/

https://bitsofall.com/https-yourblog-com-google-gemini-models-flagged-for-safety-concerns/

Anthropic Agrees to a $1.5 billion Settlement Over AI Training Data — What Happened and Why It Matters

OpenAI is launching an “AI-first” jobs platform — what that means (and why it matters)