Licensing or Liability: Why Training Data Is AI's Biggest Legal Risk

With AI fights moving to federal court, copyright risk now hinges on training data. Lawful, licensed inputs-and proof of provenance-are the only scalable defense.

Categorized in: AI News Legal
Published on: Jan 24, 2026
Licensing or Liability: Why Training Data Is AI's Biggest Legal Risk

AI Law After the Federal Pivot: Copyright Risk Moves to Center Stage

The legal fight around AI is consolidating. With a new national AI framework set by executive order in December 2025, the patchwork of state rules is likely to fade. That doesn't lower risk for developers-it concentrates it. The scrutiny now points straight at copyright and, more specifically, at the inputs used to train models.

The core lesson from recent cases, led by Bartz v. Anthropic PBC, is simple: the biggest exposure isn't what models output-it's how they were trained. Provenance, licensing, and documentation decide the case before outputs even enter the conversation.

Inputs Are Key

No US court has held that model outputs become infringing derivative works solely because the model trained on copyrighted material. Courts have, however, drawn a bright line around copying protected expression into outputs, which is classic infringement. Training on lawfully obtained works has been treated as fair use and "transformative," including in Bartz and in Kadrey v. Meta Platforms, Inc. In training, models learn statistical relationships-not protected expression-so there's no taking where the material was lawfully acquired.

Fair use collapses when the underlying copies were unlawfully obtained. Pirated books, scraped content of uncertain provenance, or data acquired outside a license fall outside the safe zone. At that point, liability flows from the exclusive rights to reproduce and create derivative works under 17 USC ยง106. With willful infringement, statutory damages can reach $150,000 per work under 17 USC ยง504, and at training-set scale, those numbers become existential.

Bartz as Blueprint

Judge William Alsup split the baby in Bartz. Training on purchased or licensed books qualified as fair use; training on pirated copies from shadow libraries did not. The court labeled the lawful training "exceedingly transformative," but the retention of over seven million pirated books cut against fair use and sent the case to a damages phase.

Then came the class certification: 482,460 copyright holders tied to works in shadow library datasets. That turned manageable exposure into a bet-the-company scenario-minimum damages exceeding $360 million, with a ceiling near $72 billion. Anthropic settled for $1.5 billion. Plaintiffs' firms now call this the "shadow library strategy": track the training data, identify unlawful copies, certify a class, and leverage statutory damages. Nothing in current law stops others from running the same play.

What the Executive Order Changes

The executive order has been cast as deregulatory. That misses the point. It recenters AI disputes in federal court and elevates copyright as the primary enforcement tool. A DOJ AI litigation task force, Commerce's role in neutralizing conflicting state rules, and funding tied to state compliance will funnel disputes into one arena-federal court.

Copyright is already mature law with strict liability features, statutory damages without proof of harm (after formalities), class aggregation, and a fair-use defense that depends on lawful acquisition. As federal agencies push disclosure and provenance requirements, developers won't just need to avoid infringement. They'll need to prove their inputs were lawful.

Licensed Training Data Is the Only Scalable Shield

Fair use isn't a hall pass for opaque data grabs. It helps only if you start with lawful copies. It can't rescue pirated inputs, unknown-provenance scrapes, or mixed datasets you can't untangle. Licensing, by contrast, is a complete defense to the reproduction right most implicated in training.

Train on licensed or public-domain data and the shadow library strategy collapses. Statutory damages tied to unlawful acquisition fall away. Fair-use arguments remain intact for transformative training. Provenance requirements are easier to meet. And class aggregation loses its foundation.

Practical Playbook for Legal Teams

  • Inventory and map your training corpora. For each source, document chain of custody, license terms, and retention. If you can't prove it, assume you'll have to.
  • Segregate datasets by provenance. Keep licensed/public-domain data separate from anything uncertain. Sunset and delete unlawful copies on discovery.
  • License at the source. Use direct licenses, reputable aggregators, or collective licenses. Avoid "gray market" scrapes and shadow-library derivatives.
  • Contract for provenance. Require supplier reps and warranties on lawful acquisition, audit rights, indemnities, and prompt cure/deletion obligations.
  • Maintain reproducible records. Preserve hashes, access logs, versioned manifests, and removal logs. You need evidence that stands up in federal court.
  • Design for exclusion. Build mechanisms to honor takedowns and remove tainted data across checkpoints, checkpoints snapshots, and derived training sets.
  • Align insurance and reserves. Confirm coverage for IP claims, class actions, and statutory-damages exposure. Stress-test worst-case scenarios.
  • Update disclosures. If you're a public company or contracting with the government, ensure risk factors and compliance attestations reflect data provenance controls.
  • Educate your engineering leads. Make provenance, licensing, and deletion workflows standard operating procedure-not one-off fire drills.

Outlook

Federal preemption won't end AI risk-it concentrates it. Copyright will be the main lever, and inputs will decide outcomes. The companies that win will be the ones that can show their models rest on lawful, licensed, and well-documented data.

If your organization needs to upskill teams on AI development practices and data provenance, explore practical training options at Complete AI Training.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)
Advertisement
Stream Watch Guide