Salesforce enters legal crossfire over AI training data, expands Google partnership
Salesforce has been sued in federal court in San Francisco under the Copyright Act for allegedly training its XGen models on a pirated library of books. The plaintiffs, authors E. Molly Tanzer and Jennifer Gilmore, seek to certify a class of U.S. copyright holders whose works have allegedly been used since October 2022. The complaint cites statements from Salesforce CEO Marc Benioff, including a January 2024 interview where he said AI companies "ripped off" training data and that "all the training data has been stolen."
The complaint: core allegations and requested relief
According to the filing, Salesforce used the RedPajama and The Pile datasets, including the Books3 subset containing more than 196,000 titles copied from the private tracker Bibliotik. Plaintiffs allege Salesforce referenced "RedPajama-Books" as a training source when XGen launched in June 2023, and an engineer linked to these datasets on GitHub. They claim Salesforce later removed those references and replaced them with generic language about "natural language data" from "publicly available sources."
- Causes of action: ongoing infringement under the Copyright Act.
- Relief sought: statutory damages, destruction of infringing copies, disgorgement of profits, declaration of willful infringement, and attorneys' fees.
- Proposed class: U.S. copyright holders whose works have been used since October 2022.
Plaintiffs also allege that Salesforce trained its CodeGen models on The Pile in 2022, later commercializing that technology within Agentforce, and released the XGen-Sales model in October 2024.
Context: recent rulings on AI training and copyright
Recent decisions have set a high bar for plaintiffs. A federal judge, Vince Chhabria, dismissed similar claims against Meta, emphasizing that asserting "our work was used" is insufficient without showing real market harm. The order characterized training on copyrighted books as fair use in that case. Courts have also issued rulings favorable to OpenAI and Anthropic on market harm, though one judge criticized Anthropic for maintaining a permanent library of pirated books.
For counsel, the thread running through these cases is clear: causation and damages are decisive. Absent concrete evidence of substitution or market impact, training-based claims face steep odds. Still, permanent retention of copyrighted works and inconsistent disclosures can create risk.
Practical takeaways for legal teams
- Preserve evidence now: implement litigation holds covering training datasets, data cards, model cards, data pipeline logs, checkpoints, and redaction histories.
- Map data lineage: document sources, licenses, filtering steps, and removal protocols; be prepared to show when and how datasets (e.g., Books3, RedPajama, The Pile) were used or removed.
- Retention policy: define whether copyrighted works are retained in raw form, embeddings, or checkpoints; justify retention periods and purge schedules.
- Disclosure hygiene: ensure public documentation matches internal practice; scrubbing specifics after publication can be portrayed as concealment.
- Contract posture: review indemnities, IP warranties, and "AI training" clauses with vendors and partners; add audit rights and model provenance representations.
- Risk framing: if training relied on copyrighted books, assess potential exposure by model version, release date, and commercialization path; segment product claims accordingly.
- Remedies exposure: model potential statutory damages and profit attribution scenarios; prepare for demands for destruction of copies and checkpoints.
Product update: deeper Google Gemini integration with Agentforce 360
Separately, Salesforce extended its partnership with Google to integrate Gemini models more deeply with Agentforce 360. The Atlas Reasoning Engine can now use Gemini for hybrid reasoning and multi-step process automation in enterprise sales and IT workflows. The integration expands beyond Gmail into Google Workspace apps such as Sheets, Docs, Drive, Slides, and Meet, with native interoperability for tasks like initiating engagements, qualifying leads, and scheduling from Gmail and Google Calendar.
Salesforce positions these agents as both capable and consistent for critical enterprise use cases. As chief scientist Silvio Savarese put it, "In the enterprise environment, it's imperative for AI agents to be highly capable and highly consistent, especially for critical use cases […] Together, we are setting a new standard for building the future of what's possible in the Agentic Enterprise down to the model level."
Compliance considerations tied to the integration
- Data flow mapping: confirm what customer data, prompts, and outputs traverse between Salesforce and Google; document processing roles and DPAs.
- Use restrictions: verify that customer data is excluded from training by default across both providers and reflected in MSAs.
- Audit readiness: align model choice logic (e.g., when Gemini is invoked) with documented risk assessments and enterprise policies.
- Records management: maintain versioned artifacts for prompt templates, tool use, and agent policies used within Agentforce 360.
Bottom line for legal teams: the case against Salesforce turns on dataset provenance, retention, and market harm. At the same time, product integrations increase the importance of precise disclosures, contractual controls, and defensible data governance.
Your membership also unlocks: