Can published patent specifications and drawings be used to train AI models in China?
Yes-sometimes. But the answer sits at a tight intersection of copyright, fair use, and new regulatory duties. If you're advising on datasets, you need to assume these materials can be copyrighted works and treat training as a use that must be justified or licensed.
Copyright status: patent documents can be "works"
Under China's Copyright Law, a work must be original and fixed. Patent specifications and drawings are created by applicants, not by the patent office, and they can carry original expression in text structure, word choice, illustration style, and layout.
Courts have said so explicitly. In case (2021) Jing 73 Civil Final No. 4384, the Beijing IP Court held that specifications and drawings are not administrative documents and that drawing choices showed originality, protected as a graphic work. In case (2022) Shan IP Civil Final No. 112, the Shanxi High People's Court found originality in textual expression and treated the specification as a literary work.
Disclosure narrows, but does not erase, rights
Publishing through CNIPA serves a public goal: disseminating technical knowledge. That supports limited fair use-reproduction and sharing to access technical information-so long as it doesn't interfere with normal exploitation or unreasonably harm rightsholders.
Scholarly commentary points the same way: protection should be moderately narrowed to keep knowledge flowing. But the copyright baseline still stands where originality is present.
AI training and the three-step test
Training may implicate reproduction and the right of communication through information networks. Whether it qualifies as fair use under China's three-step test remains fact-dependent: it must fall within statutory scenarios, not conflict with normal exploitation, and not cause unreasonable prejudice.
Model training is often "analytical" rather than substitutive, but where that line sits needs more judicial clarity. Expect arguments focused on market effect, output substitution, and whether expressive elements are being learned or replicated.
LoRA/Altman: "analytical use" and provider duties
The Shanghai Intellectual Property Court, in the 2023 LoRA/Altman matter, framed generative model training as analytical use-parsing concepts and expressive patterns rather than reproducing works-which can favor fair use. That said, the court underscored a key point: fair use does not immunize service providers from responsibility for infringing outputs.
Translation for counsel: you can defend the training, but you must control the outputs. Filters, monitoring, and responsive takedown procedures are not optional.
Regulatory overlay: Interim Measures on Generative AI
The Interim Measures require lawful data sources, respect for IP, improvements to authenticity and diversity of data, and mechanisms to prevent infringing outputs. They also emphasize transparency around data and model behavior.
Even if courts grow more tolerant of analytical training, these obligations stand. Compliance is an independent track, not a fallback.
Practical playbook for training on patent specs and drawings
- Assume protectability. Treat published specifications and drawings as potentially copyrighted works. Build policy and tooling on that baseline.
- Source validation. Pull from official repositories with clear disclosure status. Record jurisdiction, publication number, and terms of use for each source.
- Purpose documentation. Document the analytical purpose of training and why it does not target verbatim or near-verbatim reproduction of expressive passages or drawings.
- Data minimization. Prefer non-expressive fields and structured metadata where feasible. De-duplicate aggressively; avoid capturing layout artifacts and ornamental drafting choices.
- Secure processing. Use transient copies where possible. Apply hashing, access controls, and full audit trails for ingestion and training runs.
- Output controls. Block prompts requesting reproduction of specific patent texts or figures. Add similarity filters to prevent long, near-verbatim passages or figure-like outputs.
- Opt-outs and takedowns. Offer channels for rightsholders to request removal. Maintain dataset lineage so you can locate and excise items quickly.
- Licensing fallback. Where risk is high (e.g., highly expressive drawings), obtain licenses or use sources with explicit permissions.
- Cross-border rules. If you operate in the EU, respect the text-and-data-mining opt-out and any machine-readable exclusions. Align crawlers and dataset builders with those signals.
- Evaluation and red-teaming. Test for memorization of specifications and figures. Tune training and decoding parameters and apply post-training alignment to reduce reproduction risk.
- Governance. Map your controls to the Interim Measures and internal policies. Keep DPIA-style records to demonstrate compliance.
International context
In the United States, recent policy discussion continues to lean on fair use, especially for analytical, non-substitutive uses in search and data mining. See the U.S. Copyright Office's third report on AI training (May 2025): report link.
The European Union codified a text-and-data-mining exception with an opt-out for rightsholders under the Copyright Directive (Directive (EU) 2019/790). Details here: Directive (EU) 2019/790.
Expect China to keep refining the boundary through cases and policy. A hybrid of case-by-case fair use analysis and operational compliance duties is the likely path.
Bottom line for counsel
Patent specifications and drawings are valuable training data, but often protected works. Treat training as an analytical use you must justify under the three-step test, and back it with strong controls on sourcing, outputs, and remediation.
The opportunity is real, and so are the duties. If you build the compliance spine early, you won't have to rebuild it under pressure later.
Your membership also unlocks:
Speak Up on AI in Clinical Care - HHS RFI Comments Due February 23, 2026