Legal Risks in Generative AI Data Training
Recently, the first copyright infringement case involving a generative AI drawing was declared, sparking fresh debate among legal experts and industry players about the copyrightability of AI-generated content. Beyond this unsettled issue, the legality of data training in generative AI—specifically whether it infringes on rights holders’ interests—remains controversial. This article summarizes the key legal risks in generative AI data training.
Legal Requirements for Generative AI Training Data
Article 7 of the Interim Measures for the Administration of Generative Artificial Intelligence Services sets out clear rules for AI service providers on training data:
- Use data and base models with legitimate sources.
- Respect intellectual property rights according to law.
- Obtain consent when processing personal information, or comply with relevant laws.
- Take effective measures to ensure data quality, authenticity, accuracy, objectivity, and diversity.
- Comply with other applicable laws such as the Network Security Law, Data Security Law, and Personal Information Protection Law.
This article will focus primarily on the first three points.
Using Data with Legitimate Sources
Illegitimate sources often involve improper data acquisition, such as unauthorized database breaches. Such actions constitute unfair competition under the Anti-Unfair Competition Law. Several relevant cases illustrate this principle:
- (2017) Yue 03 Min Chu No. 822: A company illegally used real-time bus data from another’s software via Python scripts, constituting theft of intangible property.
- Zhe 01Min Zhong No. 7312: Inducing users to share accounts violated business ethics and was ruled unlawful.
- Zhe 8601 Min Chu No. 956: Data obtained by database breach was considered “free-riding” and unlawful.
- Zhe 01Min Zhong No. 5889: Differentiated between original and derivative data controlled by network operators.
Some argue Articles 49 and 53 of the Copyright Law provide lawful access methods. However, Article 49’s third paragraph restricts technical measures preventing unauthorized access. Since generative AI cases usually do not involve direct provision of works, this article likely doesn't apply.
Intellectual Property Rights Concerns
The training phase often involves data mining and digitizing non-electronic data, which may infringe reproduction rights, especially when permanent copies are made.
There is no established litigation in China specifically addressing fair use in generative AI training. The closest case, Wang Xin v Google, held full-text copying as copyright infringement without fair use justification. The court referenced the U.S. four-factor fair use test but stressed strict control over fair use beyond statutory exceptions.
China’s Copyright Law Article 24 lists 12 specific fair use circumstances plus a catch-all “other circumstances” clause. Generative AI training is difficult to categorize here, creating legal uncertainty.
Legal scholars raise concerns about the lack of clear legislation. Courts often mix the “three-step test” with the U.S. four-factor analysis, leading to unpredictable outcomes. This ambiguity risks increased litigation and hampers AI industry growth.
Some propose categorizing generative AI data training as fair use. Arguments include:
- Machine learning extracts value-added knowledge independent from the original work’s market value.
- AI’s deep learning does not reproduce the original work as-is but creates transformative outputs.
- Distinguishing “expressive use” from “non-expressive use” suggests some AI uses may be defensible.
- Current copyright frameworks, focused on author-centric models and strict tests, need adaptation to support innovation.
Others caution that fair use should be applied with nuance, distinguishing between commercial and non-commercial AI training. Judicial discretion combined with comprehensive tests may offer a practical path forward.
Handling Personal Information
Regarding personal data, the “Sina-Maimai” case provides guidance. It involved unfair competition through unauthorized capture and use of social media user data, establishing the “principle of triple authorization.”
- Data providers must obtain user consent before sharing data with third parties.
- Third-party platforms must clearly inform users about data use and obtain separate consent.
This principle is reinforced by Article 23 of the Personal Information Protection Law, which requires processors to inform individuals and obtain consent for data sharing.
Opinions differ on the triple authorization principle:
- Supporters see it as a balanced approach that protects all parties and promotes healthy data industry development.
- Critics argue it may hinder innovation and offer pseudo-protection without effective market benefits.
- Compromise views suggest differentiating between identifiable raw data and non-identifiable derivative data, applying tailored rules.
Conclusion
Generative AI’s rapid growth challenges traditional legal frameworks, raising unresolved questions around copyright, data sourcing, and privacy. China’s Interim Measures provide a foundational regulatory approach, especially through Article 7’s clear rules on training data.
Legal clarity and further refinement of these rules will be essential as AI applications expand. Legal professionals should closely monitor developments and consider the nuances of intellectual property and personal information protections in AI data training.
For those looking to deepen their understanding of AI and legal intersections, exploring specialized courses can be valuable. Check out Complete AI Training’s latest AI courses for relevant learning opportunities.
Your membership also unlocks: