The Flaws in India's Approach to Copyright Reform for AI
India is moving fast on AI policy. An expert committee has been set up to revisit the Copyright Act, 1957, and DPIIT's working paper proposes a blanket licence for AI training and a new collection body, the Copyright Royalties Collective for AI Training (CRCAT). The intent is clear: close gaps on text and data mining (TDM) and address AI-generated works.
The execution, however, creates structural risks for privacy, paywalled content, creators' rights, and downstream enforcement. Below is a concise breakdown for policymakers, regulators, and counsel.
What the paper proposes
- A blanket licence for AI developers to train on "lawfully accessed" content, with no opt-out for rightsholders.
- A centralized body (CRCAT), built by rightsholders but restricted to organizations (via one CMO per class of work), to collect and distribute royalties.
- Royalty rates to be fixed by a government-appointed committee dominated by officials and experts, with one representative each from CRCAT and the AI industry.
- Royalties paid on a future "commercialisation" event, using revenue sharing rather than upfront remuneration.
1) Blanket licence ignores personal data risks
"Lawfully accessible" data on the open internet often includes personal data posted unintentionally or for limited purposes (old rank lists, cached PDFs, scraped directories). The paper acknowledges this only in a footnote and offers no operative safeguards. In practice, that opens the door to training and potential regurgitation of sensitive personal information.
Without mandatory exclusions, auditability, and redress, this approach collides with privacy principles and creates liability for both developers and the state. An opt-out for rightsholders is not enough; there must be a default prohibition on processing personal and confidential data for training, backed by technical controls and penalties.
2) Paywalls, TPMs, and downstream leakage
The paper limits training to "lawfully accessed" content, but it does not address what happens when users prompt an AI system to output paywalled material that the developer accessed legitimately. That downstream access undermines technological protection measures (TPMs) and effectively enables paywall bypass by proxy.
- Section 65A of the Copyright Act penalizes circumvention of TPMs. See the statutory text here: Copyright Act, 1957 - Section 65A.
- Courts have treated functional access to paywalled content as infringement, even without formal circumvention (e.g., the Sci-Hub litigation): Elsevier Ltd. & Ors. v. Alexandra Elbakyan & Ors.
The paper is silent on user liability, developer duties to prevent paywalled outputs, and remedies for rightsholders. A licensing fee for training does not compensate for potentially unlimited, on-demand distribution of subscription content.
3) CRCAT's composition and payout rules sideline creators
CRCAT membership is restricted to organizations through a single CMO per class of work. Individuals and small, unregistered creators have no voice in distribution policy, even though the paper allows CRCAT to decide payout methodology by simple majority.
There is no prescribed valuation method: no clarity on contribution-based metrics, usage intensity, or revenue attribution. Non-members can apply for payouts but have no say in rules that determine their compensation. That is an imbalance, not a collective solution.
4) "Commercialisation" test delays and dilutes remuneration
Royalties trigger only when the AI system "commercialises," but the paper blurs revenue, profit, and commercial exploitation. Companies can generate revenue without profit for years; some models may be deployed for data flywheels or ecosystem lock-in, not direct income.
Deferring payment to an undefined future event normalizes uncompensated extraction today. It also ducks the question of how to apportion value where model outputs are statistical abstractions, platform revenue is pooled, and substitutability varies by use case.
5) Broadcasting is the wrong analogy
Broadcasting is one-to-many, expressive, and repetitive-the use is visible and meterable. AI training is ingestion and transformation to build general-purpose systems with open-ended downstream uses. Treating training like performance misses how value is created and how risks show up (privacy spillovers, paywall leakage, provenance loss).
Collective licensing can help at scale, but only if the use is identifiable and recurring. Training is neither.
A workable path forward
- Consent and control
- Offer opt-out and opt-in tiers for rightsholders, with machine-readable signals enforced (not just advisory).
- Exclude personal and confidential data by default; require documented data provenance and documented filtering.
- Paywalled content safeguards
- Ban training on paywalled content unless licensed directly from the publisher with explicit downstream guardrails.
- Mandate output filters to block reproduction or summary leakage of subscription material.
- Clarify end-user liability and developer obligations under Section 65A and related provisions.
- Fair and transparent royalties
- Include elected independent creators (not just organizations) in CRCAT governance and rate-setting.
- Publish distribution formulas; allow audits; support claims by non-members on equal footing.
- Use mixed models: modest upfront access fees + usage-weighted payouts, not revenue-only triggers.
- Measurement and auditability
- Require model logs and dataset registries; support content fingerprinting and attribution where feasible.
- Fund third-party measurement to reduce asymmetry between developers and rightsholders.
- Public interest safeguards
- Protect exceptions for research, education, and accessibility with tight scoping and non-substitutive outputs.
- Impose proportionality limits on data use relative to stated purpose.
What policymakers can do now
- Direct the committee to explicitly separate: (a) lawful access for training, (b) lawful access for outputs, and (c) liability for downstream leakage.
- Codify personal data exclusions with technical and audit requirements-not just general cross-references to privacy law.
- Redesign CRCAT to include independent creators, set mandatory distribution principles, and require public reporting.
- Define "commercialisation" precisely; allow alternative triggers (e.g., deployment scale, model usage tiers) for payment.
- Pilot the framework with limited-domain sandboxes; iterate based on measured risks and enforcement costs.
Bottom line
The paper tries to bring order to AI training, but the current design trades away consent, privacy, and practical enforceability. Without opt-outs, personal data safeguards, paywall protections, and creator governance, the scheme will face legal challenges and public pushback.
India can lead here, but it has to get the incentives, enforcement, and safeguards right-upfront, not after deployment.
Your membership also unlocks: