Training-Data Privacy and Data-Subject Rights Against AI Models

When personal data goes into training an AI model, two GDPR questions follow it: on what legal basis was it processed, and what rights can the people in that data still exercise once it’s baked into model weights? The European Data Protection Board’s Opinion 28/2024 and the CNIL’s June 2025 recommendations have now given the most authoritative answers available — and they are more nuanced than either the “models are anonymous, so GDPR doesn’t apply” camp or the “you must delete every individual on request” camp would like.

This piece states what those instruments actually say and where the genuine uncertainty remains.

EDPB Opinion 28/2024: The Anchor

The EDPB adopted Opinion 28/2024 in December 2024, at the request of the Irish Data Protection Commission, addressing data-protection aspects of processing personal data in the development and deployment of AI models. Three holdings matter most.

Holding 1: A trained model is not automatically “anonymous”

The most consequential point: the EDPB declined to say that an AI model trained on personal data is inherently anonymous. It rejected an all-or-nothing rule in favor of a case-by-case assessment of whether a given model can be considered anonymous — turning on whether personal data can be extracted from, or related to individuals through, the model.

The practical effect is that a developer cannot simply assert “the weights aren’t personal data, so GDPR stops at the model.” Whether the model is anonymous is a factual question to be demonstrated, not assumed — and where it isn’t anonymous, GDPR obligations continue to attach to the model itself, not just the training dataset.

Holding 2: Legitimate interest can be a lawful basis — with a structured test

The EDPB confirmed that a controller may rely on legitimate interests (Article 6(1)(f)) for processing personal data in developing or deploying AI models, applying the established three-step assessment: identify the legitimate interest, show the processing is necessary for it, and balance it against the data subjects’ rights and freedoms. There is no AI carve-out from the balancing test, and the reasonable expectations of the people in the data weigh heavily in it.

Holding 3: Unlawful upstream processing can taint downstream deployment

The opinion also addressed the consequence of building a model on unlawfully processed personal data, signaling that illegality in the development phase can carry into the deployment phase — a warning that data-sourcing shortcuts are not cured by the act of training.

CNIL’s 2025 Guidance: Operationalizing It

On June 19, 2025, the French regulator (CNIL) published recommendations that put operational meat on the EDPB’s bones, covering legitimate interest as a basis for AI training and measures for collecting data via web scraping.

The CNIL accepts that legitimate interest is, realistically, the most workable basis for AI developers given the impracticality of obtaining consent at training scale — but conditions it on mitigations. Among those it identifies: anonymization, use of synthetic data, and offering a prior opt-out mechanism so individuals can object before their data is used.

On web scraping specifically, the CNIL did not impose a ban but set conditions: define precise collection criteria in advance, exclude certain data categories, respect sites’ explicit objections (terms, robots.txt, CAPTCHA), maintain exclusion lists of sites not to scrape, apply anonymization or pseudonymization promptly after collection, and provide a discretionary right to object. The throughline is the EDPB’s balancing test made concrete: legitimate interest survives only when the developer actively limits the impact on the people in the data.

The Right to Erasure Problem

This is where law and engineering collide. Under GDPR Article 17, individuals can request erasure of their personal data. Applied to a trained model, that runs into a hard technical wall: the data isn’t stored as a retrievable database row — it’s diffused across the model’s weights. True erasure can, in principle, require costly retraining or unproven “machine unlearning” techniques.

Regulators have not resolved this cleanly, but a pragmatic direction is emerging. The CNIL’s guidance allows for alternatives to literal deletion — such as output filtering to suppress a person’s name or documented suppression logic — provided the rationale is recorded. The same logic extends to the right of access and rectification: where the data is embedded rather than stored, controllers are expected to provide meaningful information about how the data was used and to take available measures to limit its influence, rather than to perform a deletion the architecture may not support.

The honest state of play: there is no settled doctrine that erasure-by-retraining is required, nor a blanket exemption for models. Controllers are expected to do what is feasible, document why, and not treat technical difficulty as a free pass.

What This Means for AI Developers

A working compliance posture given the current instruments:

Do not assume anonymity — assess it. Per EDPB 28/2024, treat “is the model anonymous?” as a documented factual determination, including whether training data can be extracted or individuals re-identified. If you cannot demonstrate anonymity, GDPR continues to apply to the model.

Build the legitimate-interest assessment as a real document. Identify the interest, justify necessity, run the balancing test, and record the mitigations (opt-out, scraping exclusions, pseudonymization) the CNIL expects. A boilerplate LIA will not survive scrutiny.

Implement an upstream opt-out and scraping discipline. Honor robots.txt and explicit objections, keep exclusion lists, and offer a pre-collection objection mechanism. This is now the regulator-expected baseline, not a nicety.

Have an answer for erasure/access requests. You may not be able to retrain on demand, but you need a defensible, documented response: what suppression or output-filtering measures you apply, what information you provide about the data’s use, and why your approach is the feasible one.

Watch the development-phase legality. Unlawful sourcing can taint deployment. Diligence on where training data came from is a compliance control, not just an ethics one.

The Direction of Travel

The combined message of EDPB 28/2024 and the CNIL’s 2025 work is that AI training is inside the GDPR, not adjacent to it — but the regulators are pragmatic about the technical realities. Legitimate interest is available; anonymity is provable but not presumed; and data-subject rights apply, with feasibility-bounded expectations on the hardest cases like erasure from weights. Developers who treat training data as a GDPR-free zone are exposed; developers who build documented assessments and real upstream controls are in defensible territory.

Cross-references

For how automated decisions made by these models are governed, see GDPR Article 22 and LLM automated decision-making. For the cross-border dimension of moving training and inference data, see cross-border LLM data transfers and SCC compliance. For the assessment artifact that ties this together operationally, see the DPIA template for LLM deployment.

For continued coverage of EDPB and member-state regulator guidance, AI policy watch ↗ tracks these developments.

Training-Data Privacy and Data-Subject Rights Against AI Models

EDPB Opinion 28/2024: The Anchor

Holding 1: A trained model is not automatically “anonymous”

Holding 2: Legitimate interest can be a lawful basis — with a structured test

Holding 3: Unlawful upstream processing can taint downstream deployment

CNIL’s 2025 Guidance: Operationalizing It

The Right to Erasure Problem

What This Means for AI Developers

The Direction of Travel

Cross-references

Sources

AI Privacy Report — in your inbox

Related

The Privacy Risks of AI Chat Assistants: Retention, Review, Training

Cross-Border LLM Data Transfers: SCC Compliance After Schrems II

GDPR Article 22 and LLM Automated Decision-Making

Comments