AI Act

AI Training Data Requirements: Article 10

Complete guide to AI Act Article 10 data governance. Training data requirements, bias detection, data provenance, GPAI transparency template, and compliance steps.

Legalithm TeamMarch 10, 202626 min read

ShareLinkedIn X

Read time26 min

TopicAI Act

UpdatedMar 2026

Table of contents

AI Training Data Requirements Under the EU AI Act: Article 10 and Data Governance

If there is one provision in the EU AI Act that will force organisations to change how they build AI systems, it is Article 10. Not because it bans anything. Not because it imposes a fine. But because it reaches inside the machine learning pipeline itself and prescribes how training data must be collected, prepared, documented, and governed — before the system ever reaches the market. For providers of high-risk AI systems, Article 10 transforms data governance from an internal best practice into a legal obligation. For providers of general-purpose AI (GPAI) models, a parallel transparency regime under Article 53 requires detailed summaries of training data. This guide breaks down every requirement, explains what compliance looks like in practice, and gives you the steps to get there before the 2 August 2026 deadline.

TL;DR — AI training data requirements essentials

Who: Every provider of a high-risk AI system (as classified under Annex III or sector-specific legislation) and every provider of a GPAI model.
What: Article 10 mandates data governance and management practices covering the entire data lifecycle — design, collection, preparation, labelling, cleaning, enrichment, aggregation, bias examination, gap identification, and documentation.
When: High-risk AI system obligations apply from 2 August 2026. GPAI obligations already apply from 2 August 2025, with a transitional deadline of 2 August 2027 for models already on the market.
Datasets covered: Training data, validation data, and testing data — all three.
Bias obligation: Datasets must be examined for possible biases, and providers must take appropriate measures to detect, prevent, and mitigate them.
Special category data: Article 10(5) creates a legal basis for processing otherwise prohibited personal data (race, gender, health, etc.) when strictly necessary for bias monitoring — subject to strict safeguards.
GPAI transparency: Article 53(1)(d) requires GPAI providers to publish a sufficiently detailed summary of training data content, using the EU's standardised template published in July 2025.
Documentation: All data governance practices must be documented in Annex IV technical documentation, Section 2.

Why data governance is the foundation of AI Act compliance

Every AI system learns from data. The quality, representativeness, and fairness of that data determine whether the system works correctly, fails silently, or actively harms people. The EU legislator understood this when drafting the AI Act — and responded by making data governance one of the most detailed and prescriptive requirements in the entire regulation.

The principle is straightforward: garbage in, garbage out. A recruitment AI trained on historical hiring data that reflects gender bias will replicate that bias at scale. A medical diagnostic system trained on datasets that underrepresent certain ethnic groups will produce less accurate diagnoses for those populations. A credit scoring model trained on data that conflates geography with creditworthiness will systematically disadvantage applicants from certain postcodes. In each case, the harm originates in the data.

Article 10 sits at the centre of the AI Act's requirements architecture. It connects directly to:

Article 9 (Risk Management): Data-related risks must be identified and mitigated within the risk management system.
Article 11 (Technical Documentation): Data governance practices must be documented in Annex IV, Section 2.
Article 13 (Transparency): Users must be informed about the data the system was trained on.
Article 15 (Accuracy, Robustness, Cybersecurity): Accuracy metrics are only meaningful if calculated on representative, correctly labelled data.
Article 72 (Post-Market Monitoring): Ongoing monitoring must detect data drift and degradation over time.

This means you cannot comply with the AI Act without complying with Article 10. It is not one requirement among many — it is the foundation on which every other technical requirement depends.

The 2 August 2026 deadline applies to all high-risk AI system obligations, including Article 10. If you are building a system that falls within Annex III — biometric identification, critical infrastructure, education, employment, access to essential services, law enforcement, migration, or administration of justice — your training data practices must meet these requirements before that date. Start now. Retrospective documentation takes two to three times longer than prospective documentation.

Article 10 — the legal requirements explained

Article 10 is titled "Data and data governance." It applies to high-risk AI systems that use techniques involving the training of AI models with data. It is one of the longest and most detailed articles in the regulation, spanning five paragraphs that cover everything from design choices to special category personal data. Here is what each paragraph requires.

Data governance and management practices (Article 10(2))

Article 10(2) is the core provision. It states that training, validation, and testing datasets shall be subject to data governance and management practices appropriate for the intended purpose of the AI system. Those practices must concern, at minimum, the elements listed in sub-paragraphs (a) through (f).

The word "appropriate" is important. The AI Act does not prescribe one-size-fits-all data management. What is appropriate depends on the intended purpose, the risk level, the state of the art, and the specific characteristics of the data. A facial recognition system trained on biometric data demands different governance than a manufacturing quality control system trained on product images. But both must demonstrate that governance practices exist, are documented, and are proportionate.

Design choices for datasets (Article 10(2)(a)–(f))

The six mandatory elements are:

(a) The relevant design choices: This covers the foundational decisions about your datasets — what data to include, what to exclude, what the target variable represents, what the input features are, how the data is structured. Every design choice must be documented and justified. If you decided to exclude certain data sources, explain why. If you chose a particular labelling taxonomy, document it.

(b) Data collection processes and the origin of data, and in the case of personal data, the original purpose of the data collection: You must know where your data comes from. For every dataset, document the source, the collection method, and the timeframe. For personal data, you must additionally document the original purpose for which the data was collected — which has direct implications for GDPR lawful basis assessments.

(c) Relevant data-preparation processing operations: This includes cleaning, filtering, normalisation, encoding, feature engineering, augmentation, imputation, and any other transformation applied to the raw data. Each operation must be documented, including the rationale for choosing it and its effect on the dataset.

(d) The formulation of assumptions: Any assumptions underlying the datasets must be made explicit. If you assume that historical data is representative of future conditions, document that assumption. If you assume that a proxy variable correlates with the actual variable of interest, document that assumption and the evidence supporting it.

(e) An assessment of the availability, quantity, and suitability of the datasets that are needed: You must evaluate whether you have enough data, whether that data is suitable for the intended purpose, and whether there are gaps. This is not a one-time check — it must be reassessed when the system or its operating conditions change.

(f) Examination in view of possible biases that are likely to affect the health and safety of persons, have a negative impact on fundamental rights, or lead to discrimination: This is the bias examination obligation. It requires proactive examination of the datasets — not just the model outputs — for biases that could cause harm. The examination must be followed by appropriate measures to detect, prevent, and mitigate those biases. For a detailed guide on implementing this requirement, see our dedicated article on AI bias testing for EU AI Act compliance.

Relevant, representative, and free of errors (Article 10(3))

Article 10(3) sets three quality standards for training, validation, and testing datasets:

Relevant: The data must be relevant to the intended purpose. A system designed to assess credit risk must be trained on data that actually relates to creditworthiness — not on tangentially related proxies.
Sufficiently representative: The datasets must be representative of the persons or conditions on which the system will be used. A hiring tool deployed across the EU must be trained on data that represents the diversity of the EU labour market — not on data from a single country or demographic group.
To the best extent possible, free of errors and complete: The regulation uses realistic language — "to the best extent possible" — acknowledging that perfect data does not exist. But it requires a demonstrable, documented effort to identify and correct errors, handle missing values, and ensure completeness.

These three standards create an ongoing obligation. If the operating context changes — for example, if a system trained on Dutch healthcare data is deployed in Romania — the representativeness assessment must be updated.

Taking into account specific geographical, contextual, and behavioural settings (Article 10(4))

Article 10(4) requires that datasets take into account the specific geographical, contextual, behavioural, or functional settings within which the system is intended to be used. This provision targets a common failure mode: systems that perform well in the environment where they were developed but fail when deployed in a different context.

Practical example: A fraud detection system trained on transaction patterns from Northern European banking markets may not perform accurately when deployed in Southern European markets, where payment behaviour, transaction volumes, and fraud patterns differ significantly. Article 10(4) requires the provider to account for these differences in the training data.

This applies equally to linguistic context (a system must handle the languages and dialects of its deployment market), temporal context (data from 2019 may not represent post-pandemic behaviour), and demographic context.

Processing of special categories of personal data (Article 10(5))

Article 10(5) addresses one of the most sensitive intersections in AI regulation: the need to process data about race, ethnicity, gender, health, religion, sexual orientation, and other protected characteristics in order to detect and correct bias — even though processing such data is generally prohibited under GDPR Article 9.

The AI Act resolves this tension by creating a specific legal basis for processing special category data, but only when all of the following conditions are met:

The processing is strictly necessary for the purpose of ensuring bias monitoring, detection, and correction.
The processing is subject to appropriate safeguards for the fundamental rights and freedoms of natural persons.
Those safeguards include technical limitations on re-use and use of state-of-the-art security and privacy-preserving measures, such as pseudonymisation or encryption where anonymisation may significantly affect the purpose pursued.
The special category data must be deleted once the bias has been corrected or the data retention period has expired, whichever comes first.
The provider must maintain records of why the processing was necessary and what safeguards were applied.

This is significant for compliance teams. It means you can — and in many cases should — collect demographic data for the purpose of bias testing. But the safeguards are non-negotiable. Data must be pseudonymised, access-controlled, purpose-limited, and deleted when no longer needed. See our guide on the interaction between the AI Act and GDPR for a deeper analysis of this overlap.

Summary of Article 10 requirements

Paragraph	Requirement	Key obligation
10(1)	Scope	Applies to high-risk systems using data-trained models
10(2)(a)	Design choices	Document and justify all dataset design decisions
10(2)(b)	Data origin	Record source, collection method, and (for personal data) original purpose
10(2)(c)	Data preparation	Document all processing operations applied to raw data
10(2)(d)	Assumptions	Make explicit all assumptions underlying the data
10(2)(e)	Availability assessment	Evaluate whether datasets are sufficient and suitable
10(2)(f)	Bias examination	Examine datasets for biases; detect, prevent, and mitigate them
10(3)	Data quality	Datasets must be relevant, representative, and free of errors
10(4)	Context awareness	Account for geographical, contextual, and behavioural deployment settings
10(5)	Special category data	Permitted for bias detection under strict safeguards

Training, validation, and testing datasets — what you must document

Article 10 does not exist in isolation. Its requirements feed directly into the Annex IV technical documentation, particularly Section 2 (detailed information about the development process, including data). Here is what you must document for each dataset type.

Data provenance and origin

For every dataset used in training, validation, or testing, you must be able to answer:

Where did the data come from? Identify each source — internal databases, third-party data providers, public datasets, web scraping, user-generated content, synthetic generation, or other origins.
When was the data collected? Specify the timeframe and any temporal limitations.
Who collected the data? Identify the entity responsible for collection.
Under what legal basis? For personal data, specify the GDPR lawful basis under Article 6, and for special category data, the Article 9 exception relied upon.
What licences or terms govern the data? For third-party or public datasets, document licensing terms and any restrictions on use.

Data provenance is not just a documentation exercise — it is a liability shield. If a downstream error is traced to a specific data source, provenance records allow you to isolate the problem and demonstrate due diligence.

Data collection methodology

Document the methodology used to collect each dataset:

Sampling strategy: Random, stratified, convenience, or purposive sampling? What was the sampling frame?
Inclusion/exclusion criteria: What criteria determined which data points were included or excluded?
Annotation/labelling process: Who labelled the data? What guidelines did they follow? What was the inter-annotator agreement rate?
Quality control during collection: What checks were applied during collection to ensure accuracy?

Statistical properties and representativeness

Article 10(3) requires datasets to be "sufficiently representative." To demonstrate this, you must document:

Distributions of key variables: Summary statistics, histograms, and distribution analyses for all features relevant to the system's purpose.
Demographic composition: For systems affecting natural persons, the demographic breakdown of the dataset compared to the target population.
Class balance: For classification tasks, the distribution of target classes and any imbalance.
Temporal distribution: How the data is distributed over time, and whether temporal trends exist.
Geographical distribution: Where the data subjects or data points are located, and how this maps to the intended deployment geography.

Practical example: A provider of a high-risk hiring AI deployed across Germany, France, and Spain must document that its training data includes applicants from all three countries in proportions that reflect the deployment population. If 80% of training data comes from Germany but 40% of deployment usage occurs in France, the provider must explain this gap and describe the mitigation measures taken (e.g., oversampling, domain adaptation, or separate model tuning).

Bias identification and mitigation steps

This is where Article 10(2)(f) meets Annex IV documentation. You must document:

Which biases were examined: The specific types of bias assessed (historical, representation, measurement, aggregation, evaluation, deployment).
The methodology used: What tools, metrics, and tests were applied. Examples include demographic parity analysis, equalized odds testing, disparate impact ratios, and subgroup performance comparisons.
The results: What biases were detected, their severity, and which groups were affected.
The mitigation measures: What was done to address each detected bias — resampling, reweighting, adversarial debiasing, threshold adjustment, data augmentation, or other techniques.
The residual risk: What bias remains after mitigation, and why it is considered acceptable given the system's intended purpose and risk level.

For a step-by-step methodology, see our guide on AI bias testing and fairness.

Gap identification and filling

Article 10(2)(e) requires an assessment of whether the available data is sufficient and suitable. If gaps are identified, you must document:

What gaps exist: Underrepresented subgroups, missing time periods, geographical blind spots, insufficient edge cases.
How gaps were identified: Through statistical analysis, domain expert review, comparison against the target population, or other methods.
How gaps were addressed: Data augmentation, synthetic data generation, additional data collection, transfer learning, or — if gaps cannot be filled — restrictions on the system's deployment scope.
Residual gaps: What gaps remain after remediation, and how they affect the system's performance and fairness.

GPAI model training data transparency

The AI Act imposes a separate but related regime for general-purpose AI models under Article 53. While Article 10 governs data governance for high-risk systems, Article 53 governs training data transparency for GPAI models — and the two regimes overlap when a GPAI model is integrated into a high-risk system.

Article 53(1)(d) — training data summary obligation

Article 53(1)(d) requires every provider of a GPAI model to draw up and make publicly available a sufficiently detailed summary about the content used for training of the general-purpose AI model. This summary must be prepared in accordance with a template provided by the AI Office.

The purpose is twofold: to allow downstream providers to understand the data foundations of the model, and to enable copyright holders to identify whether their content was used in training.

The EU training data transparency template (published July 2025)

The AI Office published the standardised training data summary template in July 2025, following public consultation. The template is a structured document that GPAI providers must complete and publish. It is designed to be machine-readable as well as human-readable, enabling systematic analysis by regulators and researchers.

What the template requires

The transparency template is organised into three main sections:

1. General metadata:

Model name, version, and release date
Provider name and contact details
Model type (foundation model, fine-tuned, retrieval-augmented, etc.)
Modalities supported (text, image, audio, video, code, multimodal)
Intended downstream uses and known limitations

2. Data source categories and descriptions:

Categorised list of data sources (e.g., web crawl, licensed datasets, public domain works, government data, user-contributed data, synthetic data)
For each category: a description of the type of content, the approximate volume, the languages represented, the geographical origin, and the timeframe
Identification of whether copyrighted works are included, and if so, which categories (books, news articles, academic papers, music, images, code repositories, etc.)
Any data sources explicitly excluded and the reason for exclusion

3. Data processing and governance:

Data filtering and cleaning procedures applied
Deduplication methodology
Content safety filtering (removal of CSAM, toxic content, personal data)
Quality scoring or selection mechanisms
Personal data handling — whether personal data is included, and if so, what categories and what safeguards are applied
Opt-out mechanisms provided to data subjects or copyright holders

The template does not require disclosure of the actual training data or the specific URLs or documents used. It requires a summary — detailed enough to be meaningful, but not so granular that it reveals trade secrets.

Transitional deadline: August 2, 2027 for existing models

GPAI obligations under Article 53 apply from 2 August 2025 for new models placed on the market after that date. For GPAI models already on the market before 2 August 2025, a transitional period applies — providers must comply by 2 August 2027.

This gives existing GPAI providers two years to reconstruct or compile training data summaries for models that may have been trained before the transparency requirements were known.

Bias in training data — detection and mitigation

Article 10(2)(f) requires providers to examine datasets for biases and take appropriate measures to address them. This section provides an overview of the types of bias relevant to compliance, the testing methodologies available, and the documentation required. For a complete technical guide, see our dedicated article on AI bias testing.

Types of bias

Bias type	Description	Example
Selection bias	The data collection process systematically excludes certain groups or conditions	A medical AI trained only on data from university hospitals, excluding rural clinics
Measurement bias	Inconsistent or inaccurate data collection across groups	A credit scoring system where income is self-reported in one region but verified in another
Historical bias	The data accurately reflects a world that contains systemic inequality	A hiring model trained on 20 years of promotion data that reflects historical gender imbalances
Representation bias	Certain groups are underrepresented relative to the target population	A facial recognition system with 90% of training images from light-skinned individuals
Aggregation bias	A single model is used for groups that require different treatment	A clinical prediction model that averages across ethnic groups with different baseline health profiles
Evaluation bias	The evaluation dataset or metrics do not reflect the deployment population	A model evaluated on English-language benchmarks but deployed in a multilingual context

Testing methodologies

Bias testing under Article 10 must be conducted at the data level (examining datasets before and during training) and at the model level (evaluating outputs across subgroups). Key approaches include:

Subgroup analysis: Disaggregate performance metrics (accuracy, precision, recall, F1) by protected attributes (gender, ethnicity, age, disability status).
Disparate impact ratio: Compare outcome rates across groups. A ratio below 0.8 (the four-fifths rule) is a common indicator of adverse impact.
Equalized odds: Test whether the model has equal true positive and false positive rates across groups.
Counterfactual fairness: Change protected attributes in the input and test whether the output changes.
Intersectional analysis: Test for bias at the intersection of multiple attributes (e.g., women over 50, young men from ethnic minorities).

Annex IV documentation requirements for bias

Annex IV Section 2 requires the technical documentation to include:

A description of the data examination measures undertaken
The specific biases tested for and why they were selected
The metrics and thresholds used
The results of bias testing, including quantitative findings
The mitigation measures applied and their effectiveness
Any residual bias and the justification for its acceptability
The plan for ongoing bias monitoring in production

Practical example — credit scoring: A provider of a high-risk credit scoring system trained on historical loan application data conducts the following bias assessment:

Selection bias check: Compare the demographic composition of the training data against Eurostat population statistics for the target deployment countries. Finding: women aged 18–25 are underrepresented by 12%. Mitigation: stratified oversampling and synthetic data augmentation for that subgroup.
Historical bias check: Analyse historical approval rates by ethnicity. Finding: applicants with non-European-sounding names had 15% lower approval rates even after controlling for financial indicators. Mitigation: remove name-derived features; retrain model; validate that disparity reduces to below 3%.
Measurement bias check: Verify consistency of income reporting across data sources. Finding: self-employed income is measured differently across three data providers. Mitigation: standardise income calculation methodology and document assumptions.

Special categories of personal data

Article 10(5) addresses the paradox at the heart of AI fairness: to detect bias against protected groups, you need data about those groups — but processing that data is generally prohibited.

When processing is permitted under Article 10(5)

Processing special category data is permitted under Article 10(5) when:

Bias cannot be monitored through other means: If you can detect and correct bias using anonymised or non-sensitive data, you must do so. Article 10(5) is a last resort, not a default option.
Processing is strictly necessary: Not merely useful or convenient, but strictly necessary for the purpose of bias monitoring, detection, or correction.
Appropriate safeguards are in place: Technical and organisational measures must protect the data throughout its lifecycle.

The AI Act's Article 10(5) creates a sector-specific exception that interacts with GDPR Article 9(2)(g) (substantial public interest). However, the two regimes impose cumulative obligations, not alternative ones. Providers must comply with both:

The AI Act's safeguard requirements under Article 10(5)
GDPR's data protection principles (lawfulness, purpose limitation, data minimisation, storage limitation, integrity and confidentiality)
GDPR's accountability requirements (DPIA, records of processing, DPO consultation)

This means you need both a GDPR-compliant legal basis and AI Act-compliant safeguards. A data protection impact assessment (DPIA) under GDPR Article 35 is effectively mandatory, since processing special category data for bias detection will almost always meet the threshold for "high risk" processing under GDPR.

For a comprehensive analysis, see our article on EU AI Act vs GDPR: differences and overlap.

Safeguards and limitations

At minimum, the following safeguards must be implemented:

Pseudonymisation: Replace direct identifiers with pseudonyms. True anonymisation is preferred where it does not undermine the bias detection purpose.
Encryption: Apply encryption at rest and in transit for all special category data.
Access controls: Restrict access to special category data to a defined group of authorised personnel with a documented need.
Purpose limitation: Special category data must be used only for bias monitoring, detection, and correction — never for model training or any other purpose.
Retention limits: Delete special category data once the bias analysis is complete or the retention period expires, whichever is sooner.
Logging and audit trails: Maintain records of all access to and processing of special category data.
Technical limitations on re-use: Implement technical controls (not just policies) that prevent the data from being re-used for other purposes.

Implementation guide — achieving Article 10 compliance

Compliance with Article 10 is not a single deliverable. It is a set of ongoing practices embedded in your AI development lifecycle. The following step-by-step guide provides a structured approach.

Step 1: Inventory your datasets

Before you can govern your data, you need to know what you have. Create a comprehensive inventory of all datasets used in training, validation, and testing for each high-risk AI system.

Checklist:

List all training datasets with source, size, format, and date of acquisition
List all validation datasets with the same information
List all testing datasets with the same information
Identify which datasets contain personal data and which contain special category data
Map data flows from source to model

Step 2: Establish data governance policies

Define and document governance policies that cover the Article 10(2) requirements.

Checklist:

Data quality policy (cleaning, error correction, completeness checks)
Data collection policy (sources, methods, consent/legal basis)
Data labelling policy (guidelines, annotator training, quality assurance)
Data retention and deletion policy
Data access control policy
Data versioning and change management policy

Step 3: Assess representativeness

Compare your datasets against the target deployment population along all relevant dimensions.

Checklist:

Identify the target population for each high-risk system
Compare dataset demographics against target population demographics
Compare dataset geographical distribution against deployment geography
Compare dataset temporal distribution against intended operational period
Document any gaps and define a remediation plan

Step 4: Conduct bias examination

Perform the bias examination required by Article 10(2)(f).

Checklist:

Select bias types to test for based on system context and risk profile
Choose appropriate fairness metrics and thresholds
Conduct data-level bias analysis (distribution analysis, representation checks)
Conduct model-level bias analysis (subgroup performance, disparate impact, equalized odds)
Document all findings, including negative results (no bias found)
Implement mitigation measures for identified biases
Validate that mitigation measures are effective
Document residual bias and justification

Step 5: Document everything in Annex IV format

Compile all of the above into Annex IV technical documentation, Section 2.

Checklist:

Design choices documented and justified
Data provenance recorded for all datasets
Data preparation operations documented
Assumptions made explicit
Availability and suitability assessment completed
Bias examination results and mitigation documented
Statistical properties and representativeness analysis included
Special category data processing documented with safeguards

Step 6: Establish ongoing monitoring

Article 10 compliance is not a one-time gate. Data quality and representativeness must be monitored throughout the system's lifecycle, as required by Article 72's post-market monitoring obligations.

Checklist:

Define data quality metrics and monitoring frequency
Implement data drift detection for production data
Schedule periodic bias re-examination (at least annually or upon material changes)
Define triggers for dataset revalidation (new deployment geography, regulatory change, performance degradation)
Integrate data governance monitoring into the overall compliance programme

Frequently asked questions

Does Article 10 apply if I use a pre-trained model or foundation model?

Yes. If you fine-tune, adapt, or integrate a pre-trained or foundation model into a high-risk AI system, you are the provider of that system and Article 10 applies to the data you use for fine-tuning, validation, and testing. For the underlying pre-training data, the GPAI transparency obligations under Article 53 apply to the GPAI model provider. You should request the GPAI provider's training data summary and assess whether the pre-training data is suitable for your intended purpose.

What if my training data is entirely synthetic?

Synthetic data is not exempt from Article 10. The data governance requirements apply regardless of whether the data is real, synthetic, or a combination. You must document the generation methodology, verify that the synthetic data does not introduce or amplify biases, and assess whether it is representative of real-world conditions. Synthetic data can help address representation gaps, but it introduces new risks — such as mode collapse or amplification of seed data patterns — that must be documented and managed.

There is an inherent tension. GDPR Article 5(1)(c) requires data minimisation — collecting only what is necessary. Article 10 requires datasets that are sufficiently representative and comprehensive. The AI Act resolves this in Recital 67 by clarifying that providers may process personal data to the extent necessary for ensuring bias monitoring, detection, and correction, subject to appropriate safeguards. The key is purpose specification: you must clearly articulate which data is needed for which purpose and cannot use "AI training" as a blanket justification for unlimited data collection. A properly conducted data protection impact assessment (DPIA) will help navigate this balance.

What level of data quality is "good enough" for compliance?

Article 10(3) uses the phrase "to the best extent possible, free of errors and complete." This is a reasonableness standard, not a perfection standard. You must demonstrate that you have taken appropriate, documented steps to identify and correct errors, handle missing values, and ensure data quality. What is "appropriate" depends on the risk level — a system that classifies AI-generated images has a lower data quality bar than a system that determines eligibility for social benefits. Document your quality processes, your error rates, and your rationale for considering the remaining quality level acceptable.

Do I need to re-examine my training data for bias every time I retrain?

Yes. Each retraining cycle may introduce new data, change data distributions, or alter the model's relationship with protected groups. Article 10(2)(f) and the risk management obligations under Article 9 require ongoing examination. In practice, you should integrate bias examination into your CI/CD pipeline so that it runs automatically with each retraining cycle, and conduct deeper manual reviews at defined intervals or when triggered by monitoring alerts.

What happens if I cannot obtain representative data for my deployment population?

If representative data is not available for a specific deployment context — for example, if you lack training data from a particular EU member state where the system will be deployed — you have several options: (1) restrict deployment to contexts for which you have representative data and document those restrictions in the instructions for use; (2) use transfer learning or domain adaptation techniques and validate performance in the target context; (3) generate synthetic data that fills the representativeness gap; or (4) collect additional data. What you cannot do is deploy the system in contexts where you know the data is not representative without disclosing that limitation to deployers and users.

Next steps

Article 10 data governance is one component of a broader compliance obligation. To build a complete compliance programme, review:

EU AI Act Compliance Checklist 2026 — A structured checklist covering all provider obligations.
GPAI Model Obligations — Training data transparency and other GPAI requirements.
AI Bias Testing Guide — Technical implementation of bias detection and mitigation.
Annex IV Technical Documentation — How to structure your documentation, including Section 2 on data.
AI Act vs GDPR — Navigating the overlap between data protection and AI regulation.
Post-Market Monitoring — Ongoing monitoring obligations that include data quality and bias.

Not sure whether your AI system is high-risk? Take our free AI Act risk assessment to find out in five minutes.

AI Act

Training Data

Data Governance

Article 10

Bias Detection

GPAI

Compliance

Check your AI system's compliance

Free assessment — no signup required. Get your risk classification in minutes.

Run free assessment

Agentic AI Governance and Compliance

Complete guide to agentic AI governance. Singapore framework, EU AI Act application to AI agents, accountability gaps, technical controls, and enterprise compliance.

EU AI Act Compliance Software Tools Compared (2026)

Objective comparison of EU AI Act compliance tools in 2026. Covers GRC platforms, AI governance, open-source scanners, and workflow tools with pricing and criteria.

Colorado AI Act and US State AI Laws Guide

Complete guide to Colorado SB 205 AI Act and US state AI laws. Algorithmic discrimination, developer and deployer duties, NIST defense, and compliance steps.

AI Training Data Requirements Under the EU AI Act: Article 10 and Data Governance

TL;DR — AI training data requirements essentials

Why data governance is the foundation of AI Act compliance

Article 10 — the legal requirements explained

Data governance and management practices (Article 10(2))

Design choices for datasets (Article 10(2)(a)–(f))

Relevant, representative, and free of errors (Article 10(3))

Taking into account specific geographical, contextual, and behavioural settings (Article 10(4))

Processing of special categories of personal data (Article 10(5))

Summary of Article 10 requirements

Training, validation, and testing datasets — what you must document

Data provenance and origin

Data collection methodology

Statistical properties and representativeness

Bias identification and mitigation steps

Gap identification and filling

GPAI model training data transparency

Article 53(1)(d) — training data summary obligation

The EU training data transparency template (published July 2025)

What the template requires

Transitional deadline: August 2, 2027 for existing models

Bias in training data — detection and mitigation

Types of bias

Testing methodologies

Annex IV documentation requirements for bias

Special categories of personal data

When processing is permitted under Article 10(5)

Interaction with GDPR Article 9

Safeguards and limitations

Implementation guide — achieving Article 10 compliance

Step 1: Inventory your datasets

Step 2: Establish data governance policies

Step 3: Assess representativeness

Step 4: Conduct bias examination

Step 5: Document everything in Annex IV format

Step 6: Establish ongoing monitoring

Frequently asked questions

Does Article 10 apply if I use a pre-trained model or foundation model?

What if my training data is entirely synthetic?

How does Article 10 interact with GDPR data minimisation?

What level of data quality is "good enough" for compliance?

Do I need to re-examine my training data for bias every time I retrain?

What happens if I cannot obtain representative data for my deployment population?

Next steps

Check your AI system's compliance

Related articles