AI Act Technical Documentation: A Practical Guide to Annex IV Requirements
TL;DR — What you need to know about Annex IV documentation
- Who: Every provider of a high-risk AI system must prepare Annex IV technical documentation before placing the system on the EU market or putting it into service.
- What: Nine mandatory sections covering system description, development process, monitoring, performance metrics, risk management, lifecycle changes, harmonised standards, declaration of conformity, and post-market monitoring.
- When: Documentation must exist before conformity assessment begins — not after. The high-risk obligations deadline is 2 August 2026.
- How long: Expect 40–60 hours for simple systems, 60–100 hours for moderate systems, and 100–200+ hours for complex systems. Retrospective documentation takes 2–3x longer.
- SME relief: Article 11(2) allows SMEs and startups to use a simplified form, though all nine sections must still be addressed.
- Living document: Annex IV documentation is not a one-time deliverable — it must be updated throughout the AI system's entire lifecycle.
- Conformity gate: Without complete documentation, you cannot pass conformity assessment. Without conformity assessment, you cannot legally operate.
Why technical documentation matters
Article 11 requires that technical documentation of a high-risk AI system be drawn up before that system is placed on the market or put into service, and be kept up to date throughout its lifecycle.
The documentation serves two purposes:
- Demonstrate compliance with the requirements in Articles 8–15 — risk management, data governance, transparency, human oversight, accuracy, robustness, and cybersecurity.
- Provide national competent authorities with all necessary information to assess the system's conformity.
This is not a formality. During conformity assessment — whether self-assessed or evaluated by a notified body — the assessor evaluates your documentation against the legal requirements. Gaps in documentation translate directly to assessment failures, delays, and in the worst case, inability to place your system on the market.
The connection to conformity assessment is direct: for systems requiring self-assessment under Annex VI, the provider's own quality management team reviews the documentation. For systems requiring a notified body under Annex VII (primarily biometric identification), an external assessor scrutinises every section. In both cases, incomplete or vague documentation is the single most common reason for assessment failure.
The nine mandatory sections of Annex IV
The following sections correspond to the structure specified in Annex IV of the AI Act. For each section, we explain what to include, what not to include, and provide a practical example based on a credit scoring AI system — one of the most common high-risk classifications under Annex III point 5(a).
Section 1: General description of the AI system
What to include:
- The system's intended purpose, stated precisely
- The provider's name, address, and contact details
- System version number and any predecessor versions
- How the system interacts with external hardware or software
- Versions of relevant software or firmware, and requirements for version updates
- All forms in which the system is placed on the market (SaaS, API, embedded, on-premise)
- The hardware on which the system is intended to run
- For product components: photographs showing external features, marking, and internal layout
- A basic description of the user interface provided to the deployer
What NOT to include: Marketing copy, aspirational feature descriptions, or vague claims about the system's capabilities. Write as if explaining the system to a regulator who has never seen it before.
Credit scoring example: "CreditScore Pro v3.2 is an AI system that assesses the creditworthiness of natural persons applying for consumer loans between EUR 1,000 and EUR 50,000. It ingests applicant financial history, employment data, and transaction patterns via API integration with the deploying bank's core banking system. It outputs a numerical score (300–850) and a risk category (low/medium/high/very high). It is deployed as a cloud-hosted SaaS application running on AWS eu-west-1. The system does not make autonomous lending decisions — it provides a recommendation that a human credit officer evaluates."
Section 2: Detailed description of the elements and development process
This is the most technically demanding section. It must cover five sub-areas:
Design and development:
- General logic of the system and the algorithms used
- Key design choices, including rationale and assumptions made
- System architecture explaining how software components build on or feed into each other
- Computational resources used for development, training, testing, and validation
- Third-party tools, libraries, or pre-trained models used, with version numbers
Data practices:
- Training methodologies and techniques
- Training data: description of datasets, data provenance, scope, main characteristics
- How data was obtained and selected
- Labelling procedures and data-cleaning methodologies
- Data assessment in terms of suitability, biases, and potential gaps
Human oversight:
- Measures designed into the system to facilitate human oversight under Article 14
Pre-determined changes:
- Any pre-determined changes to the system and its performance, with details of the technical solutions to ensure continued compliance
Validation and testing:
- Validation and testing procedures, including the data used and its main characteristics
- Metrics used to measure accuracy, robustness, and compliance
- Test logs and test reports with dates and signatures
Cybersecurity:
- Technical solutions addressing Article 15 requirements
- Measures against AI-specific vulnerabilities: data poisoning, model poisoning, adversarial examples
Credit scoring example — data practices section: "Training data comprises 2.4 million anonymised historical loan applications from the period 2018–2024, sourced from three EU banking partners under data sharing agreements. The dataset includes 43 features per application. Applicants' protected characteristics (gender, ethnicity, age) were excluded from model inputs but retained in a separate analysis dataset for bias testing. Labelling: each application was labelled with the actual repayment outcome (default/no-default) at 12 months. Data cleaning: 14,200 records (0.6%) were excluded due to incomplete repayment data. Bias assessment: the training dataset over-represents applicants aged 30–50 and under-represents applicants under 25. This was addressed through stratified sampling during training and post-hoc calibration of scores across age brackets."
Handling third-party and pre-trained models: If your system uses a model you did not train — a fine-tuned foundation model, a pre-trained embedding model, or a third-party classification API — you must still document the base model's characteristics, your adaptation process, and any limitations inherited from the base model. "We used Model X" is not sufficient. Request technical documentation, model cards, or data sheets from your suppliers. Document what you know, what you do not know, and what steps you have taken to address gaps. If the supplier cannot provide adequate documentation, this itself is a risk that must be documented and mitigated.
Section 3: Monitoring, functioning, and control
- The system's capabilities and limitations in performance, including degrees of accuracy for specific persons or groups
- Foreseeable unintended outcomes and sources of risks to health, safety, and fundamental rights
- Human oversight specifications: technical measures to facilitate interpretation of outputs
- Specifications for input data, as appropriate
Credit scoring example: "The system's accuracy (AUC-ROC) is 0.87 on the general test population. Known limitations: accuracy drops to 0.79 for applicants with fewer than 12 months of credit history, and to 0.81 for self-employed applicants with irregular income patterns. The system may produce unreliable scores for applicants from countries with incompatible credit reporting frameworks. Human oversight: the deployer dashboard displays the score, the top five contributing factors, and a confidence indicator. If confidence is below 70%, the system flags the case for mandatory manual review."
Section 4: Appropriateness of performance metrics
- The metrics chosen to measure performance
- Why these metrics are appropriate for the specific system and intended purpose
- The benchmark(s) against which performance is measured
Disaggregated accuracy requirements: The AI Act expects performance metrics to be broken down across relevant subgroups — not reported only as aggregate figures. For a credit scoring system, this means reporting accuracy, false positive rates, and false negative rates disaggregated by age bracket, gender, geographic region, and employment type. A single aggregate "95% accuracy" figure is insufficient and will likely be challenged during conformity assessment.
Credit scoring example: "Primary metric: AUC-ROC, chosen because it measures discriminative ability across all classification thresholds, which is appropriate for a scoring system where deployers set their own acceptance thresholds. Secondary metrics: false positive rate (FPR) and false negative rate (FNR), reported disaggregated by age group (<25, 25–35, 35–50, 50–65, 65+), gender, and employment type (employed, self-employed, unemployed). Benchmark: the system's performance is compared against the incumbent logistic regression model used by the primary banking partner, using the same test dataset."
Section 5: Risk management system
- The risk management system under Article 9
- Known or foreseeable risks identified
- Risk evaluation results
- Risk management measures adopted and residual risk assessment
- Evidence that the process was iterative and carried out throughout the development lifecycle
Credit scoring example: "Risk register includes 23 identified risks. Top-5 by severity: (1) systematic bias against young applicants with thin credit files — mitigated by age-stratified calibration and mandatory manual review for applicants under 25; (2) proxy discrimination via postal code — mitigated by excluding geographic features and testing for disparate impact; (3) data drift from changing economic conditions — mitigated by quarterly model performance monitoring and retraining triggers; (4) adversarial manipulation of input data — mitigated by input validation, anomaly detection, and transaction pattern cross-verification; (5) over-reliance by deployers on automated scores — mitigated by requiring human review for all borderline scores (550–650 range)."
Section 6: Changes throughout the lifecycle
- All relevant changes made to the system throughout its lifecycle
- How changes were tested and validated
- Version control and change management procedures
This section must be maintained as a living record. Every model update, retraining, feature addition, or performance recalibration should be logged with the date, rationale, test results, and confirmation of continued compliance.
Section 7: Applied harmonised standards
- If harmonised standards under Article 40 were applied, list them with version numbers
- Where harmonised standards were not applied, describe the solutions adopted to meet Chapter III, Section 2
As of April 2026, CEN/CENELEC has published draft standards but not all have been formally harmonised. Document which standards you followed and, for areas without harmonised standards, explain how you met the legal requirements directly from the text of Articles 8–15.
Section 8: EU declaration of conformity
- A copy of the EU declaration of conformity under Article 47
This section is completed at the end of the conformity assessment process. The declaration references the system, the provider, the harmonised standards or other specifications used, and the conformity assessment procedure followed.
Section 9: Post-market monitoring system
- The post-market monitoring system under Article 72
- How performance data is collected and analysed after deployment
- Thresholds and triggers for corrective action
- Incident reporting procedures under Article 73
Credit scoring example: "Performance monitoring: automated weekly calculation of AUC-ROC, FPR, and FNR on a rolling 90-day window of production decisions, disaggregated by subgroup. Alert thresholds: if any subgroup AUC-ROC drops below 0.80, an investigation is triggered within 48 hours. If aggregate AUC-ROC drops below 0.83, the system is flagged for retraining. Feedback loop: deployment partners report quarterly on actual default rates for AI-scored applications, enabling back-testing of predictions. Incident reporting: the post-market monitoring team reports serious incidents to the relevant market surveillance authority within 15 days per Article 73."
Real-world documentation scenarios
Scenario 1: Medical device AI (radiology)
A provider of an AI system that assists radiologists in detecting lung nodules on CT scans (high-risk under Annex III point 1 and potentially under the Medical Devices Regulation) faces the most demanding documentation requirements. Third-party conformity assessment by a notified body is likely required. Section 2 must include detailed descriptions of the training dataset (tens of thousands of annotated scans), inter-rater reliability of the labelling, performance broken down by nodule size, patient demographics, and scanner manufacturer. Section 5 must address the risk of missed detections (false negatives) and false alarms (false positives) with specific residual risk quantification.
Scenario 2: HR screening tool (recruitment)
An HR technology company providing an AI system that filters job applications (high-risk under Annex III point 4(a)) must document in Section 2 how the training data was curated to avoid encoding historical hiring biases. Section 3 must specify accuracy disaggregated by gender, age, ethnicity, and disability status. Section 5 must address risks including indirect discrimination via proxy features (university name as a proxy for socioeconomic status, gap years as a proxy for caregiving responsibilities). Section 9 must describe how the provider monitors whether the system's recommendations lead to disparate outcomes across protected groups in production.
Scenario 3: Critical infrastructure monitoring
A provider of an AI system that monitors electrical grid stability and triggers automated load-shedding decisions (high-risk under Annex III point 2) must document in Section 1 the system's interaction with SCADA systems and grid hardware. Section 2 must cover the simulation environments used for testing, since live grid testing is impractical. Section 5 must address cascading failure risks, including scenarios where the AI incorrectly triggers load-shedding and causes unplanned outages affecting hospitals and emergency services.
Common pitfalls and how to fix them
Pitfall 1: Writing documentation retrospectively
The AI Act requires documentation to be prepared during development, not after the system is complete. If design decisions, training data choices, and test results were not documented as they happened, reconstructing them is harder and less credible to an assessor.
Fix: Start a documentation log from day one. Record key decisions, dataset descriptions, and test results in real time. Integrate documentation tasks into your sprint or development cycle.
Pitfall 2: Treating documentation as a one-time deliverable
Annex IV documentation is a living document. It must be kept up to date throughout the AI system's lifecycle. Any significant change — a model update, a new training dataset, a change in intended purpose — triggers a documentation update.
Fix: Tie documentation updates to your CI/CD pipeline. Every release that changes model behaviour should trigger a documentation review. Use version control (Git) for documentation alongside code.
Pitfall 3: Ignoring inherited limitations from third-party components
If your system uses a pre-trained model, a third-party dataset, or an external API, you must document the limitations and risks inherited from these components. "We used GPT-4 for the embedding layer" is not sufficient.
Fix: Request technical documentation or model cards from your suppliers. Document what you know and what you do not know. If a supplier cannot provide adequate documentation, document this gap and explain your mitigation strategy.
Pitfall 4: Vague or aggregate-only accuracy claims
"The system achieves 95% accuracy" fails the Annex IV standard. You must specify:
- The metric used (precision, recall, F1, AUC-ROC, etc.)
- The dataset on which it was measured
- The population segments for which it was measured (disaggregated performance)
- Known failure modes and performance drops under specific conditions
Fix: Report performance disaggregated across all relevant subgroups. Document the datasets, conditions, and thresholds used. Be explicit about where performance degrades.
Pitfall 5: Missing cybersecurity documentation
Many teams document functional performance but neglect cybersecurity. Article 15 requires specific documentation of measures against data poisoning, model poisoning, adversarial inputs, and unauthorised access.
Fix: Conduct a threat model specific to AI vulnerabilities. Document each threat, the mitigation measures, and the residual risk. This is distinct from your general IT security posture.
Pitfall 6: No version control or audit trail
Assessors will look for evidence that documentation evolved alongside the system. A single, undated Word document with no change history is a red flag.
Fix: Store documentation in version-controlled repositories. Use timestamped commits. Maintain a changelog for each major documentation revision.
SME simplifications under Article 11(2)
Article 11(2) explicitly allows SMEs and startups to provide Annex IV elements in a simplified form. The European Commission is tasked with establishing a simplified technical documentation form tailored to the needs of small and micro enterprises.
As of April 2026, the Commission has not yet published this form. The practical approach in the interim:
- Cover all nine sections — the simplification applies to depth, not scope.
- Scale detail to system complexity — a simple classification tool does not need the same depth as a medical AI diagnostic system.
- Focus on substance over length — assessors evaluate whether you addressed the requirements, not page counts.
- Document what you genuinely know — it is better to state "we tested on a dataset of 5,000 records and found X" than to fabricate elaborate testing narratives.
Example: A five-person startup providing an AI system that prioritises customer support tickets (high-risk if used in essential services) can document its development process in 15–20 pages rather than the 80+ pages that a large medical AI provider might need — as long as every Annex IV section is addressed with honest, specific information.
Documentation as a living document
Annex IV documentation is not a deliverable you complete and archive. The AI Act requires it to be maintained throughout the system's lifecycle. Triggers for updates include:
- Model retraining or fine-tuning — update Sections 2, 4, and 5.
- New training data — update Sections 2 and 5.
- Change in intended purpose or deployment context — update Sections 1, 3, and 5.
- New identified risks — update Section 5.
- Performance degradation detected — update Sections 4 and 9.
- Regulatory guidance or harmonised standards published — update Section 7.
- Post-market incidents — update Sections 5 and 9.
Establish a review cadence: at minimum quarterly, or triggered by any of the events above.
Practical tips on tooling and version control
Documentation that lives in disconnected Word documents across email threads will not survive a conformity assessment. Practical approaches used by early adopters:
- Docs-as-code: Store documentation in Markdown or reStructuredText alongside your codebase, versioned in Git. Every documentation change is a commit with a timestamp and author.
- Structured templates: Use a consistent template mirroring the nine Annex IV sections. This ensures completeness and makes assessor review straightforward.
- Automated data capture: Pull training metadata, test results, and performance metrics directly from your ML pipeline into documentation templates. Tools like MLflow, Weights & Biases, or DVC can automate much of Section 2 and Section 4.
- Review workflows: Require sign-off on documentation changes, similar to code review. This creates an audit trail showing who approved what and when.
- Single source of truth: Avoid duplicating information across systems. If your risk register lives in a GRC tool, reference it from the documentation rather than copying it.
Preparation checklist
Use this to audit your readiness before starting formal documentation:
- System description and intended purpose defined precisely
- Architecture diagrams and component inventory prepared
- All third-party components identified with supplier documentation obtained
- Training data sources, selection criteria, and cleaning methods documented
- Bias assessment of training and testing data completed with disaggregated results
- Human oversight measures specified and tested
- Accuracy metrics defined with disaggregated performance data across relevant subgroups
- Cybersecurity threat model (AI-specific) completed with mitigations documented
- Risk management process documented with iteration evidence across the development lifecycle
- Test plans, test results, and test reports archived with dates and signatures
- Post-market monitoring plan drafted with thresholds and triggers
- Change management and version control procedures defined
- Documentation stored in version-controlled repository with audit trail
- SME simplification applicability assessed (if relevant)
Time and resource estimates by system complexity
These figures assume design decisions were documented from the start. Retrospective documentation — reconstructing decisions, test results, and data provenance after the fact — is consistently the most expensive and error-prone path.
Connection to conformity assessment
Technical documentation is not an end in itself — it is the primary input to conformity assessment. The relationship is direct:
- Self-assessment (Annex VI): Your internal quality management system reviews the documentation against Articles 8–15. If the documentation is incomplete, the self-assessment cannot pass.
- Notified body assessment (Annex VII): The notified body examines your documentation in detail. Expect questions, requests for clarification, and follow-up audits. The quality of your documentation determines the speed and cost of the assessment.
- Declaration of conformity (Article 47): You cannot sign the declaration without a completed conformity assessment, and you cannot complete conformity assessment without complete documentation.
Next steps
- Classify your AI system to confirm whether Annex IV documentation is required.
- Review the full Annex IV text for the exact legal requirements.
- Use the checklist above to audit your current documentation gaps.
- Start with Section 1 (general description) and Section 2 (development process) — these are the most time-intensive.
- Review the full compliance checklist to see how documentation fits into the broader compliance programme.
Run the free AI Act assessment to confirm your system's risk classification and documentation obligations.
For the full legal text, see the complete AI Act guide.
Frequently asked questions
How detailed does Annex IV documentation need to be?
Detailed enough for an assessor — whether internal or a notified body — to verify that your system meets every requirement in Articles 8–15 without needing to ask you supplementary questions. The standard is not a page count; it is completeness and specificity. A 30-page document that addresses every section with concrete evidence is better than a 100-page document that uses generic language. The key test: could a qualified assessor who has never seen your system understand how it works, what risks it presents, and how you mitigated them, solely from reading the documentation?
Can I reuse documentation from ISO or other frameworks?
Partially. Existing documentation from ISO 42001 (AI management systems), ISO 27001 (information security), or IEC 62304 (medical device software) can provide building blocks, but none of these standards maps directly to all nine Annex IV sections. Use existing materials where they address the same topics, but be prepared to fill gaps — particularly around AI-specific requirements like disaggregated performance metrics, bias assessments, and AI-specific cybersecurity threats (data poisoning, adversarial examples).
What if my system uses a pre-trained model and the supplier will not share full documentation?
Document what the supplier has provided (model card, data sheet, performance benchmarks), what you requested but did not receive, and how you addressed the resulting documentation gaps. Conduct your own evaluation of the model's performance in your deployment context. Document the inherited risks and your mitigation strategy. A conformity assessor will evaluate whether your approach is reasonable given the information available — but a complete absence of supplier documentation is a significant risk factor that must be explicitly addressed.
Does the documentation need to be in a specific language?
The documentation must be drawn up in an official language of the Member State where the system is placed on the market or put into service. In practice, English is widely accepted by market surveillance authorities across the EU, but confirm with the relevant national authority. If you operate in multiple Member States, you may need translations of key sections.
How often must Annex IV documentation be updated?
There is no fixed schedule in the AI Act. The requirement is that documentation must be "kept up to date" throughout the system's lifecycle. In practice, updates should be triggered by any material change to the system (retraining, new data, new deployment context, identified incidents), any new risk information, and at regular review intervals (quarterly is a reasonable baseline). Every update should be version-controlled with a clear changelog.
What are the penalties for inadequate technical documentation?
Inadequate documentation of high-risk AI systems falls under the general high-risk violation category, carrying fines of up to EUR 15 million or 3% of global annual turnover, whichever is higher. For SMEs, the lower amount applies. Beyond fines, the practical consequence is that you cannot complete conformity assessment, which means you cannot legally place the system on the EU market. See the penalties and fines guide for the full breakdown.
Legalithm is an AI-assisted compliance workflow tool — not legal advice. Final compliance decisions should be reviewed by qualified legal counsel.
Check your AI system's compliance
Free assessment — no signup required. Get your risk classification in minutes.
Run free assessment


