Annex XIII: Criteria for Classification of GPAI Models with Systemic Risk
Annex XIII lists the criteria for classifying a GPAI model as having systemic risk under Article 51. It includes both quantitative indicators (notably the 10^25 FLOPs cumulative compute threshold that creates a rebuttable presumption) and qualitative criteria the AI Office considers when assessing high-impact capabilities. The Commission may update these criteria via delegated acts under Article 97 as technology evolves.
Who does this apply to?
- -Providers of GPAI models assessing whether they meet systemic-risk thresholds
- -The AI Office and scientific panel for AI (applying and monitoring the criteria)
- -Downstream providers integrating GPAI models who need to know systemic-risk status
- -Compliance teams tracking threshold changes via Commission delegated acts
Scenarios
A new frontier model is trained with cumulative compute exceeding 10^25 floating-point operations.
A model is below 10^25 FLOPs but achieves state-of-the-art scores on reasoning and code generation benchmarks with broad deployment across the EU.
The Commission adopts a delegated act lowering the FLOPs threshold to 10^24 after advances in training efficiency.
What Annex XIII covers (in plain terms)
Annex XIII provides the assessment framework the AI Office uses to determine whether a GPAI model has high-impact capabilities and should be classified as systemic risk. The criteria include:
- Number of parameters of the model
- Quality and size of the dataset used for training
- Amount of computation used for training the model (measured in FLOPs) — including the 10^25 FLOPs presumption threshold
- Input and output modalities of the model (text, image, video, code, etc.)
- Benchmarks and evaluations of the model, including state-of-the-art performance
- Number of registered users or reach
- Any other indicator of high-impact capabilities
The 10^25 FLOPs threshold creates a rebuttable presumption: models above it are presumed systemic risk, but providers may argue otherwise. Models below it can still be designated if other criteria demonstrate equivalent capabilities.
The 10^25 FLOPs threshold — context
The 10^25 floating-point operations threshold was calibrated to frontier models at the time of legislative negotiations (roughly GPT-4-class training compute). Key considerations:
- It is a rebuttable presumption, not a hard boundary
- The Commission can update the threshold via delegated act as training efficiency evolves
- Distillation, data quality improvements, and architecture advances may reduce the compute needed for equivalent capabilities—the threshold may under-capture risk over time
- The AI Office can designate models below the threshold based on qualitative criteria
Providers should track both their absolute FLOPs and benchmark performance to assess classification risk.
How Annex XIII connects to the rest of the Act
- Article 51 — Uses Annex XIII criteria to define systemic risk; paragraph (2) establishes the FLOPs presumption.
- Article 52 — Procedure for classification (notification, designation, rebuttal) based on Annex XIII assessment.
- Article 55 — Additional obligations triggered by systemic-risk classification.
- Annex XI Section 2 — Documentation requirements triggered by classification (evaluation strategies, red teaming, architecture).
- Article 97 — Delegated acts allowing the Commission to update Annex XIII criteria and thresholds.
- Article 90 — Scientific panel that may issue qualified alerts based on Annex XIII analysis.
- Article 113 — Application dates (Chapter V from 2 August 2025).
Recitals (preamble) on EUR-Lex
The recitals in the same consolidated AI Act on EUR-Lex contextualise the 10^25 FLOPs calibration, the rebuttable presumption design, and the Commission's power to evolve criteria. Use the official preamble on EUR-Lex—do not rely on unofficial recital lists without checking sequence and wording against the authentic text.
Compliance checklist
- Calculate and document cumulative training compute (FLOPs) for each GPAI model release.
- Track benchmark performance against state-of-the-art metrics across modalities.
- Monitor Commission delegated acts for threshold updates to Annex XIII.
- If above 10^25 FLOPs: prepare notification to the AI Office under Article 52.
- If below threshold but with broad deployment: assess qualitative criteria proactively.
- Document rebuttal arguments if you believe systemic-risk classification is not warranted despite threshold crossing.
- Track the AI Office's published list of systemic-risk models for upstream dependencies.
Assess your GPAI model against Annex XIII criteria—free assessment.
Start Free AssessmentRelated Articles
Article 51: Classification of GPAI Models with Systemic Risk
Article 52: Procedure for Systemic Risk Classification of GPAI Models
Article 55: Obligations for Providers of GPAI Models with Systemic Risk
Article 56: Codes of Practice for GPAI Models
Article 90: Penalties
Article 97: Exercise of the Delegation
Article 101: Fines for Providers of General-Purpose AI Models
Article 113: Entry into Force and Application Dates
Annex XI: Technical Documentation for Providers of General-Purpose AI Models
Related annexes
- Annex XI — GPAI technical documentation (Section 2 triggered by systemic-risk classification)
Frequently asked questions
Is the 10^25 FLOPs threshold permanent?
No. The Commission can update it via delegated act under Article 97 based on evolving technological benchmarks and state of the art.
Can a model below 10^25 FLOPs still be systemic risk?
Yes. Article 51(1)(b) allows the AI Office to designate based on equivalent capabilities or impact using qualitative Annex XIII criteria, even if the compute threshold is not crossed.
How do I calculate FLOPs?
FLOPs typically refers to the total number of floating-point operations used during training. For transformer models, common approximations exist based on parameter count, dataset size, and training steps. Document your methodology.
Does fine-tuning compute count?
The Annex refers to 'cumulative amount of computation used for training.' Whether fine-tuning adds to the base model's FLOPs depends on interpretation—document your position and monitor AI Office guidance.