Special Report: Why the aviation industry can't agree on how to certify AI
Release Date: November 5, 2025
Aviation wants the benefits of AI, but it runs on zero-tolerance safety culture. That clash is why certification is stuck. Regulators demand traceability, determinism, and proven safety. Modern ML brings probability, data drift, and models that aren't easy to explain - and that's the friction.
For IT and development teams, this isn't a theoretical debate. It decides what you can ship, how you build it, and how long it takes. Here's what "certifiable AI" means in practice, what standards are coming, and how to prepare your engineering process now.
What counts as certifiable AI in aviation
Certification is anchored in development assurance - the processes that prove a system does what it should and nothing it shouldn't. Two pillars guide this: ARP4754 for system development and ARP4761 for safety assessment. These tie functions to safety outcomes, then define how much rigor is required.
That rigor is expressed as Development Assurance Levels (DAL A-E). DAL A covers failures with catastrophic effects and demands the most exhaustive evidence. On the software and hardware side, DO-178 and DO-254 define what "acceptable" looks like for airborne code and electronics, respectively.
Traditional software fits this model: deterministic, traceable, and verifiable against clear requirements. Machine learning challenges each of those - requirements don't map cleanly to weights, behavior shifts with data, and explainability is limited. That's why new guidance is being written.
The new standard in the works: ARP6983/ED-324
G-34/WG-114 (a joint SAE/EUROCAE group) is drafting ARP6983/ED-324. It aims to provide a process to support certification of aircraft systems and equipment that incorporate ML. The first release is intentionally narrow: supervised learning only, with frozen models that do not continue learning in service.
Two key ideas in the draft: "learning assurance" (evidence that data is correct, complete, and that the model performs on unseen cases), and the "ML constituent" (a defined bundle that includes the model and its supporting items). Together, they adapt conventional development assurance without relaxing safety expectations.
If accepted by regulators, ARP6983/ED-324 could serve as a means of compliance for a subset of ML applications. It will not cover agents trained via reinforcement learning or anything that updates itself post-deployment. For background on regulator posture, see EASA's public materials on AI in aviation here. For current airborne software assurance, see RTCA's DO-178C overview here.
Why consensus is hard
- Safety culture vs. ML practice: Certification demands determinism and traceability. ML delivers distributions and confidence scores.
- Requirements mapping: Turning high-level safety requirements into dataset specifications and acceptance criteria is new ground.
- Evidence burden: Proving generalization, edge-case behavior, and performance under shifts goes beyond classic test coverage.
- Change control: Small model changes can have wide effects, complicating incremental certification and configuration management.
- Explainability vs. performance: Deep models perform well but are harder to justify in safety cases.
- Supply chain: Datasets, labeling vendors, pretrained weights, toolchains, and silicon become safety-critical dependencies.
- Scope creep: Where does the "system" end and the "ML constituent" begin? Boundaries matter for DAL allocation and V&V.
Determinism, agents, and what's allowed today
Early certifications will center on frozen supervised models deployed on specified hardware so outputs are repeatable for a given input. That enables deterministic behavior at the system interface even if internal logic is statistical.
Reinforcement learning agents are largely out for safety-significant functions because their policies are hard to constrain and explain. The same applies to generative models and large language models - useful for ground tools and support systems, but not for DAL A/B onboard functions. Expect strict segregation and guardrails where they are used.
Practical playbook for IT and dev teams building aviation-grade ML
- Start with functions and hazards: Perform a functional hazard assessment, allocate DALs, and decide early whether ML is justified for each function.
- Bound the ML constituent: Define clear inputs, outputs, operating conditions, and monitoring/mitigation at the system level.
- Make requirements data-centric: Specify data coverage, class balance, edge-case definitions, and acceptance metrics tied to safety objectives.
- Engineer the dataset: Version everything (raw, labels, splits). Add data quality gates, label audits, inter-annotator agreement, and bias checks.
- Separate splits with purpose: Training for fit, validation for model selection, test for final evidence. Preserve an untouched "assurance" set for certification.
- Prove generalization: Quantify performance by environment, sensor state, and corner cases. Add out-of-distribution detection and safe fallbacks.
- Stabilize the toolchain: Lock compilers, frameworks, and hardware. If tools affect outputs, qualify or constrain them.
- Plan change control: Use strict configuration management. For any retrain, show unchanged behavior where required and bounded deltas where not.
- Design for explainability where it counts: Use interpretable features at interfaces, saliency checks, and rule-based monitors that can override or vote.
- Integrate runtime assurance: Health checks, confidence thresholds, and deterministic fallbacks to meet safety objectives when ML defers.
- Build the safety case continuously: Trace from requirements to data to tests to outcomes. Treat it as a living artifact, not an afterthought.
Quick glossary for this debate
- Agent: An AI-driven software system that acts toward preset goals, often trained via reinforcement learning with rewards and penalties.
- Algorithm: The procedure used to train a model to spot patterns in data and make predictions.
- ARP4754: SAE guidance for civil aircraft and systems development; a recognized way to comply with system functional and safety regulations.
- ARP4761: SAE guidance for safety assessments on aircraft, systems, and equipment; a recognized way to comply with system safety regulations.
- ARP6983/ED-324: Draft SAE/EUROCAE process standard for AI in aeronautics; first release focuses on frozen supervised ML models.
- Artificial intelligence (AI): Machine-based systems performing tasks linked to human intelligence; also used as a catch-all for the techniques involved.
- Artificial neural network: An interconnected set of computational nodes that learns patterns from data and applies them to new inputs.
- Deep neural network: A neural network with multiple hidden layers, typical of deep learning.
- Deterministic: Produces the same output for the same input; possible with frozen ML models on specified hardware.
- Development assurance: Processes that build confidence systems meet intended functions without unsafe behavior.
- Development assurance level (DAL): Required rigor for a function or item, from A (catastrophic potential) to E (no safety effect).
- DO-178: RTCA guidance for airborne software safety.
- DO-254: RTCA guidance for airborne electronic hardware safety.
- Expert systems: Rule-based AI meant to emulate human expert decisions; prominent in the 1970s-1980s.
- Functional hazard assessment: Identification of functions and the effects of their loss or malfunction.
- G-34/WG-114: The SAE/EUROCAE group developing ARP6983/ED-324.
- Generative AI: Models that learn data patterns and produce new text, images, audio, or other content.
- Industry consensus standard: A collaboratively developed document of best practices or requirements for a field.
- Item: Hardware or software installed to meet a requirement within development assurance.
- Large language model: A text-trained ML model for natural language tasks; used in chatbots such as ChatGPT.
- Learning assurance: Activities proving training data quality/completeness and model performance on unseen data.
- Machine learning (ML): Algorithms that learn from data rather than fixed rules.
- ML constituent: A bounded set of hardware/software that includes at least one ML model, used to adapt assurance practices.
- ML model: A program trained via ML that maps inputs to predictions or decisions.
- Means of compliance: An accepted way to show a regulation has been met; standards can serve if regulators agree.
- Requirements: What an aircraft, system, or item must do; varies by level and type.
- S-18/WG-63: SAE/EUROCAE committee behind ARP4754B and ARP4761A, both widely recognized by regulators.
- Safety assessment: System/aircraft-level analysis defining safety objectives (preliminary) and verifying they're met (final).
- Supervised learning: ML trained on labeled data, e.g., images annotated as aircraft, birds, or drones.
- System: A function-oriented level of aircraft design that allocates requirements to items implementing hardware/software.
- Training data: The dataset used to train an ML model.
Bottom line
Certification will accept ML where you can freeze behavior, bound risk, and prove performance with data tied to safety objectives. If you build with that constraint set from day one, you'll move faster when the standard lands - and you'll avoid costly rework.
If you want structured learning paths on AI certifications and skill building, explore our latest resources here.
Your membership also unlocks: