Developing Formally Proven Code with AI
May 26, 2026
Blog
Developing formally proven code is challenging. It is also extraordinarily valuable: it produces software with a level of robustness that is otherwise unreachable, and you can have very high confidence that your code will not fail over time. But there is no way around it: it’s hard work.
You often find yourself reasoning alongside the prover, explaining why something is correct, and occasionally discovering, sometimes late in the process, that it isn’t. Writing software is easy. Writing correct software is not.
The prover is the ultimate code reviewer. Pushing you to make sure your code is provably correct.
Over the years, tools have made this work much more approachable and closer to everyday engineering practice. With some practice and patience, formal proof becomes a realistically achievable task for many engineers. This is a major breakthrough in software engineering over the past few years.
Now comes the next challenge: maintenance.
In theory, formally proven code should be ideal from a maintenance standpoint. Wouldn’t you want to immediately know which potential failures a change might introduce into previously verified software? In practice, it can be more complicated. The provers themselves evolve over time: many things become easier for them, but some become harder. The code itself may still be correct, but the prover may be using a different set of heuristics to run the demonstration, and thus may no longer be able to establish the proof. Worse, small changes can introduce subtle proof regressions that are difficult and time-consuming to track down.
For high-integrity software, every bit of effort is usually worth it, whether the goal is to protect lives, large financial systems, or critical infrastructure. The intellectual challenge is real and often rewarding. Over the long run, however, it can also be taxing.
Seven years ago, I worked on formally proving memory operations, in particular secure buffer movement (see https://www.adacore.com/blog/proving-memory-operations-a-spark-journey). It was a very satisfying experience. I knew the code was correct, and I knew that as long as it never changed, the proof would not need to be revisited. At the same time, I was aware that the proof relied on a number of demonstrations embedded in the code, some of them more brittle than others, and that these might not age well.
For years, I wanted to revisit that code: simplify the proofs and migrate them to a newer prover. And for years, I postponed it. It required time I didn’t quite have, and it was hard to justify spending days revisiting code that was already known to be correct.
Fast forward to today. The code is still correct. It has been proven. But it no longer proves with the latest prover. I know exactly what to do: run the proof, identify the failing checks, retry with longer timeouts, and, where necessary, add or adjust assertions to guide the prover. This is a deterministic process. In this case, I know the code is correct, but it still represents several days of focused work. Time I don’t have.
Enter Agentic AI.
I recently started using Agentic AI for development, and it’s been a surprisingly compelling experience: interacting with a very capable environment entirely through the command line. So I decided to try it on this specific problem. I downloaded the code, installed the tools, and provided the agent with the codebase and the original blog post. After a few back-and-forth exchanges on the general strategy, it got to work.
What makes this particularly effective is the amount of structure and context available. SPARK is a highly expressive language that allows rich specifications directly in the code. In this case, the code already contained much of the reasoning needed for the proof: not just what correctness means (via specifications), but why it holds (via assertions). On top of that, every modification can be validated by rerunning the prover, which precisely identifies what no longer holds. For each line of code, the agent can therefore access a large amount of semantic and formal feedback and iterate automatically.
It worked. Out of the box. With only a few high-level hints, the agent ran for about an hour and completed the necessary fixes. The result is visible in this commit:
https://github.com/AdaCore/SPARK_memory/commit/cea2e16bdaf0a62a3f0346476094983788a6af27
Pushing a bit further, I asked the agent to reduce proof effort, from level 4 with 600-second timeouts to level 2 with 120-second timeouts. That led to this follow-up commit:
https://github.com/AdaCore/SPARK_memory/commit/3138bb07e259ce1f7dbef9126c8703d5172daed6
While these sessions may look long, most of the time is not spent on AI reasoning but on running the prover itself, which is arguably much cheaper than large-scale LLM inference. There is probably a broader lesson here: combining deterministic tools with AI is a promising path forward, where AI replaces some of the more tedious human tasks, while traditional automation remains the default for many repetitive operations.
From there, additional confidence-building steps remain: checking code generation, validating runtime behavior, considering the possibility of prover issues, and so on. The advantage of SPARK is that the proof artifacts can also be checked dynamically during testing. Assertions used for formal proof can be exercised at runtime with test vectors, ensuring that both the behavior and the assumptions hold in practice. Here again, AI can help with the repetitive parts: generating test suites, driving fuzzing campaigns, and so on. At the end of the process, there is very little room left for residual errors, all achieved with a few prompts and a few hours of background computation, compared to the potentially enormous cost of defects escaping into the field.
There is a broader point behind this story. One of the main concerns with AI-generated or AI-modified code is correctness. But the tools already exist to address this. The choice of programming language matters. Writing SPARK is harder than writing Python, just as writing formally specified code is harder than writing informal code, but when the extra rigor is required, the payoff is significant. And today, this level of rigor is more accessible than ever.
There are many directions this can go: maintaining existing formally proven code, upgrading Ada codebases to SPARK, or even translating C or C++ into SPARK. Formal proof became an industrial reality over a decade ago. AI is now removing one of the remaining obstacles to making it an industry standard.
Quentin Ochem is the Chief Product Officer at AdaCore, overseeing product management. His involvement with AdaCore began in 2002 during his school years, officially joining in 2005 to work on IDE and cross-language bindings. Quentin has a background in software engineering, particularly in high-integrity domains like avionics and defense. His roles expanded to include training and technical sales, leading him to build the technical sales department and global product management in the US. In 2021, he stepped into his current role, steering the company’s strategic initiatives. Quentin holds a master's degree in Computer Engineering from Polytech Marseille, awarded in 2005.