Creating an LLM benchmark suite for specification inference

Background

Automated program synthesis has recently seen major advancements through the use of machine learning techniques. Large language models (LLMs) are capable of generating functional code from natural language prompts. However, LLMs are probabilistic and lack formal guarantees on the correctness and the safety of generated code. While progress over the last years has been impressive, the models still struggle with many difficult reasoning problems. One central way to measure the capabilities of models are benchmarks of standardized problems, for example external page MMLU for general tasks or external page MathEval for mathematical reasoning. Such benchmarks allow the comparison of different models and the measurement of progress over time. Frameworks such as Inspect provide an easy way to develop new LLM evaluation benchmarks.

Static program verification tools allow establishing formal guarantees for important program properties such as correctness and memory safety. Such tools usually compromise between the level of automation and the significance of the results. A common middle ground is to automate the proof but requires developers to manually provide specifications. Providing these specifications can be expensive and difficult, limiting the popularity and the practicality of such verification tools. It is therefore a natural question whether LLMs can generate these specifications automatically. This problem is usually called the inference of specification. While current models are often capable of inferring specifications for simple programs, they are unable to produce specifications for longer programs with more complex behavior. A benchmark consisting of inference tasks would thus offer a valuable measurement of LLM capabilities.

Goals

This project aims to develop a benchmark measuring an LLMs capabilities at inferring specifications for program verification. The benchmark should cover a variety of program verifiers (including but not limited to Viper and its front-ends) and a variety of inference tasks, for example:

Finding memory-safety specifications for code without any specifications
Finding functional correctness specifications for programs with only high-level correctness specifications
Fixing incorrect specifications
Generating and applying required lemmas

It should also test different prompting scenarios like 0-shot, few-shot and an interactive mode, where the LLM can interact with the verifier to get feedback on its attempts.

Prerequisites

Experience in using LLMs and program verification would be beneficial.

Contact

Nicolas Klose (Programming Methodology Group)