Experiments and Results

Dataset Evaluations and Main Results

Explore our experiments to see how our tool optimizes formal proofs from a variety of datasets, spanning educational, research, and competitive mathematics. See more details here.

Ablations

Explore our tests to optimize the best parameter configuration for ImProver against the GPT-4o baseline. See more details here.

Neural Theorem Proving

Explore our tests to optimize the best parameter configuration for ImProver. See more details here.

Performance Metrics

We define four performance metrics for measuring aspects of correctness and metric improvement: Improvement, Nonempty Improvement, Accuracy, and Improved Accuracy. Improvement is defined as the expected improvement in metric score from the input to output, accounting for errors in the generation. The nonempty improvement score is the expected improvement in metric score, given that there are no errors in the generation. Additionally, the Accuracy is defined as the percentage of theorems in the dataset which the model was able to generate a correct output for. The Improved Accuracy is the percentage of theorems in the dataset which the model was able to generate a correct output for, as well as improve the metric to be nonzero.