ImProver Aggregate Results

Overview

We evaluate ImProver on subsets of the following datasets.

  • Mathematics in Lean (MIL): This dataset contains pedagogical solutions of common undergraduate-level exercises, and as such contains many readable, yet verbose and inefficient proofs. We use exercise solutions from set theory, elementary number theory, group theory, topology, differential calculus, and integration & measure theory. This dataset contains theorems at an undergraduate-level of complexity. For our main results, we evaluated on 72 theorems from exercise solutions from MIL chapters 4,5,8,9, and 10.
  • Compfiles: Solutions of International Mathematics Olympiad (IMO) and American Mathematics Olympiad (USAMO) competition problems from 2016 to 2024. This is a dataset of internationally-renowned competitive math problems, many of which are readable, yet quite verbose. This dataset contains theorems of a competitive format, and although they contain concepts only at a high-school level, the logical complexity of internationally-renowned competition results is far above that. For our main results, we used all 26 theorems and lemmas from the Compfiles database of complete solutions to the International Mathematics Olympiad (IMO) and the American Mathematics Olympiad (USAMO) from 2016-2024.
  • Mathlib: Mathlib contains many advanced results at the forefront of mathematics, and has been at the center of research-level formalizations. These proofs are extremely efficient, concise, and generalized - which often comes at the cost of readability and understandability. These results and theorems often are at the cutting edge of research.

Results

Metric Model Improvement Nonempty Improvement Accuracy Improved Acc.
Length GPT-4o 3.7 15.15 26.36% 8.31%
Length ImProver 20.96 55.29 100.0% 35.44%
Readability GPT-4o 2.21 8.02 18.75% 6.13%
Readability ImProver 9.34 30.53 100.0% 24.56%
MIL
Metric Model Improvement Nonempty Improvement Accuracy Improved Acc.
Length GPT-4o 6.25 18.58 37.5% 14.42%
Length ImProver 30.54 56.56 100.0% 50.0%
Readability GPT-4o 4.18 14.48 28.85% 11.54%
Readability ImProver 13.45 30.97 100.0% 34.21%
Compfiles
Metric Model Improvement Nonempty Improvement Accuracy Improved Acc.
Length GPT-4o 2.75 30.7 11.54% 5.13%
Length ImProver 18.86 54.48 100.0% 34.62%
Readability GPT-4o 0.39 3.38 14.1% 1.28%
Readability ImProver 5.74 24.89 100.0% 19.23%
Mathlib
Metric Model Improvement Nonempty Improvement Accuracy Improved Acc.
Length GPT-4o 0.0 0.0 16.67% 0.0%
Length ImProver 6.19 53.65 100.0% 11.54%
Readability GPT-4o 0.0 0.0 4.65% 0.0%
Readability ImProver 4.63 33.19 100.0% 11.63%
Raw data can be found and downloaded here.