ImProver Aggregate Results

Overview

We evaluate ImProver on subsets of the following datasets.

Mathematics in Lean (MIL): This dataset contains pedagogical solutions of common undergraduate-level exercises, and as such contains many readable, yet verbose and inefficient proofs. We use exercise solutions from set theory, elementary number theory, group theory, topology, differential calculus, and integration & measure theory. This dataset contains theorems at an undergraduate-level of complexity. For our main results, we evaluated on 72 theorems from exercise solutions from MIL chapters 4,5,8,9, and 10.
Compfiles: Solutions of International Mathematics Olympiad (IMO) and American Mathematics Olympiad (USAMO) competition problems from 2016 to 2024. This is a dataset of internationally-renowned competitive math problems, many of which are readable, yet quite verbose. This dataset contains theorems of a competitive format, and although they contain concepts only at a high-school level, the logical complexity of internationally-renowned competition results is far above that. For our main results, we used all 26 theorems and lemmas from the Compfiles database of complete solutions to the International Mathematics Olympiad (IMO) and the American Mathematics Olympiad (USAMO) from 2016-2024.
Mathlib: Mathlib contains many advanced results at the forefront of mathematics, and has been at the center of research-level formalizations. These proofs are extremely efficient, concise, and generalized - which often comes at the cost of readability and understandability. These results and theorems often are at the cutting edge of research.

Results

Metric	Model	Improvement	Nonempty Improvement	Accuracy	Improved Acc.
Length	GPT-4o	3.7	15.15	26.36%	8.31%
Length	ImProver	20.96	55.29	100.0%	35.44%
Readability	GPT-4o	2.21	8.02	18.75%	6.13%
Readability	ImProver	9.34	30.53	100.0%	24.56%

MIL

Metric	Model	Improvement	Nonempty Improvement	Accuracy	Improved Acc.
Length	GPT-4o	6.25	18.58	37.5%	14.42%
Length	ImProver	30.54	56.56	100.0%	50.0%
Readability	GPT-4o	4.18	14.48	28.85%	11.54%
Readability	ImProver	13.45	30.97	100.0%	34.21%

Compfiles

Metric	Model	Improvement	Nonempty Improvement	Accuracy	Improved Acc.
Length	GPT-4o	2.75	30.7	11.54%	5.13%
Length	ImProver	18.86	54.48	100.0%	34.62%
Readability	GPT-4o	0.39	3.38	14.1%	1.28%
Readability	ImProver	5.74	24.89	100.0%	19.23%

Mathlib

Metric	Model	Improvement	Nonempty Improvement	Accuracy	Improved Acc.
Length	GPT-4o	0.0	0.0	16.67%	0.0%
Length	ImProver	6.19	53.65	100.0%	11.54%
Readability	GPT-4o	0.0	0.0	4.65%	0.0%
Readability	ImProver	4.63	33.19	100.0%	11.63%

Raw data can be found and downloaded here.