ImProver Ablations

Overview

There are over 8640 possible parameter combinations, and we aim to define ImProver such that its specification is the optimal combination with respect to the Improvement score. To do this, we employ a factorial testing strategy with the following testing groups using the length metric:

  • GPT-4o-mini/GPT-4o: This varies the GPT-4o model, outputting a string with no other features.
  • Output and CoS: We evaluate the effects of different output formatting styles (string, string list, string tree) and CoS (True, False), with the model fixed as GPT-4o, with no other features enabled.
  • Example Retrieval: We evaluate the effects of increasing the number of examples provided (multi-shot prompting) in the range of 0,3,5,7, and 10, with the model fixed as GPT-4o, CoS and output formatting fixed as the best combination from the previous test, and no other features enabled.
  • Sampling Method: Here, we evaluate the effects of best-of-n and refinement for a fixed n=5. Additionally we test on the refinement cases if forwarding the most recent iteration result, or all previous iteration results is the best, and if we should keep the best out of the iterations, or the most recent. The model is fixed as GPT-4o, CoS, output formatting, and examples are fixed as the best combination from the previous test, and no other features enabled.
  • $n$ and Model: Here, we evaluate the effects of larger n values and different models. We test n=3,5,7,10,15 on GPT-4o and GPT-4o-mini, as well as n=20 for GPT-4o-mini (as it is of a far lower token cost). CoS, output formatting, examples, and sampling method are fixed as the best combination from the previous test, and no other features enabled.
  • Combos and RAG: We evaluate combination methods refinement(best_of_m',m) and best_of_m'(refinement(m)), for \(m\neq m'\) with \(mm'\) equal to the optimal value \(m\) from the previous test. We also test the effect of enabling document retrieval. Model, CoS, output formatting, examples, n, and sampling method are fixed as the best combination from the previous test.

We evaluate our ablations on a subset of MIL. However, due to the increase in model calls for larger n values, we switch a representative sample of this subset for some test groups. Namely, GPT-4o-mini, GPT-4o, Output and CoS, Example Retrieval, and Sampling Method are tested on the 133 theorems in the solutions of C03_Logic, C04_Sets_and_Functions, and C05_Elementary_Number_Theory. \(n\) and Model are tested on 55 theorems from a representative sample of the aforementioned, and Combos and RAG are tested on a representative sample of 32 theorems from the aforementioned.

Results

Improvement Nonempty Improve. Accuracy Improved Acc.
GPT-4o-mini 0 0 3.62% 0%
GPT-4o 7.03 19.67 35.77% 15.33%
+ Output and CoS 8.04 / 6.31 12.38 / 14.17 64.96% / 44.53% 21.17% / 16.06%
+ Example Retrieval 9.34 / 5.67 14.7 / 8.44 63.5% / 67.15% 21.9% / 16.79%
+ Sampling Method 15.35 / 9.34 18.44 / 14.7 83.21% / 63.5% 36.5% / 21.9%
+ \(n\) and Model 23.51 / 3.65 26.28 / 4.63 89.47% / 78.95% 45.61% / 8.77%
+ Combos and RAG 34.88 / 28.25 57.56 / 33.48 60.61% / 84.38% 54.55% / 53.12%
ImProver 34.88 57.56 100% 54.55%
Where each cell has best/worst signifying the best and worst parameter combinations for each testing group, with respect to their Improvement score.

Readability Ablation

We additionally examine the effects of disabling CoS on readability optimization tasks, as the previous study focused on length optimization tasks, and we speculate that CoS has a high impact on the performance of readability optimization tasks, as the proof states that are embedded due to CoS seem to be a critical aspect to generating the explicit declarations that the readability metric measures.

Improvement Nonempty Improve. Accuracy Improved Acc.
GPT-4o 4.97 15.89 37.5% 12.5%
ImProver, CoS Disabled 9.23 24.61 100.0% 28.12%
ImProver 16.69 31.42 100.0% 46.88%

Raw data can be found and downloaded here.