重构排行榜
Aider的重构基准测试要求大型语言模型对来自大型Python类的89个大型方法进行重构。这是一个更具挑战性的基准测试,用于评估模型在不跳过代码段或出错的情况下输出长代码块的能力。该测试旨在激发和测量GPT-4 Turbo的”懒惰编码”习惯。
重构基准测试需要大上下文窗口来处理大型源代码文件,因此可测试的模型数量较少。
模型 | 正确完成百分比 | 使用正确编辑格式百分比 | 命令 | 编辑格式 |
---|---|---|---|---|
claude-3-5-sonnet-20241022 | 92.1% | 91.0% | aider --sonnet |
diff |
o1-preview | 75.3% | 57.3% | aider --model o1-preview |
diff |
claude-3-opus-20240229 | 72.3% | 79.5% | aider --opus |
diff |
claude-3.5-sonnet-20240620 | 64.0% | 76.4% | aider --sonnet |
diff |
gpt-4o | 62.9% | 53.9% | aider |
diff |
gpt-4-1106-preview | 50.6% | 39.3% | aider --model gpt-4-1106-preview |
udiff |
gpt-4o-2024-08-06 | 49.4% | 89.9% | aider --model openai/gpt-4o-2024-08-06 |
diff |
gemini/gemini-1.5-pro-latest | 49.4% | 7.9% | aider --model gemini/gemini-1.5-pro-latest |
diff-fenced |
o1-mini | 44.9% | 29.2% | aider --model o1-mini |
diff |
gpt-4-turbo-2024-04-09 (udiff) | 34.1% | 30.7% | aider --gpt-4-turbo |
udiff |
gpt-4-0125-preview | 33.7% | 47.2% | aider --model gpt-4-0125-preview |
udiff |
DeepSeek Coder V2 0724 (deprecated) | 32.6% | 59.6% | aider --model deepseek/deepseek-coder |
diff |
DeepSeek Chat V2.5 | 31.5% | 67.4% | aider --deepseek |
diff |
gpt-4-turbo-2024-04-09 (diff) | 21.4% | 6.8% | aider --model gpt-4-turbo-2024-04-09 |
diff |
作者:Paul Gauthier, 最后更新于 2025年4月12日。