重构排行榜

Aider的重构基准测试要求大型语言模型对来自大型Python类的89个大型方法进行重构。这是一个更具挑战性的基准测试,用于评估模型在不跳过代码段或出错的情况下输出长代码块的能力。该测试旨在激发和测量GPT-4 Turbo的”懒惰编码”习惯

重构基准测试需要大上下文窗口来处理大型源代码文件,因此可测试的模型数量较少。

模型 正确完成百分比 使用正确编辑格式百分比 命令 编辑格式
claude-3-5-sonnet-20241022 92.1% 91.0% aider --sonnet diff
o1-preview 75.3% 57.3% aider --model o1-preview diff
claude-3-opus-20240229 72.3% 79.5% aider --opus diff
claude-3.5-sonnet-20240620 64.0% 76.4% aider --sonnet diff
gpt-4o 62.9% 53.9% aider diff
gpt-4-1106-preview 50.6% 39.3% aider --model gpt-4-1106-preview udiff
gpt-4o-2024-08-06 49.4% 89.9% aider --model openai/gpt-4o-2024-08-06 diff
gemini/gemini-1.5-pro-latest 49.4% 7.9% aider --model gemini/gemini-1.5-pro-latest diff-fenced
o1-mini 44.9% 29.2% aider --model o1-mini diff
gpt-4-turbo-2024-04-09 (udiff) 34.1% 30.7% aider --gpt-4-turbo udiff
gpt-4-0125-preview 33.7% 47.2% aider --model gpt-4-0125-preview udiff
DeepSeek Coder V2 0724 (deprecated) 32.6% 59.6% aider --model deepseek/deepseek-coder diff
DeepSeek Chat V2.5 31.5% 67.4% aider --deepseek diff
gpt-4-turbo-2024-04-09 (diff) 21.4% 6.8% aider --model gpt-4-turbo-2024-04-09 diff

目录