代码编辑排行榜
这个旧的 aider 代码编辑排行榜 已被新的、更具挑战性的 多语言排行榜取代。
Aider 的代码编辑基准测试要求大型语言模型编辑 Python 源文件来完成来自 Exercism 的 133 个小型编程练习。 这衡量了 LLM 的编码能力,以及它是否能编写能集成到现有代码中的新代码。 模型还需要在没有人工干预的情况下成功将所有更改应用到源文件中。
模型 | 正确完成百分比 | 使用正确编辑格式的百分比 | 命令 | 编辑格式 |
---|---|---|---|---|
o1 | 84.2% | 99.2% | aider --model openrouter/openai/o1 |
diff |
claude-3-5-sonnet-20241022 | 84.2% | 99.2% | aider --model anthropic/claude-3-5-sonnet-20241022 |
diff |
gemini-exp-1206 (whole) | 80.5% | 100.0% | aider --model gemini/gemini-exp-1206 |
whole |
o1-preview | 79.7% | 93.2% | aider --model o1-preview |
diff |
claude-3.5-sonnet-20240620 | 77.4% | 99.2% | aider --model claude-3.5-sonnet-20240620 |
diff |
claude-3-5-haiku-20241022 | 75.2% | 95.5% | aider --model anthropic/claude-3-5-haiku-20241022 |
diff |
ollama/qwen2.5-coder:32b | 72.9% | 100.0% | aider --model ollama/qwen2.5-coder:32b |
whole |
DeepSeek Coder V2 0724 | 72.9% | 97.7% | aider --model deepseek/deepseek-coder |
diff |
gpt-4o-2024-05-13 | 72.9% | 96.2% | aider |
diff |
DeepSeek-V2.5-1210 | 72.2% | 99.2% | aider --model deepseek/deepseek-chat |
diff |
openai/chatgpt-4o-latest | 72.2% | 97.0% | aider --model openai/chatgpt-4o-latest |
diff |
DeepSeek V2.5 | 72.2% | 96.2% | aider --deepseek |
diff |
gpt-4o-2024-11-20 | 71.4% | 99.2% | aider --model openai/gpt-4o-2024-11-20 |
diff |
Qwen2.5-Coder-32B-Instruct | 71.4% | 94.7% | aider --model openai/hf:Qwen/Qwen2.5-Coder-32B-Instruct --openai-api-base https://glhf.chat/api/openai/v1 |
diff |
gpt-4o-2024-08-06 | 71.4% | 98.5% | aider --model openai/gpt-4o-2024-08-06 |
diff |
o1-mini (whole) | 70.7% | 90.0% | aider --model o1-mini |
whole |
gemini-2.0-flash-exp | 69.9% | 97.0% | aider --model gemini/gemini-2.0-flash-exp |
diff |
DeepSeek Chat V2 0628 | 69.9% | 97.7% | aider --model deepseek/deepseek-chat |
diff |
gemini-exp-1206 (diff) | 69.2% | 84.2% | aider --model gemini/gemini-exp-1206 |
diff |
Qwen2.5-Coder-14B-Instruct | 69.2% | 100.0% | aider --model openai/Qwen2.5-Coder-14B-Instruct |
whole |
claude-3-opus-20240229 | 68.4% | 100.0% | aider --opus |
diff |
gpt-4-0613 | 67.7% | 100.0% | aider -4 |
diff |
Dracarys2-72B-Instruct | 66.9% | 100.0% | (via glhf.chat) |
whole |
gemini-1.5-pro-exp-0827 | 66.9% | 94.7% | aider --model gemini/gemini-1.5-pro-exp-0827 |
diff-fenced |
llama-3.1-405b-instruct (whole) | 66.2% | 100.0% | aider --model openrouter/meta-llama/llama-3.1-405b-instruct |
whole |
gpt-4-0314 | 66.2% | 93.2% | aider --model gpt-4-0314 |
diff |
gpt-4-0125-preview | 66.2% | 97.7% | aider --model gpt-4-0125-preview |
udiff |
yi-lightning | 65.4% | 97.0% | aider --model openai/yi-lightning |
whole |
openrouter/qwen/qwen-2.5-coder-32b-instruct | 65.4% | 84.2% | aider --model openrouter/qwen/qwen-2.5-coder-32b-instruct |
diff |
Mistral Large (2411) | 65.4% | 96.2% | aider --model mistral/mistral-large-latest |
diff |
gemini-1.5-pro-002 | 65.4% | 96.2% | aider --model gemini/gemini-1.5-pro-002 |
diff-fenced |
qwen-2.5-72b-instruct (bf16) | 65.4% | 96.2% | aider --model openrouter/qwen/qwen-2.5-72b-instruct |
diff |
gpt-4-1106-preview | 65.4% | 92.5% | aider --model gpt-4-1106-preview |
udiff |
ollama/Qwen2.5.1-Coder-7B-Instruct-GGUF:Q8_0-32k | 63.9% | 100.0% | aider --model ollama/Qwen2.5.1-Coder-7B-Instruct-GGUF:Q8_0-32k |
whole |
nousresearch/hermes-3-llama-3.1-405b | 63.9% | 100.0% | aider --model openrouter/nousresearch/hermes-3-llama-3.1-405b |
whole |
llama-3.1-405b-instruct (diff) | 63.9% | 92.5% | aider --model openrouter/meta-llama/llama-3.1-405b-instruct |
diff |
gpt-4-turbo-2024-04-09 (udiff) | 63.9% | 97.0% | aider --gpt-4-turbo |
udiff |
ollama/qwen2.5-coder:14b | 61.7% | 98.5% | aider --model ollama/qwen2.5-coder:14b |
whole |
o1-mini | 61.1% | 100.0% | aider --model o1-mini |
diff |
gemini-exp-1114 | 60.9% | 85.7% | aider --model gemini/gemini-exp-1114 |
diff |
Mistral Large 2 (2407) | 60.2% | 100.0% | aider --model mistral/mistral-large-2407 |
whole |
llama-3.3-70b-instruct | 59.4% | 88.7% | aider --model openrouter/meta-llama/llama-3.3-70b-instruct |
diff |
ollama/qwen2.5:32b-instruct-q8_0 | 58.6% | 100.0% | aider --model ollama/qwen2.5:32b-instruct-q8_0 |
whole |
Grok-2 | 58.6% | 98.5% | aider --model openrouter/x-ai/grok-2 |
whole |
llama-3.1-70b-instruct | 58.6% | 100.0% | aider --model fireworks_ai/accounts/fireworks/models/llama-v3p1-70b-instruct |
whole |
gemini-exp-1121 | 57.9% | 83.5% | aider --model gemini/gemini-exp-1121 |
diff |
Qwen2.5-Coder-7B-Instruct | 57.9% | 100.0% | aider --model openai/Qwen2.5-Coder-7B-Instruct |
whole |
gpt-3.5-turbo-0301 | 57.9% | 100.0% | aider --model gpt-3.5-turbo-0301 |
whole |
gpt-4-turbo-2024-04-09 (diff) | 57.6% | 100.0% | aider --model gpt-4-turbo-2024-04-09 |
diff |
gemini-1.5-pro-001 | 57.1% | 87.2% | aider --model gemini/gemini-1.5-pro-latest |
diff-fenced |
gpt-3.5-turbo-1106 | 56.1% | 100.0% | aider --model gpt-3.5-turbo-1106 |
whole |
gpt-4o-mini | 55.6% | 100.0% | aider --model gpt-4o-mini |
whole |
Qwen2 72B Instruct | 55.6% | 100.0% | aider --model together_ai/qwen/Qwen2-72B-Instruct |
whole |
Llama-3.1-Nemotron-70B-Instruct-HF | 54.9% | 99.2% | (via glhf.chat) |
whole |
Grok-2-mini | 54.9% | 100.0% | aider --model openrouter/x-ai/grok-2-mini |
whole |
claude-3-sonnet-20240229 | 54.9% | 100.0% | aider --sonnet |
whole |
Nova Pro | 54.1% | 100.0% | aider --model bedrock/us.amazon.nova-pro-v1:0 |
whole |
ollama/qwen2.5:32b | 54.1% | 100.0% | aider --model ollama/qwen2.5:32b |
whole |
Yi Coder 9B Chat | 54.1% | 100.0% | aider --model openai/hf:01-ai/Yi-Coder-9B-Chat --openai-api-base https://glhf.chat/api/openai/v1 |
whole |
gemini-1.5-flash-exp-0827 | 52.6% | 100.0% | aider --model gemini/gemini-1.5-flash-exp-0827 |
whole |
qwen2.5-coder:7b-instruct-q8_0 | 51.9% | 100.0% | aider --model ollama/qwen2.5-coder:7b-instruct-q8_0 |
whole |
gemini-1.5-flash-002 (0924) | 51.1% | 100.0% | aider --model gemini/gemini-1.5-flash-002 |
whole |
codestral-2405 | 51.1% | 100.0% | aider --model mistral/codestral-2405 |
whole |
gpt-3.5-turbo-0613 | 50.4% | 100.0% | aider --model gpt-3.5-turbo-0613 |
whole |
gpt-3.5-turbo-0125 | 50.4% | 100.0% | aider -3 |
whole |
qwen2:72b-instruct-q8_0 | 49.6% | 100.0% | aider --model ollama/qwen2:72b-instruct-q8_0 |
whole |
llama3-70b-8192 | 49.2% | 73.5% | aider --model groq/llama3-70b-8192 |
diff |
Codestral-22B-v0.1-Q4_K_M | 48.1% | 100.0% | aider --model Codestral-22B-v0.1-Q4_K_M |
whole |
codestral:22b-v0.1-q8_0 | 48.1% | 100.0% | aider --model ollama/codestral:22b-v0.1-q8_0 |
whole |
claude-3-haiku-20240307 | 47.4% | 100.0% | aider --model claude-3-haiku-20240307 |
whole |
ollama/codestral | 45.9% | 98.5% | aider --model ollama/codestral |
whole |
yi-coder:9b-chat-q4_0 | 45.1% | 100.0% | aider --model ollama/yi-coder:9b-chat-q4_0 |
whole |
gemini-1.5-flash-latest | 44.4% | 100.0% | aider --model gemini/gemini-1.5-flash-latest |
whole |
WizardLM-2 8x22B | 44.4% | 100.0% | aider --model openrouter/microsoft/wizardlm-2-8x22b |
whole |
ollama/yi-coder:9b-chat-fp16 | 43.6% | 99.2% | aider --model ollama/yi-coder:9b-chat-fp16 |
whole |
Reflection-70B | 42.1% | 100.0% | (not currently supported) |
whole |
Qwen2.5-Coder-3B-Instruct | 39.1% | 100.0% | aider --model openai/Qwen2.5-Coder-3B-Instruct |
whole |
ollama/mistral-small | 38.3% | 99.2% | aider --model ollama/mistral-small |
whole |
gemini-1.5-flash-8b-exp-0924 | 38.3% | 100.0% | aider --model gemini/gemini-1.5-flash-8b-exp-0924 |
whole |
Command R (08-24) | 38.3% | 100.0% | aider --model command-r-08-2024 |
whole |
Command R+ (08-24) | 38.3% | 100.0% | aider --model command-r-plus-08-2024 |
whole |
gemini-1.5-flash-8b-exp-0827 | 38.3% | 100.0% | aider --model gemini/gemini-1.5-flash-8b-exp-0827 |
whole |
llama-3.1-8b-instruct | 37.6% | 100.0% | aider --model fireworks_ai/accounts/fireworks/models/llama-v3p1-8b-instruct |
whole |
qwen1.5-110b-chat | 37.6% | 100.0% | aider --model together_ai/qwen/qwen1.5-110b-chat |
whole |
gemma2:27b-instruct-q8_0 | 36.1% | 100.0% | aider --model ollama/gemma2:27b-instruct-q8_0 |
whole |
codeqwen:7b-chat-v1.5-q8_0 | 34.6% | 100.0% | aider --model ollama/codeqwen:7b-chat-v1.5-q8_0 |
whole |
ollama/mistral-nemo:12b-instruct-2407-q4_K_M | 33.1% | 100.0% | aider --model ollama/mistral-nemo:12b-instruct-2407-q4_K_M |
whole |
ollama/codegeex4 | 32.3% | 97.0% | aider --model ollama/codegeex4 |
whole |
Qwen2.5-Coder-1.5B-Instruct | 31.6% | 100.0% | aider --model openai/Qwen2.5-Coder-1.5B-Instruct |
whole |
command-r-plus | 31.6% | 100.0% | aider --model command-r-plus |
whole |
ollama/hermes3:8b-llama3.1-fp16 | 30.1% | 98.5% | aider --model ollama/hermes3:8b-llama3.1-fp16 |
whole |
ollama/wojtek/opencodeinterpreter:6.7b | 30.1% | 91.0% | aider --model ollama/wojtek/opencodeinterpreter:6.7b |
whole |
o1-mini-2024-09-12 | 27.1% | 95.6% | aider --model o1-mini |
whole |
ollama/tulu3 | 26.3% | 100.0% | aider --model ollama/tulu3 |
whole |
ollama/llama3.2:3b-instruct-fp16 | 26.3% | 97.0% | aider --model ollama/llama3.2:3b-instruct-fp16 |
whole |
ollama/hermes3 | 22.6% | 98.5% | aider --model ollama/hermes3 |
whole |
ollama/granite3-dense:8b | 20.3% | 78.9% | aider --model ollama/granite3-dense:8b |
whole |
Qwen2.5-Coder-0.5B-Instruct | 14.3% | 100.0% | aider --model openai/Qwen2.5-Coder-0.5B-Instruct |
whole |
基准测试结果说明
关键基准测试指标包括:
- 正确完成率 - 衡量 LLM 成功完成编码任务的百分比。要完成一个任务,LLM 必须解决编程问题并编辑代码以实现该解决方案。
- 正确编辑格式使用率 - 衡量 LLM 遵守系统提示中指定编辑格式的编码任务百分比。如果 LLM 出现编辑错误,aider 会提供反馈并要求修复编辑版本。表现最佳的模型能够可靠地遵循编辑格式而不出错。
关于编辑格式的说明
Aider 使用不同的”编辑格式”从不同 LLM 收集代码编辑内容:
- “whole”(完整)格式是 LLM 最容易使用的格式,但它会消耗大量 token,可能限制可编辑文件的大小。
- 能够使用某种 diff 格式的模型效率更高,消耗的 token 数量显著减少。
- 使用类 diff 格式的模型能够以更低成本编辑更大的文件,且不会触及 token 限制。
Aider 已配置为流行的 OpenAI 和 Anthropic 模型以及 LLM 页面推荐的其他模型 使用最佳编辑格式。对于不太知名的模型,aider 将默认使用”whole”编辑格式,因为这是 LLM 最容易使用的格式。
贡献基准测试结果
欢迎贡献基准测试结果! 有关运行 aider 代码编辑基准测试的信息,请参阅 基准测试 README。 提交结果时,请通过创建 PR 并编辑 基准测试结果数据文件 来完成。
作者:Paul Gauthier, 最后更新于 2025年4月12日。