代码编辑排行榜

这个旧的 aider 代码编辑排行榜 已被新的、更具挑战性的 多语言排行榜取代。

Aider 的代码编辑基准测试要求大型语言模型编辑 Python 源文件来完成来自 Exercism 的 133 个小型编程练习。 这衡量了 LLM 的编码能力,以及它是否能编写能集成到现有代码中的新代码。 模型还需要在没有人工干预的情况下成功将所有更改应用到源文件中。

模型 正确完成百分比 使用正确编辑格式的百分比 命令 编辑格式
o1 84.2% 99.2% aider --model openrouter/openai/o1 diff
claude-3-5-sonnet-20241022 84.2% 99.2% aider --model anthropic/claude-3-5-sonnet-20241022 diff
gemini-exp-1206 (whole) 80.5% 100.0% aider --model gemini/gemini-exp-1206 whole
o1-preview 79.7% 93.2% aider --model o1-preview diff
claude-3.5-sonnet-20240620 77.4% 99.2% aider --model claude-3.5-sonnet-20240620 diff
claude-3-5-haiku-20241022 75.2% 95.5% aider --model anthropic/claude-3-5-haiku-20241022 diff
ollama/qwen2.5-coder:32b 72.9% 100.0% aider --model ollama/qwen2.5-coder:32b whole
DeepSeek Coder V2 0724 72.9% 97.7% aider --model deepseek/deepseek-coder diff
gpt-4o-2024-05-13 72.9% 96.2% aider diff
DeepSeek-V2.5-1210 72.2% 99.2% aider --model deepseek/deepseek-chat diff
openai/chatgpt-4o-latest 72.2% 97.0% aider --model openai/chatgpt-4o-latest diff
DeepSeek V2.5 72.2% 96.2% aider --deepseek diff
gpt-4o-2024-11-20 71.4% 99.2% aider --model openai/gpt-4o-2024-11-20 diff
Qwen2.5-Coder-32B-Instruct 71.4% 94.7% aider --model openai/hf:Qwen/Qwen2.5-Coder-32B-Instruct --openai-api-base https://glhf.chat/api/openai/v1 diff
gpt-4o-2024-08-06 71.4% 98.5% aider --model openai/gpt-4o-2024-08-06 diff
o1-mini (whole) 70.7% 90.0% aider --model o1-mini whole
gemini-2.0-flash-exp 69.9% 97.0% aider --model gemini/gemini-2.0-flash-exp diff
DeepSeek Chat V2 0628 69.9% 97.7% aider --model deepseek/deepseek-chat diff
gemini-exp-1206 (diff) 69.2% 84.2% aider --model gemini/gemini-exp-1206 diff
Qwen2.5-Coder-14B-Instruct 69.2% 100.0% aider --model openai/Qwen2.5-Coder-14B-Instruct whole
claude-3-opus-20240229 68.4% 100.0% aider --opus diff
gpt-4-0613 67.7% 100.0% aider -4 diff
Dracarys2-72B-Instruct 66.9% 100.0% (via glhf.chat) whole
gemini-1.5-pro-exp-0827 66.9% 94.7% aider --model gemini/gemini-1.5-pro-exp-0827 diff-fenced
llama-3.1-405b-instruct (whole) 66.2% 100.0% aider --model openrouter/meta-llama/llama-3.1-405b-instruct whole
gpt-4-0314 66.2% 93.2% aider --model gpt-4-0314 diff
gpt-4-0125-preview 66.2% 97.7% aider --model gpt-4-0125-preview udiff
yi-lightning 65.4% 97.0% aider --model openai/yi-lightning whole
openrouter/qwen/qwen-2.5-coder-32b-instruct 65.4% 84.2% aider --model openrouter/qwen/qwen-2.5-coder-32b-instruct diff
Mistral Large (2411) 65.4% 96.2% aider --model mistral/mistral-large-latest diff
gemini-1.5-pro-002 65.4% 96.2% aider --model gemini/gemini-1.5-pro-002 diff-fenced
qwen-2.5-72b-instruct (bf16) 65.4% 96.2% aider --model openrouter/qwen/qwen-2.5-72b-instruct diff
gpt-4-1106-preview 65.4% 92.5% aider --model gpt-4-1106-preview udiff
ollama/Qwen2.5.1-Coder-7B-Instruct-GGUF:Q8_0-32k 63.9% 100.0% aider --model ollama/Qwen2.5.1-Coder-7B-Instruct-GGUF:Q8_0-32k whole
nousresearch/hermes-3-llama-3.1-405b 63.9% 100.0% aider --model openrouter/nousresearch/hermes-3-llama-3.1-405b whole
llama-3.1-405b-instruct (diff) 63.9% 92.5% aider --model openrouter/meta-llama/llama-3.1-405b-instruct diff
gpt-4-turbo-2024-04-09 (udiff) 63.9% 97.0% aider --gpt-4-turbo udiff
ollama/qwen2.5-coder:14b 61.7% 98.5% aider --model ollama/qwen2.5-coder:14b whole
o1-mini 61.1% 100.0% aider --model o1-mini diff
gemini-exp-1114 60.9% 85.7% aider --model gemini/gemini-exp-1114 diff
Mistral Large 2 (2407) 60.2% 100.0% aider --model mistral/mistral-large-2407 whole
llama-3.3-70b-instruct 59.4% 88.7% aider --model openrouter/meta-llama/llama-3.3-70b-instruct diff
ollama/qwen2.5:32b-instruct-q8_0 58.6% 100.0% aider --model ollama/qwen2.5:32b-instruct-q8_0 whole
Grok-2 58.6% 98.5% aider --model openrouter/x-ai/grok-2 whole
llama-3.1-70b-instruct 58.6% 100.0% aider --model fireworks_ai/accounts/fireworks/models/llama-v3p1-70b-instruct whole
gemini-exp-1121 57.9% 83.5% aider --model gemini/gemini-exp-1121 diff
Qwen2.5-Coder-7B-Instruct 57.9% 100.0% aider --model openai/Qwen2.5-Coder-7B-Instruct whole
gpt-3.5-turbo-0301 57.9% 100.0% aider --model gpt-3.5-turbo-0301 whole
gpt-4-turbo-2024-04-09 (diff) 57.6% 100.0% aider --model gpt-4-turbo-2024-04-09 diff
gemini-1.5-pro-001 57.1% 87.2% aider --model gemini/gemini-1.5-pro-latest diff-fenced
gpt-3.5-turbo-1106 56.1% 100.0% aider --model gpt-3.5-turbo-1106 whole
gpt-4o-mini 55.6% 100.0% aider --model gpt-4o-mini whole
Qwen2 72B Instruct 55.6% 100.0% aider --model together_ai/qwen/Qwen2-72B-Instruct whole
Llama-3.1-Nemotron-70B-Instruct-HF 54.9% 99.2% (via glhf.chat) whole
Grok-2-mini 54.9% 100.0% aider --model openrouter/x-ai/grok-2-mini whole
claude-3-sonnet-20240229 54.9% 100.0% aider --sonnet whole
Nova Pro 54.1% 100.0% aider --model bedrock/us.amazon.nova-pro-v1:0 whole
ollama/qwen2.5:32b 54.1% 100.0% aider --model ollama/qwen2.5:32b whole
Yi Coder 9B Chat 54.1% 100.0% aider --model openai/hf:01-ai/Yi-Coder-9B-Chat --openai-api-base https://glhf.chat/api/openai/v1 whole
gemini-1.5-flash-exp-0827 52.6% 100.0% aider --model gemini/gemini-1.5-flash-exp-0827 whole
qwen2.5-coder:7b-instruct-q8_0 51.9% 100.0% aider --model ollama/qwen2.5-coder:7b-instruct-q8_0 whole
gemini-1.5-flash-002 (0924) 51.1% 100.0% aider --model gemini/gemini-1.5-flash-002 whole
codestral-2405 51.1% 100.0% aider --model mistral/codestral-2405 whole
gpt-3.5-turbo-0613 50.4% 100.0% aider --model gpt-3.5-turbo-0613 whole
gpt-3.5-turbo-0125 50.4% 100.0% aider -3 whole
qwen2:72b-instruct-q8_0 49.6% 100.0% aider --model ollama/qwen2:72b-instruct-q8_0 whole
llama3-70b-8192 49.2% 73.5% aider --model groq/llama3-70b-8192 diff
Codestral-22B-v0.1-Q4_K_M 48.1% 100.0% aider --model Codestral-22B-v0.1-Q4_K_M whole
codestral:22b-v0.1-q8_0 48.1% 100.0% aider --model ollama/codestral:22b-v0.1-q8_0 whole
claude-3-haiku-20240307 47.4% 100.0% aider --model claude-3-haiku-20240307 whole
ollama/codestral 45.9% 98.5% aider --model ollama/codestral whole
yi-coder:9b-chat-q4_0 45.1% 100.0% aider --model ollama/yi-coder:9b-chat-q4_0 whole
gemini-1.5-flash-latest 44.4% 100.0% aider --model gemini/gemini-1.5-flash-latest whole
WizardLM-2 8x22B 44.4% 100.0% aider --model openrouter/microsoft/wizardlm-2-8x22b whole
ollama/yi-coder:9b-chat-fp16 43.6% 99.2% aider --model ollama/yi-coder:9b-chat-fp16 whole
Reflection-70B 42.1% 100.0% (not currently supported) whole
Qwen2.5-Coder-3B-Instruct 39.1% 100.0% aider --model openai/Qwen2.5-Coder-3B-Instruct whole
ollama/mistral-small 38.3% 99.2% aider --model ollama/mistral-small whole
gemini-1.5-flash-8b-exp-0924 38.3% 100.0% aider --model gemini/gemini-1.5-flash-8b-exp-0924 whole
Command R (08-24) 38.3% 100.0% aider --model command-r-08-2024 whole
Command R+ (08-24) 38.3% 100.0% aider --model command-r-plus-08-2024 whole
gemini-1.5-flash-8b-exp-0827 38.3% 100.0% aider --model gemini/gemini-1.5-flash-8b-exp-0827 whole
llama-3.1-8b-instruct 37.6% 100.0% aider --model fireworks_ai/accounts/fireworks/models/llama-v3p1-8b-instruct whole
qwen1.5-110b-chat 37.6% 100.0% aider --model together_ai/qwen/qwen1.5-110b-chat whole
gemma2:27b-instruct-q8_0 36.1% 100.0% aider --model ollama/gemma2:27b-instruct-q8_0 whole
codeqwen:7b-chat-v1.5-q8_0 34.6% 100.0% aider --model ollama/codeqwen:7b-chat-v1.5-q8_0 whole
ollama/mistral-nemo:12b-instruct-2407-q4_K_M 33.1% 100.0% aider --model ollama/mistral-nemo:12b-instruct-2407-q4_K_M whole
ollama/codegeex4 32.3% 97.0% aider --model ollama/codegeex4 whole
Qwen2.5-Coder-1.5B-Instruct 31.6% 100.0% aider --model openai/Qwen2.5-Coder-1.5B-Instruct whole
command-r-plus 31.6% 100.0% aider --model command-r-plus whole
ollama/hermes3:8b-llama3.1-fp16 30.1% 98.5% aider --model ollama/hermes3:8b-llama3.1-fp16 whole
ollama/wojtek/opencodeinterpreter:6.7b 30.1% 91.0% aider --model ollama/wojtek/opencodeinterpreter:6.7b whole
o1-mini-2024-09-12 27.1% 95.6% aider --model o1-mini whole
ollama/tulu3 26.3% 100.0% aider --model ollama/tulu3 whole
ollama/llama3.2:3b-instruct-fp16 26.3% 97.0% aider --model ollama/llama3.2:3b-instruct-fp16 whole
ollama/hermes3 22.6% 98.5% aider --model ollama/hermes3 whole
ollama/granite3-dense:8b 20.3% 78.9% aider --model ollama/granite3-dense:8b whole
Qwen2.5-Coder-0.5B-Instruct 14.3% 100.0% aider --model openai/Qwen2.5-Coder-0.5B-Instruct whole

基准测试结果说明

关键基准测试指标包括:

  • 正确完成率 - 衡量 LLM 成功完成编码任务的百分比。要完成一个任务,LLM 必须解决编程问题编辑代码以实现该解决方案。
  • 正确编辑格式使用率 - 衡量 LLM 遵守系统提示中指定编辑格式的编码任务百分比。如果 LLM 出现编辑错误,aider 会提供反馈并要求修复编辑版本。表现最佳的模型能够可靠地遵循编辑格式而不出错。

关于编辑格式的说明

Aider 使用不同的”编辑格式”从不同 LLM 收集代码编辑内容:

  • “whole”(完整)格式是 LLM 最容易使用的格式,但它会消耗大量 token,可能限制可编辑文件的大小。
  • 能够使用某种 diff 格式的模型效率更高,消耗的 token 数量显著减少。
  • 使用类 diff 格式的模型能够以更低成本编辑更大的文件,且不会触及 token 限制。

Aider 已配置为流行的 OpenAI 和 Anthropic 模型以及 LLM 页面推荐的其他模型 使用最佳编辑格式。对于不太知名的模型,aider 将默认使用”whole”编辑格式,因为这是 LLM 最容易使用的格式。

贡献基准测试结果

欢迎贡献基准测试结果! 有关运行 aider 代码编辑基准测试的信息,请参阅 基准测试 README。 提交结果时,请通过创建 PR 并编辑 基准测试结果数据文件 来完成。


目录