代码编辑排行榜

这个旧的 aider 代码编辑排行榜已被新的、更具挑战性的多语言排行榜取代。

Aider 的代码编辑基准测试要求大型语言模型编辑 Python 源文件来完成来自 Exercism 的 133 个小型编程练习。这衡量了 LLM 的编码能力，以及它是否能编写能集成到现有代码中的新代码。模型还需要在没有人工干预的情况下成功将所有更改应用到源文件中。

模型	正确完成百分比	使用正确编辑格式的百分比	命令	编辑格式
o1	84.2%	99.2%	`aider --model openrouter/openai/o1`	diff
claude-3-5-sonnet-20241022	84.2%	99.2%	`aider --model anthropic/claude-3-5-sonnet-20241022`	diff
gemini-exp-1206 (whole)	80.5%	100.0%	`aider --model gemini/gemini-exp-1206`	whole
o1-preview	79.7%	93.2%	`aider --model o1-preview`	diff
claude-3.5-sonnet-20240620	77.4%	99.2%	`aider --model claude-3.5-sonnet-20240620`	diff
claude-3-5-haiku-20241022	75.2%	95.5%	`aider --model anthropic/claude-3-5-haiku-20241022`	diff
ollama/qwen2.5-coder:32b	72.9%	100.0%	`aider --model ollama/qwen2.5-coder:32b`	whole
DeepSeek Coder V2 0724	72.9%	97.7%	`aider --model deepseek/deepseek-coder`	diff
gpt-4o-2024-05-13	72.9%	96.2%	`aider`	diff
DeepSeek-V2.5-1210	72.2%	99.2%	`aider --model deepseek/deepseek-chat`	diff
openai/chatgpt-4o-latest	72.2%	97.0%	`aider --model openai/chatgpt-4o-latest`	diff
DeepSeek V2.5	72.2%	96.2%	`aider --deepseek`	diff
gpt-4o-2024-11-20	71.4%	99.2%	`aider --model openai/gpt-4o-2024-11-20`	diff
Qwen2.5-Coder-32B-Instruct	71.4%	94.7%	`aider --model openai/hf:Qwen/Qwen2.5-Coder-32B-Instruct --openai-api-base https://glhf.chat/api/openai/v1`	diff
gpt-4o-2024-08-06	71.4%	98.5%	`aider --model openai/gpt-4o-2024-08-06`	diff
o1-mini (whole)	70.7%	90.0%	`aider --model o1-mini`	whole
gemini-2.0-flash-exp	69.9%	97.0%	`aider --model gemini/gemini-2.0-flash-exp`	diff
DeepSeek Chat V2 0628	69.9%	97.7%	`aider --model deepseek/deepseek-chat`	diff
gemini-exp-1206 (diff)	69.2%	84.2%	`aider --model gemini/gemini-exp-1206`	diff
Qwen2.5-Coder-14B-Instruct	69.2%	100.0%	`aider --model openai/Qwen2.5-Coder-14B-Instruct`	whole
claude-3-opus-20240229	68.4%	100.0%	`aider --opus`	diff
gpt-4-0613	67.7%	100.0%	`aider -4`	diff
Dracarys2-72B-Instruct	66.9%	100.0%	`(via glhf.chat)`	whole
gemini-1.5-pro-exp-0827	66.9%	94.7%	`aider --model gemini/gemini-1.5-pro-exp-0827`	diff-fenced
llama-3.1-405b-instruct (whole)	66.2%	100.0%	`aider --model openrouter/meta-llama/llama-3.1-405b-instruct`	whole
gpt-4-0314	66.2%	93.2%	`aider --model gpt-4-0314`	diff
gpt-4-0125-preview	66.2%	97.7%	`aider --model gpt-4-0125-preview`	udiff
yi-lightning	65.4%	97.0%	`aider --model openai/yi-lightning`	whole
openrouter/qwen/qwen-2.5-coder-32b-instruct	65.4%	84.2%	`aider --model openrouter/qwen/qwen-2.5-coder-32b-instruct`	diff
Mistral Large (2411)	65.4%	96.2%	`aider --model mistral/mistral-large-latest`	diff
gemini-1.5-pro-002	65.4%	96.2%	`aider --model gemini/gemini-1.5-pro-002`	diff-fenced
qwen-2.5-72b-instruct (bf16)	65.4%	96.2%	`aider --model openrouter/qwen/qwen-2.5-72b-instruct`	diff
gpt-4-1106-preview	65.4%	92.5%	`aider --model gpt-4-1106-preview`	udiff
ollama/Qwen2.5.1-Coder-7B-Instruct-GGUF:Q8_0-32k	63.9%	100.0%	`aider --model ollama/Qwen2.5.1-Coder-7B-Instruct-GGUF:Q8_0-32k`	whole
nousresearch/hermes-3-llama-3.1-405b	63.9%	100.0%	`aider --model openrouter/nousresearch/hermes-3-llama-3.1-405b`	whole
llama-3.1-405b-instruct (diff)	63.9%	92.5%	`aider --model openrouter/meta-llama/llama-3.1-405b-instruct`	diff
gpt-4-turbo-2024-04-09 (udiff)	63.9%	97.0%	`aider --gpt-4-turbo`	udiff
ollama/qwen2.5-coder:14b	61.7%	98.5%	`aider --model ollama/qwen2.5-coder:14b`	whole
o1-mini	61.1%	100.0%	`aider --model o1-mini`	diff
gemini-exp-1114	60.9%	85.7%	`aider --model gemini/gemini-exp-1114`	diff
Mistral Large 2 (2407)	60.2%	100.0%	`aider --model mistral/mistral-large-2407`	whole
llama-3.3-70b-instruct	59.4%	88.7%	`aider --model openrouter/meta-llama/llama-3.3-70b-instruct`	diff
ollama/qwen2.5:32b-instruct-q8_0	58.6%	100.0%	`aider --model ollama/qwen2.5:32b-instruct-q8_0`	whole
Grok-2	58.6%	98.5%	`aider --model openrouter/x-ai/grok-2`	whole
llama-3.1-70b-instruct	58.6%	100.0%	`aider --model fireworks_ai/accounts/fireworks/models/llama-v3p1-70b-instruct`	whole
gemini-exp-1121	57.9%	83.5%	`aider --model gemini/gemini-exp-1121`	diff
Qwen2.5-Coder-7B-Instruct	57.9%	100.0%	`aider --model openai/Qwen2.5-Coder-7B-Instruct`	whole
gpt-3.5-turbo-0301	57.9%	100.0%	`aider --model gpt-3.5-turbo-0301`	whole
gpt-4-turbo-2024-04-09 (diff)	57.6%	100.0%	`aider --model gpt-4-turbo-2024-04-09`	diff
gemini-1.5-pro-001	57.1%	87.2%	`aider --model gemini/gemini-1.5-pro-latest`	diff-fenced
gpt-3.5-turbo-1106	56.1%	100.0%	`aider --model gpt-3.5-turbo-1106`	whole
gpt-4o-mini	55.6%	100.0%	`aider --model gpt-4o-mini`	whole
Qwen2 72B Instruct	55.6%	100.0%	`aider --model together_ai/qwen/Qwen2-72B-Instruct`	whole
Llama-3.1-Nemotron-70B-Instruct-HF	54.9%	99.2%	`(via glhf.chat)`	whole
Grok-2-mini	54.9%	100.0%	`aider --model openrouter/x-ai/grok-2-mini`	whole
claude-3-sonnet-20240229	54.9%	100.0%	`aider --sonnet`	whole
Nova Pro	54.1%	100.0%	`aider --model bedrock/us.amazon.nova-pro-v1:0`	whole
ollama/qwen2.5:32b	54.1%	100.0%	`aider --model ollama/qwen2.5:32b`	whole
Yi Coder 9B Chat	54.1%	100.0%	`aider --model openai/hf:01-ai/Yi-Coder-9B-Chat --openai-api-base https://glhf.chat/api/openai/v1`	whole
gemini-1.5-flash-exp-0827	52.6%	100.0%	`aider --model gemini/gemini-1.5-flash-exp-0827`	whole
qwen2.5-coder:7b-instruct-q8_0	51.9%	100.0%	`aider --model ollama/qwen2.5-coder:7b-instruct-q8_0`	whole
gemini-1.5-flash-002 (0924)	51.1%	100.0%	`aider --model gemini/gemini-1.5-flash-002`	whole
codestral-2405	51.1%	100.0%	`aider --model mistral/codestral-2405`	whole
gpt-3.5-turbo-0613	50.4%	100.0%	`aider --model gpt-3.5-turbo-0613`	whole
gpt-3.5-turbo-0125	50.4%	100.0%	`aider -3`	whole
qwen2:72b-instruct-q8_0	49.6%	100.0%	`aider --model ollama/qwen2:72b-instruct-q8_0`	whole
llama3-70b-8192	49.2%	73.5%	`aider --model groq/llama3-70b-8192`	diff
Codestral-22B-v0.1-Q4_K_M	48.1%	100.0%	`aider --model Codestral-22B-v0.1-Q4_K_M`	whole
codestral:22b-v0.1-q8_0	48.1%	100.0%	`aider --model ollama/codestral:22b-v0.1-q8_0`	whole
claude-3-haiku-20240307	47.4%	100.0%	`aider --model claude-3-haiku-20240307`	whole
ollama/codestral	45.9%	98.5%	`aider --model ollama/codestral`	whole
yi-coder:9b-chat-q4_0	45.1%	100.0%	`aider --model ollama/yi-coder:9b-chat-q4_0`	whole
gemini-1.5-flash-latest	44.4%	100.0%	`aider --model gemini/gemini-1.5-flash-latest`	whole
WizardLM-2 8x22B	44.4%	100.0%	`aider --model openrouter/microsoft/wizardlm-2-8x22b`	whole
ollama/yi-coder:9b-chat-fp16	43.6%	99.2%	`aider --model ollama/yi-coder:9b-chat-fp16`	whole
Reflection-70B	42.1%	100.0%	`(not currently supported)`	whole
Qwen2.5-Coder-3B-Instruct	39.1%	100.0%	`aider --model openai/Qwen2.5-Coder-3B-Instruct`	whole
ollama/mistral-small	38.3%	99.2%	`aider --model ollama/mistral-small`	whole
gemini-1.5-flash-8b-exp-0924	38.3%	100.0%	`aider --model gemini/gemini-1.5-flash-8b-exp-0924`	whole
Command R (08-24)	38.3%	100.0%	`aider --model command-r-08-2024`	whole
Command R+ (08-24)	38.3%	100.0%	`aider --model command-r-plus-08-2024`	whole
gemini-1.5-flash-8b-exp-0827	38.3%	100.0%	`aider --model gemini/gemini-1.5-flash-8b-exp-0827`	whole
llama-3.1-8b-instruct	37.6%	100.0%	`aider --model fireworks_ai/accounts/fireworks/models/llama-v3p1-8b-instruct`	whole
qwen1.5-110b-chat	37.6%	100.0%	`aider --model together_ai/qwen/qwen1.5-110b-chat`	whole
gemma2:27b-instruct-q8_0	36.1%	100.0%	`aider --model ollama/gemma2:27b-instruct-q8_0`	whole
codeqwen:7b-chat-v1.5-q8_0	34.6%	100.0%	`aider --model ollama/codeqwen:7b-chat-v1.5-q8_0`	whole
ollama/mistral-nemo:12b-instruct-2407-q4_K_M	33.1%	100.0%	`aider --model ollama/mistral-nemo:12b-instruct-2407-q4_K_M`	whole
ollama/codegeex4	32.3%	97.0%	`aider --model ollama/codegeex4`	whole
Qwen2.5-Coder-1.5B-Instruct	31.6%	100.0%	`aider --model openai/Qwen2.5-Coder-1.5B-Instruct`	whole
command-r-plus	31.6%	100.0%	`aider --model command-r-plus`	whole
ollama/hermes3:8b-llama3.1-fp16	30.1%	98.5%	`aider --model ollama/hermes3:8b-llama3.1-fp16`	whole
ollama/wojtek/opencodeinterpreter:6.7b	30.1%	91.0%	`aider --model ollama/wojtek/opencodeinterpreter:6.7b`	whole
o1-mini-2024-09-12	27.1%	95.6%	`aider --model o1-mini`	whole
ollama/tulu3	26.3%	100.0%	`aider --model ollama/tulu3`	whole
ollama/llama3.2:3b-instruct-fp16	26.3%	97.0%	`aider --model ollama/llama3.2:3b-instruct-fp16`	whole
ollama/hermes3	22.6%	98.5%	`aider --model ollama/hermes3`	whole
ollama/granite3-dense:8b	20.3%	78.9%	`aider --model ollama/granite3-dense:8b`	whole
Qwen2.5-Coder-0.5B-Instruct	14.3%	100.0%	`aider --model openai/Qwen2.5-Coder-0.5B-Instruct`	whole

基准测试结果说明

关键基准测试指标包括：

正确完成率 - 衡量 LLM 成功完成编码任务的百分比。要完成一个任务，LLM 必须解决编程问题并编辑代码以实现该解决方案。
正确编辑格式使用率 - 衡量 LLM 遵守系统提示中指定编辑格式的编码任务百分比。如果 LLM 出现编辑错误，aider 会提供反馈并要求修复编辑版本。表现最佳的模型能够可靠地遵循编辑格式而不出错。

关于编辑格式的说明

Aider 使用不同的”编辑格式”从不同 LLM 收集代码编辑内容：

“whole”（完整）格式是 LLM 最容易使用的格式，但它会消耗大量 token，可能限制可编辑文件的大小。
能够使用某种 diff 格式的模型效率更高，消耗的 token 数量显著减少。
使用类 diff 格式的模型能够以更低成本编辑更大的文件，且不会触及 token 限制。

Aider 已配置为流行的 OpenAI 和 Anthropic 模型以及 LLM 页面推荐的其他模型使用最佳编辑格式。对于不太知名的模型，aider 将默认使用”whole”编辑格式，因为这是 LLM 最容易使用的格式。

贡献基准测试结果

欢迎贡献基准测试结果！有关运行 aider 代码编辑基准测试的信息，请参阅基准测试 README。提交结果时，请通过创建 PR 并编辑基准测试结果数据文件来完成。

作者：Paul Gauthier，最后更新于 2025年4月12日。

代码编辑排行榜

基准测试结果说明

关于编辑格式的说明

贡献基准测试结果

目录