Devin on SWE-BENCH: A Deep Dive into the SWE-BENCH Benchmark and Insights into Evaluating Devin's Performance
Understanding the benchmark behind Cognition's impressive claims
This newsletter serves as a brief guide to understanding how Devin was evaluated to outperform other models on the SWE-BENCH. We will explore the evaluation process and the factors that contributed to Devin's superior performance in resolving real-world software issues.
SWE-BENCH is a benchmark developed by researchers from Princeton University and the University of Chicago to evaluate the ability of large language models to resolve real-world software issues found on GitHub. Detailed in the 2023 research paper titled "SWE-BENCH: CAN LANGUAGE MODELS RESOLVE REAL-WORLD GITHUB ISSUES?", this framework outlines a methodology for evaluating the capabilities of large language models in tackling GitHub issues.
The benchmark includes 2,294 software engineering problems drawn from real-world GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a description of an issue to be resolved, language models are tasked with editing the codebase to address the issue.
The benchmark aims to capture what state-of-the-art large language models can and cannot do by accurately reflecting real-world applications of large language models to help shape their future development and usage.
SWE-BENCH sources task instances from real-world Python repositories by connecting GitHub issues to merged pull request solutions that resolve related tests. Provided with the issue text and a codebase snapshot from the repository, models generate a patch that is evaluated against real tests.
The evaluation by SWE-BENCH on multiple state-of-the-art large language models, as detailed in the research paper, found that large language models fail to solve all except the simplest issues. Claude 2 and GPT-4 only resolved 4.8% and 1.7% of tasks, respectively. Devin is not included in the model performance benchmark due to its recent release and we can only use the given internal evaluation of 13.8% to compare it with other models.
The research team fine-tuned CodeLlama on 19,000 non-testing task instances from 37 other repositories for their SWE-Llama 7b and 13b models. These models are competitive with Claude 2 and can solve issues using over 100,000 tokens as context.
To create the benchmark, approximately 90,000 pull requests (PRs) from 12 open-source Python repositories on GitHub were collected, as popular repositories tend to be better maintained, have clear guidelines, and better test coverage.
Through the above stages of filtering, the original 90,000 pull requests are filtered down to 2,294 task instances.
To evaluate a proposed solution generated by the model through a patch, the generated patch is applied to a unit test associated with the task instance. If the patch applies successfully and all the tests pass, the proposed solution is considered to have resolved the issue.
The performance of models on the benchmark differs across repositories. This is because repositories have different contributor guidelines, popularity, complexity, dependencies, and other factors that can influence the difficulty of resolving issues and the performance of language models.
Due to the need to process long sequences of text, only a few models were suitable for SWE-BENCH. These include ChatGPT-3.5 with a context window size of 16K tokens, GPT-4 with 32K tokens, and Claude 2 with 100K tokens.
Interestingly, an increase in context length tends to decrease the model's performance, as the models can become distracted by the additional context. The figure below shows Claude 2's performance as context length increases.
In conclusion, the introduction of benchmarks like SWE-BENCH is an essential step towards improving the capabilities of language models in software engineering tasks. As more information becomes available about Devin's capabilities in the upcoming technical paper, it will be easier to properly evaluate its performance.
By Mlondi