Small improvements & fixes to SWE-Bench #1874

li-boxuan · 2024-05-18T08:04:13Z

I was able to run a few benchmark instances from SWE-Bench by myself following the documentation - it was great! In general the experience was smooth, thanks to @xingyaoww, @libowen2121 and the team! I made a few small enhancements and fixes to further improve the developer experience.

Always use poetry run python (using python from poetry's virtual environment) over python or python3 in scripts to make sure the behavior is consistent.
Make AGENT configurable. One can use an argument to control which agent they would like to benchmark. To facilitate this, I removed hardcoded CodeActAgent from run_infer.sh, and also added VERSION attribute to all agents, as the benchmark needs to record the agent version.
Make EVAL_LIMIT configurable. One can use an argument to control how many instances they'd like to benchmark. Useful for debugging & development purposes.
Fix 'eval_output_dir' not defined error in run_infer.py.
Other enhancements to the README file and logs.

I also notice that a lot of code from run_infer.py could be shared by other benchmarks, but since we only have one benchmark now, I think we could avoid over-engineering. A refactor and code dedup would be useful in the future once we have more benchmarks, though.

xingyaoww

Looks awesome to me!! Thanks for polishing it so well! :)

evaluation/swe_bench/README.md

@xingyaoww

I was able to run a few benchmark instances from SWE-Bench by myself following the documentation - it was great! In general the experience was smooth, thanks to @xingyaoww, @libowen2121 and the team! I made a few small enhancements and fixes to further improve the developer experience. Always use poetry run python (using python from poetry's virtual environment) over python or python3 in scripts to make sure the behavior is consistent. Make AGENT configurable. One can use an argument to control which agent they would like to benchmark. To facilitate this, I removed hardcoded CodeActAgent from run_infer.sh, and also added VERSION attribute to all agents, as the benchmark needs to record the agent version. Make EVAL_LIMIT configurable. One can use an argument to control how many instances they'd like to benchmark. Useful for debugging & development purposes. Fix 'eval_output_dir' not defined error in run_infer.py. Other enhancements to the README file and logs. I also notice that a lot of code from run_infer.py could be shared by other benchmarks, but since we only have one benchmark now, I think we could avoid over-engineering. A refactor and code dedup would be useful in the future once we have more benchmarks, though.

libowen2121 · 2024-05-21T10:44:14Z

Hey @li-boxuan! Thanks for the excellent work!

li-boxuan added 9 commits May 18, 2024 00:32

Add agent version to all agents

a10e6aa

Avoid hardcoding CodeActAgent in run_infer.sh

4e20c29

swe_bench README: use poetry in example

5ee141a

Merge remote-tracking branch 'upstream/main' into swebench/minor-fixes

bd20e41

swe_bench README: Explain run_infer.sh

609fca9

Make MODEL_CONFIG optional and add EVAL_LIMIT

48a31e3

Add comment for eval configs in config.py

30853ba

Fix name 'eval_output_dir' is not defined error

8bef860

Fix run_infer.sh argument README

08158f8

li-boxuan changed the title ~~Small improvements to SWE-Bench doc and scripts~~ Small improvements & fixes to SWE-Bench doc and scripts May 19, 2024

li-boxuan added 2 commits May 19, 2024 13:44

Improve log

5fc2209

Make agent configurable

8280a2f

li-boxuan changed the title ~~Small improvements & fixes to SWE-Bench doc and scripts~~ Small improvements & fixes to SWE-Bench May 19, 2024

li-boxuan marked this pull request as ready for review May 19, 2024 20:52

li-boxuan added 3 commits May 19, 2024 13:56

README: mention docker deamon has to be running:

7920f09

README: Mention docker image has to be pulled first

1eb6999

Revert accidental change of --max-iterations

c4b63a9

li-boxuan requested review from xingyaoww and libowen2121 May 19, 2024 21:00

li-boxuan assigned xingyaoww May 19, 2024

li-boxuan added evaluation agent framework strategies for prompting, agent, etc labels May 19, 2024

Merge branch 'main' into swebench/minor-fixes

aa4de3b

xingyaoww approved these changes May 20, 2024

View reviewed changes

evaluation/swe_bench/README.md Show resolved Hide resolved

li-boxuan commented May 20, 2024

View reviewed changes

evaluation/swe_bench/README.md Show resolved Hide resolved

Update evaluation/swe_bench/README.md

de27c64

li-boxuan enabled auto-merge (squash) May 20, 2024 07:56

li-boxuan merged commit b845a38 into OpenDevin:main May 20, 2024
25 checks passed

xingyaoww mentioned this pull request May 21, 2024

Add: a mechanism for tracking contributions to the paper #1917

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small improvements & fixes to SWE-Bench #1874

Small improvements & fixes to SWE-Bench #1874

li-boxuan commented May 18, 2024 •

edited

xingyaoww left a comment

libowen2121 commented May 21, 2024

Small improvements & fixes to SWE-Bench #1874

Small improvements & fixes to SWE-Bench #1874

Conversation

li-boxuan commented May 18, 2024 • edited

xingyaoww left a comment

Choose a reason for hiding this comment

libowen2121 commented May 21, 2024

li-boxuan commented May 18, 2024 •

edited