Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers

(senior-swe-bench.snorkel.ai)

37 points | by matt_d 2 hours ago

9 comments

magnio 1 minute ago
I saw on Twitter that in an ML course at Tsinghua University, one of the tests asks students to write quizzes that fail the most LLM models as possible.
What if we create a benchmark that works like this and assigns ELO scores? Models fight head-to-head by writing a question, a bug, or an incomplete implementation, which the opponent has to answer, fix, or finish.
0xbadcafebee 1 minute ago
[delayed]
jonathanleane 1 hour ago
Top solve rate is currently 24% with Opus 4.8... What's a competent human supposed to score?
[-]
- lacunary 1 hour ago
  presumably whatever the top model uses and then some, since the human can use the model.
  I wonder if a model could score higher if it had a human at its disposal?
guilhermecgs 21 minutes ago
fable 5?
[-]
- guessmyname 4 minutes ago
  The people who created the benchmark(s) don’t have access to Fable 5.
LiamPowell 1 hour ago
> You are a senior SWE-Bench reviewer, make no mistakes.
I don't know what a better approach would look like while still remaining feasible, however this approach of telling a LLM to make a subjective judgement seems fundamentally flawed.
Madmallard 55 minutes ago
next round of trust me bro benchmarks
[-]
- dozerly 12 minutes ago
  Just wait for the next 100 rounds. People love seeing the 65% -> 85% seemingly over and over again for every new model.
danpalmer 1 hour ago
Why didn't they just make it "Staff SWE-Bench", would be much better smh. /s
But seriously, as an industry we're terrible at assessing engineering levels, I've worked with "senior engineers" who can't code and I've worked with "junior engineers" who could run rings around them.
Benchmarks like this should be much more precise about what they're actually testing, and what axes they're hard on. We also need to rise above prompts like "you are a senior engineer", it's woo, and it's far better to ask for precise outcomes.
[-]
- glaslong 24 minutes ago
  Principal-SWE-Bench will take some time to run, because the LLM needs to wait for a crisis to present its solution, having correctly identified that the same solution would have been organizationally impossible to propose until that moment.
- amrrs 1 hour ago
  As someone who's trying to get better assessments, I'm struggling to come up with objective coding tasks that evaluates all aspects of real life like planning, design choices, problem solving and context usage. From your experience with humans, Do you have any recommendations on what could be effective in measuring it?
  [-]
  - allan_s 44 minutes ago
    I think the source of your issue is in your statement itself, why do you want a task that evaluate things as broad to be only a coding task ? Shouldn't it be a planning task, documentation task, knowledge retrieval task etc. And very certainly not with just an initial prompt but an existing codebase + existing doc + tickets ?
jocelyner 1 hour ago
[flagged]
purple-leafy 1 hour ago
Benchmarks are great, but I feel like there’s a better way this seems quite subjective.
What you really need is an objective benchmark
[-]
- eli 1 hour ago
  I actually really like subjective benchmarks, so long as it's a human (ideally me) grading the results. LLM as judge never made much sense.
  [-]
  - charcircuit 16 minutes ago
    The issue is that you can't do unsupervised learning if you require humans.
- echelon 1 hour ago
  > What you really need is an objective benchmark
  "When are all the software engineers unemployed?"
  [-]
  - purple-leafy 59 minutes ago
    Not sure I follow haha