Tag
1 articles
DeepSWE’s 113-task test across 91 repos puts GPT-5.5 at 70% and exposes a loophole in Claude Opus.