[paper review] Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models
A collection of 40 recent CTF challenges from 4 CTF competitions, with intermediate steps for 17 of them. Recent models are tested, including Llama 3.1 405B Instruct. The best performers are Claude 3.5 Sonnet and GPT-4o (though statistical power seems very low). Despite the low number of challenges, they only gave each model a single attempt at each challenge.
As they use CTF challenges that have been run in competitions, they are able to get a difficulty estimate in the form of the first solve time (FST) by humans. The hardest problems that the best models can solve correspond to a FST of 11 minutes. The FST metric can be slightly misleading to outsiders, though: in a CTF competition, all teams are presented with all the challenges at the same time, which introduces randomness in FST compared to a situation where all teams would be concurrently trying to solve the same challenge (I don’t see this limitation mentioned in the paper). Some CTF competitions even unlock some challenges only after others are completed, so the FST of these challenges would be too high (it’s unclear whether these competitions did this, though in all likelihood, no models were able to solve any challenge that would fall in this category).
A significant fraction of challenges predate the training data cutoffs of some models, though the authors note that “there is minimal overlap between training and test data on any solved task besides those for Claude 3.5 Sonnet”. However, this means that this isn’t a future-proof benchmark: it won’t remain useful for future models.
This work is conceptually very similar to the other papers on LLM agents for CTFs, such as NYU CTF Dataset: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security.