[paper review] NYU CTF Dataset: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security

Jul 05, 2024

This paper introduces a dataset of 200 CTF challenges sourced from NYU’s annual CSAW CTF competition. They are selected out of 568 challenges by manually verifying that the challenges still work (outdated software packages / GPG keys can be a significant issue when trying to run old software). They upload Docker images to Docker Hub, which is good since it limits the risk of outdated software dependencies going forward, compared to building the images from source.

The paper also includes a framework for solving the challenges, including giving access to some tools (like netcat, Ghidra, gmpy2…). But this part doesn’t seem much different from the previous paper by the same team.

Challenges are roughly evenly distributed between 2017 and 2023, at a rate of roughly 30 challenges a year. Unless other sources than NYU’s CSAW are used, the rate of addition of new challenges going forward should be in this ballpark, perhaps slightly higher. This also raises training data contamination concerns, as nearly all challenges are before the knowledge cutoffs of the considered models.

As in the previous paper, they don’t make clear which models they’re testing. They test:

“GPT-4” (probably gpt-4-1106-preview or gpt-4-0125-preview, which are listed in the backend section)
“GPT-3.5” (probably gpt-3.5-turbo-1106, the only one listed in the backend section)
“Mixtral” (probably mistralai/Mixtral-8x7B-Instruct-v0.1, for the same reason)
“LLaMA 3” (one of the two 70b versions listed in the backend section)
“Claude 3” (we have to assume claude-3-opus-20240229)

200 challenges is probably the largest readily available dataset out there, but it is still fairly small, as evidenced by some results such as GPT-3.5 outperforming GPT-4 in the 2022 qualifiers and finals.

More paper reviews are available on this page: https://tchauvin.com/recent-papers-ai-hacking.

Timothee Chauvin

Discussion about this post