Cybersecurity in AI: where progress is needed
Below are some areas in the field of cybersecurity in AI where I would like to see progress. By “cybersecurity in AI”, I mean one of two things: cybersecurity from AI (e.g. how to deal with deep fakes making phishing easier), and cybersecurity for AI (e.g. how to make ML supply chains trustworthy). (There’s also cybersecurity with AI, e.g. using AI to improve cybersecurity products, but I don’t consider this here). The following are at least partly research questions, unless I mark them as [implementation], meaning that almost no research is needed, and it’s just a matter of the relevant organizations implementing some measures.
confidential computing: third-party model evaluations are an important component of AI risk mitigation. Today, there is a trust issue: evaluators don’t want the AI labs to know which evaluations they’re running (because they could game them), and labs don’t want evaluators to have access to their model weights. In practice, third-party evaluators have so far been trusting the labs not to look at the evaluations, out of better options, and are limited to black-box investigations on frontier models. Confidential computing could change both of these things. There are software-based solutions (which tend to have a significant overhead), and hardware-based solutions (example company in this space: Mithril Security).
[implementation] widespread cryptographic proofs of content authenticity, or “HTTPS for media”: deep fakes have the potential to help with phishing (by spoofing voice and video), and undermine images and videos as a trusted media. However, public key cryptography can be leveraged to prove the authenticity of content, in the same way that HTTPS is now a widespread proof of the authenticity of websites. To prove that a raw picture or video is genuine, it’s enough to have the camera contain a private key, ideally through a Trusted Platform Module (TPM), and use it to digitally sign all its outputs (while easy, camera manufacturers are only starting to roll this out). Where progress is needed is in the infrastructure to make verification of all media online the default, like HTTPS has now become − after decades of effort − a universal standard, with browsers now raising warnings on non-HTTPS content.
trustworthy ML supply chain: backdoors are easier to create than detect, and when the weights of a model are published, we would ideally want more guarantees on these weights than just trusting the model developer. Work in this domain could include cryptographic proofs that a model was indeed pretrained on the alleged training data, post-trained with the alleged method and data… In the future, a powerful model trained to be honest and unbiased would be valuable, and this would also require similar proofs for everyone to agree that the model isn’t e.g. pushing a hidden agenda.
securing source code: work to improve performance on AI vulnerability detection in source code (e.g. something like the eyeballvul benchmark) would be valuable1. Reducing false positives should be relatively easy, by spawning agents that investigate each lead in detail. Reducing false negatives (finding more, harder vulnerabilities) is more challenging. There’s a lot of work to be done in integrating frontier LLMs with the usual tools used by vulnerability researchers.
good benchmarking of offensive hacking capabilities: we don’t want to be caught by surprise by sudden jumps in capabilities there. In fact, when agents are able to significantly uplift offensive hacking teams, it will probably be time to start restricting access to these capabilities (from the current default of everyone having access to SOTA models, protected only by non-adversarially robust “alignment” guardrails and some amount of monitoring).
securing AI labs against state-level actors: this is really hard. Even the NSA’s hacking arsenal was stolen and published in 2016/2017. The RAND report Securing AI Model Weights is a well-regarded overview of the problem.
Footnotes
Isn’t vulnerability detection dual-use? Yes, but see the discussion in section 6 of the eyeballvul paper for why I believe that vulnerability detection in source code, using simple and universal tooling, in the absence of an implementation overhang, should empower defenders disproportionately over attackers.