Download
Abstract
EXP-Bench is our in-house, novel benchmark designed by the Curie team, to systematically evaluate AI agents on complete research experiments sourced from influential AI publications. Given a research question and incomplete starter code, EXP-Bench challenges AI agents to formulate hypotheses, design and implement experimental procedures, execute them, and analyze results. This dataset curates 461 AI research tasks from 51 top-tier AI research papers.
Figure 1: EXP-Bench high-level overview.
EXP-Bench evaluates AI agents on research experiment tasks.
Figure 2: One AI research task example from ICLR 2024 MogaNet.
Citation
Patrick Tser Jern Kon, Jiachen Liu, Xinyi Zhu, Qiuyi Ding, Jingjia Peng, Jiarong Xing, Yibo Huang, Yiming Qiu, Jayanth Srinivasa, Myungjin Lee, Mosharaf Chowdhury, Matei Zaharia and Ang Chen. “EXP-Bench: Can AI Conduct AI Research Experiments?” Arxiv 2025.