EXP-Bench: Can AI Conduct AI Research Experiments? ICLR 2026.

EXP-Bench is the first benchmark to evaluate AI agents on research experiment tasks that are semi-autonomously constructed from top-tier ML research papers.

January 2026 · Patrick Tser Jern Kon, Qiuyi Ding, Jiachen Liu, Xinyi Zhu, Jingjia Peng, Jiarong Xing, Yibo Huang, Yiming Qiu, Jayanth Srinivasa, Myungjin Lee, Mosharaf Chowdhury, Matei Zaharia and Ang Chen

Cloud Infrastructure Management in the Age of AI Agents. SIGOPS Operating Systems Review 2025.

We explore the promise of LLM-powered AI agents for cloud infrastructure management and report early takeaways and research challenges from a preliminary study across common cloud interfaces.

June 2025 · Zhenning Yang, Archit Bhatnagar, Yiming Qiu, Tongyuan Miao, Patrick Tser Jern Kon, Yunming Xiao, Yibo Huang, Martin Casado and Ang Chen

Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents. Arxiv 2025.

Curie is the first AI-agent framework designed for automated and rigorous scientific experimentation. Curie helps answer your curiosity through end-to-end experimentation automation, ensuring that every step—from hypothesis formulation to result interpretation—is conducted with precision, reliability, and reproducibility.

January 2025 · Patrick Tser Jern Kon, Jiachen Liu, Qiuyi Ding, Yiming Qiu, Zhenning Yang, Yibo Huang, Jayanth Srinivasa, Myungjin Lee, Mosharaf Chowdhury and Ang Chen

Automated Bug Discovery in Cloud Infrastructure-as-Code Updates with LLM Agents. In AIOps 2025 (ICSE Workshop)

Cloud environments are increasingly managed by Infrastructure-as-Code (IaC) platforms like Terraform, which let developers define infrastructure as configuration code. While IaC automates deployment, its update logic is error-prone, often introducing subtle yet impactful bugs. IaC updates are common because cloud infrastructures are long-lived but user requirements fluctuate over time. Testing updates is challenging due to the vast and evolving search space of infrastructure setups and resources. We introduce TerraFault, an efficient, LLM-guided system for discovering update bugs. Our prototype optimizes search and testing to systematically detect bugs, even in simple updates, improving Cloud reliability.

January 2025 · Yiming Xiang, Zhenning Yang, Jingjia Peng, Hermann Bauer, Patrick Tser Jern Kon, Yiming Qiu, and Ang Chen

Automated Lifting for Cloud Infrastructure-as-Code Programs. In AIOps 2025 (ICSE Workshop)

While effective for greenfield (new) cloud deployments, existing IaC platforms struggle with brownfield migration—translating existing non-IaC infrastructure into IaC programs. This limits Cloud adoption, as current tools rely on error-prone, rule-based reverse engineering. We introduce Lilac, a novel approach that automates IaC lifting by combining LLMs for rule extraction with symbolic methods for correctness assurance. Lilac aims to enable an automated, provider-agnostic lifting tool with broad coverage and high accuracy, streamlining IaC adoption.

January 2025 · Jingjia Peng, Yiming Qiu, Patrick Tser Jern Kon, Pinhan Zhao, Yibo Huang, Zheng Guo, Xinyu Wang, and Ang Chen