Ai-Agents

EXP-Bench: Can AI Conduct AI Research Experiments? ICLR 2026.

EXP-Bench is the first benchmark to evaluate AI agents on research experiment tasks that are semi-autonomously constructed from top-tier ML research papers.

Cloud Infrastructure Management in the Age of AI Agents. SIGOPS Operating Systems Review 2025.

We explore the promise of LLM-powered AI agents for cloud infrastructure management and report early takeaways and research challenges from a preliminary study across common cloud interfaces.

Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents. Arxiv 2025.

Curie is the first AI-agent framework designed for automated and rigorous scientific experimentation. Curie helps answer your curiosity through end-to-end experimentation automation, ensuring that every step—from hypothesis formulation to result interpretation—is conducted with precision, reliability, and reproducibility.

Automated Bug Discovery in Cloud Infrastructure-as-Code Updates with LLM Agents. In AIOps 2025 (ICSE Workshop)

Cloud environments are increasingly managed by Infrastructure-as-Code (IaC) platforms like Terraform, which let developers define infrastructure as configuration code. While IaC automates deployment, its update logic is error-prone, often introducing subtle yet impactful bugs. IaC updates are common because cloud infrastructures are long-lived but user requirements fluctuate over time. Testing updates is challenging due to the vast and evolving search space of infrastructure setups and resources. We introduce TerraFault, an efficient, LLM-guided system for discovering update bugs. Our prototype optimizes search and testing to systematically detect bugs, even in simple updates, improving Cloud reliability.

Automated Lifting for Cloud Infrastructure-as-Code Programs. In AIOps 2025 (ICSE Workshop)

While effective for greenfield (new) cloud deployments, existing IaC platforms struggle with brownfield migration—translating existing non-IaC infrastructure into IaC programs. This limits Cloud adoption, as current tools rely on error-prone, rule-based reverse engineering. We introduce Lilac, a novel approach that automates IaC lifting by combining LLMs for rule extraction with symbolic methods for correctness assurance. Lilac aims to enable an automated, provider-agnostic lifting tool with broad coverage and high accuracy, streamlining IaC adoption.