EXP-Bench: Can AI Conduct AI Research Experiments? Arxiv 2025.
EXP-Bench is the first benchmark to evaluate AI agents on research experiment tasks that are semi-autonomously constructed from top-tier ML research papers.
EXP-Bench is the first benchmark to evaluate AI agents on research experiment tasks that are semi-autonomously constructed from top-tier ML research papers.
Curie is the first AI-agent framework designed for automated and rigorous scientific experimentation. Curie helps answer your curiosity through end-to-end experimentation automation, ensuring that every step—from hypothesis formulation to result interpretation—is conducted with precision, reliability, and reproducibility.
Cloud environments are increasingly managed by Infrastructure-as-Code (IaC) platforms like Terraform, which let developers define infrastructure as configuration code. While IaC automates deployment, its update logic is error-prone, often introducing subtle yet impactful bugs. IaC updates are common because cloud infrastructures are long-lived but user requirements fluctuate over time. Testing updates is challenging due to the vast and evolving search space of infrastructure setups and resources. We introduce TerraFault, an efficient, LLM-guided system for discovering update bugs. Our prototype optimizes search and testing to systematically detect bugs, even in simple updates, improving Cloud reliability.
While effective for greenfield (new) cloud deployments, existing IaC platforms struggle with brownfield migration—translating existing non-IaC infrastructure into IaC programs. This limits Cloud adoption, as current tools rely on error-prone, rule-based reverse engineering. We introduce Lilac, a novel approach that automates IaC lifting by combining LLMs for rule extraction with symbolic methods for correctness assurance. Lilac aims to enable an automated, provider-agnostic lifting tool with broad coverage and high accuracy, streamlining IaC adoption.