EXP-Bench: Can AI Conduct AI Research Experiments? Arxiv 2025.
EXP-Bench is the first benchmark to evaluate AI agents on research experiment tasks that are semi-autonomously constructed from top-tier ML research papers.
EXP-Bench is the first benchmark to evaluate AI agents on research experiment tasks that are semi-autonomously constructed from top-tier ML research papers.
Curie is the first AI-agent framework designed for automated and rigorous scientific experimentation. Curie helps answer your curiosity through end-to-end experimentation automation, ensuring that every step—from hypothesis formulation to result interpretation—is conducted with precision, reliability, and reproducibility.
Cloud environments are increasingly managed by Infrastructure-as-Code (IaC) platforms like Terraform, which let developers define infrastructure as configuration code. While IaC automates deployment, its update logic is error-prone, often introducing subtle yet impactful bugs. IaC updates are common because cloud infrastructures are long-lived but user requirements fluctuate over time. Testing updates is challenging due to the vast and evolving search space of infrastructure setups and resources. We introduce TerraFault, an efficient, LLM-guided system for discovering update bugs. Our prototype optimizes search and testing to systematically detect bugs, even in simple updates, improving Cloud reliability.
While effective for greenfield (new) cloud deployments, existing IaC platforms struggle with brownfield migration—translating existing non-IaC infrastructure into IaC programs. This limits Cloud adoption, as current tools rely on error-prone, rule-based reverse engineering. We introduce Lilac, a novel approach that automates IaC lifting by combining LLMs for rule extraction with symbolic methods for correctness assurance. Lilac aims to enable an automated, provider-agnostic lifting tool with broad coverage and high accuracy, streamlining IaC adoption.
While LLMs show potential in general code generation, their efficacy in IaC development remains unknown. To address this, we developed the first dataset and benchmark capable of evaluating IaC code generation. Our dataset comprises 458 human-curated scenarios spanning various AWS services, involving over 1,720 hours of human effort. Our results reveal significant performance gaps.
Zodiac automatically unearths complex cloud IaC semantic checks/rules that state-of-the-art IaC tools cannot easily capture, allowing us to reduce runtime error violations that can take very long to debug, into simple compile time checks.
SpotProxy is a censorship resistance system that uses cost-effective and high-churn cloud instances to maximize the circumvention utility of cloud-hosted proxies.
NetShuffle is a censorship resistance system that offers shuffle proxies, designed to engage a new class of support base–edge networks–which have received scant attention from existing work.
Cloudless Computing makes a case for simplifying cloud infrastructure management, by sinking these cloudy infrastructure management tasks down from the user’s perception and providing them as-a-service, analogous to serverless computing that relieves users of the burden of managing server instances.
Stargaze is a security-centric experimentation platform for Low-earth orbit (LEO) satellite constellations.