Papers

EXP-Bench: Can AI Conduct AI Research Experiments? ICLR 2026.

EXP-Bench is the first benchmark to evaluate AI agents on research experiment tasks that are semi-autonomously constructed from top-tier ML research papers.

Cloud Infrastructure Management in the Age of AI Agents. SIGOPS Operating Systems Review 2025.

We explore the promise of LLM-powered AI agents for cloud infrastructure management and report early takeaways and research challenges from a preliminary study across common cloud interfaces.

Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents. Arxiv 2025.

Curie is the first AI-agent framework designed for automated and rigorous scientific experimentation. Curie helps answer your curiosity through end-to-end experimentation automation, ensuring that every step—from hypothesis formulation to result interpretation—is conducted with precision, reliability, and reproducibility.

Automated Bug Discovery in Cloud Infrastructure-as-Code Updates with LLM Agents. In AIOps 2025 (ICSE Workshop)

Cloud environments are increasingly managed by Infrastructure-as-Code (IaC) platforms like Terraform, which let developers define infrastructure as configuration code. While IaC automates deployment, its update logic is error-prone, often introducing subtle yet impactful bugs. IaC updates are common because cloud infrastructures are long-lived but user requirements fluctuate over time. Testing updates is challenging due to the vast and evolving search space of infrastructure setups and resources. We introduce TerraFault, an efficient, LLM-guided system for discovering update bugs. Our prototype optimizes search and testing to systematically detect bugs, even in simple updates, improving Cloud reliability.

Automated Lifting for Cloud Infrastructure-as-Code Programs. In AIOps 2025 (ICSE Workshop)

While effective for greenfield (new) cloud deployments, existing IaC platforms struggle with brownfield migration—translating existing non-IaC infrastructure into IaC programs. This limits Cloud adoption, as current tools rely on error-prone, rule-based reverse engineering. We introduce Lilac, a novel approach that automates IaC lifting by combining LLMs for rule extraction with symbolic methods for correctness assurance. Lilac aims to enable an automated, provider-agnostic lifting tool with broad coverage and high accuracy, streamlining IaC adoption.

IaC-Eval: A code generation benchmark for Cloud Infrastructure-as-Code programs. In NeurIPS 2024

While LLMs show potential in general code generation, their efficacy in IaC development remains unknown. To address this, we developed the first dataset and benchmark capable of evaluating IaC code generation. Our dataset comprises 458 human-curated scenarios spanning various AWS services, involving over 1,720 hours of human effort. Our results reveal significant performance gaps.

Unearthing Semantic Checks for Cloud Infrastructure-as-Code Programs. In SOSP 2024

Zodiac automatically unearths complex cloud IaC semantic checks/rules that state-of-the-art IaC tools cannot easily capture, allowing us to reduce runtime error violations that can take very long to debug, into simple compile time checks.

SpotProxy: Rediscovering the Cloud for Censorship Circumvention. In USENIX Security 2024

SpotProxy is a censorship resistance system that uses cost-effective and high-churn cloud instances to maximize the circumvention utility of cloud-hosted proxies.

NetShuffle: Circumventing Censorship with Shuffle Proxies at the Edge. In IEEE Symposium on Security and Privacy (S&P), 2024

NetShuffle is a censorship resistance system that offers shuffle proxies, designed to engage a new class of support base–edge networks–which have received scant attention from existing work.

Simplifying Cloud Management with Cloudless Computing. HotNets '23

Cloudless Computing makes a case for simplifying cloud infrastructure management, by sinking these cloudy infrastructure management tasks down from the user’s perception and providing them as-a-service, analogous to serverless computing that relieves users of the burden of managing server instances.

Stargaze: A LEO Constellation Emulator for Security Experimentation. In CPSIoTSec '22 (Colocated with ACM CCS)

Stargaze is a security-centric experimentation platform for Low-earth orbit (LEO) satellite constellations.