Our Work
Papers, datasets, and open-source tools from Icaro Lab.
Tools and datasets
Public infrastructure, benchmarks, and reusable artifacts.
Adversarial Humanities Benchmark
A text-only safety benchmark for humanities-style adversarial reformulations.
Papers
Published and publicly available research.
Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety
Results from the AHB safety benchmark, showing that stylistic reformulations substantially increase attack success rates across 31 frontier models.
Agentic Microphysics: A Manifesto for Generative AI Safety
A methodological proposal for studying agentic AI safety from local interaction dynamics up to population-level risks.
Institutional AI: Governing LLM Collusion in Multi-Agent Cournot Markets via Public Governance Graphs
An experimental governance-graph framework for reducing collusion in multi-agent LLM Cournot markets.
Institutional AI: A Governance Framework for Distributional AGI Safety
A system-level alignment framework that treats AI agent safety as a question of institutional governance and mechanism design.
From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda
A study of culturally coded jailbreaks through narrative structure, with an agenda for mechanistic interpretability of stylistic attacks.
Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models
Evidence that poetic reformulations can produce systematic single-turn safety failures across frontier and open-weight models.
Beyond Single-Agent Safety: A Taxonomy of Risks in LLM-to-LLM Interactions
A taxonomy of micro-, meso-, and macro-level risks that emerge when language models interact with other language models.