What progress have you made since your last update?
We have developed an advanced version of our problem generation tool, which enables the creation of planning problem instances in a "gripper-like" environment, modeled after the classical STRIPS planning domain, where a robot moves between rooms and picks up, drops, or interacts with objects, with configurable numbers of objects, locations, and safety constraints.
This tool allows us to systematically vary the size of the problem and the number of safety constraints, supporting the construction of a flexible and scalable benchmark.
Initial experimental runs using this setup have also provided us with important conceptual clarity. In particular, we've identified a promising direction for contribution: characterizing the computational complexity of safety constraints. Our aim is to link different classes of constraints to known complexity classes in automated planning — and to use this connection to better understand and empirically predict how likely it is that state-of-the-art frontier models will violate these constraints, depending on their complexity.
What are your next steps?
Formalize our theoretical framing around safety constraint complexity and its empirical implications, with the goal of producing a framework that connects symbolic planning theory with LLM behavior in practice.
Finalize the SafePlanBench benchmark by expanding the set of safety constraint types and further diversifying problem templates.
Begin large-scale evaluation of instruction-tuned and reasoning LLMs using the benchmark.