Standard evaluation protocols in robotic manipulation typically assess policy performance over curated, in-distribution test sets, offering limited insight into how systems fail under plausible variation.
We introduce a red-teaming framework that probes robustness through object-centric geometric perturbations, automatically generating CrashShapes---structurally valid, user-constrained mesh deformations that trigger catastrophic failures in pre-trained manipulation policies.
The method integrates a Jacobian field–based deformation model with a gradient-free, simulator-in-the-loop optimization strategy.
Across insertion, articulation, and grasping tasks, our approach consistently discovers deformations that collapse policy performance, revealing brittle failure modes missed by static benchmarks.
By combining task-level policy rollouts with constraint-aware shape exploration, we aim to build a general purpose framework for structured, object-centric robustness evaluation in robotic manipulation.
We additionally show that fine-tuning on individual CrashShapes, a process we refer to as blue-teaming, improves task success by up to 60 percentage points on those shapes, while preserving performance on the original object, demonstrating the utility of red-teamed geometries for targeted policy refinement.
Finally, we validate both red-teaming and blue-teaming results with a real robotic arm, confirming that the discovered red-teaming failure cases, and corresponding blue-teaming refinement, transfer from simulation to the real world.
System overview of our geometric red-teaming pipeline. Given a task description and nominal object (Initialization Parameters), our pipeline first selects anchor and handle points using a vision-language model (a). Initial handle displacements are sampled to define a population of deformation candidates. Each sample is converted into a perturbed mesh via Jacobian field-based optimization (b) and evaluated in simulation with a frozen policy (c). Deformations that induce failure are selected and mutated to guide the next population.
Object Type | Task 1: Grasping | Task 2: Insertion | Task 3: Articulation |
---|---|---|---|
Nominal Objects |
|
|
|
CrashShapes |
|
|
|
Two-stage VLM prompting strategy for 3D handle-point selection. First, the Geometric Reasoning template aligns a canonical view-panel and indexed keypoints with a high-level task description, guiding the VLM to infer which vertices control meaningful mesh deformations. Next, the Task-Critical Ranking template asks the model to pareto-rank these candidates by plausibility and task relevance, producing a compact set of handle points for targeted, task-aware red-teaming.
JSON Content
We tested our geometric red-teaming framework on a physical xARM 6 robot using 3D-printed plugs (Nominal, CS-1, CS-2). The following videos demonstrate the effectiveness of red-teaming in identifying failures and blue-teaming in enhancing policy robustness.
Tip: Select a thumbnail from the video strips to play the trial in the larger window.
The pre-trained policy consistently succeeds with the standard, undeformed object.
The same policy consistently fails when presented with CrashShape CS-1.
The policy also consistently fails when presented with CrashShape CS-2.
A policy was fine-tuned using CS-1 and the nominal plug. It was then evaluated on both shapes.
The blue-teamed policy now consistently succeeds on CS-1.
Performance on the nominal plug is preserved.
A separate policy was fine-tuned using CS-2 and the nominal plug, then evaluated.
This blue-teamed policy now consistently succeeds on CS-2.
Performance on the nominal plug is also preserved with this policy.
We tested our geometric red-teaming framework on a physical Franka Emika Panda robot using 3D-printed objects (Nominal, Deformed) pairs from the YCB dataset. The following videos demonstrate the effectiveness of red-teaming in identifying failures over the generalizable grasping model - Contact Graspnet.
Tip: Select a thumbnail from the video strips to play the trial in the larger window.
The most confident grasp from Contact Graspnet consistently performs well on the 3D printed version of the mustard bottle taken directly from the YCB dataset.
The most confident grasp from Contact Graspnet fails on the 3D printed version of the deformed mustard bottle obtained upon geometric red-teaming.
The most confident grasp from Contact Graspnet consistently performs well on the 3D printed version of the screw driver taken directly from the YCB dataset.
The most confident grasp from Contact Graspnet fails on the 3D printed version of the deformed screw driver obtained upon geometric red-teaming.