Geometric Red-Teaming for Robotic Manipulation

Conference on Robot Learning (CoRL) 2025

1Robotics Institute, Carnegie Mellon University
2National Institute of Standards and Technology
Equal Advising
Overview of the Geometric Red-Teaming framework showing object deformation and policy failure.

Abstract

Standard evaluation protocols in robotic manipulation typically assess policy performance over curated, in-distribution test sets, offering limited insight into how systems fail under plausible variation. We introduce a red-teaming framework that probes robustness through object-centric geometric perturbations, automatically generating CrashShapes---structurally valid, user-constrained mesh deformations that trigger catastrophic failures in pre-trained manipulation policies. The method integrates a Jacobian field–based deformation model with a gradient-free, simulator-in-the-loop optimization strategy. Across insertion, articulation, and grasping tasks, our approach consistently discovers deformations that collapse policy performance, revealing brittle failure modes missed by static benchmarks. By combining task-level policy rollouts with constraint-aware shape exploration, we aim to build a general purpose framework for structured, object-centric robustness evaluation in robotic manipulation. We additionally show that fine-tuning on individual CrashShapes, a process we refer to as blue-teaming, improves task success by up to 60 percentage points on those shapes, while preserving performance on the original object, demonstrating the utility of red-teamed geometries for targeted policy refinement. Finally, we validate both red-teaming and blue-teaming results with a real robotic arm, confirming that the discovered red-teaming failure cases, and corresponding blue-teaming refinement, transfer from simulation to the real world.

System Overview

System overview of our geometric red-teaming pipeline.

System overview of our geometric red-teaming pipeline. Given a task description and nominal object (Initialization Parameters), our pipeline first selects anchor and handle points using a vision-language model (a). Initial handle displacements are sampled to define a population of deformation candidates. Each sample is converted into a perturbed mesh via Jacobian field-based optimization (b) and evaluated in simulation with a frozen policy (c). Deformations that induce failure are selected and mutated to guide the next population.

Simulation Demos

Object Type Task 1: Grasping Task 2: Insertion Task 3: Articulation
Nominal Objects
CrashShapes

VLM Prompting Strategy

Diagram of the Two-stage VLM prompting strategy for 3D handle-point selection

Two-stage VLM prompting strategy for 3D handle-point selection. First, the Geometric Reasoning template aligns a canonical view-panel and indexed keypoints with a high-level task description, guiding the VLM to infer which vertices control meaningful mesh deformations. Next, the Task-Critical Ranking template asks the model to pareto-rank these candidates by plausibility and task relevance, producing a compact set of handle points for targeted, task-aware red-teaming.

VLM Prompting Examples

Real World Validation Over Insertion Policy

We tested our geometric red-teaming framework on a physical xARM 6 robot using 3D-printed plugs (Nominal, CS-1, CS-2). The following videos demonstrate the effectiveness of red-teaming in identifying failures and blue-teaming in enhancing policy robustness.

Tip: Select a thumbnail from the video strips to play the trial in the larger window.

1. Baseline: Original Policy with Nominal Plug

The pre-trained policy consistently succeeds with the standard, undeformed object.

2. Red Teaming: Original Policy Fails on CrashShape CS-1

The same policy consistently fails when presented with CrashShape CS-1.

3. Red Teaming: Original Policy Fails on CrashShape CS-2

The policy also consistently fails when presented with CrashShape CS-2.

4. Blue Teaming on CrashShape CS-1

A policy was fine-tuned using CS-1 and the nominal plug. It was then evaluated on both shapes.

Performance on CS-1 (with CS-1 Blue-Teamed Policy)

The blue-teamed policy now consistently succeeds on CS-1.


Performance on Nominal Plug (with CS-1 Blue-Teamed Policy)

Performance on the nominal plug is preserved.

5. Blue Teaming on CrashShape CS-2

A separate policy was fine-tuned using CS-2 and the nominal plug, then evaluated.

Performance on CS-2 (with CS-2 Blue-Teamed Policy)

This blue-teamed policy now consistently succeeds on CS-2.


Performance on Nominal Plug (with CS-2 Blue-Teamed Policy)

Performance on the nominal plug is also preserved with this policy.

Real World Validation Over Contact Graspnet

We tested our geometric red-teaming framework on a physical Franka Emika Panda robot using 3D-printed objects (Nominal, Deformed) pairs from the YCB dataset. The following videos demonstrate the effectiveness of red-teaming in identifying failures over the generalizable grasping model - Contact Graspnet.

Tip: Select a thumbnail from the video strips to play the trial in the larger window.

1. Baseline: Contact Graspnet on Original YCB Mustard Bottle

The most confident grasp from Contact Graspnet consistently performs well on the 3D printed version of the mustard bottle taken directly from the YCB dataset.

2. Red-Teaming: Contact Graspnet Fails on Deformed YCB Mustard Bottle

The most confident grasp from Contact Graspnet fails on the 3D printed version of the deformed mustard bottle obtained upon geometric red-teaming.

3. Baseline: Contact Graspnet on Original YCB Screw Driver

The most confident grasp from Contact Graspnet consistently performs well on the 3D printed version of the screw driver taken directly from the YCB dataset.

2. Red-Teaming: Contact Graspnet Fails on Deformed YCB Screw Driver

The most confident grasp from Contact Graspnet fails on the 3D printed version of the deformed screw driver obtained upon geometric red-teaming.