UAD: Unsupervised Affordance Distillation
for Generalization in Robotic Manipulation

ICRA 2025 (Best Paper Finalist)

1Stanford University


Unsupervised Affordance Distillation (UAD) distills visual affordances from off-the-shelf foundation models that is fine-grained, task-conditioned, works in-the-wild, in dynamic environments, all without any manual annotations.

Abstract

Understanding fine-grained object affordances is imperative for robots to manipulate objects in unstructured environments given open-ended task instructions. However, existing methods on visual affordance predictions often rely on manually-annotated data or conditions only on predefined set of tasks. We introduce Unsupervised Affordance Distillation (UAD), a method for distilling affordance knowledge from foundation models into a task-conditioned affordance model without any manual annotations. By leveraging the complementary strengths of large vision models and vision-language models, UAD automatically annotates a large-scale dataset with detailed pairs. Training only a lightweight task-conditioned decoder atop frozen features, UAD exhibits notable generalization to in-the-wild robotic scenes as well as to various human activities despite only being trained on rendered objects in simulation. Using affordance provided by UAD as the observation space, we demonstrate an imitation learning policy that demonstrates promising generalization to unseen object instances, object categories, and even variations in task instructions after training on as few as 10 demonstrations.

Overview of UAD



Using renderings of 3D objects, we first perform multi-view fusion of DINOv2 features and clustering to obtain fine-grained semantic regions of objects, which are then fed to VLM for proposing relevant tasks and corresponding regions (a). The extracted affordance is then distilled by training a language-conditioning FiLM atop frozen DINOv2 features (b). The learned task-conditioned affordance model provides in-the-wild prediction for diverse fine-grained regions, which are used as observation space for manipulation policies (c).

Task-Conditioned Affordance Prediction

DROID Eval Results
AGD20K Eval Results

Leveraging pre-trained features, UAD can seamlessly generalize to real-world robotic scenes, and even to human activities.

Interactive Affordance Demo

Waking the model…

Generalization Properties for Policy Learning

Using affordance as a task-conditioned visual representation, policies learned from as few as 10 demonstrations generalize to novel poses, instances, categories, and even to novel instructions, all in a zero-shot manner. Here we show examples of the generalization settings for three tasks in simulation: pour, open, and insertion.

Pouring: hold liquid container and pour into bowl

Training Scenario

Unseen Pose

Unseen Instance
(beer bottle & bowl)

Unseen Category
(beer bottle → coke can)

Unseen Instruction
(pour beer → water plant)

Opening: grasp and pull open revolute cabinet door

Training Scenario

Unseen Pose

Unseen Instance

Unseen Category
(cabinet → fridge)

Insertion: pick pen and insert into pen holder

Training Scenario

Unseen Pose

Unseen Instance (marker)

Unseen Category
(pen holder → cup)

Unseen Instruction
(insert pen → insert carrot)

Real-World Robot Execution

We further show that UAD-based policies can solve real-world tasks. Each policy is trained on 10 demonstrations.

Pouring: hold liquid container and water the plant.

Opening: opening drawer.

Insertion: pick pen and insert into pen holder.