Self-supervised learning for robotics

A crash course from Robotics: Science and Systems Conference 2020.

6 min readJul 14, 2020

--

Self-supervised learning is an exciting research direction that aims to learn representations from the data itself without explicit and potentially even manual supervision. One of the major benefits of self-supervised learning is the ability to scale to large amounts of unlabelled data in a lifelong learning manner and to improve performance by reducing the effect of dataset bias. Recent development in self-supervised learning has resulted in achieving comparable or better performance than fully-supervised models. However, many of these methods are developed in domain-specific communities such as robotics, computer vision or reinforcement learning. The aim of this workshop is to bring together researchers from different communities to discuss opportunities, challenges and explore new directions.

I wanted to learn something new. Self-supersived learning for robotics is all about data creation (and augmentation), reward engineering, and experimental setup so that our robots can learn on their own (and into lifelong learning). It struck me as a very young field with a breadth of applications. The link to all materials is here.

Live steam.

Dieter Fox: Overview of self-supervised learning for robotics

Dieter Fox

I am a Professor in the Department of Computer Science & Engineering at the University of Washington. I grew up in…

homes.cs.washington.edu

Autonomous data generation

Train pose estimation beforehand.
Robot generate and label data on its own (refine detection more).
Accurate pose initialization is important.

“Generate and label their own data”

A good introduction to self-supervision for me. I joined this talk late.

Abhinav Gupta: Learning like babies

Abhinav Gupta Homepage

Abhinav Gupta Associate Professor The Robotics Institute Carnegie Mellon University Research Manager Facebook AI…

www.cs.cmu.edu

Scaling learning with self-supervision and life long learning

3 core vectors: “100x images, supervising robot data, curiosity”
This work is the intersection of supervised / passive learning with RL in robotics.

Existing approaches don’t scale!

Imagenet like approaches label 1M boxes over5years, but Facebooks generates > 600m images a day… can we label that?
Simulation is 1 task, tons of interactions, but in reality babies do 1000s of tasks in parallel with less structure.

Stages of how robots could learn from simplest to most complex.

Remove data labelling bottleneck! How do self-supervised approaches scale?
Hardest tasks refine the representations further (more specific embeddings, potentially better performance).
example: pick up objects at random locations (force sensor feedback) is a way to collect data.

Not enough diverse data so cannot transition from lab to real world

Chicken-egg issue: need the data to be useful, but it will not be useful until we have access to data, so…
Rented Airbnb’s to collect data!

Hilarious videos of training robots in various Airbnb’s.

can we formulate curiosity in an end-to-end gradient method rather than learning in a “reward actions that disagree with the environment”?

Pierre Sermanet: Using play and language to scale robot learning

Pierre Sermanet Homepage

Grounding Language in Play Learning Latent Plans from Play Self-Supervised Actionable Representations Time-Contrastive…

sermanet.github.io

Recent works

Formulates what “playing” means here Learning-from-play.github.io — (Paper: Learning to play by imitating humans).
Also, uses language to instruct tasks at Play-language.github.io, want we want robots to return language.

What is “Play data”

Play can be a substitute for RL because it gives exploration inherently (RL is designed to balance exploration and exploitation).

The great slides from this talk below (thanks Zoom).

Different types of robotics data potentially used.

Panel 1 was highlighted by this question:

How far from real industrial application?
Answer: “crickets”
“Robots are going to have to explain themselves, so they will need to generate text”

Roberto Calandra: Few lessons learned from self-supervised learning on real robots

RobertoCalandra.com

Email: Curriculum vitae: [ CV] Twitter: @RCalandra Follow @RCalandra Linkedin: linkedin.com/in/rcalandra GitHub…

www.robertocalandra.com

What is self-supervised learning?

Supervised learning without labelling the data: Learn embeddings, automatic labelling.
Benefits: large data collection is feasible, in real world it leads to better experimental design and engineering, seems obvious from how humans work.
Limitations: structure of the problem needs to be known and consistent, labelling mechanism needed.
Need to remember the front end prettiness of robotics vs hidden behind the scenes challenges in robotics, and self-supervision may mitigate this.

Marble manipulation task used self-supervision.

Can use self-supervision with model-based planning.

6 lessons from robotics & self-supervision

Safety, safety, safety.
Careful experiment design is very important. Measure twice, cut once is not enough always -> need iterative improvement process.
Don’t underestimate engineering.
Designing and monitoring diagnostics is crucial.
Log everything and maintain consistency (hydra.cc).
Do not code experiments as sequences of actions (finite state machines offer substantially more robustness).

Can we run robots entirely remotely?

Chelsea Finn: Data scalability for robot learning

Chelsea Finn, Stanford University

Tutorials and Lectures At ICML 2019 and CVPR 2019, I gave an invited tutorial on Meta-Learning: from Few-Shot Learning…

ai.stanford.edu

Generalizing across tasks, objects, and environments

How the training and test distributions often look in robotics.

Generalize broadly, train it on broad data. We want scalable data sources to do so (like modern CV).
What does robot learning data look like? Match this with ML data process of training + validation sets.

Timeline on multi-robot, multi-task learning.

Need to get large datasets and algorithms to work with them

goal: accumulate and reuse datasets across labs: RoboNet.
Like a validation set, can use one robot to fine tune performance.

Can we link all the videos available online to robotic tasks? Need to account for dramatic domain shift, but it’s a huge dataset opportunity.

Trying to model video prediction in robots: Bottleneck — undercutting (model everything).
Instead, we consider goal-aware prediction to aim at goal-relevant content. Redistribute model capacity to good trajectories.

Pieter Abbeel: DRL — Can learning from pixels be as efficient as learning from state?

Pieter Abbeel--UC Berkeley--Covariant--Gradescope

Edit description

people.eecs.berkeley.edu

DRL from state was not good at all a few years ago, but now it’s nearly matched.

History

CURL: Contrastive Unsupervised Representations for Reinforcement Learning (CURL) and Reinforcement Learning with Augmented Data (RAD) papers.
Contrastive learning — dominant in CV. Results now show that supervised and unsupervised learning combined can do better.
SIMCLR is SOTA in image recognition (with self-supervision).
Important to have a sequence of 3 frames (movement is key in robotics).

Blue is with self-supervision on Imagenet, red is without (it’s more data efficient).

Contrastive learning (CURL)

The main objective function in DRL from images these days.

add to it query / key pairs (e.g. random crop), bilinear inner product with learned weight matrix, keys encoded with momentum.
Hard environments don’t work yet: Supervised learning cannot extract the state from the image (need good enough images).

RAD: RL + Data augmentation

Rotation, mirror, etc (crop, rotate are important), Matches CURL
CURL can be applied without reward function (multi-task fit).

Another comparison of DRL to image-based research.

Andy Zhang: Learning to See Actions (for Vision-based Manipulation)

Andy Zeng - Google

I am a research scientist at Robotics at Google AI. Before that, I received a PhD in Computer Science at Princeton, and…

andyzeng.github.io

How do we get our training labels?
Trying to label 3d orientation in 2d images in non-intuitive and hard.

A proposal for end-to-end learning for objective centric representations (no assumptions of object-ness when end-to-end)

Like this? Please subscribe to my direct newsletter on robotics, automation, and AI at https://robotic.substack.com/. Thanks for reading!