Test-Time Training on Video Streams

Prior work has established Test-Time Training (TTT) as a general framework to further improve a trained model at test time. Before making a prediction on each test instance, the model is first trained on the same instance using a self-supervised task such as reconstruction. We extend TTT to the streaming setting, where multiple test instances - video frames in our case - arrive in temporal order. Our extension is online TTT: The current model is initialized from the previous model, then trained on the current frame and a small window of frames immediately before. Online TTT significantly outperforms the fixed-model baseline for four tasks, on three real-world datasets. The improvements are more than 2.2x and 1.5x for instance and panoptic segmentation. Surprisingly, online TTT also outperforms its offline variant that accesses strictly more information, training on all frames from the entire test video regardless of temporal order. This finding challenges those in prior work using synthetic videos. We formalize a notion of locality as the advantage of online over offline TTT, and analyze its role with ablations and a theory based on bias-variance trade-off.

Test-Time Training on Video Streams

Abstract

Results

Task 1: COCO Videos - Instance Segmentation

Restaurant

Input Video

Baseline

TTT

Havana

Input Video

Baseline

TTT

Task 2: COCO Videos - Panoptic Segmentation

School

Input Video

Baseline

TTT

Bangkok

Input Video

Baseline

TTT

Task 3: KITTI-STEP - Semantic Segmentation

Video 0002

Input Video

Baseline

TTT

Video 0018

Input Video

Baseline

TTT

Task 4: Video Colorization

L'Arrivée d'un Train En Gare de La Ciotat ("The Arrival of a Train")

Input Video

Baseline

TTT

La Pêche Aux Poissons Rouges ("Fishing for Goldfish")

Input Video

Baseline

TTT

Repas de Bébé ("Baby's Breakfast")

Input Video

Baseline

TTT

COCO-Videos Example (Havana)

RGB Ground Truth

Baseline

TTT