The idea of moving from pixel forecsting to feature forecasting is really clever, especially for tasks like autonomus driving where you care about segmentation or depth but not necessarily RGB details. I wonder how the temporal extrapolation holds up in practice when scenes have sudden changes or occlusions. Does the model start to degrade fast or does it stay surprisingly robust?
Well, if you think about it, this model is at most as good as having the ground truth future RGB and extracting DINO features from it. Features will go through large variations if there are sudden changes, and this model should replicate that behavior. In practice, it has only been trained on a small set, so I guess the delta is much higher there, but I haven't tested.
The idea of moving from pixel forecsting to feature forecasting is really clever, especially for tasks like autonomus driving where you care about segmentation or depth but not necessarily RGB details. I wonder how the temporal extrapolation holds up in practice when scenes have sudden changes or occlusions. Does the model start to degrade fast or does it stay surprisingly robust?
Well, if you think about it, this model is at most as good as having the ground truth future RGB and extracting DINO features from it. Features will go through large variations if there are sudden changes, and this model should replicate that behavior. In practice, it has only been trained on a small set, so I guess the delta is much higher there, but I haven't tested.