Stable Video Diffusion Training Video Models
Stable Video Diffusion Image Pretraining For image pretraining, the paper discusses initially pretraining a Diffuse Transformer on a large-scale semantic segmentation dataset called CC-12M. This was done using self-supervised learning to acquire strong image representation capabilities. The pretraining enabled the model to recognize visual details and structures in images such as faces, objects etc. Subsequently, ... Read more