(click to copy)
There are two prevailing methods for pre-training on large datasets to learn transferable representations: 1) supervised pre-training on large but weakly-labeled datasets; 2) contrastively training on image only and image, text pairs.
While supervised pre-training learns good representations that can be transferred to a wide range of tasks, contrastively models such as CLIP have demonstrated unprecedented zero-shot transfer.
In this work, we compare the transferability of the two aforementioned methods to multiple downstream tasks. The pre-training distributions we consider include YFCC, Conceptual Captions, and ImageNet-21K while pre-training objectives range from supervised to SimCLR, CLIP, and SLIP.
We observe that different pre-training methods with the same training source transfer similarly given their ImageNet accuracy.
M. M. Shariatnia, R. Entezari, M. Wortsman, O. Saukh, L. Schmidt, How well do contrastively trained models transfer?, ICML 2022 Pre-training Workshop