Published in arXiv:2103.03568, 2021
Pretext-based self-supervised learning aims to learn the semantic representation via a handcrafted pretext task over unlabeled data and then use the learned representation for downstream prediction tasks. It is proved that pretext-based self-supervised learning can effectively reduce the sample complexity of downstream tasks under Conditional Independence (CI) between the components of the pretext task conditional on the downstream label. However, the downstream sample complexity will get much worse if the CI condition does not hold. One interesting question is whether we can make the CI condition hold by using downstream data to refine the unlabeled data to boost self-supervised learning. At first glance, one might think that seeing downstream data in advance would always boost the downstream performance. However, we show that it is not intuitively true and point out that in some cases, it will hurt the final performance instead. In particular, we prove both model-free and model-dependent lower bounds of the number of downstream samples used for data refinement. Moreover, we conduct several experiments on both synthetic and real-world datasets to verify our theoretical results.