Learning Representation Through Self-Supervised Learning on Real Gravitational Lensing Images

This blog post provides an overview of my ongoing project with Machine Learning for Science (ML4Sci) as part of the Google Summer of Code (GSoC) 2024. All the project's code is openly accessible at github.com/iyersreehari/DeepLense_SSL_Sreehari_Iyer

Abstract

Strong gravitational lensing provides a means to probe dark matter substructure. In recent years, machine learning techniques, particularly supervised learning, are being utilized for substructure detection, and other regression and classification tasks on the lensing dataset. With strong gravitational lensing, the available labeled training data is scarce. Supervised learning requires abundant labeled training data and can be biased by class imbalances in the training dataset. To circumvent this, previous works have utilized simulated lensing dataset for supervised learning. However, this approach may result in diminished performance on running inferences with real datasets. In computer vision, self-supervised learning (SSL) has emerged as a potent solution, particularly effective in scenarios with abundant unlabeled data and scarce labeled data. Recent works have studied convolutional neural network (CNN) based SSL with simulated lensing data. This project focuses on evaluating of self-supervised learning techniques with Transformers utilizing real-world strong gravitational lensing dataset.

Experiment

This project focuses on evaluating of self-supervised learning techniques with Transformers utilizing real-world strong gravitational lensing dataset. The learned representations are then evaluated on the downstream task to classify lens and non-lens images. We train and compare Vision Transformer [Dosovitskiy, Alexey, et al., 2020] with different self supervised learning algorithms alongwith a supervised baseline. Vision Transformers split the image into fixed-size patches and linearly embed them with additional position embedding. The resulting vector representation is then fed into a standard Transformer encoder. The following llustration of the ViT architecture is from Dosovitskiy, Alexey, et al., 2020

ViT Architecture from Dosovitskiy, Alexey, et al., 2020

For the supervised learning baseline, the ViT backbone is followed by a classifier MLP head. The hyperparameters for the classification task are chosen such that the loss computed over the validation dataset is minimized. For self-supervised learning, the ViT backbone is trained on the training dataset without the label information. For evaluating the learned network, the backbone followed by a linear classifier is finetuned on the train dataset (with label information) and then evaluated over the held-out test dataset.

Example images from the training dataset

Each image has three channels or filters - b (blue filter), g (green filter) and i (near infrared filter). Detailed information about the filters are available at SDSS Filters
The train dataset contains 2333 lens images and 1530 non-lens images. The validation dataset contains 259 lens images and 170 non-lens images. The test dataset contains 458 lens images and 300 non-lens images.
The images are center cropped to 32 × 32 pixel as this empirically resulted in better prediction accuracy. To understand how well SSL works with different fractions of labelled and unlabelled data, the model is pre-trained through self supervision on the entire train data and then finetuned on the labelled fraction of the train data and compared with supervised baseline trained only on that labeled fraction. This simulates the real world scenario where only a fraction of dataset may have associated labels.

Supervised Learning Baseline

A supervised learning baseline is trained to compare the performance of self-supervised learning algorithms. The ViT-B and ViT-S networks are trained to minimize the cross entropy loss between the predicted labels and the actual labels for the training dataset (with the label information).

Backbone

# labelled data for
training/fine-tuning

Accuracy

AUC Score

ViT-S (patch size: 8)

300

89.7098%

0.9547

ViT-B (patch size: 8)

300

90.7652%

0.9534

ViT-S (patch size: 8)

600

92.2164%

0.9445

ViT-B (patch size: 8)

600

92.6121%

0.9579

ViT-S (patch size: 8)

1200

93.1398%

0.9780

ViT-B (patch size: 8)

1200

93.4037%

0.9678

ViT-S (patch size: 8)

3863

94.9868%

0.9861

ViT-B (patch size: 8)

3863

94.9868%

0.9843

ROC for ViT-S backbone

ROC for ViT-B backbone

SimSiam

[Chen, Xinlei, and Kaiming He., 2021]

Recent studies in self-supervised representation learning, utilize methods involving certain forms of Siamese networks.

SimSiam Architecture from Chen, Xinlei, and Kaiming He., 2021

The SimSiam architecture takes as input two randomly augmented views \( x_1 \) and \( x_2 \) from an image \( x \). We utilize random horizontal flips, random brightness and contrast jitter and random gaussian blur for obtaining the augmentations.
The architecture consists an encoder network \( f \) followed by a prediction MLP head \( h \). The encoder network \( f \) consists of the backbone network followed by a projection MLP.
Let the cosine similarity between the output vectors \( p_1 = h(f(x_1)) \) and \( z_2 = f(x_2) \) be \[ \mathcal{D} (p_1, z_2) = \frac{p_1}{|| p_1 ||_2} \cdot \frac{z_2}{|| z_2 ||_2} \] The objective of SimSiam is to minimize the symmetrized loss defined as follows: \[ \mathcal{L} = - \frac{1}{2} \mathcal{D} (p_1, z_2) - \frac{1}{2} \mathcal{D} (p_2, z_1) \] Further, the architecture utilizes stop-gradient operation to avoid representation collapse which may cause the optimizer to quickly obtain a degenerated solution reaching the minimum possible loss of −1.
The evaluation results on the test dataset after finetuning the architecture on the train dataset are given below

Backbone

# labelled data for
training/fine-tuning

Accuracy

AUC Score

ViT-S (patch size: 8)

300

89.8417%

0.9540

ViT-B (patch size: 8)

300

90.3694%

0.9539

ViT-S (patch size: 8)

600

92.2164%

0.9691

ViT-B (patch size: 8)

600

92.0844%

0.9589

ViT-S (patch size: 8)

1200

94.9868%

0.9843

ViT-B (patch size: 8)

1200

93.5356%

0.9817

ViT-S (patch size: 8)

3863

95.1187%

0.9864

ViT-B (patch size: 8)

3863

94.8549%

0.9832

ROC for ViT-S backbone

ROC for ViT-B backbone

DINO

[Caron, Mathilde, et al., 2021]

DINO Architecture from Caron, Mathilde, et al., 2021

The model consists of student and teacher networks. The output of the teacher network is centered with mean computed over a batch. The output of the student and the teacher networks are normalized with a temperature softmax over the feature dimension.

[Zhou, Jinghao, et al., 2021] Given the training set \( \mathcal{I} \), an image \( x \sim \mathcal{I} \) is sampled uniformly, over which two random augmentations are applied, yielding two distorted views \( u \) and \( v \). We utilize random horizontal flips, random brightness and contrast jitter and random gaussian blur for obtaining the augmentations. The two distorted views are then put through a teacher-student framework to get the predictive categorical distributions from the \( \texttt{[CLS]} \) token: \[ v^{\texttt{[CLS]}}_t = P_{\theta '}^{\texttt{[CLS]}} (v) \ \text{ and } \ u^{\texttt{[CLS]}}_s = P_{\theta}^{\texttt{[CLS]}} (u) \]
where \( \theta ' \) denotes the parameters of the teacher network and \( \theta \) denotes the parameters of the student network.
The knowledge is distilled from teacher to student by minimizing their cross-entropy loss w.r.t. the student parameters \( \theta_s \)
\[ \mathcal{L}_{\texttt{[CLS]}} (u, v) = - P_{\theta '}^{\texttt{[CLS]}} (v) \cdot \log {(P_{\theta}^{\texttt{[CLS]}} (u))} \] The teacher and the student share the same architecture consisting of a backbone \( \mathcal{f} \) and a projection MLP head \( h \). A stop-gradient (sg) operator is applied on the teacher network similar to SimSiam architecture.

Self-supervised learning is implenmented in Caron, Mathilde, et al., 2021 through different distorted views, or crops, of an image with multi-crop strategy. For a given image, a set \( V \) of different views is generated. This set contains two (or more) global views, \( x_{g1} \) and \( x_{g2} \) and several local views of smaller resolution (image cropped to cover smaller area). All crops are passed through the student network while only the global views are passed through the teacher network.
Thus, the objective to minimize w.r.t. the student parameter \( \theta_s \) is as follows: \[ \mathcal{L}_{DINO} = \sum_{v \in {x_{g1}, x_{g2}}} \ \sum_{u \in V, u \neq v} - P_{\theta '}^{\texttt{[CLS]}} (v) \cdot \log {(P_{\theta}^{\texttt{[CLS]}} (u))} \] The teacher parameters \( \theta_t \) are updated with an exponential moving average (ema) of the student parameters \( \theta_s \). The evaluation results on the test dataset after finetuning the architecture on the train dataset are given below

Backbone

# labelled data for
training/fine-tuning

Accuracy

AUC Score

ViT-S (patch size: 8)

300

91.1609%

0.9689

ViT-B (patch size: 8)

300

91.5567%

0.9703

ViT-S (patch size: 8)

600

92.6121%

0.9653

ViT-B (patch size: 8)

600

92.2164%

0.9667

ViT-S (patch size: 8)

1200

93.7995%

0.9858

ViT-B (patch size: 8)

1200

93.2718%

0.9798

ViT-S (patch size: 8)

3863

94.9868%

0.9865

ViT-B (patch size: 8)

3863

94.1953%

0.9876

ROC for ViT-S backbone

ROC for ViT-B backbone

iBOT

[Zhou, Jinghao, et al., 2021]

iBot Architecture from [Zhou, Jinghao, et al., 2021]

iBot objective is a linear combination of the DINO loss and the Masked Image Modelling (MIM) loss. \[ \lambda_1 \cdot \mathcal{L}_{DINO} + \lambda_2 \cdot \mathcal{L}_{MIM} \] where \( \mathcal{L}_{DINO} \) is the objective to be minimized in DINO as described in the previous section and \( \lambda_1 \), \( \lambda_2 \) are hyperparameters.
The objective is minimized with respect to \( \theta_s \), the parameters of the student network. The parameters of the teacher network \( \theta_t \) are updated with an exponential moving average (ema) of the student parameters. To compute the MIM loss, a blockwise mask is applied to the two augmented views \( x_1 \), \( x_2 \) of the same image \( x \) and their corresponding masked views \( \hat{x}_1 \) and \( \hat{x}_2 \) are obtained. This is achieved by masking random contiguous blocks of image patches, effectively covering square-shaped regions with continuous blocks. The training objective of MIM in iBOT is defined as: \[ \mathcal{L}_{MIM} = - \sum_{i=1}^N {m_i \cdot P_{\theta'}^{patch} (x^i_1) \cdot log (P_{\theta}^{patch} (\hat{x}^i_1)) } - \sum_{i=1}^N {m_i \cdot P_{\theta'}^{patch} (x^i_2) \cdot log (P_{\theta}^{patch} (\hat{x}^i_2)) } \] Here, \( N \) is the number of images in the train dataset. While implementing the self-supervised learning through multi-crop strategy as described in the previous section, the terms \( x_1 \) and \( x_2 \) correspond to the global views. We again use random horizontal flips, random brightness and contrast jitter and random gaussian blur for obtaining the augmentations.

Backbone

# labelled data for
training/fine-tuning

Accuracy

AUC Score

ViT-S (patch size: 8)

300

91.4248%

0.9669

ViT-B (patch size: 8)

300

91.9525%

0.9668

ViT-S (patch size: 8)

600

92.2164%

0.9631

ViT-B (patch size: 8)

600

92.6121%

0.9696

ViT-S (patch size: 8)

1200

93.7995%

0.9814

ViT-B (patch size: 8)

1200

93.1398%

0.9814

ViT-S (patch size: 8)

3863

95.6464%

0.9906

ViT-B (patch size: 8)

3863

94.4591%

0.9885

ROC for ViT-S backbone

ROC for ViT-B backbone

Conclusion

In this blog, we present the results of the SSL algorithms that have been evaluated on the real lensing dataset. We observe that SSL pretrained models on fine-tuning on the labeled data performs better than supervised baseline trained on the labeled data from scratch, when there is an abundance of unlabelled data. This is clearly observed in the case with only 300 labeled data.

References

Dosovitskiy, Alexey, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. International Conference on Learning Representations. 2020.

Chen, Xinlei, and Kaiming He. Exploring simple siamese representation learning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.

Caron, Mathilde, et al. Emerging properties in self-supervised vision transformers. Proceedings of the IEEE/CVF international conference on computer vision. 2021.

Zhou, Jinghao, et al. ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832 (2021).