Learning Representation Through Self-Supervised Learning on Real Gravitational Lensing Images

Indian Institute of Technology Madras

This blog post provides an overview of my ongoing project with Machine Learning for Science (ML4Sci) as part of the Google Summer of Code (GSoC) 2024. All the project's code is openly accessible at github.com/iyersreehari/DeepLense_SSL_Sreehari_Iyer

Abstract

Strong gravitational lensing provides a means to probe dark matter substructure. In recent years, machine learning techniques, particularly supervised learning, are being utilized for substructure detection, and other regression and classification tasks on the lensing dataset. With strong gravitational lensing, the available labeled training data is scarce. Supervised learning requires abundant labeled training data and can be biased by class imbalances in the training dataset. To circumvent this, previous works have utilized simulated lensing dataset for supervised learning. However, this approach may result in diminished performance on running inferences with real datasets. In computer vision, self-supervised learning (SSL) has emerged as a potent solution, particularly effective in scenarios with abundant unlabeled data and scarce labeled data. Recent works have studied convolutional neural network (CNN) based SSL with simulated lensing data. This project focuses on evaluating of self-supervised learning techniques with Transformers utilizing real-world strong gravitational lensing dataset.

Experiment

This project focuses on evaluating of self-supervised learning techniques with Transformers utilizing real-world strong gravitational lensing dataset. The learned representations are then evaluated on the downstream task to classify lens and non-lens images. We train and compare Vision Transformer [Dosovitskiy, Alexey, et al., 2020] with different self supervised learning algorithms alongwith a supervised baseline. Vision Transformers split the image into fixed-size patches and linearly embed them with additional position embedding. The resulting vector representation is then fed into a standard Transformer encoder. The following llustration of the ViT architecture is from Dosovitskiy, Alexey, et al., 2020

ViT Architecture
ViT Architecture from Dosovitskiy, Alexey, et al., 2020
For the supervised learning baseline, the ViT backbone is followed by a classifier MLP head. The hyperparameters for the classification task are chosen such that the loss computed over the validation dataset is minimized. For self-supervised learning, the ViT backbone is trained on the training dataset without the label information. For evaluating the learned network, the backbone followed by a linear classifier is finetuned on the train dataset (with label information) and then evaluated over the held-out test dataset.
Example images from the training dataset
Example images from the training dataset
Each image has three channels or filters - b (blue filter), g (green filter) and i (near infrared filter). Detailed information about the filters are available at SDSS Filters
The train dataset contains 2333 lens images and 1530 non-lens images. The validation dataset contains 259 lens images and 170 non-lens images. The test dataset contains 458 lens images and 300 non-lens images.
The images are center cropped to 32 × 32 pixel as this empirically resulted in better prediction accuracy. To understand how well SSL works with different fractions of labelled and unlabelled data, the model is pre-trained through self supervision on the entire train data and then finetuned on the labelled fraction of the train data and compared with supervised baseline trained only on that labeled fraction. This simulates the real world scenario where only a fraction of dataset may have associated labels.

Supervised Learning Baseline

A supervised learning baseline is trained to compare the performance of self-supervised learning algorithms. The ViT-B and ViT-S networks are trained to minimize the cross entropy loss between the predicted labels and the actual labels for the training dataset (with the label information).

Backbone
# labelled data for
training/fine-tuning
Accuracy
AUC Score
ViT-S (patch size: 8)
300
89.7098%
0.9547
ViT-B (patch size: 8)
300
90.7652%
0.9534
ViT-S (patch size: 8)
600
92.2164%
0.9445
ViT-B (patch size: 8)
600
92.6121%
0.9579
ViT-S (patch size: 8)
1200
93.1398%
0.9780
ViT-B (patch size: 8)
1200
93.4037%
0.9678
ViT-S (patch size: 8)
3863
94.9868%
0.9861
ViT-B (patch size: 8)
3863
94.9868%
0.9843
ROC for ViT-S backbone
ROC for ViT-B backbone

SimSiam

[Chen, Xinlei, and Kaiming He., 2021]

Recent studies in self-supervised representation learning, utilize methods involving certain forms of Siamese networks.

SimSiam Architecture
SimSiam Architecture from Chen, Xinlei, and Kaiming He., 2021

The SimSiam architecture takes as input two randomly augmented views \( x_1 \) and \( x_2 \) from an image \( x \). We utilize random horizontal flips, random brightness and contrast jitter and random gaussian blur for obtaining the augmentations.
The architecture consists an encoder network \( f \) followed by a prediction MLP head \( h \). The encoder network \( f \) consists of the backbone network followed by a projection MLP.
Let the cosine similarity between the output vectors \( p_1 = h(f(x_1)) \) and \( z_2 = f(x_2) \) be \[ \mathcal{D} (p_1, z_2) = \frac{p_1}{|| p_1 ||_2} \cdot \frac{z_2}{|| z_2 ||_2} \] The objective of SimSiam is to minimize the symmetrized loss defined as follows: \[ \mathcal{L} = - \frac{1}{2} \mathcal{D} (p_1, z_2) - \frac{1}{2} \mathcal{D} (p_2, z_1) \] Further, the architecture utilizes stop-gradient operation to avoid representation collapse which may cause the optimizer to quickly obtain a degenerated solution reaching the minimum possible loss of −1.
The evaluation results on the test dataset after finetuning the architecture on the train dataset are given below

Backbone
# labelled data for
training/fine-tuning
Accuracy
AUC Score
ViT-S (patch size: 8)
300
89.8417%
0.9540
ViT-B (patch size: 8)
300
90.3694%
0.9539
ViT-S (patch size: 8)
600
92.2164%
0.9691
ViT-B (patch size: 8)
600
92.0844%
0.9589
ViT-S (patch size: 8)
1200
94.9868%
0.9843
ViT-B (patch size: 8)
1200
93.5356%
0.9817
ViT-S (patch size: 8)
3863
95.1187%
0.9864
ViT-B (patch size: 8)
3863
94.8549%
0.9832
ROC for ViT-S backbone
ROC for ViT-B backbone

DINO

[Caron, Mathilde, et al., 2021]

DINO Architecture
DINO Architecture from Caron, Mathilde, et al., 2021

The model consists of student and teacher networks. The output of the teacher network is centered with mean computed over a batch. The output of the student and the teacher networks are normalized with a temperature softmax over the feature dimension.

[Zhou, Jinghao, et al., 2021] Given the training set \( \mathcal{I} \), an image \( x \sim \mathcal{I} \) is sampled uniformly, over which two random augmentations are applied, yielding two distorted views \( u \) and \( v \). We utilize random horizontal flips, random brightness and contrast jitter and random gaussian blur for obtaining the augmentations. The two distorted views are then put through a teacher-student framework to get the predictive categorical distributions from the \( \texttt{[CLS]} \) token: \[ v^{\texttt{[CLS]}}_t = P_{\theta '}^{\texttt{[CLS]}} (v) \ \text{ and } \ u^{\texttt{[CLS]}}_s = P_{\theta}^{\texttt{[CLS]}} (u) \]
where \( \theta ' \) denotes the parameters of the teacher network and \( \theta \) denotes the parameters of the student network.
The knowledge is distilled from teacher to student by minimizing their cross-entropy loss w.r.t. the student parameters \( \theta_s \)
\[ \mathcal{L}_{\texttt{[CLS]}} (u, v) = - P_{\theta '}^{\texttt{[CLS]}} (v) \cdot \log {(P_{\theta}^{\texttt{[CLS]}} (u))} \] The teacher and the student share the same architecture consisting of a backbone \( \mathcal{f} \) and a projection MLP head \( h \). A stop-gradient (sg) operator is applied on the teacher network similar to SimSiam architecture.

Self-supervised learning is implenmented in Caron, Mathilde, et al., 2021 through different distorted views, or crops, of an image with multi-crop strategy. For a given image, a set \( V \) of different views is generated. This set contains two (or more) global views, \( x_{g1} \) and \( x_{g2} \) and several local views of smaller resolution (image cropped to cover smaller area). All crops are passed through the student network while only the global views are passed through the teacher network.
Thus, the objective to minimize w.r.t. the student parameter \( \theta_s \) is as follows: \[ \mathcal{L}_{DINO} = \sum_{v \in {x_{g1}, x_{g2}}} \ \sum_{u \in V, u \neq v} - P_{\theta '}^{\texttt{[CLS]}} (v) \cdot \log {(P_{\theta}^{\texttt{[CLS]}} (u))} \] The teacher parameters \( \theta_t \) are updated with an exponential moving average (ema) of the student parameters \( \theta_s \). The evaluation results on the test dataset after finetuning the architecture on the train dataset are given below

Backbone
# labelled data for
training/fine-tuning
Accuracy
AUC Score
ViT-S (patch size: 8)
300
91.1609%
0.9689
ViT-B (patch size: 8)
300
91.5567%
0.9703
ViT-S (patch size: 8)
600
92.6121%
0.9653
ViT-B (patch size: 8)
600
92.2164%
0.9667
ViT-S (patch size: 8)
1200
93.7995%
0.9858
ViT-B (patch size: 8)
1200
93.2718%
0.9798
ViT-S (patch size: 8)
3863
94.9868%
0.9865
ViT-B (patch size: 8)
3863
94.1953%
0.9876
ROC for ViT-S backbone
ROC for ViT-B backbone

iBOT

[Zhou, Jinghao, et al., 2021]

iBot Architecture
iBot Architecture from [Zhou, Jinghao, et al., 2021]

iBot objective is a linear combination of the DINO loss and the Masked Image Modelling (MIM) loss. \[ \lambda_1 \cdot \mathcal{L}_{DINO} + \lambda_2 \cdot \mathcal{L}_{MIM} \] where \( \mathcal{L}_{DINO} \) is the objective to be minimized in DINO as described in the previous section and \( \lambda_1 \), \( \lambda_2 \) are hyperparameters.
The objective is minimized with respect to \( \theta_s \), the parameters of the student network. The parameters of the teacher network \( \theta_t \) are updated with an exponential moving average (ema) of the student parameters. To compute the MIM loss, a blockwise mask is applied to the two augmented views \( x_1 \), \( x_2 \) of the same image \( x \) and their corresponding masked views \( \hat{x}_1 \) and \( \hat{x}_2 \) are obtained. This is achieved by masking random contiguous blocks of image patches, effectively covering square-shaped regions with continuous blocks. The training objective of MIM in iBOT is defined as: \[ \mathcal{L}_{MIM} = - \sum_{i=1}^N {m_i \cdot P_{\theta'}^{patch} (x^i_1) \cdot log (P_{\theta}^{patch} (\hat{x}^i_1)) } - \sum_{i=1}^N {m_i \cdot P_{\theta'}^{patch} (x^i_2) \cdot log (P_{\theta}^{patch} (\hat{x}^i_2)) } \] Here, \( N \) is the number of images in the train dataset. While implementing the self-supervised learning through multi-crop strategy as described in the previous section, the terms \( x_1 \) and \( x_2 \) correspond to the global views. We again use random horizontal flips, random brightness and contrast jitter and random gaussian blur for obtaining the augmentations.

Backbone
# labelled data for
training/fine-tuning
Accuracy
AUC Score
ViT-S (patch size: 8)
300
91.4248%
0.9669
ViT-B (patch size: 8)
300
91.9525%
0.9668
ViT-S (patch size: 8)
600
92.2164%
0.9631
ViT-B (patch size: 8)
600
92.6121%
0.9696
ViT-S (patch size: 8)
1200
93.7995%
0.9814
ViT-B (patch size: 8)
1200
93.1398%
0.9814
ViT-S (patch size: 8)
3863
95.6464%
0.9906
ViT-B (patch size: 8)
3863
94.4591%
0.9885
ROC for ViT-S backbone
ROC for ViT-B backbone

Conclusion

In this blog, we present the results of the SSL algorithms that have been evaluated on the real lensing dataset. We observe that SSL pretrained models on fine-tuning on the labeled data performs better than supervised baseline trained on the labeled data from scratch, when there is an abundance of unlabelled data. This is clearly observed in the case with only 300 labeled data.

References

Dosovitskiy, Alexey, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. International Conference on Learning Representations. 2020.
Chen, Xinlei, and Kaiming He. Exploring simple siamese representation learning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.
Caron, Mathilde, et al. Emerging properties in self-supervised vision transformers. Proceedings of the IEEE/CVF international conference on computer vision. 2021.
Zhou, Jinghao, et al. ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832 (2021).