This blog post provides an overview of my ongoing project with Machine Learning for Science (ML4Sci) as part of the Google Summer of Code (GSoC) 2024. All the project's code is openly accessible at github.com/iyersreehari/DeepLense_SSL_Sreehari_Iyer
Strong gravitational lensing provides a means to probe dark matter substructure. In recent years, machine learning techniques, particularly supervised learning, are being utilized for substructure detection, and other regression and classification tasks on the lensing dataset. With strong gravitational lensing, the available labeled training data is scarce. Supervised learning requires abundant labeled training data and can be biased by class imbalances in the training dataset. To circumvent this, previous works have utilized simulated lensing dataset for supervised learning. However, this approach may result in diminished performance on running inferences with real datasets. In computer vision, self-supervised learning (SSL) has emerged as a potent solution, particularly effective in scenarios with abundant unlabeled data and scarce labeled data. Recent works have studied convolutional neural network (CNN) based SSL with simulated lensing data. This project focuses on evaluating of self-supervised learning techniques with Transformers utilizing real-world strong gravitational lensing dataset.
This project focuses on evaluating of self-supervised learning techniques with Transformers utilizing real-world strong gravitational lensing dataset. The learned representations are then evaluated on the downstream task to classify lens and non-lens images. We train and compare Vision Transformer [Dosovitskiy, Alexey, et al., 2020] with different self supervised learning algorithms alongwith a supervised baseline. Vision Transformers split the image into fixed-size patches and linearly embed them with additional position embedding. The resulting vector representation is then fed into a standard Transformer encoder. The following llustration of the ViT architecture is from Dosovitskiy, Alexey, et al., 2020
A supervised learning baseline is trained to compare the performance of self-supervised learning algorithms. The ViT-B and ViT-S networks are trained to minimize the cross entropy loss between the predicted labels and the actual labels for the training dataset (with the label information).
Recent studies in self-supervised representation learning, utilize methods involving certain forms of Siamese networks.
The SimSiam architecture takes as input two randomly
augmented views \( x_1 \) and \( x_2 \) from an image \( x \). We utilize random horizontal flips, random brightness and contrast jitter and random gaussian blur for obtaining the augmentations.
The architecture consists an encoder network \( f \) followed by a prediction MLP head \( h \). The encoder network \( f \) consists of the backbone network followed by a projection MLP.
Let the cosine similarity between the output vectors \( p_1 = h(f(x_1)) \) and \( z_2 = f(x_2) \) be \[ \mathcal{D} (p_1, z_2) = \frac{p_1}{|| p_1 ||_2} \cdot \frac{z_2}{|| z_2 ||_2} \]
The objective of SimSiam is to minimize the symmetrized loss defined as follows:
\[ \mathcal{L} = - \frac{1}{2} \mathcal{D} (p_1, z_2) - \frac{1}{2} \mathcal{D} (p_2, z_1) \]
Further, the architecture utilizes stop-gradient operation to avoid representation collapse which may cause the optimizer to quickly obtain a degenerated solution reaching the minimum possible loss of −1.
The evaluation results on the test dataset after finetuning the architecture on the train dataset are given below
The model consists of student and teacher networks.
The output of the teacher network is centered with mean computed over a batch.
The output of the student and the teacher networks are normalized with a temperature softmax over the feature dimension.
[Zhou, Jinghao, et al., 2021] Given the training set \( \mathcal{I} \), an image \( x \sim \mathcal{I} \) is sampled uniformly, over
which two random augmentations are applied, yielding two distorted views \( u \) and \( v \). We utilize random horizontal flips, random brightness and contrast jitter and random gaussian blur for obtaining the augmentations.
The two distorted views are then put through a teacher-student framework to get the predictive categorical distributions from the \( \texttt{[CLS]} \) token:
\[
v^{\texttt{[CLS]}}_t = P_{\theta '}^{\texttt{[CLS]}} (v) \
\text{ and } \
u^{\texttt{[CLS]}}_s = P_{\theta}^{\texttt{[CLS]}} (u)
\]
where \( \theta ' \) denotes the parameters of the teacher network and \( \theta \) denotes the parameters of the student network.
The knowledge is distilled from teacher to student by minimizing their cross-entropy loss w.r.t. the student parameters \( \theta_s \)
\[
\mathcal{L}_{\texttt{[CLS]}} (u, v) = - P_{\theta '}^{\texttt{[CLS]}} (v) \cdot \log {(P_{\theta}^{\texttt{[CLS]}} (u))}
\]
The teacher and the student share the same architecture consisting of a backbone \( \mathcal{f} \)
and a projection MLP head \( h \). A stop-gradient (sg) operator is applied on the teacher network similar to SimSiam architecture.
Self-supervised learning is implenmented in Caron, Mathilde, et al., 2021 through different distorted views, or crops, of an image with multi-crop strategy. For a given image, a set \( V \) of different views is generated. This set contains two (or more) global views, \( x_{g1} \) and \( x_{g2} \) and several local views of smaller resolution (image cropped to cover smaller area). All crops are passed through the student network while only the global views are passed through the teacher network.
Thus, the objective to minimize w.r.t. the student parameter \( \theta_s \) is as follows:
\[
\mathcal{L}_{DINO} = \sum_{v \in {x_{g1}, x_{g2}}} \ \sum_{u \in V, u \neq v}
- P_{\theta '}^{\texttt{[CLS]}} (v) \cdot \log {(P_{\theta}^{\texttt{[CLS]}} (u))}
\]
The teacher parameters \( \theta_t \) are updated with
an exponential moving average (ema) of the student parameters \( \theta_s \).
The evaluation results on the test dataset after finetuning the architecture on the train dataset are given below
iBot objective is a linear combination of the DINO loss and the Masked Image Modelling (MIM) loss.
\[
\lambda_1 \cdot \mathcal{L}_{DINO} + \lambda_2 \cdot \mathcal{L}_{MIM}
\]
where \( \mathcal{L}_{DINO} \) is the objective to be minimized in DINO as described in the previous section and \( \lambda_1 \), \( \lambda_2 \) are hyperparameters.
The objective is minimized with respect to \( \theta_s \), the parameters of the student network. The parameters of the teacher network \( \theta_t \) are updated with an exponential moving average (ema) of the student parameters. To compute the MIM loss, a blockwise mask is applied to the two augmented views \( x_1 \), \( x_2 \) of the same image \( x \) and their corresponding masked views \( \hat{x}_1 \) and \( \hat{x}_2 \) are obtained. This is achieved by masking random contiguous blocks of image patches, effectively covering square-shaped regions with continuous blocks.
The training objective of MIM in iBOT is defined as:
\[
\mathcal{L}_{MIM} = - \sum_{i=1}^N {m_i \cdot P_{\theta'}^{patch} (x^i_1) \cdot log (P_{\theta}^{patch} (\hat{x}^i_1)) } - \sum_{i=1}^N {m_i \cdot P_{\theta'}^{patch} (x^i_2) \cdot log (P_{\theta}^{patch} (\hat{x}^i_2)) }
\]
Here, \( N \) is the number of images in the train dataset. While implementing the self-supervised learning through multi-crop strategy as described in the previous section, the terms \( x_1 \) and \( x_2 \) correspond to the global views.
We again use random horizontal flips, random brightness and contrast jitter and random gaussian blur for obtaining the augmentations.