[논문 리뷰] simCLR, A Simple Framework for Contrastive Learning of Visual Representations

Posted Jul 3, 2024

By Luna Lee

10 min read

Geoffrey Hinton PMLR 2020

Paper: https://arxiv.org/abs/2002.05709
Git: https://github.com/google-research/simclr

📖 핵심 훑어보기 !!

Visual representation을 학습하기 위해 latent space에서 비슷한 이미지는 가까워지도록, 다른 이미지는 서로 멀어지도록 학습하는 Contrastive Learning 방법을 사용
데이터 sample에 augmentation을 적용하여 같은 이미지에 생성된 예제는 positive pair로, batch 내의 나머지 예제를 negative로 취급하여 학습함
다양한 Augmentation, model size, projection head, batch size등 실험을 통해 다양한 parameter의 영향에 대해 증명함

Introduction

Human supervision 없이 효과적인 visual representation을 학습하는 것은 오래 다뤄져온 문제이다. 일반적인 SSL(self-supervised learning) 방법으로 self-supervised learning에서 사용하는 object function을 사용하되, unlabeled data에서 파생된 pretext task를 수행하도록 모델을 학습시키는 방법을 수행해왔다. 하지만 이러한 방법은 pretext task를 설계하기 위해 heuristic에 의존할 뿐 아니라 학습된 representation의 generality를 제한할 수 있다.

저자는 contrastive learning을 기반 알고리즘에 영감을 얻어, SimCLR이라고 부르는 visual representation의 contrastive learning을 위한 framework를 제안한다. SimCLR은 이전 작업보다 뛰어난 성능을 보일 뿐 아니라 특수한 구조나 memory bank를 필요로 하지 않는다. 주요 발견은 다음과 같다.

여러 data augmentation의 구성은 효과적인 contrastive prediction task를 정의하는데 중요하다.
Representation과 contrastive loss 사이에 학습 가능한 nonlinear transformation을 추가하면 학습된 representation의 품질이 크게 향상된다.
Contrastive cross entropy loss을 사용한 학습은 normalized embedding과 조정된 temperature parameter로 부터 이점을 얻는다.
Supervised learning에 비해 큰 batch size와 긴 학습, 더 깊고 넓은 네트워크의 이점을 얻는다.

Method

1. The Contrastive Learning Framework

Contrastive Learning은 말 그대로 “대조 학습”으로, representation을 학습할 때 latent space에서 비슷한 이미지는 가까워지도록, 다른 이미지는 서로 멀어지도록 학습하는 방법을 말한다. SimCLR도 마찬가지로 latent space에서 contrastive loss를 활용하여, 동일한 이미지에서 다른 Augmentation을 적용하여 같은 이미지에서 만들어진 것을 positive pair로 최대한 가까워지도록 학습한다. 위 그림과 같이 simCLR은 크게 네가지 구성 요소로 이루어져 있다.

stochastic data augmentation module: data sample을 random하게 augmentation하여, 동일한 이미지에서 생성된 augmented 이미지를 positive pair로 간주함.
base encoder $f(\cdot)$: augmented 된 data에서 representation vector를 추출.
projection head $g(\cdot)$: representation을 constrastive loss가 적용되는 공간으로 매핑.
contrastive loss function: contrastive prediction task에 의해 정의됨.

이 때 random 하게 N개의 데이터 예제로 구성된 minibatch를 샘플링하고, augmentation을 적용하여 2N개의 데이터 point를 생성한다. 같은 이미지에 생성된 예제는 positive pair가 되고, batch 내의 나머지 2(N-1)개의 예제를 negative로 취급하여 학습한다. Loss function은 다음과 같이 정의된다. $\text{sim}(u, v) = u^\top v/∥u∥∥v∥$는 cosine similarity를 나타내고, $\tau$는 temperature parameter를 의미한다.

\[\ell_{i,j} = -\log \frac{\exp(\text{sim} (z_i, z_j)/\tau)}{\sum^{2N}_{k=1} \mathbb 1_{[k \not = i]}\exp (\text{sim}(z_i, z_k)/\tau)}\]

알고리즘은 아래와 같이 요약될 수 있다.

2. Data Augmentation for Contrastive Representation Learning

Data Augmentation은 supervised, unsupervised representation learning에서 많이 사용되었지만, contrastive prediction task를 위한 체계적인 방법으로 사용되지는 않았다. 기존 접근 방식들은 contrastive prediction task를 위해 architecture를 변경하여, contrastive prediction task를 정의했다.

예를들면 네트워크 architecture에서 receptive field를 제한하여 global-to-local view prediction을 수행한다던지, context aggregation 네트워크를 사용하여 이미지 분할 절차를 수정하여 neighboring view prediction을 수행했다.

본 논문에서는 아래 그림과 같이 간단한 random cropping(with resizing)만을 수행하여 이러한 복잡한 과정을 대체할 수 있음을 증명했다.

Composition of data augmentation operations is crucial for learning good representations.

Data Augmentation의 영향을 체계적으로 연구하기 위해 몇가지 augmentation을 적용했다.

Cropping, resizing, rotation, cutout 같은 spatial/geometric transformation
Color distortion, Gaussian blur, Sobel filtering과 같은 appearance transformation

개별 augmentation을 사용하는 효과와 augmentation의 구성 즉, pair로 사용할 때의 성능을 조사했다. 위 그림에서 마지막 열 ‘Avarage’를 제외하고 나머지는 x, y에 해당하는 augmentation을 pair로 적용한 결과이다(대각행렬은 개별 augmentation 결과이다).

위 결과를 통해 단일 augmentation을 적용한 경우, 성능이 확연히 저하되는 것을 알 수 있다. 또한, random cropping + random color distortion으로 구성했을 때 가장 좋은 결과를 보이는 것을 알 수 있다.

3. Architectures for Encoder and Head

Unsupervised contrastive learning benefits (more) from bigger models.

다음으로는 모델 크기에 대해 성능 향상을 조사했다. 위의 그림과 같이, 모델의 depth와 width를 늘리면 성능이 향상되는 것을 볼 수 있다. 이 때, 모델의 크기가 커질 수록 supervised learning에서보다 unsupervised learning에서의 성능 격차가 줄어드는 것을 보면 unsupervised learning이 모델 크기 측면에서 더 많은 이점을 얻는다는 것을 알 수 있다.

A nonlinear projection head improves the representation quality of the layer before it.

다음으로는 projection head g(h)의 중요성에 대해 연구했다. 위의 그림은 head에 세 가지 다른 architecture를 사용했을 때의 성능에 대해 평가한 그래프이다. 각각 Identity mapping, linear projection, 추가적인 hidden layer를 사용한 nonlinear projection이다. Nonlinear projection이 linear projection 보다 더 좋은 결과를 보이고, projection이 없는 것보다 훨씬 나은 성능을 보이는 것을 확인할 수 있다.

또한 nonliear projection을 사용하더라도, hidden layer가 projection head 앞에 위치하는 경우$(\mathcal h)$가 뒤에 위치하는 경우$(z=g (\mathcal h))$보다 성능이 좋았다. 저자들은 이러한 현상을 nonlinear projection 앞의 representation을 사용하게 되면 contrastive loss로 인해 정보가 손실되기 때문이라고 추측했다.

4. Loss Functions and Batch Size

Normalized cross entropy loss with adjustable temperature works better than alternatives.

다음으로는 loss function에 따른 결과이다. 위 표와 같이 NT-Xent loss가 가장 높은 Top-1 score를 보이는 것을 알 수 있다.

또한 표 5를 통해 NT-Xent loss에서 $\ell_2$ normalization(cosine similarity vs dot product)과 temperature $\tau$의 중요성에 대해 실험한 결과를 볼 수 있다. Noramlization과 적절한 temperature 설정 없이는 성능이 좋지 않은 것을 확인할 수 있다.

Contrastive learning benefits (more) from larger batch sizes and longer training.

위의 그래프는 모델의 epoch에 따른 batch size의 영향을 나타낸다. Training epoch가 적을 때에는 batch size에 따른 성능 차이가 크고, epoch가 진행될 수록 batch size로 인한 격차가 줄어드는 것을 볼 수 있다. Contrastive learning에서 더 큰 batch size를 사용하면 더 많은 negative example을 제공하므로 모델이 수렴하기에 용이하다. 마찬가지로 더 많은 training epoch를 진행하면 더 많은 negative example을 제공하고, 결과가 개선된다.

AI, Paper Review

Contrastive Learning Self-supervised

This post is licensed under CC BY 4.0 by the author.