[논문 리뷰] ViViT: A Video Vision Transformer

Posted Feb 28, 2024

By Luna Lee

17 min read

Google Research ICCV 2021

Introduction

Transformer는 Multi-headed self-attention을 기반으로 long-range dependency를 모델링하는데 효과적인 모델이다. Receptive field가 제한되고 네트워크의 깊이에 따라 선형적으로 증가하는 convolution과 달리 모델이 입력 sequence의 모든 요소에 attend 할 수 있도록 하는 특징이 있다. 최근 NLP에 이어 Attention-based 모델인 Transformer는 Computer vision에서도 convolution을 대체하려는 시도가 등장했다. 하지만 Transformer는 convolution의 inductive bias(translational equivariance와 같은)가 부족하기 때문에 더 많은 데이터 혹은 강력한 regularization이 필요하다.

Convolution 구조보다 뛰어난 성능을 보인 ViT에 영감을 받아, attention-based 구조가 long-range contextual relationship을 모델링하는데 효과적이라는 사실에 기반하여 저자는 Video classification을 위한 transformer-based 모델을 제안한다. 제안하는 모델은 pure-transformer 구조로 입력 video에서 추출한 일련의 spatio-temporal token을 기반으로 계산된다. 또한 수많은 spatio-temporal token을 효과적으로 처리하기 위해 spatial 및 temporal 차원에 따라 모델을 factorising하여 효율성과 확장성을 높이는 여러 가지 방법을 제시한다. 뿐만 아니라 더 작은 데이터셋에서 모델을 효과적으로 훈련하기 위한 regularization, pre-training 방법을 제시한다.

Method (Video Vision Transformers)

1. Overview of Vision Transformers (ViT)

Vision Transformer(ViT)는 최소한의 변경으로 Transformer architecture를 적용한다. ViT에서는 N개의 non-overlapping patch $(x_i \in ℝ^{h\times w})$를 추출하여 linear projection 한 뒤 1D token으로 raster화$(z_i \in ℝ^d)$ 한다. Learned classification token $z_{cls}$가 시퀀스에 추가되고, 위치정보를 유지하기 위한 learned positional embedding $p \in ℝ^{N \times d}$가 추가된다.

\[\mathbf{z} = [z_{cls}, \mathbf{E}x_1, \mathbf{E}x_2, ... , \mathbf{E}x_N] + \mathbf{p}\]

모델은 총 L개의 transformer layer로 구성되어있으며, 각각의 layer에는 Multi-Headed Self-Attention, layer normalization (LN), MLP blocks 로 구성되어있다. 마지막으로 linear classifier는 입력에 classification token이 추가된 경우 $z^L_{cls} \in ℝ^d$를 기반으로, 아닌 경우 모든 token $\mathbf{z}^L$의 global average pooling를 기반으로 분류를 수행한다.

\[\mathbf{y^\ell} = MSA(LM(\mathbf{z^{\ell}})) + \mathbf{z^{\ell}} \\\] \[\mathbf{z^{\ell+1}} = MLP(LN(y^{\ell})) + y^{\ell}\]

2. Embedding video clips

논문에서는 video $\mathbf{V} \in ℝ^{T \times H \times W \times C}$ 를 token sequence $\tilde{\mathbf{z}} \in ℝ^{n_t \times n_h \times n_w \times d}$ 로 매핑하는 두 가지 방법을 제시한다. 그 뒤에 positional embedding을 추가하고 $ℝ^{N \times d}$로 reshape 하여 transformer input $\mathbf{z}$를 얻는다.

< Uniform frame sampling >

입력 비디오를 토큰화하는 가장 간단한 방법은 위의 그림과 같이 video clip에서 frame $n_t$를 균일하게 샘플링하고, 각 2D frame에 대해 독립적으로 ViT와 동일한 방법을 사용하는 것이다. 단순하게 모든 frame에 대해 각각 ViT patch 분할과정을 적용한다고 보면 된다.

각각 embed된 patch는 모든 frame에 대해 concat되어 입력으로 사용된다. 각 frame에서 non-overlapping patch $n_h \cdot n_w$를 추출하면 총 $n_t \cdot n_h \cdot n_w$ token이 transformer encoder에 들어간다.

< Tubelet Embedding >

두 번째 방법은 입력 volume에서 spatio-temporal “tube”를 추출하고 이를 $ℝ^d$로 linearly project하는 것이다. 이 방법은 ViT의 embedding을 3D로 확장한 버전이다. Tubelet의 dimension이 $t \times h times w$ 일 때, token의 개수 $n_t \cdot n_h \cdot n_w$ 는 각각 $n_t = T/t, n_h = H/h, n_w = W/w$ 를 의미한다.

Tubelet의 dimension이 작을 수록 더 많은 token이 생성되고 계산량이 증가한다. 이 방법은 서로 다른 frame에 대한 temporal 정보가 transformer에서 융합되는 Uniform frame sampling과 달리 token화 중에 spatio-temporal 정보를 융합한다.

3. Transformer Models for Video

논문에서는 다양한 Transformer 구조를 제안한다. 모든 spatio-temporal token의 pairwise interaction를 모델링하는 ViT의 간단한 확장 버전부터 입력 video의 spatial,temporal dimension을 효율적으로 factorise하는 모델까지 다양한 transformer 구조를 개발했다.

Model 1. Spatio-temporal attention

Spatio-temporal attention 모델은 단순히 video $z^0$에서 추출한 모든 spatio-temporal token을 인코더 입력으로 사용하는 방법이다. Receptive field가 선형적으로 증가하는 CNN 구조와 달리 각각의 transformer layer는 모든 spatio-temporal token에 대해, 모든 쌍에 대한 상호작용(pair-wise interaction)을 모델링한다. 따라서 첫 번째 layer에서도 video 전체의 long-range interaction을 모델링할 수 있다.

하지만 이 때문에 Multi-Headed Self Attention (MSA)는 token 수의 제곱만큼 복잡도가 증가한다는 문제가 있다.

Model 2. Factorised encoder

Factorised encoder 모델은 위의 그림과 같이 두개의 transformer encoder로 구성되어 있다.

Spatial encoder($L_s$ transformer layer)
- 동일한 temporal index에서 추출된 token 간의 상호 작용만 모델링. 즉, 동일한 시간(temporal) 정보를 가진 token에 대해 공간적인(spatial) 정보를 학습하기 위함
- Spatial encoder를 거쳐 encoding된 각각의 CLS token은 해당 temporal index에 대한 representation$(h_i \in ℝ^d)$으로 볼 수 있다. (입력으로 CLS token이 추가 된 경우에만. 만약 CLS token이 추가되지 않았다면 앞서 언급한 대로 spatial encoder output token $\mathbf{z}^{L_s}$를 global average pooling해서 구한다).
- frame-level representation $h_i$는 concat 되어$(\mathbf{H} \in ℝ^{n_t \times d})$ temporal encoder로 전달된다.
Temporal encoder($L_t$ transformer layer)
- 공간적인 정보를 학습한 representation을 받아 다른 temporal index를 가진 token 사이의 상호작용을 모델링, 즉 시간적인(temporal) 정보를 학습하기 위함
- Temporal Encoder의 output token이 최종적으로 분류에 사용된다.

이 구조는 시간적인 정보를 “late fusion”하는 구조이다. 모델 1과 비교하여 더 많은 수의 transformer layer를 포함하지만 적은 FLOPs가 요구된다.

Model 3. Factorised self-attention

Factorised self-attention 모델은 모델1과 동일한 수의 transformer layer를 포함한다. 하지만 layer l에서 모든 token 쌍 $\mathbf{z}^{\ell}$에 대해 multi-headed self-attention을 계산하지 않고 연산을 factorise한다. 먼저 공간적(spatially)으로 self-attention을 수행한 다음 시간적(temporally)으로(동일한 temporal index에서 추출된 모든 token에 대해) self-attention을 수행한다. 이 방법으로 transformer는 spatio-temporal 상호작용을 모델링 할 수 있지만 연산을 factorise했으므로 computational complexity는 모델 2와 동일하다.

효과적인 spatial self-attention 연산을 위해 token $\mathbf{z}$를 $ℝ^{1 \times n_t \cdot n_h \cdot n_w \cdot d}$ 에서 $ℝ^{n_t \times n_h \cdot n_w \cdot d}$로 reshape한다$( \mathbf{z_s})$. 마찬가지로 temporal self-attention 연산을 위해 $\mathbf{z}_t$를 $ℝ^{n_t \cdot n_h \times n_w \cdot d}$로 reshape한다. Self-attention은 아래와 같이 계산된다.

저자는 spatial-then-temporal selfattention과 temporal-then-spatial self-attention의 차이, 즉 순서에 따른 차이는 없었다고 한다. 또한 공간적 차원과 시간적 차원 사이에서 입력 token을 reshape할 때 모호함을 피하기 위해 이 모델에서 classification token을 사용하지 않았다고 한다.

Model 4. Factorised dot-product attention

마지막으로, 저자는 모델2, 3과 동일한 computational complexity를 가지면서 모델1과 동일한 parameter수를 갖는 모델을 개발했다. 공간 및 시간 차원의 factorisation은 모델3과 유사하지만, 대신 multi-head dot-product attention 연산에 대해 factorise를 수행한다. 각각의 token의 시간, 공간 dimension에 대해 분리된 head를 사용하여 attention weight을 계산한다. 각각의 head에 대한 attention 연산은 아래와 같다.

\[Attention(\mathbf{Q, K, V}) = softmax\Big(\dfrac{\mathbf{Q{K}^\top}}{\sqrt{d_k}}\Big)\mathbf{V}\]

여기서 query $\mathbf{Q}=\mathbf{X}\mathbf{W}_q$, key $\mathbf{K}=\mathbf{X}\mathbf{W}_k$, value $\mathbf{V}=\mathbf{X}\mathbf{W}_v$ 는 입력 $\mathbf{X}$에 대한 linear projections이다.$(\mathbf{X, Q, K, V} \in ℝ^{N \times d}, \; N = n_t \cdot n_h \cdot n_w)$

논문의 메인 아이디어는 동일한 spatial-, temporal index를 가지는 token만 attend 할 수 있도록 key와 value 값을 각각의 query를 기준으로 수정하는 것이다. 이를 위해 key와 value 값의 차원을 $\mathbf{K}_s, \mathbf{V}_s \in ℝ^{n_h \cdot n_w \times d}$ 와 $\mathbf{K}_t, \mathbf{V}_t \in ℝ^{n_t \times d}$ 로 수정한다.

Sptial head와 Temporal head는 아래와 같이 나타낼 수 있다.

\[\mathbf{Y}_s = Attention(\mathbf{Q}, \mathbf{K}_s, \mathbf{V}_s) \\\] \[\mathbf{Y}_t = Attention(\mathbf{Q}, \mathbf{K}_t, \mathbf{V}_t)\]

그리고 최종적으로 두 결과는 concat되고 linear projection한다.

\[\mathbf{Y} = Concat(\mathbf{Y}_s, \mathbf{Y}_t)\mathbf{W}_o\]

4. Initialisation by leveraging pretrained models

Transformer는 CNN의 inductive bias가 부족하기 때문에 대규모 데이터셋이 필요한 것으로 알려져있다. 하지만 Video 데이터는 가장 크다고 알려진 Kinetics 조차 2D 이미지에 비해 적은 양의 데이터를 가지고 있다. 따라서 본 논문에서는 이러한 문제를 해결하고 효율적으로 모델을 학습시키기 위해 pretrained image 모델로 부터 video 모델을 initialise(초기화) 하기 위한 전략을 제시한다.

Positional embeddings
positinal embedding은 각 입력 token에 추가된다. 하지만 video 모델은 image 모델보다 몇 배의 token을 가지고 있다. 따라서 positional embedding을 시간적으로 반복하여 초기화한다($ℝ^{n_h \cdot n_w \times d} → ℝ^{n_t \cdot n_h \cdot n_w \times d}$). 따라서 동일한 spatial index를 가진 token은 동일한 embedding으로 초기화되며, fine-tuning된다.

Embedding weights, E
Pre-train된 모델 $\mathbf{E}_{image}$의 embedding filter가 2D tensor것과 달리, “tubelet embedding” tokenisation 방법을 사용할 때 embedding filter는 3D tensor이다. 일반적으로 Video classification을 위해 2D filter에서 3D convolution filter를 초기화하는 방법은 시간적 차원을 따라 filter를 복제하고 평균내는 “inflate” 방법 이다.

\[\mathbf{E} = \dfrac1t[\mathbf{E}_{image},\; ...\;, \mathbf{E}_{image},\;...\;, \mathbf{E}_{image}].\]

저자는 “central frame initialisation”라고 하는 추가적인 전략을 고안했다. 가운데 t/2를 제외하고 나머지 시간적 위치를 0으로 초기화한다. 이 방법을 이용하면 3D convolution filter는 초기화시에 “Uniform frame sampling”처럼 보이게 되어 효과적으로 동작하고, 모델 학습이 진행됨에 따라 여러 frame에서 시간 정보를 집계하는 방법을 학습할 수 있다.

\[\mathbf{E} = [0, \;...\;, \mathbf{E}_{image}, \; ... \;, 0 ].\]

Transformer weights for Model 3
모델 3의 transformer block은 pretrain된 ViT와 다르게 2개의 MSA(Multi-headed Self Attention) 모듈을 포함한다. 이 경우에 pretrained 모듈에서 spatial MSA를 초기화하고 temporal MSA의 모든 weight을 0으로 초기화한다.

Empirical evaluation

1. Experimental Setup

Network, training, datasets.

backbone: ViT(ViT-Base, ViT-Large, ViT-Huge), BERT
ImageNet-21K, larger JFT dataset으로 학습한 ViT를 이용해 ViViT 모델을 초기화함
모든 실험에서 tubelet의 size는 동일. $h \times w \times t = 16×16×2$.
Dataset: Kinetics, Epic Kitchens-100, Moments in Time, Something-Something v2(SSv2)

Inference.

network에 대한 입력은 32 frames, stride 2의 video clip이다. inference시에는 더 긴 video에 대해 multiple view로 처리하고, 각각의 per-view logit에 대해 average해서 최종 결과를 얻는다고 한다. 긴 동영상은 쪼개서 여러번 모델에 대해 결과를 얻고 average해서 결과를 얻는 방식인것 같다.

2. Ablation study

Input encoding

모델 1과 Kinetics 400 데이터셋을 이용하여 다양한 입력 인코딩 방법(Method 2.)에 대한 실험을 진행했다.

Model variants

accuracy와 efficiency 측면에서 Kinetics 400 및 Epic Kitchens 데이터셋에 대해 제안된 model variants(Method 3.)에 대해 비교했다.

모델 2는 추가적인 hyperparameter $L_t$(temporal transformer 수)가 있다. $L_t$에 대한 실험 결과는 아래와 같다.

Model regularisation

Factorised encoder 모델(모델 2)를 사용하여 각각의 regularisation 추가에 따른 성능 향상에 대해 조사했다.

Varying the backbone, number of tokens & input frames

3. Comparison to state-of-the-art

Ablation study 결과를 바탕으로 두가지 모델 변형을 사용하여 SOTA와 비교했다. 더 높은 정확도를 달성하기 위해 모델 1보다 더 많은 토큰을 처리할 수 있는 Factorized Encoder 모델(모델 2)를 주로 사용하여 실험을 진행했다. 각각의 데이터셋에 대한 비교 결과는 아래와 같다.

AI, Paper Review

This post is licensed under CC BY 4.0 by the author.