MAE:Masked autoencoders are scalable vision learners

Masked Autoencoders Are Scalable Vision Learners

This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we

arxiv.org

2022 CVPR

한줄 요약 : input image에 mask random patch를 붙이고 이를 복원하는 작업

Approach

일반적인 간단한 autoencoder 접근법

encoder : signal ⇒ latent representation

decoder : latent representation ⇒ original signal 복원

classical autoencoder와 달리, asymmetric design 채택

Masking

일반적인 ViT 처럼 image를 non-overlapping patch로 나눔
patch 중 일부를 random sampling 하고 나머지는 Mask로 채우기

이 저자들은 75% 까지 masking을 해서 pretraining 효율을 높였다고 함

이러한 High masking ratio는 중복되는 부분을 제거

⇒ neighboring patch 들 끼리의 extrapolation으로는 쉽게 해결할 수 없는 Task 해결

MAE encoder

encoder로 ViT이용 ⇒ unmaked patch에만!

일반적인 ViT처럼 embedding 하여 이용. mask token 이용 x

⇒ 25%만 이용하기에 computing양 감소, lightweight decoder 가능

MAE decoder

input : encoded visible patches, mask tokens

Mask token은 예측해야할 patch의 존재를 나타내는 shared, learned vector

모든 token에 positional embedding 적용
decoder는 image reconstruction task에서 pre-training 동안에만 이용됨⇒ encoder design과 무관하게 decoder architecture를 유연하게 적용 가능
저자들은 token 당계산량이 encoder에 비해 10% 이하인 decoder를 사용했음
recognition 위한 Image representation 생성에는 encoder만 이용됨

Reconstruction targe

각 masked patch에서 pixel 수준의 prediction으로 Input reconsturction
decoder의 output의 각 element : patch 를 나타내는 pixel 수준의 vector
decoder의 최종 layer는 linear projection인데 Output channel 수 = patch의 pixel 수

Loss function으로 MSE(mean squarred error) 이용 - reconstruction, original image 차이

⇒ 이 때 MSE는 masked patch에 대해서만 진행

reconstruction targe이 각 masked patch의 normalized pixel value인 variant(분산)에 대한 고찰

⇒ patch에서 모든 Pixel의 평균, 표준편차를 계산하고, 이를 patch(Masking된) 정규화에 이용

⇒ 정규화된 Pixel을 이용시 reconstruction quality가 향상되었음

'vision' 카테고리의 다른 글

Active Learning For Convolutional Neural Networks: A Core Set Approach(ICLR 2018) (0)	2024.04.05
DINO : Emerging Properties in Self-Supervised Vision Transformers (0)	2024.01.12
HummingBrid: Towards In-context Scene Understanding (0)	2024.01.10

koo's learning

MAE:Masked autoencoders are scalable vision learners