IMPROVING VISION TRANSFORMERS TO LEARN SMALL-SIZE DATASET FROM SCRATCH

Improving Vision Transformers to Learn Small-Size Dataset From Scratch

Improving Vision Transformers to Learn Small-Size Dataset From Scratch

Blog Article

This paper proposes various techniques that help Vision Transformer (ViT) to learn small-size datasets from scratch successfully.ViT, which applied the transformer structure to the image classification task, has outperformed convolutional neural networks, recently.However, the lpc330v high performance of ViT results from pre-training using large-size datasets, and its dependence on large datasets comes from low locality inductive bias.

And conventional ViT cannot effectively attend the target class due to redundant attention caused by a rather high constant temperature factor.In order to improve the locality inductive bias of ViT, this paper proposes novel tokenization (Shifted Patch Tokenization: SPT) using shifted patches and a position encoding (CoordConv Position Encoding: CPE) using $1 imes 1$ CoordConv.Also, to improve poor attention, we propose a new self-attention mechanism (Locality Self-Attention: LSA) based on learnable temperature and self-relation masking.

SPT, CPE, and LSA are intuitive techniques, isidor oreiller but they successfully improve the performance of ViT even on small-size datasets.We qualitatively show that each technique attends a more important area and contributes to having a flatter loss landscape.Moreover, the proposed techniques are generic add-on modules applicable to various ViT backbones.

Our experiments show, when learning Tiny-ImageNet from scratch, the proposed scheme based on SPT, CPE, and LSA increases the accuracy of ViT backbones by +3.66 on average and up to +5.7.

Also, the performance improvement of ViT backbones in ImageNet-1K classification, learning on COCO from scratch, and transfer learning on classification datasets verify that the generalization ability of the proposed method is excellent.

Report this page