Introduction

Tutorials

Developer guide

API Reference

public ViTstruct

ViT(mode::Symbol = base; imsize::Dims{2} = (256, 256), inchannels = 3,
    patch_size::Dims{2} = (16, 16), pool = :class, nclasses = 1000)

Creates a Vision Transformer (ViT) model. (reference).

Arguments

  • mode: the model configuration, one of [:tiny, :small, :base, :large, :huge, :giant, :gigantic]
  • imsize: image size
  • inchannels: number of input channels
  • patch_size: size of the patches
  • pool: pooling type, either :class or :mean
  • nclasses: number of classes in the output

See also Metalhead.vit.