private vit — function

vit(imsize::Dims{2} = (256, 256); inchannels = 3, patch_size::Dims{2} = (16, 16),
    embedplanes = 768, depth = 6, nheads = 16, mlp_ratio = 4.0, dropout = 0.1,
    emb_dropout = 0.1, pool = :class, nclasses = 1000)

Creates a Vision Transformer (ViT) model. (reference).

Arguments

imsize: image size
inchannels: number of input channels
patch_size: size of the patches
embedplanes: the number of channels after the patch embedding
depth: number of blocks in the transformer
nheads: number of attention heads in the transformer
mlpplanes: number of hidden channels in the MLP block in the transformer
dropout: dropout rate
emb_dropout: dropout rate for the positional embedding layer
pool: pooling type, either :class or :mean
nclasses: number of classes in the output

Tutorials

Developer guide

Arguments