private vit — function
vit(imsize::Dims{2} = (256, 256); inchannels = 3, patch_size::Dims{2} = (16, 16),
embedplanes = 768, depth = 6, nheads = 16, mlp_ratio = 4.0, dropout = 0.1,
emb_dropout = 0.1, pool = :class, nclasses = 1000)
Creates a Vision Transformer (ViT) model. (reference).
Arguments
imsize: image sizeinchannels: number of input channelspatch_size: size of the patchesembedplanes: the number of channels after the patch embeddingdepth: number of blocks in the transformernheads: number of attention heads in the transformermlpplanes: number of hidden channels in the MLP block in the transformerdropout: dropout rateemb_dropout: dropout rate for the positional embedding layerpool: pooling type, either :class or :meannclasses: number of classes in the output