public ViT — struct
ViT(mode::Symbol = base; imsize::Dims{2} = (256, 256), inchannels = 3,
patch_size::Dims{2} = (16, 16), pool = :class, nclasses = 1000)
Creates a Vision Transformer (ViT) model. (reference).
Arguments
mode: the model configuration, one of[:tiny, :small, :base, :large, :huge, :giant, :gigantic]imsize: image sizeinchannels: number of input channelspatch_size: size of the patchespool: pooling type, either :class or :meannclasses: number of classes in the output
See also Metalhead.vit.