Vision Transformer models
This is the API reference for the Vision Transformer models supported by Metalhead.jl.
The higher-level model constructors
Metalhead.ViT
— TypeViT(config::Symbol = base; imsize::Dims{2} = (224, 224), inchannels::Integer = 3,
patch_size::Dims{2} = (16, 16), pool = :class, nclasses::Integer = 1000)
Creates a Vision Transformer (ViT) model. (reference).
Arguments
config
: the model configuration, one of[:tiny, :small, :base, :large, :huge, :giant, :gigantic]
imsize
: image sizeinchannels
: number of input channelspatch_size
: size of the patchespool
: pooling type, either :class or :meannclasses
: number of classes in the output
See also Metalhead.vit
.
The mid-level functions
Metalhead.vit
— Functionvit(imsize::Dims{2} = (256, 256); inchannels::Integer = 3, patch_size::Dims{2} = (16, 16),
embedplanes = 768, depth = 6, nheads = 16, mlp_ratio = 4.0, dropout_prob = 0.1,
emb_dropout_prob = 0.1, pool = :class, nclasses::Integer = 1000)
Creates a Vision Transformer (ViT) model. (reference).
Arguments
imsize
: image sizeinchannels
: number of input channelspatch_size
: size of the patchesembedplanes
: the number of channels after the patch embeddingdepth
: number of blocks in the transformernheads
: number of attention heads in the transformermlpplanes
: number of hidden channels in the MLP block in the transformerdropout_prob
: dropout probabilityemb_dropout
: dropout probability for the positional embedding layerpool
: pooling type, either :class or :meannclasses
: number of classes in the output