An introduction to the Layers module in Metalhead.jl
Since v0.8, Metalhead.jl exports a Layers module that contains a number of useful layers and utilities for building neural networks. This guide will walk you through the most commonly used layers and utilities present in the Layers module, and how to use them. It also contains some examples of how these layers are used in Metalhead.jl as well as a comprehensive API reference.
The Layers module is still a work in progress. While we will endeavour to keep the API stable, we cannot guarantee that it will not change in the future. In particular, the API may change significantly between major versions of Metalhead.jl. If you find any of the functions in this module do not work as expected, please open an issue on GitHub.
First, however, you want to make sure that the Layers module is loaded, and that the functions and types are available in your current scope. You can do this by running the following code:
using Metalhead
using Metalhead.LayersConvolution + Normalisation: the conv_norm layer
One of the most common patterns in modern neural networks is to have a convolutional layer followed by a normalisation layer. Most major deep learning libraries have a way to combine these two layers into a single layer. In Metalhead.jl, this is done with the Metalhead.Layers.conv_norm layer. The function signature for this is given below:
Metalhead.Layers.conv_norm — Functionconv_norm(kernel_size::Dims{2}, inplanes::Integer, outplanes::Integer,
activation = relu; norm_layer = BatchNorm, revnorm::Bool = false,
preact::Bool = false, stride::Integer = 1, pad::Integer = 0,
dilation::Integer = 1, groups::Integer = 1, [bias, weight, init])Create a convolution + normalisation layer pair with activation.
Arguments
kernel_size: size of the convolution kernel (tuple)inplanes: number of input feature mapsoutplanes: number of output feature mapsactivation: the activation function for the final layernorm_layer: the normalisation layer used. Note that usingidentityas the normalisation layer will result in no normalisation being applied. (This is only compatible withpreactandrevnormboth set tofalse.)revnorm: set totrueto place the normalisation layer before the convolutionpreact: set totrueto place the activation function before the normalisation layer (only compatible withrevnorm = false)bias: bias for the convolution kernel. This is set tofalseby default ifnorm_layeris notidentityandtrueotherwise.stride: stride of the convolution kernelpad: padding of the convolution kerneldilation: dilation of the convolution kernelgroups: groups for the convolution kernelweight,init: initialization for the convolution kernel (seeFlux.Conv)
To know more about the exact details of each of these parameters, you can refer to the documentation for this function. For now, we will focus on some common use cases. For example, if you want to create a convolutional layer with a kernel size of 3x3, with 32 input channels and 64 output channels, along with a BatchNorm layer, you can do the following:
conv_norm((3, 3), 32, 64)This returns a Vector with the desired layers. To use it in a model, the user should splat it into a Chain. For example:
Chain(Dense(3, 32), conv_norm((3, 3), 32, 64)..., Dense(64, 10))The default activation function for conv_norm is relu, and the default normalisation layer is BatchNorm. To use a different activation function, you can just pass it in as a positional argument. For example, to use a sigmoid activation function:
conv_norm((3, 3), 32, 64, sigmoid)Let's try something else. Suppose you want to use a GroupNorm layer instead of a BatchNorm layer. Note that norm_layer is a keyword argument in the function signature of conv_norm as shown above. Then we can write:
conv_norm((3, 3), 32, 64; norm_layer = GroupNorm)What if you want to change certain specific parameters of the norm_layer? For example, what if you want to change the number of groups in the GroupNorm layer?
# defining the norm layer
norm_layer = planes -> GroupNorm(planes, 4)
# passing it to the conv_norm layer
conv_norm((3, 3), 32, 64; norm_layer = norm_layer)One of Julia's features is that functions are first-class objects, and can be passed around as arguments to other functions. Here, we have create an anonymous function that takes in the number of planes as an argument, and returns a GroupNorm layer with 4 groups. This is then passed to the norm_layer keyword argument of the conv_norm layer. Using anonymous functions allows us to configure the layers in a very flexible manner, and this is a common pattern in Metalhead.jl.
Let's take a slightly more complicated example. TensorFlow uses different defaults for its normalisation layers. In particular, it uses an epsilon value of 1e-3 for BatchNorm layers. If you want to use the same defaults as TensorFlow, you can do the following:
# note that 1e-3 is not a Float32 and Flux is optimized for Float32, so we use 1.0f-3
conv_norm((3, 3), 32, 64; norm_layer = planes -> BatchNorm(planes, eps = 1.0f-3))which, incidentally, is very similar to the code Metalhead uses internally for the Metalhead.Layers.basic_conv_bn layer that is used in the Inception family of models.
Metalhead.Layers.basic_conv_bn — Functionbasic_conv_bn(kernel_size::Dims{2}, inplanes, outplanes, activation = relu;
kwargs...)Returns a convolution + batch normalisation pair with activation as used by the Inception family of models with default values matching those used in the official TensorFlow implementation.
Arguments
kernel_size: size of the convolution kernel (tuple)inplanes: number of input feature mapsoutplanes: number of output feature mapsactivation: the activation function for the final layerbatchnorm: set totrueto include batch normalization after each convolutionkwargs: keyword arguments passed toconv_norm
Normalisation layers
The Layers module provides some custom normalisation functions that are not present in Flux.
Metalhead.Layers.LayerScale — FunctionLayerScale(planes::Integer, λ)Creates a Flux.Scale layer that performs "LayerScale" (reference).
Arguments
planes: Size of channel dimension in the input.λ: initialisation value for the learnable diagonal matrix.
Metalhead.Layers.LayerNormV2 — TypeLayerNormV2(size..., λ=identity; affine=true, eps=1f-5)Same as Flux's LayerNorm but eps is added before taking the square root in the denominator. Therefore, LayerNormV2 matches pytorch's LayerNorm.
Metalhead.Layers.ChannelLayerNorm — TypeChannelLayerNorm(sz::Integer, λ = identity; eps = 1.0f-6)A variant of LayerNorm where the input is normalised along the channel dimension. The input is expected to have channel dimension with size sz. It also applies a learnable shift and rescaling after the normalization.
Note that this is specifically for inputs with 4 dimensions in the format (H, W, C, N) where H, W are the height and width of the input, C is the number of channels, and N is the batch size.
There is also a utility function, prenorm, which applies a normalisation layer before a given block and simply returns a Chain with the normalisation layer and the block. This is useful for creating Vision Transformers (ViT)-like models.
Metalhead.Layers.prenorm — Functionprenorm(planes, block; norm_layer = LayerNorm)Utility function to apply a normalization layer before a block.
Arguments
planes: Size of dimension to normalize.block: The block before which the normalization layer is applied.norm_layer: The normalization layer to use.
Dropout layers
The Layers module provides two dropout-like layers not present in Flux:
Metalhead.Layers.DropBlock — TypeDropBlock(drop_block_prob = 0.1, block_size = 7, gamma_scale = 1.0, [rng])The DropBlock layer. While training, it zeroes out continguous regions of size block_size in the input. During inference, it simply returns the input x. It can be used in two ways: either with all blocks having the same survival probability or with a linear scaling rule across the blocks. This is performed only at training time. At test time, the DropBlock layer is equivalent to identity.
Arguments
drop_block_prob: probability of dropping a block. Ifnothingis passed, it returnsidentity. Note that some literature uses the term "survival probability" instead, which is equivalent to1 - drop_block_prob.block_size: size of the block to dropgamma_scale: multiplicative factor forgammaused. For the calculation of gamma, refer to the paper.rng: can be used to pass in a custom RNG instead of the default. Custom RNGs are only supported on the CPU.
Metalhead.Layers.StochasticDepth — FunctionStochasticDepth(p, mode = :row; [rng])Implements Stochastic Depth. This is a Dropout layer from Flux that drops values with probability p. (reference)
This layer can be used to drop certain blocks in a residual structure and allow them to propagate completely through the skip connection. It can be used in two ways: either with all blocks having the same survival probability or with a linear scaling rule across the blocks. This is performed only at training time. At test time, the StochasticDepth layer is equivalent to identity.
Arguments
p: probability of Stochastic Depth. Note that some literature uses the term "survival probability" instead, which is equivalent to1 - p.mode: Either:batchor:row.:batchrandomly zeroes the entire input,rowzeroes randomly selected rows from the batch. The default is:row.rng: can be used to pass in a custom RNG instead of the default. SeeFlux.Dropoutfor more information on the behaviour of this argument. Custom RNGs are only supported on the CPU.
DropBlock also has a functional variant present in the Layers module:
Metalhead.Layers.dropblock — Functiondropblock([rng], x::AbstractArray{T, 4}, drop_block_prob, block_size,
gamma_scale, active::Bool = true)The dropblock function. If active is true, for each input, it zeroes out continguous regions of size block_size in the input. Otherwise, it simply returns the input x.
Arguments
rng: can be used to pass in a custom RNG instead of the default. Custom RNGs are only supported on the CPU.x: input arraydrop_block_prob: probability of dropping a block. Ifnothingis passed, it returnsidentity.block_size: size of the block to dropgamma_scale: multiplicative factor forgammaused. For the calculations, refer to the paper.
If you are not a package developer, you most likely do not want this function. Use DropBlock instead.
Both DropBlock and StochasticDepth are used along with probability values that vary based on a linear schedule across the structure of the model (see the respective papers for more details). The Layers module provides a utility function to create such a schedule as well:
Metalhead.Layers.linear_scheduler — Functionlinear_scheduler(drop_prob = 0.0; start_value = 0.0, depth)
linear_scheduler(drop_prob::Nothing; depth::Integer)Returns the dropout probabilities for a given depth using the linear scaling rule. Note that this returns evenly spaced values between start_value and drop_prob, not including drop_prob. If drop_prob is nothing, it returns a Vector of length depth with all values equal to nothing.
The Metalhead.resnet function which powers the ResNet family of models in Metalhead.jl is configured to allow the use of both these layers. For examples, check out the guide for using the ResNet family in Metalhead here. These layers can also be used by the user to construct other custom models.
Pooling layers
The Layers module provides a Metalhead.Layers.AdaptiveMeanMaxPool layer, which is inspired by a similar layer present in timm.
Metalhead.Layers.AdaptiveMeanMaxPool — FunctionAdaptiveMeanMaxPool([connection = +], output_size::Tuple = (1, 1))A type of adaptive pooling layer which uses both mean and max pooling and combines them to produce a single output. Note that this is equivalent to Parallel(connection, AdaptiveMeanPool(output_size), AdaptiveMaxPool(output_size)). When connection is not specified, it defaults to +.
Arguments
connection: The connection type to use.output_size: The size of the output after pooling.
Many mid-level model functions in Metalhead.jl have been written to support passing custom pooling layers to them if applicable (either in the model itself or in the classifier head). For example, the Metalhead.resnet function supports this, and examples of this can be found in the guide for using the ResNet family in Metalhead here.
Classifier creation
Metalhead provides a function to create a classifier for neural network models that is quite flexible, and is used by the library extensively to create the classifier "head" for networks. This function is called Metalhead.Layers.create_classifier and is documented below:
Metalhead.Layers.create_classifier — Functioncreate_classifier(inplanes::Integer, nclasses::Integer, activation = identity;
use_conv::Bool = false, pool_layer = AdaptiveMeanPool((1, 1)),
dropout_prob = nothing)Creates a classifier head to be used for models.
Arguments
inplanes: number of input feature mapsnclasses: number of output classesactivation: activation function to useuse_conv: whether to use a 1x1 convolutional layer instead of aDenselayer.pool_layer: pooling layer to use. This is passed in with the layer instantiated with any arguments that are needed i.e. asAdaptiveMeanPool((1, 1)), for example.dropout_prob: dropout probability used in the classifier head. Set tonothingto disable dropout.
create_classifier(inplanes::Integer, hidden_planes::Integer, nclasses::Integer,
activations::NTuple{2} = (relu, identity);
use_conv::NTuple{2, Bool} = (false, false),
pool_layer = AdaptiveMeanPool((1, 1)), dropout_prob = nothing)Creates a classifier head to be used for models with an extra hidden layer.
Arguments
inplanes: number of input feature mapshidden_planes: number of hidden feature mapsnclasses: number of output classesactivations: activation functions to use for the hidden and output layers. This is a tuple of two elements, the first being the activation function for the hidden layer and the second for the output layer.use_conv: whether to use a 1x1 convolutional layer instead of aDenselayer. This is a tuple of two booleans, the first for the hidden layer and the second for the output layer.pool_layer: pooling layer to use. This is passed in with the layer instantiated with any arguments that are needed i.e. asAdaptiveMeanPool((1, 1)), for example.dropout_prob: dropout probability used in the classifier head. Set tonothingto disable dropout.
Due to the power of multiple dispatch in Julia, the above function can be called with two different signatures - one of which creates a classifier with no hidden layers, and the other which creates a classifier with a single hidden layer. The function signature for both is documented above, and the user can choose the one that is most convenient for them. Both are used in Metalhead.jl - the latter is used in MobileNetv3, and the former is used almost everywhere else.