# Built-in Layer Types

If you started at the beginning of the guide, then you have already met the basic `Dense`

layer, and seen `Chain`

for combining layers. These core layers form the foundation of almost all neural networks.

The `Dense`

exemplifies several features:

It contains an an activation function, which is broadcasted over the output. Because this broadcast can be fused with other operations, doing so is more efficient than applying the activation function separately.

It take an

`init`

keyword, which accepts a function acting like`rand`

. That is,`init(2,3,4)`

should create an array of this size. Flux has many such functions built-in. All make a CPU array, moved later with`gpu`

if desired.The bias vector is always initialised

`Flux.zeros32`

. The keyword`bias=false`

will turn this off, i.e. keeping the bias permanently zero.It is annotated with

`@functor`

, which means that`params`

will see the contents, and`gpu`

will move their arrays to the GPU.

By contrast, `Chain`

itself contains no parameters, but connects other layers together. The section on dataflow layers introduces others like this.

## Fully Connected

`Flux.Dense`

— Type```
Dense(in => out, σ=identity; bias=true, init=glorot_uniform)
Dense(W::AbstractMatrix, [bias, σ])
```

Create a traditional fully connected layer, whose forward pass is given by:

`y = σ.(W * x .+ bias)`

The input `x`

should be a vector of length `in`

, or batch of vectors represented as an `in × N`

matrix, or any array with `size(x,1) == in`

. The out `y`

will be a vector of length `out`

, or a batch with `size(y) == (out, size(x)[2:end]...)`

Keyword `bias=false`

will switch off trainable bias for the layer. The initialisation of the weight matrix is `W = init(out, in)`

, calling the function given to keyword `init`

, with default `glorot_uniform`

. The weight matrix and/or the bias vector (of length `out`

) may also be provided explicitly.

**Examples**

```
julia> d = Dense(5 => 2)
Dense(5 => 2) # 12 parameters
julia> d(rand32(5, 64)) |> size
(2, 64)
julia> d(rand32(5, 6, 4, 64)) |> size # treated as three batch dimensions
(2, 6, 4, 64)
julia> d1 = Dense(ones(2, 5), false, tanh) # using provided weight matrix
Dense(5 => 2, tanh; bias=false) # 10 parameters
julia> d1(ones(5))
2-element Vector{Float64}:
0.9999092042625951
0.9999092042625951
julia> Flux.params(d1) # no trainable bias
Params([[1.0 1.0 … 1.0 1.0; 1.0 1.0 … 1.0 1.0]])
```

`Flux.Bilinear`

— Type```
Bilinear((in1, in2) => out, σ=identity; bias=true, init=glorot_uniform)
Bilinear(W::AbstractArray, [bias, σ])
```

Creates a layer which is fully connected between two inputs and the output, and otherwise similar to `Dense`

. Its output, given vectors `x`

& `y`

, is another vector `z`

with, for all `i ∈ 1:out`

:

`z[i] = σ(x' * W[i,:,:] * y + bias[i])`

If `x`

and `y`

are matrices, then each column of the output `z = B(x, y)`

is of this form, with `B`

the Bilinear layer.

If the second input `y`

is not given, it is taken to be equal to `x`

, i.e. `B(x) == B(x, x)`

The two inputs may also be provided as a tuple, `B((x, y)) == B(x, y)`

, which is accepted as the input to a `Chain`

.

If the two input sizes are the same, `in1 == in2`

, then you may write `Bilinear(in => out, σ)`

.

The initialisation works as for `Dense`

layer, with `W = init(out, in1, in2)`

. By default the bias vector is `zeros(Float32, out)`

, option `bias=false`

will switch off trainable bias. Either of these may be provided explicitly.

**Examples**

```
julia> x, y = randn(Float32, 5, 32), randn(Float32, 5, 32);
julia> B = Flux.Bilinear((5, 5) => 7)
Bilinear(5 => 7) # 182 parameters
julia> B(x) |> size # interactions based on one input
(7, 32)
julia> B(x,y) == B((x,y)) # two inputs, may be given as a tuple
true
julia> sc = SkipConnection(
Chain(Dense(5 => 20, tanh), Dense(20 => 9, tanh)),
Flux.Bilinear((9, 5) => 3, bias=false),
); # used as the recombinator, with skip as the second input
julia> sc(x) |> size
(3, 32)
julia> Flux.Bilinear(rand(4,8,16), false, tanh) # first dim of weight is the output
Bilinear((8, 16) => 4, tanh; bias=false) # 512 parameters
```

`Flux.Scale`

— Type```
Scale(size::Integer..., σ=identity; bias=true, init=ones32)
Scale(scale::AbstractArray, [bias, σ])
```

Create an element-wise layer, whose forward pass is given by:

`y = σ.(scale .* x .+ bias)`

This uses `.*`

instead of matrix multiplication `*`

of `Dense`

.

The learnable scale & bias are initialised `init(size...)`

and `zeros32(size...)`

, with `init=ones32`

by default. You may specify the function `init`

, turn off trainable bias with `bias=false`

, or provide the array(s) explicitly.

Used by `LayerNorm`

with `affine=true`

.

**Examples**

```
julia> a = Flux.Scale(2)
Scale(2) # 4 parameters
julia> Flux.params(a)
Params([Float32[1.0, 1.0], Float32[0.0, 0.0]])
julia> a([1 2 3])
2×3 Matrix{Float32}:
1.0 2.0 3.0
1.0 2.0 3.0
julia> b = Flux.Scale([1 2 3 4], false, abs2)
Scale(1, 4, abs2; bias=false) # 4 parameters
julia> b([1, 10])
2×4 Matrix{Int64}:
1 4 9 16
100 400 900 1600
julia> Flux.params(b)
Params([[1 2 3 4]])
```

Perhaps `Scale`

isn't quite fully connected, but it may be thought of as `Dense(Diagonal(s.weights), s.bias)`

, and LinearAlgebra's `Diagonal`

is a matrix which just happens to contain many zeros.

Old versions of Flux accepted only `Dense(in, out, act)`

and not `Dense(in => out, act)`

. This notation makes a `Pair`

object. If you get an error like `MethodError: no method matching Dense(::Pair{Int64,Int64})`

, this means that you should upgrade to newer Flux versions.

## Convolution Models

These layers are used to build convolutional neural networks (CNNs).

They all expect images in what is called WHCN order: a batch of 32 colour images, each 50 x 50 pixels, will have `size(x) == (50, 50, 3, 32)`

. A single grayscale image might instead have `size(x) == (28, 28, 1, 1)`

.

Besides images, 2D data, they also work with 1D data, where for instance stereo sound recording with 1000 samples might have `size(x) == (1000, 2, 1)`

. They will also work with 3D data, `ndims(x) == 5`

, where again the last two dimensions are channel and batch.

To understand how strides and padding work, the article by Dumoulin & Visin has great illustrations.

`Flux.Conv`

— Type```
Conv(filter, in => out, σ = identity;
stride = 1, pad = 0, dilation = 1, groups = 1, [bias, init])
```

Standard convolutional layer. `filter`

is a tuple of integers specifying the size of the convolutional kernel; `in`

and `out`

specify the number of input and output channels.

Image data should be stored in WHCN order (width, height, channels, batch). In other words, a 100×100 RGB image would be a `100×100×3×1`

array, and a batch of 50 would be a `100×100×3×50`

array. This has `N = 2`

spatial dimensions, and needs a kernel size like `(5,5)`

, a 2-tuple of integers.

To take convolutions along `N`

feature dimensions, this layer expects as input an array with `ndims(x) == N+2`

, where `size(x, N+1) == in`

is the number of input channels, and `size(x, ndims(x))`

is (as always) the number of observations in a batch. Then:

`filter`

should be a tuple of`N`

integers.- Keywords
`stride`

and`dilation`

should each be either single integer, or a tuple with`N`

integers. - Keyword
`pad`

specifies the number of elements added to the borders of the data array. It can be- a single integer for equal padding all around,
- a tuple of
`N`

integers, to apply the same padding at begin/end of each spatial dimension, - a tuple of
`2*N`

integers, for asymmetric padding, or - the singleton
`SamePad()`

, to calculate padding such that`size(output,d) == size(x,d) / stride`

(possibly rounded) for each spatial dimension.

- Keyword
`groups`

is expected to be an`Int`

. It specifies the number of groups to divide a convolution into.

Keywords to control initialization of the layer:

`init`

- Function used to generate initial weights. Defaults to`glorot_uniform`

.`bias`

- The initial bias vector is all zero by default. Trainable bias can be disabled entirely by setting this to`false`

, or another vector can be provided such as`bias = randn(Float32, out)`

.

See also `ConvTranspose`

, `DepthwiseConv`

, `CrossCor`

.

**Examples**

```
julia> xs = rand32(100, 100, 3, 50); # a batch of 50 RGB images
julia> layer = Conv((5,5), 3 => 7, relu; bias = false)
Conv((5, 5), 3 => 7, relu, bias=false) # 525 parameters
julia> layer(xs) |> size
(96, 96, 7, 50)
julia> Conv((5,5), 3 => 7; stride = 2)(xs) |> size
(48, 48, 7, 50)
julia> Conv((5,5), 3 => 7; stride = 2, pad = SamePad())(xs) |> size
(50, 50, 7, 50)
julia> Conv((1,1), 3 => 7; pad = (20,10,0,0))(xs) |> size
(130, 100, 7, 50)
julia> Conv((5,5), 3 => 7; stride = 2, dilation = 4)(xs) |> size
(42, 42, 7, 50)
```

`Flux.Conv`

— Method`Conv(weight::AbstractArray, [bias, activation; stride, pad, dilation])`

Constructs a convolutional layer with the given weight and bias. Accepts the same keywords and has the same defaults as `Conv(k::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer}, σ; ...)`

.

```
julia> weight = rand(3, 4, 5);
julia> bias = zeros(5);
julia> layer = Conv(weight, bias, sigmoid) # expects 1 spatial dimension
Conv((3,), 4 => 5, σ) # 65 parameters
julia> layer(randn(100, 4, 64)) |> size
(98, 5, 64)
julia> Flux.params(layer) |> length
2
```

`Flux.ConvTranspose`

— Type`ConvTranspose(filter, in => out, σ=identity; stride=1, pad=0, dilation=1, [bias, init])`

Standard convolutional transpose layer. `filter`

is a tuple of integers specifying the size of the convolutional kernel, while `in`

and `out`

specify the number of input and output channels.

Note that `pad=SamePad()`

here tries to ensure `size(output,d) == size(x,d) * stride`

.

Parameters are controlled by additional keywords, with defaults `init=glorot_uniform`

and `bias=true`

.

See also `Conv`

for more detailed description of keywords.

**Examples**

```
julia> xs = rand32(100, 100, 3, 50); # a batch of 50 RGB images
julia> layer = ConvTranspose((5,5), 3 => 7, relu)
ConvTranspose((5, 5), 3 => 7, relu) # 532 parameters
julia> layer(xs) |> size
(104, 104, 7, 50)
julia> ConvTranspose((5,5), 3 => 7, stride=2)(xs) |> size
(203, 203, 7, 50)
julia> ConvTranspose((5,5), 3 => 7, stride=3, pad=SamePad())(xs) |> size
(300, 300, 7, 50)
```

`Flux.ConvTranspose`

— Method`ConvTranspose(weight::AbstractArray, [bias, activation; stride, pad, dilation, groups])`

Constructs a ConvTranspose layer with the given weight and bias. Accepts the same keywords and has the same defaults as `ConvTranspose(k::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer}, σ; ...)`

.

**Examples**

```
julia> weight = rand(3, 4, 5);
julia> bias = zeros(4);
julia> layer = ConvTranspose(weight, bias, sigmoid)
ConvTranspose((3,), 5 => 4, σ) # 64 parameters
julia> layer(randn(100, 5, 64)) |> size # transposed convolution will increase the dimension size (upsampling)
(102, 4, 64)
julia> Flux.params(layer) |> length
2
```

`Flux.CrossCor`

— Type`CrossCor(filter, in => out, σ=identity; stride=1, pad=0, dilation=1, [bias, init])`

Standard cross correlation layer. `filter`

is a tuple of integers specifying the size of the convolutional kernel; `in`

and `out`

specify the number of input and output channels.

Parameters are controlled by additional keywords, with defaults `init=glorot_uniform`

and `bias=true`

.

See also `Conv`

for more detailed description of keywords.

**Examples**

```
julia> xs = rand(Float32, 100, 100, 3, 50); # a batch of 50 RGB images
julia> layer = CrossCor((5,5), 3 => 6, relu; bias=false)
CrossCor((5, 5), 3 => 6, relu, bias=false) # 450 parameters
julia> layer(xs) |> size
(96, 96, 6, 50)
julia> CrossCor((5,5), 3 => 7, stride=3, pad=(2,0))(xs) |> size
(34, 32, 7, 50)
```

`Flux.CrossCor`

— Method`CrossCor(weight::AbstractArray, [bias, activation; stride, pad, dilation])`

Constructs a CrossCor layer with the given weight and bias. Accepts the same keywords and has the same defaults as `CrossCor(k::NTuple{N,Integer}, ch::Pair{<:Integer,<:Integer}, σ; ...)`

.

**Examples**

```
julia> weight = rand(3, 4, 5);
julia> bias = zeros(5);
julia> layer = CrossCor(weight, bias, relu)
CrossCor((3,), 4 => 5, relu) # 65 parameters
julia> layer(randn(100, 4, 64)) |> size
(98, 5, 64)
```

`Flux.DepthwiseConv`

— Function```
DepthwiseConv(filter, in => out, σ=identity; stride=1, pad=0, dilation=1, [bias, init])
DepthwiseConv(weight::AbstractArray, [bias, activation; stride, pad, dilation])
```

Return a depthwise convolutional layer, that is a `Conv`

layer with number of groups equal to the number of input channels.

See `Conv`

for a description of the arguments.

**Examples**

```
julia> xs = rand(Float32, 100, 100, 3, 50); # a batch of 50 RGB images
julia> layer = DepthwiseConv((5,5), 3 => 6, relu; bias=false)
Conv((5, 5), 3 => 6, relu, groups=3, bias=false) # 150 parameters
julia> layer(xs) |> size
(96, 96, 6, 50)
julia> DepthwiseConv((5, 5), 3 => 9, stride=2, pad=2)(xs) |> size
(50, 50, 9, 50)
```

`Flux.SamePad`

— Type`SamePad()`

Passed as an option to convolutional layers (and friends), this causes the padding to be chosen such that the input and output sizes agree (on the first `N`

dimensions, the kernel or window) when `stride==1`

. When `stride≠1`

, the output size equals `ceil(input_size/stride)`

.

**Examples**

```
julia> xs = rand32(100, 100, 3, 50); # a batch of images
julia> layer = Conv((2,2), 3 => 7, pad=SamePad())
Conv((2, 2), 3 => 7, pad=(1, 0, 1, 0)) # 91 parameters
julia> layer(xs) |> size # notice how the dimensions stay the same with this padding
(100, 100, 7, 50)
julia> layer2 = Conv((2,2), 3 => 7)
Conv((2, 2), 3 => 7) # 91 parameters
julia> layer2(xs) |> size # the output dimension changes as the padding was not "same"
(99, 99, 7, 50)
julia> layer3 = Conv((5, 5), 3 => 7, stride=2, pad=SamePad())
Conv((5, 5), 3 => 7, pad=2, stride=2) # 532 parameters
julia> layer3(xs) |> size # output size = `ceil(input_size/stride)` = 50
(50, 50, 7, 50)
```

`Flux.flatten`

— Functionflatten(x)

Same as `MLUtils.flatten`

, which should be prefered to this method existing only for backward compatibility.

## MultiHeadAttention

The basic blocks needed to implement Transformer architectures. See also the functional counterparts documented in NNlib's Attention section.

`Flux.MultiHeadAttention`

— Type`MultiHeadAttention(dims; [nheads, bias, init, dropout_prob])`

The multi-head dot-product attention layer used in Transformer architectures [1].

Returns the transformed input sequence and the attention scores.

[1] Vaswani et al. "Attention is all you need." Advances in Neural Information Processing Systems. 2017.

**Arguments**

`dims`

: The embedding dimensions of inputs, intermediate tensors and outputs. In the most general case, it is given as a)`(q_in_dim, k_in_dim, v_in_dim) => (qk_dim, v_dim) => out_dim`

. Can take also simpler forms as b)`dims::Int`

; c)`in_dim::Int => (qk_dim, v_dim) => out_dim`

; d)`in_dim::Int => qkv_dim => out_dim`

.`nheads`

: number of heads. Default`8`

.`init`

: weight initializer for the Dense layers. Default`glorot_uniform`

.`bias`

: whether pointwise QKVO dense transforms use bias. Default`false`

.`dropout_prob`

: dropout probability for the attention scores. Default`0.0`

.

**Forward**

`(mha::MultiHeadAttention)(q_in, k_in, v_in, [bias]; [mask])`

The arguments of the forward pass are:

`q_in`

: Input query array of size`(q_in_dim, q_len, batch_size)`

.`k_in`

: Input key array of size`(k_in_dim, kv_len, batch_size)`

.`v_in`

: Input value array of size`(v_in_dim, kv_len, batch_size)`

.`bias`

: Bias array broadcastable to size`(kv_len, q_len, nheads, batch_size)`

. It will be added to the attention scores before the softmax. Default`nothing`

.`mask`

: Input array broadcastable to size`(kv_len, q_len, nheads, batch_size)`

. The mask is applied to the attention scores just before the softmax. See`NNlib.make_causal_mask`

for creating causal masks. Default`nothing`

.

Alternative calling signatures are `mha(q_in)`

, equivalent to `mha(q_in, q_in, q_in)`

(self-attention), and `mha(q_in, k_in)`

, equivalent to `mha(q_in, k_in, k_in)`

(key and value are the same).

See also `NNlib.dot_product_attention`

.

**Examples**

```
mha = MultiHeadAttention(64, nheads = 8)
q = rand(Float32, (64, 10, 32))
k = rand(Float32, (64, 20, 32))
v = rand(Float32, (64, 20, 32))
y, α = mha(q, k, v)
# [y] = [64, 10, 32]
# [α] = [20, 10, 8, 32]
mha = MultiHeadAttention(64 => 1024 => 1024, nheads = 8)
y, α = mha(q) # self-attention
# [y] = [1024, 10, 32]
# [α] = [10, 10, 8, 32]
```

### Pooling

These layers are commonly used after a convolution layer, and reduce the size of its output. They have no trainable parameters.

`Flux.AdaptiveMaxPool`

— Type`AdaptiveMaxPool(out::NTuple)`

Adaptive max pooling layer. Calculates the necessary window size such that its output has `size(y)[1:N] == out`

.

Expects as input an array with `ndims(x) == N+2`

, i.e. channel and batch dimensions, after the `N`

feature dimensions, where `N = length(out)`

.

See also `MaxPool`

, `AdaptiveMeanPool`

.

**Examples**

```
julia> xs = rand(Float32, 100, 100, 3, 50); # batch of 50 RGB images
julia> AdaptiveMaxPool((25, 25))(xs) |> size
(25, 25, 3, 50)
julia> MaxPool((4,4))(xs) ≈ AdaptiveMaxPool((25, 25))(xs)
true
```

`Flux.MaxPool`

— Type`MaxPool(window::NTuple; pad=0, stride=window)`

Max pooling layer, which replaces all pixels in a block of size `window`

with one.

Expects as input an array with `ndims(x) == N+2`

, i.e. channel and batch dimensions, after the `N`

feature dimensions, where `N = length(window)`

.

By default the window size is also the stride in each dimension. The keyword `pad`

accepts the same options as for the `Conv`

layer, including `SamePad()`

.

See also `Conv`

, `MeanPool`

, `AdaptiveMaxPool`

, `GlobalMaxPool`

.

**Examples**

```
julia> xs = rand(Float32, 100, 100, 3, 50); # batch of 50 RGB images
julia> m = Chain(Conv((5, 5), 3 => 7, pad=SamePad()), MaxPool((5, 5), pad=SamePad()))
Chain(
Conv((5, 5), 3 => 7, pad=2), # 532 parameters
MaxPool((5, 5), pad=2),
)
julia> m[1](xs) |> size
(100, 100, 7, 50)
julia> m(xs) |> size
(20, 20, 7, 50)
julia> layer = MaxPool((5,), pad=2, stride=(3,)) # one-dimensional window
MaxPool((5,), pad=2, stride=3)
julia> layer(rand(Float32, 100, 7, 50)) |> size
(34, 7, 50)
```

`Flux.GlobalMaxPool`

— Type`GlobalMaxPool()`

Global max pooling layer.

Transforms (w,h,c,b)-shaped input into (1,1,c,b)-shaped output, by performing max pooling on the complete (w,h)-shaped feature maps.

See also `MaxPool`

, `GlobalMeanPool`

.

```
julia> xs = rand(Float32, 100, 100, 3, 50);
julia> m = Chain(Conv((3,3), 3 => 7), GlobalMaxPool());
julia> m(xs) |> size
(1, 1, 7, 50)
julia> GlobalMaxPool()(rand(3,5,7)) |> size # preserves 2 dimensions
(1, 5, 7)
```

`Flux.AdaptiveMeanPool`

— Type`AdaptiveMeanPool(out::NTuple)`

Adaptive mean pooling layer. Calculates the necessary window size such that its output has `size(y)[1:N] == out`

.

Expects as input an array with `ndims(x) == N+2`

, i.e. channel and batch dimensions, after the `N`

feature dimensions, where `N = length(out)`

.

See also `MaxPool`

, `AdaptiveMaxPool`

.

**Examples**

```
julia> xs = rand(Float32, 100, 100, 3, 50); # batch of 50 RGB images
julia> AdaptiveMeanPool((25, 25))(xs) |> size
(25, 25, 3, 50)
julia> MeanPool((4,4))(xs) ≈ AdaptiveMeanPool((25, 25))(xs)
true
```

`Flux.MeanPool`

— Type`MeanPool(window::NTuple; pad=0, stride=window)`

Mean pooling layer, averaging all pixels in a block of size `window`

.

Expects as input an array with `ndims(x) == N+2`

, i.e. channel and batch dimensions, after the `N`

feature dimensions, where `N = length(window)`

.

By default the window size is also the stride in each dimension. The keyword `pad`

accepts the same options as for the `Conv`

layer, including `SamePad()`

.

See also `Conv`

, `MaxPool`

, `AdaptiveMeanPool`

.

**Examples**

```
julia> xs = rand(Float32, 100, 100, 3, 50);
julia> m = Chain(Conv((5,5), 3 => 7), MeanPool((5,5), pad=SamePad()))
Chain(
Conv((5, 5), 3 => 7), # 532 parameters
MeanPool((5, 5), pad=2),
)
julia> m[1](xs) |> size
(96, 96, 7, 50)
julia> m(xs) |> size
(20, 20, 7, 50)
```

`Flux.GlobalMeanPool`

— Type`GlobalMeanPool()`

Global mean pooling layer.

Transforms (w,h,c,b)-shaped input into (1,1,c,b)-shaped output, by performing mean pooling on the complete (w,h)-shaped feature maps.

```
julia> xs = rand(Float32, 100, 100, 3, 50);
julia> m = Chain(Conv((3,3), 3 => 7), GlobalMeanPool());
julia> m(xs) |> size
(1, 1, 7, 50)
```

## Upsampling

The opposite of pooling, these layers increase the size of an array. They have no trainable parameters.

`Flux.Upsample`

— Type```
Upsample(mode = :nearest; [scale, size])
Upsample(scale, mode = :nearest)
```

An upsampling layer. One of two keywords must be given:

If `scale`

is a number, this applies to all but the last two dimensions (channel and batch) of the input. It may also be a tuple, to control dimensions individually. Alternatively, keyword `size`

accepts a tuple, to directly specify the leading dimensions of the output.

Currently supported upsampling `mode`

s and corresponding NNlib's methods are:

`:nearest`

->`NNlib.upsample_nearest`

`:bilinear`

->`NNlib.upsample_bilinear`

`:trilinear`

->`NNlib.upsample_trilinear`

**Examples**

```
julia> m = Upsample(scale = (2, 3))
Upsample(:nearest, scale = (2, 3))
julia> m(ones(2, 2, 1, 1)) |> size
(4, 6, 1, 1)
julia> m = Upsample(:bilinear, size = (4, 5))
Upsample(:bilinear, size = (4, 5))
julia> m(ones(2, 2, 1, 1)) |> size
(4, 5, 1, 1)
```

`Flux.PixelShuffle`

— Type`PixelShuffle(r::Int)`

Pixel shuffling layer with upscale factor `r`

. Usually used for generating higher resolution images while upscaling them.

See `NNlib.pixel_shuffle`

.

**Examples**

```
julia> p = PixelShuffle(2);
julia> xs = [2row + col + channel/10 for row in 1:2, col in 1:2, channel in 1:4, n in 1:1]
2×2×4×1 Array{Float64, 4}:
[:, :, 1, 1] =
3.1 4.1
5.1 6.1
[:, :, 2, 1] =
3.2 4.2
5.2 6.2
[:, :, 3, 1] =
3.3 4.3
5.3 6.3
[:, :, 4, 1] =
3.4 4.4
5.4 6.4
julia> p(xs)
4×4×1×1 Array{Float64, 4}:
[:, :, 1, 1] =
3.1 3.3 4.1 4.3
3.2 3.4 4.2 4.4
5.1 5.3 6.1 6.3
5.2 5.4 6.2 6.4
julia> xs = [3row + col + channel/10 for row in 1:2, col in 1:3, channel in 1:4, n in 1:1]
2×3×4×1 Array{Float64, 4}:
[:, :, 1, 1] =
4.1 5.1 6.1
7.1 8.1 9.1
[:, :, 2, 1] =
4.2 5.2 6.2
7.2 8.2 9.2
[:, :, 3, 1] =
4.3 5.3 6.3
7.3 8.3 9.3
[:, :, 4, 1] =
4.4 5.4 6.4
7.4 8.4 9.4
julia> p(xs)
4×6×1×1 Array{Float64, 4}:
[:, :, 1, 1] =
4.1 4.3 5.1 5.3 6.1 6.3
4.2 4.4 5.2 5.4 6.2 6.4
7.1 7.3 8.1 8.3 9.1 9.3
7.2 7.4 8.2 8.4 9.2 9.4
```

## Embedding Vectors

These layers accept an index, and return a vector (or several indices, and several vectors). The possible embedding vectors are learned parameters.

`Flux.Embedding`

— Type`Embedding(in => out; init=randn32)`

A lookup table that stores embeddings of dimension `out`

for a vocabulary of size `in`

, as a trainable matrix.

This layer is often used to store word embeddings and retrieve them using indices. The input to the layer can be a vocabulary index in `1:in`

, an array of indices, or the corresponding `onehot encoding`

.

For indices `x`

, the result is of size `(out, size(x)...)`

, allowing several batch dimensions. For one-hot `ohx`

, the result is of size `(out, size(ohx)[2:end]...)`

.

**Examples**

```
julia> emb = Embedding(26 => 4, init=Flux.identity_init(gain=22))
Embedding(26 => 4) # 104 parameters
julia> emb(2) # one column of e.weight (here not random!)
4-element Vector{Float32}:
0.0
22.0
0.0
0.0
julia> emb([3, 1, 20, 14, 4, 15, 7]) # vocabulary indices, in 1:26
4×7 Matrix{Float32}:
0.0 22.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0
22.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 22.0 0.0 0.0
julia> ans == emb(Flux.onehotbatch("cat&dog", 'a':'z', 'n'))
true
julia> emb(rand(1:26, (10, 1, 12))) |> size # three batch dimensions
(4, 10, 1, 12)
```

`Flux.EmbeddingBag`

— Type`EmbeddingBag(in => out, reduction=mean; init=Flux.randn32)`

A lookup table that stores embeddings of dimension `out`

for a vocabulary of size `in`

. Differs from `Embedding`

in that, instead of acting on a single vocabulary index, it always acts a vector of indices which it calls a "bag". Their individual embedding vectors are reduced to one, using `mean`

or some other function.

Instead of acting on one "bag", such as `x::Vector{Int}`

, the layer can also act on several:

Acting on a vector of "bags", it produces a matrix whose columns are the reduced vectors. More generally on

`x::Array{Vector{Int}}`

, its output is of size`(out, size(x)...)`

.Any higher-rank array of integers is interpreted as a collection of "bags" each along the first dimension. Thus the output is

`mapslices(e, x; dims=1)`

when`e::EmbeddingBag`

and`x::Array{Int,N}`

. This method is more efficient, but requires that all "bags" have the same length.A vector of "bags" may also be produced by splitting a vector of indices at specified points. For this case the layer takes two inputs, both vectors of integers. See details below.

The "bag" may equivalently be represented as a `OneHotMatrix`

. A collection of these, or one higher-rank `OneHotArray`

, again produce a stack of embeddings. See details below.

**Examples**

```
julia> vocab_size = 26; # embed into 3 dimensions, with non-random vectors:
julia> eb = EmbeddingBag(vocab_size => 3, init=Flux.identity_init(gain=100))
EmbeddingBag(26 => 3) # 78 parameters
julia> eb([2]) # one bag of 1 item
3-element Vector{Float32}:
0.0
100.0
0.0
julia> eb([3,3,1]) # one bag of 3 items, one mean embedding
3-element Vector{Float32}:
33.333332
0.0
66.666664
julia> eb([[3,1,3], [2,1]]) # two bags
3×2 Matrix{Float32}:
33.3333 50.0
0.0 50.0
66.6667 0.0
julia> eb([1 1 1 1; 1 2 3 4]) # 4 bags each of 2 items, eachcol([1 1 1 1; 1 2 3 4])
3×4 Matrix{Float32}:
100.0 50.0 50.0 50.0
0.0 50.0 0.0 0.0
0.0 0.0 50.0 0.0
julia> eb(rand(1:26, 10, 5, 5)) |> size # 25 bags each of 10 items
(3, 5, 5)
```

Another way to specify "many bags of many items" is to provide a vector `data`

(each in `1:in`

) and a vector `at`

stating where to split that up into "bags". The first bag starts with `data[at[1]]`

, the second at `data[at[2]]`

, and so on, with no overlaps and nothing left out (thus it requires `at[1]==1`

).

```
julia> data = [11, 1, 12, 2, 13, 3, 14];
julia> Flux._splitat(data, [1, 4]) |> println # internal function, makes data[1:3], data[4:end]
[[11, 1, 12], [2, 13, 3, 14]]
julia> eb(data, [1, 4]) # two bags, of 3 and 4 items
3×2 Matrix{Float32}:
33.3333 0.0
0.0 25.0
0.0 25.0
```

Finally, each bag may also be also be represented as a `OneHotMatrix`

.

```
julia> eb(Flux.onehotbatch("bba", 'a':'z')) # same as [2,2,1], one bag of 3 items
3-element Vector{Float32}:
33.333332
66.666664
0.0
julia> eb([Flux.onehotbatch("bba", 'a':'z'), Flux.onehotbatch("cc", 'a':'z')]) # two bags
3×2 Matrix{Float32}:
33.3333 0.0
66.6667 0.0
0.0 100.0
```

## Dataflow Layers, or Containers

The basic `Chain(F, G, H)`

applies the layers it contains in sequence, equivalent to `H ∘ G ∘ F`

. Flux has some other layers which contain layers, but connect them up in a more complicated way: `SkipConnection`

allows ResNet's residual connection.

`Flux.Chain`

— Type```
Chain(layers...)
Chain(name = layer, ...)
```

Collects multiple layers / functions to be called in sequence on a given input. Supports indexing and slicing, `m[2]`

or `m[1:end-1]`

, and if names are given, `m[:name] == m[1]`

etc.

**Examples**

```
julia> m = Chain(x -> x^2, x -> x+1);
julia> m(5) == 26
true
julia> m = Chain(Dense(10 => 5, tanh), Dense(5 => 2));
julia> x = rand32(10, 32);
julia> m(x) == m[2](m[1](x))
true
julia> m2 = Chain(enc = Chain(Flux.flatten, Dense(10 => 5, tanh)),
dec = Dense(5 => 2));
julia> m2(x) == (m2[:dec] ∘ m2[:enc])(x)
true
```

For large models, there is a special type-unstable path which can reduce compilation times. This can be used by supplying a vector of layers `Chain([layer1, layer2, ...])`

. This feature is somewhat experimental, beware!

`Flux.activations`

— Function`activations(c::Chain, input)`

Like calling a `Chain`

, but saves the result of each layer as an output.

**Examples**

```
julia> using Flux: activations
julia> c = Chain(x -> x + 1, x -> x * 2, x -> x ^ 3);
julia> activations(c, 1)
(2, 4, 64)
```

`Flux.Maxout`

— Type```
Maxout(layers...)
Maxout(f, n_alts)
```

This contains a number of internal layers, each of which receives the same input. Its output is the elementwise maximum of the internal layers' outputs.

Instead of defining layers individually, you can provide a zero-argument function which constructs them, and the number to construct.

Maxout over linear dense layers satisfies the universal approximation theorem. See Goodfellow, Warde-Farley, Mirza, Courville & Bengio "Maxout Networks" https://arxiv.org/abs/1302.4389.

See also `Parallel`

to reduce with other operators.

**Examples**

```
julia> m = Maxout(x -> abs2.(x), x -> x .* 3);
julia> m([-2 -1 0 1 2])
1×5 Matrix{Int64}:
4 1 0 3 6
julia> m3 = Maxout(() -> Dense(5 => 7, tanh), 3)
Maxout(
Dense(5 => 7, tanh), # 42 parameters
Dense(5 => 7, tanh), # 42 parameters
Dense(5 => 7, tanh), # 42 parameters
) # Total: 6 arrays, 126 parameters, 888 bytes.
julia> Flux.outputsize(m3, (5, 11))
(7, 11)
```

`Flux.SkipConnection`

— Type`SkipConnection(layer, connection)`

Create a skip connection which consists of a layer or `Chain`

of consecutive layers and a shortcut connection linking the block's input to the output through a user-supplied 2-argument callable. The first argument to the callable will be propagated through the given `layer`

while the second is the unchanged, "skipped" input.

The simplest "ResNet"-type connection is just `SkipConnection(layer, +)`

. Here is a more complicated example:

```
julia> m = Conv((3,3), 4 => 7, pad=(1,1));
julia> x = ones(Float32, 5, 5, 4, 10);
julia> size(m(x)) == (5, 5, 7, 10)
true
julia> sm = SkipConnection(m, (mx, x) -> cat(mx, x, dims=3));
julia> size(sm(x)) == (5, 5, 11, 10)
true
```

`Flux.Parallel`

— Type```
Parallel(connection, layers...)
Parallel(connection; name = layer, ...)
```

Create a layer which passes an input array to each path in `layers`

, before reducing the output with `connection`

.

Called with one input `x`

, this is equivalent to `connection([l(x) for l in layers]...)`

. If called with multiple inputs, one is passed to each layer, thus `Parallel(+, f, g)(x, y) = f(x) + g(y)`

.

Like `Chain`

, its sub-layers may be given names using the keyword constructor. These can be accessed by indexing: `m[1] == m[:name]`

is the first layer.

See also `SkipConnection`

which is `Parallel`

with one `identity`

, and `Maxout`

which reduces by broadcasting `max`

.

**Examples**

```
julia> model = Chain(Dense(3 => 5),
Parallel(vcat, Dense(5 => 4), Chain(Dense(5 => 7), Dense(7 => 4))),
Dense(8 => 17));
julia> model(rand32(3)) |> size
(17,)
julia> model2 = Parallel(+; α = Dense(10, 2, tanh), β = Dense(5, 2))
Parallel(
+,
α = Dense(10 => 2, tanh), # 22 parameters
β = Dense(5 => 2), # 12 parameters
) # Total: 4 arrays, 34 parameters, 392 bytes.
julia> model2(rand32(10), rand32(5)) |> size
(2,)
julia> model2[:α](rand32(10)) |> size
(2,)
julia> model2[:β] == model2[2]
true
```

`Flux.PairwiseFusion`

— Type`PairwiseFusion(connection, layers...)`

**Arguments**

`connection`

: A function taking 2 inputs and combining them into a single output`layers`

: The layers whose outputs are combined

**Inputs**

This layer behaves differently based on input type:

- If input
`x`

is a tuple of length N (or the input is`xs`

with N`x`

's), matching the number of`layers`

,

then each layer receives a new input `x[i]`

combined with the previous output `y[i-1]`

using `connection`

. Thus `(y1, y2, y3) = PairwiseFusion(connection, layer1, layer2, layer3)((x1, x2, x3))`

may be drawn as:

```
x1 → layer1 → y1 ↘
connection → layer2 → y2 ↘
x2 ↗ connection → layer3 → y3
x3 ↗
```

... or written as:

```
y1 = layer1(x1)
y2 = layer2(connection(y1, x2))
y3 = layer3(connection(y2, x3))
```

- With just one input, each layer receives the same
`x`

combined with the previous output. Thus`y = PairwiseFusion(connection, layers...)(x)`

obeys:

```
y[1] == layers[1](x)
for i in 2:length(layers)
y[i] == connection(layers[i](y[i-1]), x)
end
```

**Returns**

A tuple of length N with the output of each fusion ((`y1`

, `y2`

, ..., `yN`

) in the example above).

## Recurrent Models

Much like the core layers above, but can be used to process sequence data (as well as other kinds of structured data).

`Flux.RNN`

— Function`RNN(in => out, σ = tanh)`

The most basic recurrent layer; essentially acts as a `Dense`

layer, but with the output fed back into the input each time step.

The arguments `in`

and `out`

describe the size of the feature vectors passed as input and as output. That is, it accepts a vector of length `in`

or a batch of vectors represented as a `in x B`

matrix and outputs a vector of length `out`

or a batch of vectors of size `out x B`

.

This constructor is syntactic sugar for `Recur(RNNCell(a...))`

, and so RNNs are stateful. Note that the state shape can change depending on the inputs, and so it is good to `reset!`

the model between inference calls if the batch size changes. See the examples below.

**Examples**

```
julia> r = RNN(3 => 5)
Recur(
RNNCell(3 => 5, tanh), # 50 parameters
) # Total: 4 trainable arrays, 50 parameters,
# plus 1 non-trainable, 5 parameters, summarysize 432 bytes.
julia> r(rand(Float32, 3)) |> size
(5,)
julia> Flux.reset!(r);
julia> r(rand(Float32, 3, 10)) |> size # batch size of 10
(5, 10)
```

Failing to call `reset!`

when the input batch size changes can lead to unexpected behavior. See the following example:

```
julia> r = RNN(3 => 5)
Recur(
RNNCell(3 => 5, tanh), # 50 parameters
) # Total: 4 trainable arrays, 50 parameters,
# plus 1 non-trainable, 5 parameters, summarysize 432 bytes.
julia> r.state |> size
(5, 1)
julia> r(rand(Float32, 3)) |> size
(5,)
julia> r.state |> size
(5, 1)
julia> r(rand(Float32, 3, 10)) |> size # batch size of 10
(5, 10)
julia> r.state |> size # state shape has changed
(5, 10)
julia> r(rand(Float32, 3)) |> size # erroneously outputs a length 5*10 = 50 vector.
(50,)
```

**Note:**

`RNNCell`

s can be constructed directly by specifying the non-linear function, the `Wi`

and `Wh`

internal matrices, a bias vector `b`

, and a learnable initial state `state0`

. The `Wi`

and `Wh`

matrices do not need to be the same type, but if `Wh`

is `dxd`

, then `Wi`

should be of shape `dxN`

.

```
julia> using LinearAlgebra
julia> r = Flux.Recur(Flux.RNNCell(tanh, rand(5, 4), Tridiagonal(rand(5, 5)), rand(5), rand(5, 1)))
julia> r(rand(4, 10)) |> size # batch size of 10
(5, 10)
```

`Flux.LSTM`

— Function`LSTM(in => out)`

Long Short Term Memory recurrent layer. Behaves like an RNN but generally exhibits a longer memory span over sequences.

The arguments `in`

and `out`

describe the size of the feature vectors passed as input and as output. That is, it accepts a vector of length `in`

or a batch of vectors represented as a `in x B`

matrix and outputs a vector of length `out`

or a batch of vectors of size `out x B`

.

This constructor is syntactic sugar for `Recur(LSTMCell(a...))`

, and so LSTMs are stateful. Note that the state shape can change depending on the inputs, and so it is good to `reset!`

the model between inference calls if the batch size changes. See the examples below.

See this article for a good overview of the internals.

**Examples**

```
julia> l = LSTM(3 => 5)
Recur(
LSTMCell(3 => 5), # 190 parameters
) # Total: 5 trainable arrays, 190 parameters,
# plus 2 non-trainable, 10 parameters, summarysize 1.062 KiB.
julia> l(rand(Float32, 3)) |> size
(5,)
julia> Flux.reset!(l);
julia> l(rand(Float32, 3, 10)) |> size # batch size of 10
(5, 10)
```

Failing to call `reset!`

when the input batch size changes can lead to unexpected behavior. See the example in `RNN`

.

**Note:**

`LSTMCell`

s can be constructed directly by specifying the non-linear function, the `Wi`

and `Wh`

internal matrices, a bias vector `b`

, and a learnable initial state `state0`

. The `Wi`

and `Wh`

matrices do not need to be the same type. See the example in `RNN`

.

`Flux.GRU`

— Function`GRU(in => out)`

Gated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences. This implements the variant proposed in v1 of the referenced paper.

The integer arguments `in`

and `out`

describe the size of the feature vectors passed as input and as output. That is, it accepts a vector of length `in`

or a batch of vectors represented as a `in x B`

matrix and outputs a vector of length `out`

or a batch of vectors of size `out x B`

.

This constructor is syntactic sugar for `Recur(GRUCell(a...))`

, and so GRUs are stateful. Note that the state shape can change depending on the inputs, and so it is good to `reset!`

the model between inference calls if the batch size changes. See the examples below.

See this article for a good overview of the internals.

**Examples**

```
julia> g = GRU(3 => 5)
Recur(
GRUCell(3 => 5), # 140 parameters
) # Total: 4 trainable arrays, 140 parameters,
# plus 1 non-trainable, 5 parameters, summarysize 792 bytes.
julia> g(rand(Float32, 3)) |> size
(5,)
julia> Flux.reset!(g);
julia> g(rand(Float32, 3, 10)) |> size # batch size of 10
(5, 10)
```

Failing to call `reset!`

when the input batch size changes can lead to unexpected behavior. See the example in `RNN`

.

**Note:**

`GRUCell`

s can be constructed directly by specifying the non-linear function, the `Wi`

and `Wh`

internal matrices, a bias vector `b`

, and a learnable initial state `state0`

. The `Wi`

and `Wh`

matrices do not need to be the same type. See the example in `RNN`

.

`Flux.GRUv3`

— Function`GRUv3(in => out)`

Gated Recurrent Unit layer. Behaves like an RNN but generally exhibits a longer memory span over sequences. This implements the variant proposed in v3 of the referenced paper.

The arguments `in`

and `out`

describe the size of the feature vectors passed as input and as output. That is, it accepts a vector of length `in`

or a batch of vectors represented as a `in x B`

matrix and outputs a vector of length `out`

or a batch of vectors of size `out x B`

.

This constructor is syntactic sugar for `Recur(GRUv3Cell(a...))`

, and so GRUv3s are stateful. Note that the state shape can change depending on the inputs, and so it is good to `reset!`

the model between inference calls if the batch size changes. See the examples below.

See this article for a good overview of the internals.

**Examples**

```
julia> g = GRUv3(3 => 5)
Recur(
GRUv3Cell(3 => 5), # 140 parameters
) # Total: 5 trainable arrays, 140 parameters,
# plus 1 non-trainable, 5 parameters, summarysize 848 bytes.
julia> g(rand(Float32, 3)) |> size
(5,)
julia> Flux.reset!(g);
julia> g(rand(Float32, 3, 10)) |> size # batch size of 10
(5, 10)
```

Failing to call `reset!`

when the input batch size changes can lead to unexpected behavior. See the example in `RNN`

.

**Note:**

`GRUv3Cell`

s can be constructed directly by specifying the non-linear function, the `Wi`

, `Wh`

, and `Wh_h`

internal matrices, a bias vector `b`

, and a learnable initial state `state0`

. The `Wi`

, `Wh`

, and `Wh_h`

matrices do not need to be the same type. See the example in `RNN`

.

`Flux.Recur`

— Type`Recur(cell)`

`Recur`

takes a recurrent cell and makes it stateful, managing the hidden state in the background. `cell`

should be a model of the form:

`h, y = cell(h, x...)`

For example, here's a recurrent network that keeps a running total of its inputs:

**Examples**

```
julia> accum(h, x) = (h + x, x)
accum (generic function with 1 method)
julia> rnn = Flux.Recur(accum, 0)
Recur(accum)
julia> rnn(2)
2
julia> rnn(3)
3
julia> rnn.state
5
```

Folding over a 3d Array of dimensions `(features, batch, time)`

is also supported:

```
julia> accum(h, x) = (h .+ x, x)
accum (generic function with 1 method)
julia> rnn = Flux.Recur(accum, zeros(Int, 1, 1))
Recur(accum)
julia> rnn([2])
1-element Vector{Int64}:
2
julia> rnn([3])
1-element Vector{Int64}:
3
julia> rnn.state
1×1 Matrix{Int64}:
5
julia> out = rnn(reshape(1:10, 1, 1, :)); # apply to a sequence of (features, batch, time)
julia> out |> size
(1, 1, 10)
julia> vec(out)
10-element Vector{Int64}:
1
2
3
4
5
6
7
8
9
10
julia> rnn.state
1×1 Matrix{Int64}:
60
```

`Flux.reset!`

— Function`reset!(rnn)`

Reset the hidden state of a recurrent layer back to its original value.

Assuming you have a `Recur`

layer `rnn`

, this is roughly equivalent to:

`rnn.state = hidden(rnn.cell)`

**Examples**

```
julia> r = Flux.RNNCell(relu, ones(1,1), zeros(1,1), ones(1,1), zeros(1,1)); # users should use the RNN wrapper struct instead
julia> y = Flux.Recur(r, ones(1,1));
julia> y.state
1×1 Matrix{Float64}:
1.0
julia> y(ones(1,1)) # relu(1*1 + 1)
1×1 Matrix{Float64}:
2.0
julia> y.state
1×1 Matrix{Float64}:
2.0
julia> Flux.reset!(y)
1×1 Matrix{Float64}:
0.0
julia> y.state
1×1 Matrix{Float64}:
0.0
```

## Normalisation & Regularisation

These layers don't affect the structure of the network but may improve training times or reduce overfitting. Some of them contain trainable parameters, while others do not.

`Flux.BatchNorm`

— Type```
BatchNorm(channels::Integer, λ=identity;
initβ=zeros32, initγ=ones32,
affine=true, track_stats=true, active=nothing,
eps=1f-5, momentum= 0.1f0)
```

Batch Normalization layer. `channels`

should be the size of the channel dimension in your data (see below).

Given an array with `N`

dimensions, call the `N-1`

th the channel dimension. For a batch of feature vectors this is just the data dimension, for `WHCN`

images it's the usual channel dimension.

`BatchNorm`

computes the mean and variance for each `D_1×...×D_{N-2}×1×D_N`

input slice and normalises the input accordingly.

If `affine=true`

, it also applies a shift and a rescale to the input through to learnable per-channel bias β and scale γ parameters.

After normalisation, elementwise activation `λ`

is applied.

If `track_stats=true`

, accumulates mean and var statistics in training phase that will be used to renormalize the input in test phase.

Use `testmode!`

during inference.

**Examples**

```
julia> using Statistics
julia> xs = rand(3, 3, 3, 2); # a batch of 2 images, each having 3 channels
julia> m = BatchNorm(3);
julia> Flux.trainmode!(m);
julia> isapprox(std(m(xs)), 1, atol=0.1) && std(xs) != std(m(xs))
true
```

`Flux.Dropout`

— Type`Dropout(p; [dims, rng, active])`

Layer implementing dropout with the given probability. This is used as a regularisation, i.e. to reduce overfitting.

While training, it sets each input to `0`

(with probability `p`

) or else scales it by `1 / (1 - p)`

, using the `NNlib.dropout`

function. While testing, it has no effect.

By default the mode will switch automatically, but it can also be controlled manually via `Flux.testmode!`

, or by passing keyword `active=true`

for training mode.

By default every input is treated independently. With the `dims`

keyword, instead it takes a random choice only along that dimension. For example `Dropout(p; dims = 3)`

will randomly zero out entire channels on WHCN input (also called 2D dropout).

Keyword `rng`

lets you specify a custom random number generator. (Only supported on the CPU.)

**Examples**

```
julia> m = Chain(Dense(ones(3,2)), Dropout(0.4))
Chain(
Dense(2 => 3), # 9 parameters
Dropout(0.4),
)
julia> m(ones(2, 7)) # test mode, no effect
3×7 Matrix{Float64}:
2.0 2.0 2.0 2.0 2.0 2.0 2.0
2.0 2.0 2.0 2.0 2.0 2.0 2.0
2.0 2.0 2.0 2.0 2.0 2.0 2.0
julia> Flux.trainmode!(m) # equivalent to use within gradient
Chain(
Dense(2 => 3), # 9 parameters
Dropout(0.4, active=true),
)
julia> m(ones(2, 7))
3×7 Matrix{Float64}:
0.0 0.0 3.33333 0.0 0.0 0.0 0.0
3.33333 0.0 3.33333 0.0 3.33333 0.0 3.33333
3.33333 3.33333 0.0 3.33333 0.0 0.0 3.33333
julia> y = m(ones(2, 10_000));
julia> using Statistics
julia> mean(y) # is about 2.0, same as in test mode
1.9989999999999961
julia> mean(iszero, y) # is about 0.4
0.4003
```

`Flux.AlphaDropout`

— Type`AlphaDropout(p; [rng, active])`

A dropout layer. Used in Self-Normalizing Neural Networks. The AlphaDropout layer ensures that mean and variance of activations remain the same as before.

Does nothing to the input once `testmode!`

is true.

**Examples**

```
julia> using Statistics
julia> x = randn32(1000,1);
julia> m = Chain(Dense(1000 => 1000, selu), AlphaDropout(0.2));
julia> Flux.trainmode!(m);
julia> y = m(x);
julia> isapprox(std(x), std(y), atol=0.2)
true
```

`Flux.LayerNorm`

— Type`LayerNorm(size..., λ=identity; affine=true, eps=1f-5)`

A normalisation layer designed to be used with recurrent hidden states. The argument `size`

should be an integer or a tuple of integers.

In the forward pass, the layer normalises the mean and standard deviation of the input, then applies the elementwise activation `λ`

. The input is normalised along the first `length(size)`

dimensions for tuple `size`

, and along the first dimension for integer `size`

. The input is expected to have first dimensions' size equal to `size`

.

If `affine=true`

, it also applies a learnable shift and rescaling using the `Scale`

layer.

See also `BatchNorm`

, `InstanceNorm`

, `GroupNorm`

, and `normalise`

.

**Examples**

```
julia> using Statistics
julia> xs = rand(3, 3, 3, 2); # a batch of 2 images, each having 3 channels
julia> m = LayerNorm(3);
julia> y = m(xs);
julia> isapprox(std(y, dims=1:3), ones(1, 1, 1, 2), atol=0.1) && std(y, dims=1:3) != std(xs, dims=1:3)
true
```

`Flux.InstanceNorm`

— Type```
InstanceNorm(channels::Integer, λ=identity;
initβ=zeros32, initγ=ones32,
affine=false, track_stats=false,
eps=1f-5, momentum=0.1f0)
```

Instance Normalization layer. `channels`

should be the size of the channel dimension in your data (see below).

Given an array with `N > 2`

dimensions, call the `N-1`

th the channel dimension. For `WHCN`

images it's the usual channel dimension.

`InstanceNorm`

computes the mean and variance for each `D_1×...×D_{N-2}×1×1`

input slice and normalises the input accordingly.

If `affine=true`

, it also applies a shift and a rescale to the input through to learnable per-channel bias `β`

and scale `γ`

parameters.

If `track_stats=true`

, accumulates mean and var statistics in training phase that will be used to renormalize the input in test phase.

**Warning**: the defaults for `affine`

and `track_stats`

used to be `true`

in previous Flux versions (< v0.12).

**Examples**

```
julia> using Statistics
julia> xs = rand(3, 3, 3, 2); # a batch of 2 images, each having 3 channels
julia> m = InstanceNorm(3);
julia> y = m(xs);
julia> isapprox(std(y, dims=1:2), ones(1, 1, 3, 2), atol=0.2) && std(y, dims=1:2) != std(xs, dims=1:2)
true
```

`Flux.GroupNorm`

— Type```
GroupNorm(channels::Int, G::Int, λ = identity;
initβ = zeros32,
initγ = ones32,
affine = true,
eps = 1f-5,
momentum = 0.1f0)
```

Group Normalization layer.

`chs`

is the number of channels, the channel dimension of your input. For an array of N dimensions, the `N-1`

th index is the channel dimension.

`G`

is the number of groups along which the statistics are computed. The number of channels must be an integer multiple of the number of groups.

`channels`

should be the size of the channel dimension in your data (see below).

Given an array with `N > 2`

dimensions, call the `N-1`

th the channel dimension. For `WHCN`

images it's the usual channel dimension.

If `affine=true`

, it also applies a shift and a rescale to the input through to learnable per-channel bias `β`

and scale `γ`

parameters.

**Examples**

```
julia> using Statistics
julia> xs = rand(3, 3, 4, 2); # a batch of 2 images, each having 4 channels
julia> m = GroupNorm(4, 2);
julia> y = m(xs);
julia> isapprox(std(y[:, :, 1:2, 1]), 1, atol=0.1) && std(xs[:, :, 1:2, 1]) != std(y[:, :, 1:2, 1])
true
julia> isapprox(std(y[:, :, 3:4, 2]), 1, atol=0.1) && std(xs[:, :, 3:4, 2]) != std(y[:, :, 3:4, 2])
true
```

`Flux.normalise`

— Function`normalise(x; dims=ndims(x), eps=1e-5)`

Normalise `x`

to mean 0 and standard deviation 1 across the dimension(s) given by `dims`

. Per default, `dims`

is the last dimension. `eps`

is a small term added to the denominator for numerical stability.

**Examples**

```
julia> using Statistics
julia> x = [90, 100, 110, 130, 70];
julia> mean(x), std(x; corrected=false)
(100.0, 20.0)
julia> y = Flux.normalise(x)
5-element Vector{Float64}:
-0.49999975000012503
0.0
0.49999975000012503
1.499999250000375
-1.499999250000375
julia> isapprox(std(y; corrected=false), 1, atol=1e-5)
true
julia> x = rand(10:100, 10, 10);
julia> y = Flux.normalise(x, dims=1);
julia> isapprox(std(y; dims=1, corrected=false), ones(1, 10), atol=1e-5)
true
```

### Test vs. Train

Several normalisation layers behave differently under training and inference (testing). By default, Flux will automatically determine when a layer evaluation is part of training or inference.

This automatic train/test detection works best with Zygote, the default automatic differentiation package. It may not work with other packages such as Tracker, Yota, or ForwardDiff.

The functions `Flux.trainmode!`

and `Flux.testmode!`

let you manually specify which behaviour you want. When called on a model, they will place all layers within the model into the specified mode.

`Flux.testmode!`

— Method`testmode!(model, [mode]) -> model`

Set a layer, or all layers in a model, to test mode. This disables the effect of `Dropout`

and some other regularisation layers.

If you manually set a model into test mode, you need to manually place it back into train mode during training phase, using `trainmode!`

.

There is an optional second argument, which takes a symbol `:auto`

to reset all layers back to the default automatic mode.

**Example**

```
julia> d = Dropout(0.3)
Dropout(0.3)
julia> testmode!(d) # dropout is now always disabled
Dropout(0.3, active=false)
julia> trainmode!(d) # dropout is now always enabled
Dropout(0.3, active=true)
julia> testmode!(d, :auto) # back to default
Dropout(0.3)
```

`Flux.testmode!`

— Method`testmode!(model, inactive)`

This two-argument method is largely internal. It recurses into the `model`

, and until a method like `testmode!(d::Dropout, inactive)`

alters the activity of a layer. Custom layers can support manual `testmode!`

/ `trainmode!`

switching by defining such a method.

Possible values of `inactive`

are:

`true`

for testing, i.e.`active=false`

`false`

for training, same as`trainmode!`

`(m)`

`:auto`

or`nothing`

for Flux to detect training automatically.

This method may be removed in a future breaking change, to separate the user-facing `testmode!`

from the internal recursion.

`Flux.trainmode!`

— Function`trainmode!(model) -> model`

Set a layer, or all layers in a model, to training mode. Opposite to `testmode!`

, see further details there.

`trainmode!(m, active)`

This two-argument method is deprecated.

Possible values of `active`

are:

`true`

for training, or`false`

for testing, same as`testmode!`

`(m)`

`:auto`

or`nothing`

for Flux to detect training automatically.