# How Flux Works: Gradients and Layers

## Taking Gradients

Flux's core feature is taking gradients of Julia code. The `gradient`

function takes another Julia function `f`

and a set of arguments, and returns the gradient with respect to each argument. (It's a good idea to try pasting these examples in the Julia terminal.)

```
julia> using Flux
julia> f(x) = 3x^2 + 2x + 1;
julia> df(x) = gradient(f, x)[1]; # df/dx = 6x + 2
julia> df(2)
14.0
julia> d2f(x) = gradient(df, x)[1]; # d²f/dx² = 6
julia> d2f(2)
6.0
```

When a function has many parameters, we can get gradients of each one at the same time:

```
julia> f(x, y) = sum((x .- y).^2);
julia> gradient(f, [2, 1], [2, 0])
([0.0, 2.0], [-0.0, -2.0])
```

These gradients are based on `x`

and `y`

. Flux works by instead taking gradients based on the weights and biases that make up the parameters of a model.

Machine learning often can have *hundreds* of parameter arrays. Instead of passing them to `gradient`

individually, we can store them together in a structure. The simplest example is a named tuple, created by the following syntax:

```
julia> nt = (a = [2, 1], b = [2, 0], c = tanh);
julia> g(x::NamedTuple) = sum(abs2, x.a .- x.b);
julia> g(nt)
1
julia> dg_nt = gradient(g, nt)[1]
(a = [0.0, 2.0], b = [-0.0, -2.0], c = nothing)
```

Notice that `gradient`

has returned a matching structure. The field `dg_nt.a`

is the gradient for `nt.a`

, and so on. Some fields have no gradient, indicated by `nothing`

.

Rather than define a function like `g`

every time (and think up a name for it), it is often useful to use anonymous functions: this one is `x -> sum(abs2, x.a .- x.b)`

. Anonymous functions can be defined either with `->`

or with `do`

, and such `do`

blocks are often useful if you have a few steps to perform:

```
julia> gradient((x, y) -> sum(abs2, x.a ./ y .- x.b), nt, [1, 2])
((a = [0.0, 0.5], b = [-0.0, -1.0], c = nothing), [-0.0, -0.25])
julia> gradient(nt, [1, 2]) do x, y
z = x.a ./ y
sum(abs2, z .- x.b)
end
((a = [0.0, 0.5], b = [-0.0, -1.0], c = nothing), [-0.0, -0.25])
```

Sometimes you may want to know the value of the function, as well as its gradient. Rather than calling the function a second time, you can call `withgradient`

instead:

```
julia> Flux.withgradient(g, nt)
(val = 1, grad = ((a = [0.0, 2.0], b = [-0.0, -2.0], c = nothing),))
```

Flux used to handle many parameters in a different way, using the `params`

function. This uses a method of `gradient`

which takes a zero-argument function, and returns a dictionary through which the resulting gradients can be looked up:

```
julia> x = [2, 1];
julia> y = [2, 0];
julia> gs = gradient(Flux.params(x, y)) do
f(x, y)
end
Grads(...)
julia> gs[x]
2-element Vector{Float64}:
0.0
2.0
julia> gs[y]
2-element Vector{Float64}:
-0.0
-2.0
```

## Building Simple Models

Consider a simple linear regression, which tries to predict an output array `y`

from an input `x`

.

```
W = rand(2, 5)
b = rand(2)
predict(x) = W*x .+ b
function loss(x, y)
ŷ = predict(x)
sum((y .- ŷ).^2)
end
x, y = rand(5), rand(2) # Dummy data
loss(x, y) # ~ 3
```

To improve the prediction we can take the gradients of the loss with respect to `W`

and `b`

and perform gradient descent.

```
using Flux
gs = gradient(() -> loss(x, y), Flux.params(W, b))
```

Now that we have gradients, we can pull them out and update `W`

to train the model.

```
W̄ = gs[W]
W .-= 0.1 .* W̄
loss(x, y) # ~ 2.5
```

The loss has decreased a little, meaning that our prediction `x`

is closer to the target `y`

. If we have some data we can already try training the model.

All deep learning in Flux, however complex, is a simple generalisation of this example. Of course, models can *look* very different – they might have millions of parameters or complex control flow. Let's see how Flux handles more complex models.

## Building Layers

It's common to create more complex models than the linear regression above. For example, we might want to have two linear layers with a nonlinearity like sigmoid (`σ`

) in between them. In the above style we could write this as:

```
using Flux
W1 = rand(3, 5)
b1 = rand(3)
layer1(x) = W1 * x .+ b1
W2 = rand(2, 3)
b2 = rand(2)
layer2(x) = W2 * x .+ b2
model(x) = layer2(σ.(layer1(x)))
model(rand(5)) # => 2-element vector
```

This works but is fairly unwieldy, with a lot of repetition – especially as we add more layers. One way to factor this out is to create a function that returns linear layers.

```
function linear(in, out)
W = randn(out, in)
b = randn(out)
x -> W * x .+ b
end
linear1 = linear(5, 3) # we can access linear1.W etc
linear2 = linear(3, 2)
model(x) = linear2(σ.(linear1(x)))
model(rand(5)) # => 2-element vector
```

Another (equivalent) way is to create a struct that explicitly represents the affine layer.

```
struct Affine
W
b
end
Affine(in::Integer, out::Integer) =
Affine(randn(out, in), randn(out))
# Overload call, so the object can be used as a function
(m::Affine)(x) = m.W * x .+ m.b
a = Affine(10, 5)
a(rand(10)) # => 5-element vector
```

Congratulations! You just built the `Dense`

layer that comes with Flux. Flux has many interesting layers available, but they're all things you could have built yourself very easily.

(There is one small difference with `Dense`

– for convenience it also takes an activation function, like `Dense(10 => 5, σ)`

.)

## Stacking It Up

It's pretty common to write models that look something like:

```
layer1 = Dense(10 => 5, σ)
# ...
model(x) = layer3(layer2(layer1(x)))
```

For long chains, it might be a bit more intuitive to have a list of layers, like this:

```
using Flux
layers = [Dense(10 => 5, σ), Dense(5 => 2), softmax]
model(x) = foldl((x, m) -> m(x), layers, init = x)
model(rand(10)) # => 2-element vector
```

Handily, this is also provided for in Flux:

```
model2 = Chain(
Dense(10 => 5, σ),
Dense(5 => 2),
softmax)
model2(rand(10)) # => 2-element vector
```

This quickly starts to look like a high-level deep learning library; yet you can see how it falls out of simple abstractions, and we lose none of the power of Julia code.

A nice property of this approach is that because "models" are just functions (possibly with trainable parameters), you can also see this as simple function composition.

```
m = Dense(5 => 2) ∘ Dense(10 => 5, σ)
m(rand(10))
```

Likewise, `Chain`

will happily work with any Julia function.

```
m = Chain(x -> x^2, x -> x+1)
m(5) # => 26
```

## Layer Helpers

There is still one problem with this `Affine`

layer, that Flux does not know to look inside it. This means that `Flux.train!`

won't see its parameters, nor will `gpu`

be able to move them to your GPU. These features are enabled by the `@functor`

macro:

`Flux.@functor Affine`

Finally, most Flux layers make bias optional, and allow you to supply the function used for generating random weights. We can easily add these refinements to the `Affine`

layer as follows, using the helper function `create_bias`

:

```
function Affine((in, out)::Pair; bias=true, init=Flux.randn32)
W = init(out, in)
b = Flux.create_bias(W, bias, out)
Affine(W, b)
end
Affine(3 => 1, bias=false, init=ones) |> gpu
```