How Flux Works: Gradients and Layers
Taking Gradients
Flux's core feature is taking gradients of Julia code. The gradient function takes another Julia function f and a set of arguments, and returns the gradient with respect to each argument. (It's a good idea to try pasting these examples in the Julia terminal.)
julia> using Flux
julia> f(x) = 3x^2 + 2x + 1;
julia> df(x) = gradient(f, x)[1]; # df/dx = 6x + 2
julia> df(2)
14.0
julia> d2f(x) = gradient(df, x)[1]; # d²f/dx² = 6
julia> d2f(2)
6.0When a function has many parameters, we can get gradients of each one at the same time:
julia> f(x, y) = sum((x .- y).^2);
julia> gradient(f, [2, 1], [2, 0])
([0.0, 2.0], [-0.0, -2.0])These gradients are based on x and y. Flux works by instead taking gradients based on the weights and biases that make up the parameters of a model.
Machine learning often can have hundreds of parameter arrays. Instead of passing them to gradient individually, we can store them together in a structure. The simplest example is a named tuple, created by the following syntax:
julia> nt = (a = [2, 1], b = [2, 0], c = tanh);
julia> g(x::NamedTuple) = sum(abs2, x.a .- x.b);
julia> g(nt)
1
julia> dg_nt = gradient(g, nt)[1]
(a = [0.0, 2.0], b = [-0.0, -2.0], c = nothing)Notice that gradient has returned a matching structure. The field dg_nt.a is the gradient for nt.a, and so on. Some fields have no gradient, indicated by nothing.
Rather than define a function like g every time (and think up a name for it), it is often useful to use anonymous functions: this one is x -> sum(abs2, x.a .- x.b). Anonymous functions can be defined either with -> or with do, and such do blocks are often useful if you have a few steps to perform:
julia> gradient((x, y) -> sum(abs2, x.a ./ y .- x.b), nt, [1, 2])
((a = [0.0, 0.5], b = [-0.0, -1.0], c = nothing), [-0.0, -0.25])
julia> gradient(nt, [1, 2]) do x, y
z = x.a ./ y
sum(abs2, z .- x.b)
end
((a = [0.0, 0.5], b = [-0.0, -1.0], c = nothing), [-0.0, -0.25])Sometimes you may want to know the value of the function, as well as its gradient. Rather than calling the function a second time, you can call withgradient instead:
julia> Flux.withgradient(g, nt)
(val = 1, grad = ((a = [0.0, 2.0], b = [-0.0, -2.0], c = nothing),))Flux used to handle many parameters in a different way, using the params function. This uses a method of gradient which takes a zero-argument function, and returns a dictionary through which the resulting gradients can be looked up:
julia> x = [2, 1];
julia> y = [2, 0];
julia> gs = gradient(Flux.params(x, y)) do
f(x, y)
end
Grads(...)
julia> gs[x]
2-element Vector{Float64}:
0.0
2.0
julia> gs[y]
2-element Vector{Float64}:
-0.0
-2.0Building Simple Models
Consider a simple linear regression, which tries to predict an output array y from an input x.
W = rand(2, 5)
b = rand(2)
predict(x) = W*x .+ b
function loss(x, y)
ŷ = predict(x)
sum((y .- ŷ).^2)
end
x, y = rand(5), rand(2) # Dummy data
loss(x, y) # ~ 3To improve the prediction we can take the gradients of the loss with respect to W and b and perform gradient descent.
using Flux
gs = gradient(() -> loss(x, y), Flux.params(W, b))Now that we have gradients, we can pull them out and update W to train the model.
W̄ = gs[W]
W .-= 0.1 .* W̄
loss(x, y) # ~ 2.5The loss has decreased a little, meaning that our prediction x is closer to the target y. If we have some data we can already try training the model.
All deep learning in Flux, however complex, is a simple generalisation of this example. Of course, models can look very different – they might have millions of parameters or complex control flow. Let's see how Flux handles more complex models.
Building Layers
It's common to create more complex models than the linear regression above. For example, we might want to have two linear layers with a nonlinearity like sigmoid (σ) in between them. In the above style we could write this as:
using Flux
W1 = rand(3, 5)
b1 = rand(3)
layer1(x) = W1 * x .+ b1
W2 = rand(2, 3)
b2 = rand(2)
layer2(x) = W2 * x .+ b2
model(x) = layer2(σ.(layer1(x)))
model(rand(5)) # => 2-element vectorThis works but is fairly unwieldy, with a lot of repetition – especially as we add more layers. One way to factor this out is to create a function that returns linear layers.
function linear(in, out)
W = randn(out, in)
b = randn(out)
x -> W * x .+ b
end
linear1 = linear(5, 3) # we can access linear1.W etc
linear2 = linear(3, 2)
model(x) = linear2(σ.(linear1(x)))
model(rand(5)) # => 2-element vectorAnother (equivalent) way is to create a struct that explicitly represents the affine layer.
struct Affine
W
b
end
Affine(in::Integer, out::Integer) =
Affine(randn(out, in), randn(out))
# Overload call, so the object can be used as a function
(m::Affine)(x) = m.W * x .+ m.b
a = Affine(10, 5)
a(rand(10)) # => 5-element vectorCongratulations! You just built the Dense layer that comes with Flux. Flux has many interesting layers available, but they're all things you could have built yourself very easily.
(There is one small difference with Dense – for convenience it also takes an activation function, like Dense(10 => 5, σ).)
Stacking It Up
It's pretty common to write models that look something like:
layer1 = Dense(10 => 5, σ)
# ...
model(x) = layer3(layer2(layer1(x)))For long chains, it might be a bit more intuitive to have a list of layers, like this:
using Flux
layers = [Dense(10 => 5, σ), Dense(5 => 2), softmax]
model(x) = foldl((x, m) -> m(x), layers, init = x)
model(rand(10)) # => 2-element vectorHandily, this is also provided for in Flux:
model2 = Chain(
Dense(10 => 5, σ),
Dense(5 => 2),
softmax)
model2(rand(10)) # => 2-element vectorThis quickly starts to look like a high-level deep learning library; yet you can see how it falls out of simple abstractions, and we lose none of the power of Julia code.
A nice property of this approach is that because "models" are just functions (possibly with trainable parameters), you can also see this as simple function composition.
m = Dense(5 => 2) ∘ Dense(10 => 5, σ)
m(rand(10))Likewise, Chain will happily work with any Julia function.
m = Chain(x -> x^2, x -> x+1)
m(5) # => 26Layer Helpers
There is still one problem with this Affine layer, that Flux does not know to look inside it. This means that Flux.train! won't see its parameters, nor will gpu be able to move them to your GPU. These features are enabled by the @layer macro:
Flux.@layer AffineFinally, most Flux layers make bias optional, and allow you to supply the function used for generating random weights. We can easily add these refinements to the Affine layer as follows, using the helper function create_bias:
function Affine((in, out)::Pair; bias=true, init=Flux.randn32)
W = init(out, in)
b = Flux.create_bias(W, bias, out)
Affine(W, b)
end
Affine(3 => 1, bias=false, init=ones) |> gpuFlux.@layer — Macro@layer Dense
@layer :expand Chain
@layer BatchNorm trainable=(β,γ)This macro replaces most uses of @functor. Its basic purpose is the same: When you define a new layer, this tells Flux to explore inside it to see the parameters it trains, and also to move them to the GPU, change precision, etc.
Like @functor, this assumes your struct has the default constructor, to enable re-building. If you define an inner constructor (i.e. a function within the struct block) things may break.
The keyword trainable allows you to limit this exploration, instead of visiting all fieldnames(T). Note that it is never necessary to tell Flux to ignore non-array objects such as functions or sizes.
The macro also handles overloads of show for pretty printing.
- By default, it adds methods to 3-arg
Base.showto treat your layer much likeDenseorConv. - If your layer is a container, more like
ChainorParallel, then:expandmakesshowunfold its contents. - To disable all
showoverloads, there is an:ignoreoption too.
(You probably still want to define 2-arg show(io::IO, x::Layer), the macro does not touch this.)
Note that re-running the macro with different options may not remove all methods, you will need to restart.
Example
julia> struct Trio; a; b; c end
julia> tri = Trio(Dense([1.1 2.2], [0.0], tanh), Dense(hcat(3.3), false), Dropout(0.4))
Trio(Dense(2 => 1, tanh), Dense(1 => 1; bias=false), Dropout(0.4))
julia> Flux.destructure(tri) # parameters are not yet visible to Flux
(Bool[], Restructure(Trio, ..., 0))
julia> Flux.@layer :expand Trio
julia> Flux.destructure(tri) # now gpu, params, train!, etc will see inside too
([1.1, 2.2, 0.0, 3.3], Restructure(Trio, ..., 4))
julia> tri # and layer is printed like Chain
Trio(
Dense(2 => 1, tanh), # 3 parameters
Dense(1 => 1; bias=false), # 1 parameters
Dropout(0.4),
) # Total: 3 arrays, 4 parameters, 224 bytes.Flux.create_bias — Functioncreate_bias(weights, bias, size...)Return a bias parameter for a layer, based on the value given to the constructor's keyword bias=bias.
bias == truecreates a trainable array of the given size, of the same type asweights, initialised to zero.bias == falsereturnsfalse, which is understood by AD to be non-differentiable.bias::AbstractArrayuses the array provided, provided it has the correct size. It will also correct theeltypeto match that ofweights.