Optimisers.jl

Optimisers.jl defines many standard gradient-based optimisation rules, and tools for applying them to deeply nested models.

This was written as the new training system for Flux.jl neural networks, and also used by Lux.jl. But it can be used separately on any array, or anything else understood by Functors.jl.

Installation

In the Julia REPL, type

]add Optimisers

julia> import Pkg; Pkg.add("Optimisers")

An optimisation rule

A new optimiser must overload two functions, apply! and init. These act on one array of parameters:

# Define a container to hold any optimiser specific parameters (if any):
struct DecayDescent <: Optimisers.AbstractRule
  eta::Float64
end

# Define an `apply!` rule which encodes how the gradients will be used to
# update the parameters:
function Optimisers.apply!(o::DecayDescent, state, x, x̄)
  T = eltype(x)
  newx̄ = T(o.eta / √state) .* x̄
  nextstate = state + 1
  return nextstate, newx̄
end

# Define the function which sets up the initial state (if any):
Optimisers.init(o::DecayDescent, x::AbstractArray) = 1

The parameters will be immediately updated to x .- newx̄, while nextstate is caried to the next iteration.

Notice that the state is handled separately from the optimiser itself. This is a key design principle and allows users to manage their own state explicitly. It of course also makes it easier to store the state.

Usage with Flux.jl

To apply such an optimiser to a whole model, setup builds a tree containing any initial state for every trainable array. Then at each step, update uses this and the gradient to adjust the model:

using Flux, Metalhead, Zygote, Optimisers

model = Metalhead.ResNet(18) |> gpu  # define a model to train
image = rand(Float32, 224, 224, 3, 1) |> gpu;  # dummy data
@show sum(model(image));  # dummy loss function

rule = Optimisers.Adam()  # use the Adam optimiser with its default settings
state_tree = Optimisers.setup(rule, model);  # initialise this optimiser's momentum etc.

∇model, _ = gradient(model, image) do m, x  # calculate the gradients
  sum(m(x))
end;

state_tree, model = Optimisers.update(state_tree, model, ∇model);
@show sum(model(image));  # reduced

Notice that a completely new instance of the model is returned. Internally, this is handled by Functors.jl, where we do a walk over the tree formed by the model and update the parameters using the gradients.

There is also Optimisers.update! which similarly returns a new model, but is free to mutate arrays within the old one for efficiency. (The method of apply! above is likewise free to mutate arrays within its state; they are defensively copied when this rule is used with update.) For Adam(), there are two momenta per parameter, thus state is about twice the size of model:

Base.summarysize(model) / 1024^2  # about 45MB
Base.summarysize(state) / 1024^2  # about 90MB

Optimisers.jl does not depend on any one automatic differentiation package, but for now the most likely source of gradients is Zygote.jl. Note that update always wants the gradient from Zygote's "explicit" mode, as shown above. This ∇model is another tree structure, rather than the dictionary-like object from Zygote's "implicit" mode gradient(() -> loss(...), Flux.params(model)) – see Zygote's documentation for more about this difference.

Usage with Lux.jl

The main design difference of Lux from Flux is that the tree of parameters is separate from the layer structure. It is these parameters which setup and update need to know about.

Lux describes this separation of parameter storage from model description as "explicit" parameters. Beware that it has nothing to do with Zygote's notion of "explicit" gradients. (If the same model is written in Flux and Lux, ∇model above and ∇params below will be nearly identical trees of nested NamedTuples.)

using Lux, Boltz, Zygote, Optimisers

lux_model, params, lux_state = Boltz.resnet(:resnet18) |> gpu;  # define and initialise model
images = rand(Float32, 224, 224, 3, 4) |> gpu;  # batch of dummy data
y, lux_state = Lux.apply(lux_model, images, params, lux_state);  # run the model
@show sum(y);  # initial dummy loss

rule = Optimisers.Adam()
opt_state = Optimisers.setup(rule, params);  # optimiser state based on model parameters

(loss, lux_state), back = Zygote.pullback(params, images) do p, x
  y, st = Lux.apply(lux_model, x, p, lux_state)
  sum(y), st  # return both the loss, and the updated lux_state
end;
∇params, _ = back((one.(loss), nothing));  # gradient of only the loss, with respect to parameter tree
loss == sum(y)  # not yet changed

opt_state, params = Optimisers.update!(opt_state, params, ∇params);

y, lux_state = Lux.apply(lux_model, images, params, lux_state);
@show sum(y);  # now reduced

Besides the parameters stored in params and gradually optimised, any other model state is stored in lux_state, and updated by Lux.apply. (In this example, BatchNorm has state.) This is completely unrelated to Optimisers.jl's state, although designed in a similar spirit.

Base.summarysize(lux_model) / 1024   # just 2KB
Base.summarysize(params) / 1024^2    # about 45MB, same as Flux model
Base.summarysize(lux_state) / 1024   # 40KB
Base.summarysize(opt_state) / 1024^2 # about 90MB, with Adam

If you are certain there is no model state, then the gradient calculation can be simplified to use Zygote.gradient instead of Zygote.pullback:

∇params, _ = gradient(params, images) do p, x
  y, _ = Lux.apply(lux_model, x, p, lux_state)  # discards new lux_state
  sum(y)
end;

Non-`trainable` Parameters

Optimisers.jl uses Functors.jl to walk the structs making up the model, for which they must be annotated @functor Type. By default optimisation will alter all isnumeric arrays.

If some arrays of a particular layer should not be treated this way, you can define a method for trainable

struct Layer{T}
  alpha::T
  beta::T
  length::Int
end
Layer(n::Int) = Layer(randn(n), zeros(n), n)

Functors.@functor Layer

# Both array fields will be, for example, moved to the GPU:
Functors.children(Layer(3))  # (alpha = [...], beta = [...], length)

Optimisers.trainable(x::Layer) = (; alpha = x.alpha)  # must be a subset of children

# Only the first field will be optimised:
st = Optimisers.setup(DecayDescent(0.1), Layer(3))

Frozen Parameters

To temporarily prevent training from affecting some parameters, use freeze! and thaw!. They work by mutating all Leafs of the state tree, or part of it.

using Flux, Optimisers

x = randn(Float32, 28, 28, 1, 1);
net = @autosize (size(x)...,) Chain(
  Conv((3, 3), 1 => 3, stride=2, bias=false), Flux.flatten, Dense(_ => 2, relu),
)
opt = Optimisers.setup(Optimisers.Momentum(), net);

net.layers[3] isa Dense  # now freeze this layer's parameters:
Optimisers.freeze!(opt.layers[3])
opt.layers[3].bias  # confirm: Leaf(Momentum(...), [0.0, 0.0], frozen = true)

Optimisers.update!(opt, net, gradient(m -> sum(m(x)), net)...);

net.layers[3].bias  # stil zero, and its momentum is too:

Optimisers.thaw!(opt)
opt.layers[3].bias  # Leaf(Momentum(...), [0.0, 0.0])

Adjusting Hyperparameters

To change the learning rate during training, use adjust!. This works much like freeze! by mutating the state tree, or part of it, without discarding the momenta. For the Flux model from just above:

Optimisers.adjust!(opt, 0.03)  # change η for the whole model...

Optimisers.adjust!(opt.layers[3], 0.04)  # ... or just for one layer.

To change other fields of the optimisation rule, it accepts keyword arguments:

Momentum |> fieldnames  # (:eta, :rho)

Optimisers.adjust!(opt, rho = 0.95)  # change ρ for the whole model.

Tied Parameters

If the same array appears twice (or more) in the model, Functors.jl should recognise this. Within Optimisers.jl, setup will initialise once, and use the same Leaf for both parameters. Then update will accumulate the gradient from both, and the updated model returned will have the tie maintained.

using Flux, Optimisers

enc = Chain(Dense(40 => 20, tanh), Dense(20 => 10));
dec = Chain(Dense(enc[1].weight', true, tanh), Dense(enc[2].weight', true, tanh));
model = Chain(; enc, dec)

st = Optimisers.setup(Optimisers.Adam(), model);

st.layers.enc.layers[1].weight === st.layers.dec.layers[1].weight.parent  # true

This identification relies on ===, and will work for ordinary Arrays and CuArrays. It will not at present work for reshaped arrays, nor for immutable arrays such as those from StaticArrays.jl.

Obtaining a flat parameter vector

Instead of a nested tree-like structure, sometimes is is convenient to have all the parameters as one simple vector. Optimisers.jl contains a function destructure which creates this vector, and also creates way to re-build the original structure with new parameters. Both flattening and re-building may be used within gradient calls.

An example with Flux's model:

using ForwardDiff  # an example of a package which only likes one array

model = Chain(  # much smaller model example, as ForwardDiff is a slow algorithm here
          Conv((3, 3), 3 => 5, pad=1, bias=false), 
          BatchNorm(5, relu), 
          Conv((3, 3), 5 => 3, stride=16),
        )
image = rand(Float32, 224, 224, 3, 1);
@show sum(model(image));

flat, re = destructure(model)
st = Optimisers.setup(rule, flat)  # state is just one Leaf now

∇flat = ForwardDiff.gradient(flat) do v
  m = re(v)      # rebuild a new object like model
  sum(m(image))  # call that as before
end

st, flat = Optimisers.update(st, flat, ∇flat)
@show sum(re(flat)(image));

Here flat contains only the 283 trainable parameters, while the non-trainable ones are preserved inside re, an object of type Restructure. When defining new layers, these can be specified if necessary by overloading trainable. By default, all numeric arrays visible to Functors.jl are assumed to contain trainable parameters. Tied parameters (arrays appearing in different layers) are included only once in flat.

Lux stores only the trainable parameters in params. This can also be flattened to a plain Vector in the same way:

params, lux_state = Lux.setup(Random.default_rng(), lux_model);

flat, re = destructure(params)

∇flat = ForwardDiff.gradient(flat) do v
  p = re(v)  # rebuild an object like params
  y, _ = Lux.apply(lux_model, images, p, lux_state)
  sum(y)
end

Collecting all trainable parameters

Sometimes it is useful to collect all trainable parameters in a model, similarly to what destructure does but without concatenating the arrays into a flat vector. This is done by trainables, which returns a list of arrays:

julia> using Flux, Optimisers

julia> model = Chain(Dense(2 => 3, tanh), BatchNorm(3), Dense(3 => 2));

julia> trainables(model)
6-element Vector{AbstractArray}:
 Float32[0.5756773 -0.1975264; 0.4723181 -0.7546912; -0.91631395 0.07392061]
 Float32[0.0, 0.0, 0.0]
 Float32[0.0, 0.0, 0.0]
 Float32[1.0, 1.0, 1.0]
 Float32[-0.8764882 0.40812716 0.1919528; -0.9123545 -0.4462516 0.6751252]
 Float32[0.0, 0.0]

julia> l2reg(model) = sum([sum(abs2, p) for p in trainables(model)]);

julia> g = gradient(l2reg, model)[1];

Notice that the BatchNorm layer has two trainable parameters, γ and β, which are included in the list, while the μ and σ² buffers are not.

Sometimes one wants to iterate over all trainable parameters in a model and the corresponding parameters of a matched structure such a gradient or the moving average of the model. This can be done using trainables(model, path=true). For instance, here is how to update the parameters of a moving average model with the parameters of the model:

for (kp, p_avg) in trainables(model_avg, path=true)
    p = getkeypath(model, kp)  
    p_avg .= 0.99 .* p_avg .+ 0.01 .* p
end

Incomplete or nothing gradients

If the gradient is not available for some parameters, or branches of the model, update will not take an optimisation step for those parameters. This is the case when the gradient is nothing or a subtype of ChainRules.AbstractZero.

For stateful optimisers, skipping an update it is generaly not the same as updating with a zero gradient. For example, in the case of Adam, the momentum and variance are updated even if the gradient is zero:

julia> x = (a = ones(2), b = ones(2));
(a = [1.0, 1.0], b = [1.0, 1.0])

julia> opt_state = Optimisers.setup(Adam(0.1), x)
(a = Leaf(Adam(0.1, (0.9, 0.999), 1.0e-8), ([0.0, 0.0], [0.0, 0.0], (0.9, 0.999))), b = Leaf(Adam(0.1, (0.9, 0.999), 1.0e-8), ([0.0, 0.0], [0.0, 0.0], (0.9, 0.999))))

julia> g = (; a = ones(2), b = ones(2)); # First an update with a non-zero gradient to increase the momentum and variance

julia> Optimisers.update!(opt_state, x, g);

julia> opt_state # the state in `a` and `b` are the same
(a = Leaf(Adam(0.1, (0.9, 0.999), 1.0e-8), ([0.1, 0.1], [0.001, 0.001], (0.81, 0.998001))), b = Leaf(Adam(0.1, (0.9, 0.999), 1.0e-8), ([0.1, 0.1], [0.001, 0.001], (0.81, 0.998001))))

julia> g = (; a = zeros(2), b = nothing); # Now an update with a zero gradient for a and no gradient for b

julia> Optimisers.update!(opt_state, x, g);

julia> opt_state # the state in `a` and `b` differ
(a = Leaf(Adam(0.1, (0.9, 0.999), 1.0e-8), ([0.09, 0.09], [0.000999, 0.000999], (0.729, 0.997003))), b = Leaf(Adam(0.1, (0.9, 0.999), 1.0e-8), ([0.1, 0.1], [0.001, 0.001], (0.81, 0.998001))))

Usage with Enzyme.jl

Enzyme.jl is a new automatic differentiation package, an alternative to Zygote.jl. It likes to store the model and the gradient together, as an object Duplicated(x, dx).

Optimisers.jl now has some methods to handle this:

update!(opt_state, Duplicated(model, grad)) uses the gradient to update both the model and the optimiser state, and
setup(::AbstractRule, ::Duplicated) ignores the gradient and returns setup(rule, model).