Scheduling optimizers
A schedule by itself is not helpful; we need to use the schedules to adjust parameters. In this tutorial, we will examine three ways to do just that — iterating the schedule, using a stateful iterator, and using an scheduled optimizer.
Iterating during training
Since every schedule is a standard iterator, we can insert it into a training loop by simply zipping up with another iterator. For example, the following code adjusts the learning rate of the optimizer before each batch of training.
using Flux, ParameterSchedulers
data = [(rand(4, 10), rand([-1, 1], 1, 10)) for _ in 1:3]
m = Chain(Dense(4, 4, tanh), Dense(4, 1, tanh))
p = params(m)
opt = Descent()
s = Exp(λ = 1e-1, γ = 0.2)
for (η, (x, y)) in zip(s, data)
opt.eta = η
g = Flux.gradient(() -> Flux.mse(m(x), y), p)
Flux.update!(opt, p, g)
println("η: ", opt.eta)
end
η: 0.1
η: 0.020000000000000004
η: 0.004000000000000001
We can also adjust the learning on an epoch basis instead. All that is required is to change what we zip our schedule with.
nepochs = 6
s = Step(λ = 1e-1, γ = 0.2, step_sizes = [3, 2, 1])
for (η, epoch) in zip(s, 1:nepochs)
opt.eta = η
for (i, (x, y)) in enumerate(data)
g = Flux.gradient(() -> Flux.mse(m(x), y), p)
Flux.update!(opt, p, g)
println("epoch: $epoch, batch: $i, η: $(opt.eta)")
end
end
epoch: 1, batch: 1, η: 0.1
epoch: 1, batch: 2, η: 0.1
epoch: 1, batch: 3, η: 0.1
epoch: 2, batch: 1, η: 0.1
epoch: 2, batch: 2, η: 0.1
epoch: 2, batch: 3, η: 0.1
epoch: 3, batch: 1, η: 0.1
epoch: 3, batch: 2, η: 0.1
epoch: 3, batch: 3, η: 0.1
epoch: 4, batch: 1, η: 0.020000000000000004
epoch: 4, batch: 2, η: 0.020000000000000004
epoch: 4, batch: 3, η: 0.020000000000000004
epoch: 5, batch: 1, η: 0.020000000000000004
epoch: 5, batch: 2, η: 0.020000000000000004
epoch: 5, batch: 3, η: 0.020000000000000004
epoch: 6, batch: 1, η: 0.004000000000000001
epoch: 6, batch: 2, η: 0.004000000000000001
epoch: 6, batch: 3, η: 0.004000000000000001
Stateful iteration with training
Sometimes zipping up the schedule with an iterator isn’t sufficient. For example, we might want to advance the schedule with every batch but not be forced to restart each epoch. In such a situation with nested loops, it becomes useful to use ParameterSchedulers.Stateful
which maintains its own iteration state.
nepochs = 3
s = ParameterSchedulers.Stateful(Inv(λ = 1e-1, γ = 0.2, p = 2))
for epoch in 1:nepochs
for (i, (x, y)) in enumerate(data)
opt.eta = ParameterSchedulers.next!(s)
g = Flux.gradient(() -> Flux.mse(m(x), y), p)
Flux.update!(opt, p, g)
println("epoch: $epoch, batch: $i, η: $(opt.eta)")
end
end
epoch: 1, batch: 1, η: 0.1
epoch: 1, batch: 2, η: 0.06944444444444445
epoch: 1, batch: 3, η: 0.051020408163265314
epoch: 2, batch: 1, η: 0.03906249999999999
epoch: 2, batch: 2, η: 0.030864197530864196
epoch: 2, batch: 3, η: 0.025
epoch: 3, batch: 1, η: 0.020661157024793386
epoch: 3, batch: 2, η: 0.01736111111111111
epoch: 3, batch: 3, η: 0.014792899408284023
Working with Flux optimizers
Warning
Currently, we are porting Scheduler
to Flux.jl.
It may be renamed once it is ported out of this package.
The API will also undergo minor changes.
While the approaches above can be helpful when dealing with fine-grained training loops, it is usually simpler to just use a ParameterSchedulers.Scheduler
.
using ParameterSchedulers: Scheduler
nepochs = 3
s = Inv(λ = 1e-1, p = 2, γ = 0.2)
opt = Scheduler(s, Descent())
for epoch in 1:nepochs
for (i, (x, y)) in enumerate(data)
g = Flux.gradient(() -> Flux.mse(m(x), y), p)
Flux.update!(opt, p, g)
println("epoch: $epoch, batch: $i, η: $(opt.optim.eta)")
end
end
epoch: 1, batch: 1, η: 0.1
epoch: 1, batch: 2, η: 0.06944444444444445
epoch: 1, batch: 3, η: 0.051020408163265314
epoch: 2, batch: 1, η: 0.03906249999999999
epoch: 2, batch: 2, η: 0.030864197530864196
epoch: 2, batch: 3, η: 0.025
epoch: 3, batch: 1, η: 0.020661157024793386
epoch: 3, batch: 2, η: 0.01736111111111111
epoch: 3, batch: 3, η: 0.014792899408284023
The scheduler, opt
, can be used anywhere a Flux optimizer can. For example, it can be passed to Flux.train!
:
s = Inv(λ = 1e-1, p = 2, γ = 0.2)
opt = Scheduler(s, Descent())
loss(x, y, m) = Flux.mse(m(x), y)
cb = () -> @show(opt.optim.eta)
Flux.@epochs nepochs Flux.train!((x, y) -> loss(x, y, m), params(m), data, opt, cb = cb)
[ Info: Epoch 1
opt.optim.eta = 0.1
opt.optim.eta = 0.06944444444444445
opt.optim.eta = 0.051020408163265314
[ Info: Epoch 2
opt.optim.eta = 0.03906249999999999
opt.optim.eta = 0.030864197530864196
opt.optim.eta = 0.025
[ Info: Epoch 3
opt.optim.eta = 0.020661157024793386
opt.optim.eta = 0.01736111111111111
opt.optim.eta = 0.014792899408284023
Finally, you might be interested in reading Interpolating schedules to see how to specify a schedule in terms of epochs but iterate it at the granularity of batches.