Utilities
Zygote's gradients can be used to construct a Jacobian (by repeated evaluation) or a Hessian (by taking a second derivative).
Zygote.jacobian
— Functionjacobian(f, args...) -> Tuple
For each array a ∈ args
this returns a matrix with Ja[k,i] = ∂y[k]/∂a[i]
where y = f(args...)
is usually a vector. Arrays of higher dimension are treated like vec(a)
, or vec(y)
for output.
For scalar x::Number ∈ args
, the result is a vector Jx[k] = ∂y[k]/∂x
, while for scalar y
all results have just one row.
With any other argument type, no result is produced, even if gradient
would work.
This reverse-mode Jacobian needs to evaluate the pullback once for each element of y
. Doing so is usually only efficient when length(y)
is small compared to length(a)
, otherwise forward mode is likely to be better.
See also withjacobian
, hessian
, hessian_reverse
.
Examples
julia> jacobian(a -> 100*a[1:3].^2, 1:7)[1] # first index (rows) is output
3×7 Matrix{Int64}:
200 0 0 0 0 0 0
0 400 0 0 0 0 0
0 0 600 0 0 0 0
julia> jacobian((a,x) -> a.^2 .* x, [1,2,3], 1) # scalar argument has vector jacobian
([2 0 0; 0 4 0; 0 0 6], [1, 4, 9])
julia> jacobian((a,d) -> prod(a, dims=d), [1 2; 3 4; 5 6], 2)
([2 0 … 0 0; 0 4 … 3 0; 0 0 … 0 5], [0, 0, 0])
For arguments of any type except Number
& AbstractArray
, the result is nothing
.
julia> jacobian((a,s) -> a.^length(s), [1,2,3], "str")
([3 0 0; 0 12 0; 0 0 27], nothing)
julia> jacobian((a,t) -> sum(a .* t[1]) + t[2], [1,2,3], (4,5))
([4 4 4], nothing)
julia> gradient((a,t) -> sum(a .* t[1]) + t[2], [1,2,3], (4,5)) # gradient undersands the tuple
([4 4 4], (6, 1))
jacobian(loss, ::Params)
Like gradient
with implicit parameters, this method takes a zero-argument function and returns an IdDict
-like object, now containing the Jacobian for each parameter.
Examples
julia> xs = [1 2; 3 4]; ys = [5,7,9];
julia> Jxy = jacobian(() -> ys[1:2] .+ sum(xs.^2), Params([xs, ys]))
Grads(...)
julia> Jxy[ys]
2×3 Matrix{Int64}:
1 0 0
0 1 0
julia> Jxy[xs]
2×4 Matrix{Int64}:
2 6 4 8
2 6 4 8
Zygote.hessian
— Functionhessian(f, x)
Construct the Hessian ∂²f/∂x²
, where x
is a real number or an array, and f(x)
is a real number. When x
is an array, the result is a matrix H[i,j] = ∂²f/∂x[i]∂x[j]
, using linear indexing x[i]
even if the argument is higher-dimensional.
This uses forward over reverse, ForwardDiff over Zygote, calling hessian_dual(f, x)
. See hessian_reverse
for an all-Zygote alternative.
See also diaghessian
to compute only the diagonal part.
Examples
julia> hessian(x -> x[1]*x[2], randn(2))
2×2 Matrix{Float64}:
0.0 1.0
1.0 0.0
julia> hessian(x -> sum(x.^3), [1 2; 3 4]) # uses linear indexing of x
4×4 Matrix{Int64}:
6 0 0 0
0 18 0 0
0 0 12 0
0 0 0 24
julia> hessian(sin, pi/2)
-1.0
Zygote.hessian_reverse
— Functionhessian_reverse(f, x)
This should be equivalent to hessian(f, x)
, but implemented using reverse over reverse mode, all Zygote. (This is usually much slower, and more likely to find errors.)
Zygote.diaghessian
— Functiondiaghessian(f, args...) -> Tuple
Diagonal part of the Hessian. Returns a tuple containing, for each argument x
, h
of the same shape with h[i] = Hᵢᵢ = ∂²y/∂x[i]∂x[i]
. The original evaluation y = f(args...)
must give a real number y
.
For one vector argument x
, this is equivalent to (diag(hessian(f,x)),)
. Like hessian
it uses ForwardDiff over Zygote.
For arguments of any type except Number
& AbstractArray
, the result is nothing
.
Examples
julia> diaghessian(x -> sum(x.^3), [1 2; 3 4])[1]
2×2 Matrix{Int64}:
6 12
18 24
julia> Diagonal(vec(ans)) == hessian(x -> sum(x.^3), [1 2; 3 4]) # full Hessian is diagonal
true
julia> diaghessian((x,y) -> sum(x .* y .* y'), [1 22; 333 4], [0.5, 0.666]) # two array arguments
([0.0 0.0; 0.0 0.0], [2.0, 8.0])
julia> diaghessian(atan, 1, 2) # two scalar arguments
(-0.16, 0.16)
julia> hessian(xy -> atan(xy[1], xy[2]), [1, 2]) # full Hessian is not diagonal
2×2 Matrix{Float64}:
-0.16 -0.12
-0.12 0.16
Zygote also provides a set of helpful utilities. These are all "user-level" tools – in other words you could have written them easily yourself, but they live in Zygote for convenience.
See ChainRules.ignore_derivatives
if you want to exclude some of your code from the gradient calculation. This replaces previous Zygote-specific ignore
and dropgrad
functionality.
Zygote.withgradient
— Functionwithgradient(f, args...)
withgradient(f, ::Params)
Returns both the value of the function and the gradient
, as a named tuple.
julia> y, ∇ = withgradient(/, 1, 2)
(val = 0.5, grad = (0.5, -0.25))
julia> ∇ == gradient(/, 1, 2)
true
Allows you to capture auxillary outputs, in addition to the scalar used by gradient
. To do this, f
must return a Tuple or NamedTuple. Then it calculates grad = gradient(first∘f, args...) but returns the whole
val = f(args...)`:
julia> withgradient([1,2,4]) do x
z = 1 ./ x
sum(z), z # here z is an auxillary output
end
(val = (1.75, [1.0, 0.5, 0.25]), grad = ([-1.0, -0.25, -0.0625],))
julia> withgradient(3.0, 4.0) do x, y
(div = x/y, mul = x*y)
end
(val = (div = 0.75, mul = 12.0), grad = (0.25, -0.1875))
Also supports implicit mode:
julia> w = [3.0];
julia> res = withgradient(() -> sum(abs2, w), Params([w]))
(val = 9.0, grad = Grads(...))
julia> res.grad[w]
1-element Vector{Float64}:
6.0
Zygote.withjacobian
— Functionwithjacobian(f, args...)
Returns both the value f(args...)
and the jacobian
as a named tuple.
julia> withjacobian(cumsum, [1,2,3])
(val = [1, 3, 6], grad = ([1 0 0; 1 1 0; 1 1 1],))
Zygote.@showgrad
— Macro@showgrad(x) -> x
Much like @show
, but shows the gradient about to accumulate to x
. Useful for debugging gradients.
julia> gradient(2, 3) do a, b
@showgrad(a)*b
end
∂(a) = 3
(3, 2)
Note that the gradient depends on how the output of @showgrad
is used, and is not the overall gradient of the variable a
. For example:
julia> gradient(2) do a
@showgrad(a)*a
end
∂(a) = 2
(4,)
julia> gradient(2, 3) do a, b
@showgrad(a) # not used, so no gradient
a*b
end
∂(a) = nothing
(3, 2)
Zygote.hook
— Functionhook(x̄ -> ..., x) -> x
Gradient hooks. Allows you to apply an arbitrary function to the gradient for x
.
julia> gradient(2, 3) do a, b
hook(ā -> @show(ā), a)*b
end
ā = 3
(3, 2)
julia> gradient(2, 3) do a, b
hook(-, a)*b
end
(-3, 2)
Zygote.Buffer
— TypeBuffer(xs, ...)
Buffer
is an array-like type which is mutable when taking gradients. You can construct a Buffer
with the same syntax as similar
(e.g. Buffer(xs, 5)
) and then use normal indexing. Finally, use copy
to get back a normal array.
For example:
julia> function vstack(xs)
buf = Buffer(xs, length(xs), 5)
for i = 1:5
buf[:, i] = xs
end
return copy(buf)
end
vstack (generic function with 1 method)
julia> vstack([1, 2, 3])
3×5 Array{Int64,2}:
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
julia> gradient(x -> sum(vstack(x)), [1, 2, 3])
([5.0, 5.0, 5.0],)
Buffer
is not an AbstractArray
and can't be used for linear algebra operations like matrix multiplication. This prevents it from being captured by pullbacks.
copy
is a semantic copy, but does not allocate memory. Instead the Buffer
is made immutable after copying.
Zygote.forwarddiff
— Functionforwarddiff(f, x; chunk_threshold = ForwardDiff.DEFAULT_CHUNK_THRESHOLD) -> f(x)
Runs f(x)
as usual, but instructs Zygote to differentiate f
using forward mode, rather than the usual reverse mode. The chunk_threshold
argument controls the maximum chunk size (c.f. ForwardDiff documentation).
Forward mode takes time linear in length(x)
but only has constant memory overhead, and is very efficient for scalars, so in some cases this can be a useful optimisation.
julia> function pow(x, n)
r = one(x)
for i = 1:n
r *= x
end
return r
end
pow (generic function with 1 method)
julia> gradient(5) do x
forwarddiff(x) do x
pow(x, 2)
end
end
(10,)
Note that the function f
will drop gradients for any closed-over values.
julia> gradient(2, 3) do a, b
forwarddiff(a) do a
a*b
end
end
(3, nothing)
This can be rewritten by explicitly passing through b
, i.e.
gradient(2, 3) do a, b
forwarddiff([a, b]) do (a, b)
a*b
end
end
Zygote.checkpointed
— Functioncheckpointed(f, xs...)
Use gradient checkpointing on the call f(xs...)
. This means that checkpointed(f, xs...) === f(xs...)
, but when computing the derivative intermediate results from the forward pass of f
will not be stored. Instead the forward pass will be repeated, when computing the derivative. This saves memory at the cost of increasing execution time.
If f
is not a pure function, checkpointed
will likely give wrong results.
Zygote.eager_update!
— Functioneager_update!(state, model, update!)
Eagerly updates the model parameters, discarding the updated gradients to save memory. model
stores the parameters to be updated, state
is the optimization state (eg. from Optimisers.jl) matching your model component, and update!
is the function that updates the parameters (eg. from Optimisers.jl
), usually called as update!(state, model, grads)
.
If f
is a function that takes a single layer, called as h = f(model.layers[i], h, other_args...)
then we can eagerly update with:
h = f(Zygote.eager_update!(state.layers[i], model.layers[i], Optimisers.update!), h, other_args...)
or combine this with gradient checkpointing (for additional memory saving at the cost of increased execution time) with:
h = Zygote.checkpointed(f, eager_update!(state.layers[i], model.layers[i], Optimisers.update!), h, other_args...)
If model.layers[i]
itself is callable, we can use the above by first wrapping it:
f(model, xs...) = model(xs...)
h = f(Zygote.eager_update!(state.layers[i], model.layers[i], Optimisers.update!), h, other_args...)
If different layers share trainable parameters, then eager_update!
will likely give wrong results.
Params
and Grads
can be copied to and from arrays using the copy!
function.
Working with Grads
Map, broadcast, and iteration are supported for the dictionary-like Grads
objects. These operations are value based and preserve the keys.
using Zygote, Test
w, x1, x2, b = rand(2), rand(2), rand(2), rand(2)
gs1 = gradient(() -> sum(tanh.(w .* x1 .+ b)), Params([w, b]))
gs2 = gradient(() -> sum(tanh.(w .* x2 .+ b)), Params([w, b]))
# accumulate gradients
gs = gs1 .+ gs2
@test gs[w] ≈ gs1[w] + gs2[w]
@test gs[b] ≈ gs1[b] + gs2[b]
# gradients and IdDict interact nicely
# note that an IdDict must be used for gradient algebra on the GPU
gs .+= IdDict(p => randn(size(p)) for p in keys(gs))
# clip gradients
map(x -> clamp.(x, -0.1, 0.1), gs)
# clip gradients in-place
foreach(x -> clamp!(x, -0.1, 0.1), gs)
for (p, g) in pairs(gs)
# do something with parameter `p` and corresponding gradient `g`
end
# note that gradients must be w.r.t. to the same parameter key set
gs3 = gradient(() -> sum(tanh.(w .* x2)), Params([w]))
# gs3 does not have the key b
@test_throws ArgumentError gs1 .+ gs3