Utilities

Zygote's gradients can be used to construct a Jacobian (by repeated evaluation) or a Hessian (by taking a second derivative).

Zygote.jacobian — Function

jacobian(f, args...) -> Tuple

For each array a ∈ args this returns a matrix with Ja[k,i] = ∂y[k]/∂a[i] where y = f(args...) is usually a vector. Arrays of higher dimension are treated like vec(a), or vec(y) for output.

For scalar x::Number ∈ args, the result is a vector Jx[k] = ∂y[k]/∂x, while for scalar y all results have just one row.

With any other argument type, no result is produced, even if gradient would work.

This reverse-mode Jacobian needs to evaluate the pullback once for each element of y. Doing so is usually only efficient when length(y) is small compared to length(a), otherwise forward mode is likely to be better.

Examples

julia> jacobian(a -> 100*a[1:3].^2, 1:7)[1]  # first index (rows) is output
3×7 Matrix{Int64}:
 200    0    0  0  0  0  0
   0  400    0  0  0  0  0
   0    0  600  0  0  0  0

julia> jacobian((a,x) -> a.^2 .* x, [1,2,3], 1)  # scalar argument has vector jacobian
([2 0 0; 0 4 0; 0 0 6], [1, 4, 9])

julia> jacobian((a,d) -> prod(a, dims=d), [1 2; 3 4; 5 6], 2)
([2 0 … 0 0; 0 4 … 3 0; 0 0 … 0 5], [0, 0, 0])

Warning

For arguments of any type except Number & AbstractArray, the result is nothing.

julia> jacobian((a,s) -> a.^length(s), [1,2,3], "str")
([3 0 0; 0 12 0; 0 0 27], nothing)

julia> jacobian((a,t) -> sum(a .* t[1]) + t[2], [1,2,3], (4,5))
([4 4 4], nothing)

julia> gradient((a,t) -> sum(a .* t[1]) + t[2], [1,2,3], (4,5))  # gradient undersands the tuple
([4 4 4], (6, 1))

source

jacobian(loss, ::Params)

Like gradient with implicit parameters, this method takes a zero-argument function and returns an IdDict-like object, now containing the Jacobian for each parameter.

Examples

julia> xs = [1 2; 3 4]; ys = [5,7,9];

julia> Jxy = jacobian(() -> ys[1:2] .+ sum(xs.^2), Params([xs, ys]))
Grads(...)

julia> Jxy[ys]
2×3 Matrix{Int64}:
 1  0  0
 0  1  0

julia> Jxy[xs]
2×4 Matrix{Int64}:
 2  6  4  8
 2  6  4  8

source

Zygote.hessian — Function

hessian(f, x)

Construct the Hessian ∂²f/∂x², where x is a real number or an array, and f(x) is a real number. When x is an array, the result is a matrix H[i,j] = ∂²f/∂x[i]∂x[j], using linear indexing x[i] even if the argument is higher-dimensional.

This uses forward over reverse, ForwardDiff over Zygote, calling hessian_dual(f, x). See hessian_reverse for an all-Zygote alternative.

See also diaghessian to compute only the diagonal part.

Examples

julia> hessian(x -> x[1]*x[2], randn(2))
2×2 Matrix{Float64}:
 0.0  1.0
 1.0  0.0

julia> hessian(x -> sum(x.^3), [1 2; 3 4])  # uses linear indexing of x
4×4 Matrix{Int64}:
 6   0   0   0
 0  18   0   0
 0   0  12   0
 0   0   0  24

julia> hessian(sin, pi/2)
-1.0

source

Zygote.hessian_reverse — Function

hessian_reverse(f, x)

This should be equivalent to hessian(f, x), but implemented using reverse over reverse mode, all Zygote. (This is usually much slower, and more likely to find errors.)

source

Zygote.diaghessian — Function

diaghessian(f, args...) -> Tuple

Diagonal part of the Hessian. Returns a tuple containing, for each argument x, h of the same shape with h[i] = Hᵢᵢ = ∂²y/∂x[i]∂x[i]. The original evaluation y = f(args...) must give a real number y.

For one vector argument x, this is equivalent to (diag(hessian(f,x)),). Like hessian it uses ForwardDiff over Zygote.

Warning

For arguments of any type except Number & AbstractArray, the result is nothing.

Examples

julia> diaghessian(x -> sum(x.^3), [1 2; 3 4])[1]
2×2 Matrix{Int64}:
  6  12
 18  24

julia> Diagonal(vec(ans)) == hessian(x -> sum(x.^3), [1 2; 3 4])  # full Hessian is diagonal
true

julia> diaghessian((x,y) -> sum(x .* y .* y'), [1 22; 333 4], [0.5, 0.666])  # two array arguments
([0.0 0.0; 0.0 0.0], [2.0, 8.0])

julia> diaghessian(atan, 1, 2)  # two scalar arguments
(-0.16, 0.16)

julia> hessian(xy -> atan(xy[1], xy[2]), [1, 2])  # full Hessian is not diagonal
2×2 Matrix{Float64}:
 -0.16  -0.12
 -0.12   0.16

source

Zygote also provides a set of helpful utilities. These are all "user-level" tools – in other words you could have written them easily yourself, but they live in Zygote for convenience.

See ChainRules.ignore_derivatives if you want to exclude some of your code from the gradient calculation. This replaces previous Zygote-specific ignore and dropgrad functionality.

Zygote.withgradient — Function

withgradient(f, args...)
withgradient(f, ::Params)

Returns both the value of the function and the gradient, as a named tuple.

julia> y, ∇ = withgradient(/, 1, 2)
(val = 0.5, grad = (0.5, -0.25))

julia> ∇ == gradient(/, 1, 2)
true

Allows you to capture auxillary outputs, in addition to the scalar used by gradient. To do this, f must return a Tuple or NamedTuple. Then it calculates grad = gradient(first∘f, args...) but returns the wholeval = f(args...)`:

julia> withgradient([1,2,4]) do x
          z = 1 ./ x
          sum(z), z  # here z is an auxillary output
       end
(val = (1.75, [1.0, 0.5, 0.25]), grad = ([-1.0, -0.25, -0.0625],))

julia> withgradient(3.0, 4.0) do x, y
          (div = x/y, mul = x*y)
       end
(val = (div = 0.75, mul = 12.0), grad = (0.25, -0.1875))

Also supports implicit mode:

julia> w = [3.0];

julia> res = withgradient(() -> sum(abs2, w), Params([w]))
(val = 9.0, grad = Grads(...))

julia> res.grad[w]
1-element Vector{Float64}:
 6.0

source

Zygote.withjacobian — Function

withjacobian(f, args...)

Returns both the value f(args...) and the jacobian as a named tuple.

julia> withjacobian(cumsum, [1,2,3])
(val = [1, 3, 6], grad = ([1 0 0; 1 1 0; 1 1 1],))

source

Zygote.@showgrad — Macro

@showgrad(x) -> x

Much like @show, but shows the gradient about to accumulate to x. Useful for debugging gradients.

julia> gradient(2, 3) do a, b
         @showgrad(a)*b
       end
∂(a) = 3
(3, 2)

Note that the gradient depends on how the output of @showgrad is used, and is not the overall gradient of the variable a. For example:

julia> gradient(2) do a
     @showgrad(a)*a
   end
∂(a) = 2
(4,)

julia> gradient(2, 3) do a, b
         @showgrad(a) # not used, so no gradient
         a*b
       end
∂(a) = nothing
(3, 2)

source

Zygote.hook — Function

hook(x̄ -> ..., x) -> x

Gradient hooks. Allows you to apply an arbitrary function to the gradient for x.

julia> gradient(2, 3) do a, b
         hook(ā -> @show(ā), a)*b
       end
ā = 3
(3, 2)

julia> gradient(2, 3) do a, b
         hook(-, a)*b
       end
(-3, 2)

source

Zygote.Buffer — Type

Buffer(xs, ...)

Buffer is an array-like type which is mutable when taking gradients. You can construct a Buffer with the same syntax as similar (e.g. Buffer(xs, 5)) and then use normal indexing. Finally, use copy to get back a normal array.

For example:

julia> function vstack(xs)
           buf = Buffer(xs, length(xs), 5)
           for i = 1:5
             buf[:, i] = xs
           end
           return copy(buf)
         end
vstack (generic function with 1 method)

julia> vstack([1, 2, 3])
3×5 Array{Int64,2}:
 1  1  1  1  1
 2  2  2  2  2
 3  3  3  3  3

julia> gradient(x -> sum(vstack(x)), [1, 2, 3])
([5.0, 5.0, 5.0],)

Buffer is not an AbstractArray and can't be used for linear algebra operations like matrix multiplication. This prevents it from being captured by pullbacks.

copy is a semantic copy, but does not allocate memory. Instead the Buffer is made immutable after copying.

source

Zygote.forwarddiff — Function

forwarddiff(f, x; chunk_threshold = ForwardDiff.DEFAULT_CHUNK_THRESHOLD) -> f(x)

Runs f(x) as usual, but instructs Zygote to differentiate f using forward mode, rather than the usual reverse mode. The chunk_threshold argument controls the maximum chunk size (c.f. ForwardDiff documentation).

Forward mode takes time linear in length(x) but only has constant memory overhead, and is very efficient for scalars, so in some cases this can be a useful optimisation.

julia> function pow(x, n)
         r = one(x)
         for i = 1:n
           r *= x
         end
         return r
       end
pow (generic function with 1 method)

julia> gradient(5) do x
         forwarddiff(x) do x
           pow(x, 2)
         end
       end
(10,)

Note that the function f will drop gradients for any closed-over values.

julia> gradient(2, 3) do a, b
         forwarddiff(a) do a
           a*b
         end
       end
(3, nothing)

This can be rewritten by explicitly passing through b, i.e.

gradient(2, 3) do a, b
  forwarddiff([a, b]) do (a, b)
    a*b
  end
end

source

Zygote.checkpointed — Function

checkpointed(f, xs...)

Use gradient checkpointing on the call f(xs...). This means that checkpointed(f, xs...) === f(xs...), but when computing the derivative intermediate results from the forward pass of f will not be stored. Instead the forward pass will be repeated, when computing the derivative. This saves memory at the cost of increasing execution time.

Warning

If f is not a pure function, checkpointed will likely give wrong results.

source

Zygote.eager_update! — Function

eager_update!(state, model, update!)

Eagerly updates the model parameters, discarding the updated gradients to save memory. model stores the parameters to be updated, state is the optimization state (eg. from Optimisers.jl) matching your model component, and update! is the function that updates the parameters (eg. from Optimisers.jl), usually called as update!(state, model, grads).

If f is a function that takes a single layer, called as h = f(model.layers[i], h, other_args...) then we can eagerly update with:

h = f(Zygote.eager_update!(state.layers[i], model.layers[i], Optimisers.update!), h, other_args...)

or combine this with gradient checkpointing (for additional memory saving at the cost of increased execution time) with:

h = Zygote.checkpointed(f, eager_update!(state.layers[i], model.layers[i], Optimisers.update!), h, other_args...)

If model.layers[i] itself is callable, we can use the above by first wrapping it:

f(model, xs...) = model(xs...)
h = f(Zygote.eager_update!(state.layers[i], model.layers[i], Optimisers.update!), h, other_args...)

Warning

If different layers share trainable parameters, then eager_update! will likely give wrong results.

source

Params and Grads can be copied to and from arrays using the copy! function.

Working with Grads

Map, broadcast, and iteration are supported for the dictionary-like Grads objects. These operations are value based and preserve the keys.

using Zygote, Test

w, x1, x2, b = rand(2), rand(2), rand(2), rand(2)

gs1 = gradient(() -> sum(tanh.(w .* x1 .+ b)), Params([w, b]))
gs2 = gradient(() -> sum(tanh.(w .* x2 .+ b)), Params([w, b]))

# accumulate gradients
gs = gs1 .+ gs2
@test gs[w] ≈ gs1[w] + gs2[w]
@test gs[b] ≈ gs1[b] + gs2[b]

# gradients and IdDict interact nicely
# note that an IdDict must be used for gradient algebra on the GPU
gs .+= IdDict(p => randn(size(p)) for p in keys(gs))

# clip gradients
map(x -> clamp.(x, -0.1, 0.1), gs)

# clip gradients in-place
foreach(x -> clamp!(x, -0.1, 0.1), gs)

for (p, g) in pairs(gs)
  # do something with parameter `p` and corresponding gradient `g`
end

# note that gradients must be w.r.t. to the same parameter key set
gs3 = gradient(() -> sum(tanh.(w .* x2)), Params([w]))
# gs3 does not have the key b
@test_throws ArgumentError gs1 .+ gs3