# Utilities

Zygote's gradients can be used to construct a Jacobian (by repeated evaluation) or a Hessian (by taking a second derivative).

`Zygote.jacobian`

— Function`jacobian(f, args...) -> Tuple`

For each array `a ∈ args`

this returns a matrix with `Ja[k,i] = ∂y[k]/∂a[i]`

where `y = f(args...)`

is usually a vector. Arrays of higher dimension are treated like `vec(a)`

, or `vec(y)`

for output.

For scalar `x::Number ∈ args`

, the result is a vector `Jx[k] = ∂y[k]/∂x`

, while for scalar `y`

all results have just one row.

With any other argument type, no result is produced, even if `gradient`

would work.

This reverse-mode Jacobian needs to evaluate the pullback once for each element of `y`

. Doing so is usually only efficient when `length(y)`

is small compared to `length(a)`

, otherwise forward mode is likely to be better.

See also `withjacobian`

, `hessian`

, `hessian_reverse`

.

**Examples**

```
julia> jacobian(a -> 100*a[1:3].^2, 1:7)[1] # first index (rows) is output
3×7 Matrix{Int64}:
200 0 0 0 0 0 0
0 400 0 0 0 0 0
0 0 600 0 0 0 0
julia> jacobian((a,x) -> a.^2 .* x, [1,2,3], 1) # scalar argument has vector jacobian
([2 0 0; 0 4 0; 0 0 6], [1, 4, 9])
julia> jacobian((a,d) -> prod(a, dims=d), [1 2; 3 4; 5 6], 2)
([2 0 … 0 0; 0 4 … 3 0; 0 0 … 0 5], [0, 0, 0])
```

For arguments of any type except `Number`

& `AbstractArray`

, the result is `nothing`

.

```
julia> jacobian((a,s) -> a.^length(s), [1,2,3], "str")
([3 0 0; 0 12 0; 0 0 27], nothing)
julia> jacobian((a,t) -> sum(a .* t[1]) + t[2], [1,2,3], (4,5))
([4 4 4], nothing)
julia> gradient((a,t) -> sum(a .* t[1]) + t[2], [1,2,3], (4,5)) # gradient undersands the tuple
([4 4 4], (6, 1))
```

`jacobian(loss, ::Params)`

Like `gradient`

with implicit parameters, this method takes a zero-argument function and returns an `IdDict`

-like object, now containing the Jacobian for each parameter.

**Examples**

```
julia> xs = [1 2; 3 4]; ys = [5,7,9];
julia> Jxy = jacobian(() -> ys[1:2] .+ sum(xs.^2), Params([xs, ys]))
Grads(...)
julia> Jxy[ys]
2×3 Matrix{Int64}:
1 0 0
0 1 0
julia> Jxy[xs]
2×4 Matrix{Int64}:
2 6 4 8
2 6 4 8
```

`Zygote.hessian`

— Function`hessian(f, x)`

Construct the Hessian `∂²f/∂x²`

, where `x`

is a real number or an array, and `f(x)`

is a real number. When `x`

is an array, the result is a matrix `H[i,j] = ∂²f/∂x[i]∂x[j]`

, using linear indexing `x[i]`

even if the argument is higher-dimensional.

This uses forward over reverse, ForwardDiff over Zygote, calling `hessian_dual(f, x)`

. See `hessian_reverse`

for an all-Zygote alternative.

See also `diaghessian`

to compute only the diagonal part.

**Examples**

```
julia> hessian(x -> x[1]*x[2], randn(2))
2×2 Matrix{Float64}:
0.0 1.0
1.0 0.0
julia> hessian(x -> sum(x.^3), [1 2; 3 4]) # uses linear indexing of x
4×4 Matrix{Int64}:
6 0 0 0
0 18 0 0
0 0 12 0
0 0 0 24
julia> hessian(sin, pi/2)
-1.0
```

`Zygote.diaghessian`

— Function`diaghessian(f, args...) -> Tuple`

Diagonal part of the Hessian. Returns a tuple containing, for each argument `x`

, `h`

of the same shape with `h[i] = Hᵢᵢ = ∂²y/∂x[i]∂x[i]`

. The original evaluation `y = f(args...)`

must give a real number `y`

.

For one vector argument `x`

, this is equivalent to `(diag(hessian(f,x)),)`

. Like `hessian`

it uses ForwardDiff over Zygote.

For arguments of any type except `Number`

& `AbstractArray`

, the result is `nothing`

.

**Examples**

```
julia> diaghessian(x -> sum(x.^3), [1 2; 3 4])[1]
2×2 Matrix{Int64}:
6 12
18 24
julia> Diagonal(vec(ans)) == hessian(x -> sum(x.^3), [1 2; 3 4]) # full Hessian is diagonal
true
julia> diaghessian((x,y) -> sum(x .* y .* y'), [1 22; 333 4], [0.5, 0.666]) # two array arguments
([0.0 0.0; 0.0 0.0], [2.0, 8.0])
julia> diaghessian(atan, 1, 2) # two scalar arguments
(-0.16, 0.16)
julia> hessian(xy -> atan(xy[1], xy[2]), [1, 2]) # full Hessian is not diagonal
2×2 Matrix{Float64}:
-0.16 -0.12
-0.12 0.16
```

Zygote also provides a set of helpful utilities. These are all "user-level" tools – in other words you could have written them easily yourself, but they live in Zygote for convenience.

See `ChainRules.ignore_derivatives`

if you want to exclude some of your code from the gradient calculation. This replaces previous Zygote-specific `ignore`

and `dropgrad`

functionality.

`Zygote.withgradient`

— Function```
withgradient(f, args...)
withgradient(f, ::Params)
```

Returns both the value of the function and the `gradient`

, as a named tuple.

```
julia> y, ∇ = withgradient(/, 1, 2)
(val = 0.5, grad = (0.5, -0.25))
julia> ∇ == gradient(/, 1, 2)
true
```

`Zygote.withjacobian`

— Function`withjacobian(f, args...)`

Returns both the value `f(args...)`

and the `jacobian`

as a named tuple.

```
julia> withjacobian(cumsum, [1,2,3])
(val = [1, 3, 6], grad = ([1 0 0; 1 1 0; 1 1 1],))
```

`Zygote.@showgrad`

— Macro`@showgrad(x) -> x`

Much like `@show`

, but shows the gradient about to accumulate to `x`

. Useful for debugging gradients.

```
julia> gradient(2, 3) do a, b
@showgrad(a)*b
end
∂(a) = 3
(3, 2)
```

Note that the gradient depends on how the output of `@showgrad`

is *used*, and is not the *overall* gradient of the variable `a`

. For example:

```
julia> gradient(2) do a
@showgrad(a)*a
end
∂(a) = 2
(4,)
julia> gradient(2, 3) do a, b
@showgrad(a) # not used, so no gradient
a*b
end
∂(a) = nothing
(3, 2)
```

`Zygote.hook`

— Function`hook(x̄ -> ..., x) -> x`

Gradient hooks. Allows you to apply an arbitrary function to the gradient for `x`

.

```
julia> gradient(2, 3) do a, b
hook(ā -> @show(ā), a)*b
end
ā = 3
(3, 2)
julia> gradient(2, 3) do a, b
hook(-, a)*b
end
(-3, 2)
```

`Zygote.Buffer`

— Type`Buffer(xs, ...)`

`Buffer`

is an array-like type which is mutable when taking gradients. You can construct a `Buffer`

with the same syntax as `similar`

(e.g. `Buffer(xs, 5)`

) and then use normal indexing. Finally, use `copy`

to get back a normal array.

For example:

```
julia> function vstack(xs)
buf = Buffer(xs, length(xs), 5)
for i = 1:5
buf[:, i] = xs
end
return copy(buf)
end
vstack (generic function with 1 method)
julia> vstack([1, 2, 3])
3×5 Array{Int64,2}:
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
julia> gradient(x -> sum(vstack(x)), [1, 2, 3])
([5.0, 5.0, 5.0],)
```

`Buffer`

is not an `AbstractArray`

and can't be used for linear algebra operations like matrix multiplication. This prevents it from being captured by pullbacks.

`copy`

is a semantic copy, but does not allocate memory. Instead the `Buffer`

is made immutable after copying.

`Zygote.forwarddiff`

— Function`forwarddiff(f, x; chunk_threshold = ForwardDiff.DEFAULT_CHUNK_THRESHOLD) -> f(x)`

Runs `f(x)`

as usual, but instructs Zygote to differentiate `f`

using forward mode, rather than the usual reverse mode. The `chunk_threshold`

argument controls the maximum chunk size (c.f. ForwardDiff documentation).

Forward mode takes time linear in `length(x)`

but only has constant memory overhead, and is very efficient for scalars, so in some cases this can be a useful optimisation.

```
julia> function pow(x, n)
r = one(x)
for i = 1:n
r *= x
end
return r
end
pow (generic function with 1 method)
julia> gradient(5) do x
forwarddiff(x) do x
pow(x, 2)
end
end
(10,)
```

Note that the function `f`

will *drop gradients* for any closed-over values.

```
julia> gradient(2, 3) do a, b
forwarddiff(a) do a
a*b
end
end
(3, nothing)
```

This can be rewritten by explicitly passing through `b`

, i.e.

```
gradient(2, 3) do a, b
forwarddiff([a, b]) do (a, b)
a*b
end
end
```

`Zygote.checkpointed`

— Function`checkpointed(f, xs...)`

Use gradient checkpointing on the call `f(xs...)`

. This means that `checkpointed(f, xs...) === f(xs...)`

, but when computing the derivative intermediate results from the forward pass of `f`

will not be stored. Instead the forward pass will be repeated, when computing the derivative. This saves memory at the cost of increasing exectution time.

If `f`

is not a pure function, `checkpointed`

will likely give wrong results.

`Params`

and `Grads`

can be copied to and from arrays using the `copy!`

function.

## Working with Grads

Map, broadcast, and iteration are supported for the dictionary-like `Grads`

objects. These operations are value based and preserve the keys.

```
using Zygote, Test
w, x1, x2, b = rand(2), rand(2), rand(2), rand(2)
gs1 = gradient(() -> sum(tanh.(w .* x1 .+ b)), Params([w, b]))
gs2 = gradient(() -> sum(tanh.(w .* x2 .+ b)), Params([w, b]))
# accumulate gradients
gs = gs1 .+ gs2
@test gs[w] ≈ gs1[w] + gs2[w]
@test gs[b] ≈ gs1[b] + gs2[b]
# gradients and IdDict interact nicely
# note that an IdDict must be used for gradient algebra on the GPU
gs .+= IdDict(p => randn(size(p)) for p in keys(gs))
# clip gradients
map(x -> clamp.(x, -0.1, 0.1), gs)
# clip gradients in-place
foreach(x -> clamp!(x, -0.1, 0.1), gs)
for (p, g) in pairs(gs)
# do something with parameter `p` and corresponding gradient `g`
end
# note that gradients must be w.r.t. to the same parameter key set
gs3 = gradient(() -> sum(tanh.(w .* x2)), Params([w]))
# gs3 does not have the key b
@test_throws ArgumentError gs1 .+ gs3
```