GPU Support
Most work on neural networks involves the use of GPUs, as they can typically perform the required computation much faster. This page describes how Flux co-operates with various other packages, which talk to GPU hardware.
For those in a hurry, see the quickstart page. Or do using CUDA
and then call gpu
on both the model and the data.
Basic GPU use: from Array
to CuArray
Julia's GPU packages work with special array types, in place of the built-in Array
. The most used is CuArray
provided by CUDA.jl, for GPUs made by NVIDIA. That package provides a function cu
which converts an ordinary Array
(living in CPu memory) to a CuArray
(living in GPU memory). Functions like *
and broadcasting specialise so that, when given CuArray
s, all the computation happens on the GPU:
W = randn(3, 4) # some weights, on CPU: 3×4 Array{Float64, 2}
x = randn(4) # fake data
y = tanh.(W * x) # computation on the CPU
using CUDA
cu(W) isa CuArray{Float32}
(cW, cx) = (W, x) |> cu # move both to GPU
cy = tanh.(cW * cx) # computation on the GPU
Notice that cu
doesn't only move arrays, it also recurses into many structures, such as the tuple (W, x)
above. (Notice also that it converts Julia's default Float64
numbers to Float32
, as this is what most GPUs support efficiently – it calls itself "opinionated". Flux defaults to Float32
in all cases.)
To use CUDA with Flux, you can simply use cu
to move both the model, and the data. It will create a copy of the Flux model, with all of its parameter arrays moved to the GPU:
using Pkg; Pkg.add(["CUDA", "cuDNN"]) # do this once
using Flux, CUDA
CUDA.allowscalar(false) # recommended
model = Dense(W, true, tanh) # wrap the same matrix W in a Flux layer
model(x) ≈ y # same result, still on CPU
c_model = cu(model) # move all the arrays within model to the GPU
c_model(cx) # computation on the GPU
Notice that you need using CUDA
(every time) but also ] add cuDNN
(once, when installing packages). This is a quirk of how these packages are set up. (The cuDNN.jl
sub-package handles operations such as convolutions, called by Flux via NNlib.jl.)
Flux's gradient
, and training functions like setup
, update!
, and train!
, are all equally happy to accept GPU arrays and GPU models, and then perform all computations on the GPU. It is recommended that you move the model to the GPU before calling setup
.
grads = Flux.gradient((f,x) -> sum(abs2, f(x)), model, x) # on CPU
c_grads = Flux.gradient((f,x) -> sum(abs2, f(x)), c_model, cx) # same result, all on GPU
c_opt = Flux.setup(Adam(), c_model) # setup optimiser after moving model to GPU
Flux.update!(c_opt, c_model, c_grads[1]) # mutates c_model but not model
To move arrays and other objects back to the CPU, Flux provides a function cpu
. This is recommended when saving models, Flux.state(c_model |> cpu)
, see below.
cpu(cW) isa Array{Float32, 2}
model2 = cpu(c_model) # copy model back to CPU
model2(x)
Old versions of Flux automatically loaded CUDA.jl to provide GPU support. Starting from Flux v0.14, it has to be loaded separately. Julia's package extensions allow Flux to automatically load some GPU-specific code when needed.
Other GPU packages for AMD & Apple
Non-NVIDIA graphics cards are supported by other packages. Each provides its own function which behaves like cu
. AMD GPU support provided by AMDGPU.jl, on systems with ROCm and MIOpen installed. This package has a function roc
which converts Array
to ROCArray
:
using Flux, AMDGPU
AMDGPU.allowscalar(false)
r_model = roc(model)
r_model(roc(x))
Flux.gradient((f,x) -> sum(abs2, f(x)), r_model, roc(x))
Experimental support for Apple devices with M-series chips is provided by Metal.jl. This has a function mtl
which works like cu
, converting Array
to MtlArray
:
using Flux, Metal
Metal.allowscalar(false)
m_model = mtl(model)
m_y = m_model(mtl(x))
Flux.gradient((f,x) -> sum(abs2, f(x)), m_model, mtl(x))
Metal support in Flux is experimental and many features are not yet available. AMD support is improving, but likely to have more rough edges than CUDA.
If you want your model to work with any brand of GPU, or none, then you may not wish to write cu
everywhere. One simple way to be generic is, at the top of the file, to un-comment one of several lines which import a package and assign its "adaptor" to the same name:
using CUDA: cu as device # after this, `device === cu`
# using AMDGPU: roc as device
# device = identity # do-nothing, for CPU
using Flux
model = Chain(...) |> device
The functions cu
, mtl
, roc
all use Adapt.jl, to work within various wrappers. The reason they work on Flux models is that Flux.@layer Layer
defines methods of Adapt.adapt_structure(to, lay::Layer)
.
Automatic GPU choice with gpu
and gpu_device
Flux also provides a more automatic way of choosing which GPU (or none) to use. This is the function gpu
:
- By default it does nothing.
- If the package CUDA is loaded, and
CUDA.functional() === true
, then it behaves likecu
. - If the package AMDGPU is loaded, and
AMDGPU.functional() === true
, then it behaves likeroc
. - If the package Metal is loaded, and
Metal.functional() === true
, then it behaves likemtl
. - If two differnet GPU packages are loaded, the first one takes priority.
For the most part, this means that a script which says model |> gpu
and data |> gpu
will just work. It should always run, and if a GPU package is loaded (and finds the correct hardware) then that will be used.
The function gpu
uses a lower-level function called gpu_device
from MLDataDevices.jl, which checks what to do and then returns some device object. In fact, the entire implementation is just this:
gpu(x) = gpu_device()(x)
cpu(x) = cpu_device()(x)
Automatic backend selection through gpu
is not type-stable. That doesn't matter if you do it once, or once per large batch – it costs a few microseconds. But it might matter if you do it within some loop.
To avoid this, you can first obtain a "device object" with device = gpu_device()
, once, and then use this as the function to transfer data. Something like this:
to_device = gpu_device()
gpu_model = model |> to_device
for epoch in 1:num_epochs
for (x, y) in dataloader
x_gpu, y_gpu = (x, y) |> to_device
# training code...
Finally, setting a backend prefence with gpu_backend!
gives type stability to the whole pipeline.
Transferring Training Data
In order to train the model using the GPU both model and the training data have to be transferred to GPU memory. Moving the data can be done in two different ways:
Iterating over the batches in a
DataLoader
object transferring each one of the training batches at a time to the GPU. This is recommended for large datasets. Done by hand, it might look like this:train_loader = Flux.DataLoader((X, Y), batchsize=64, shuffle=true) # ... model definition, optimiser setup for epoch in 1:epochs for (x_cpu, y_cpu) in train_loader x = gpu(x_cpu) y = gpu(y_cpu) grads = gradient(m -> loss(m, x, y), model) Flux.update!(opt_state, model, grads[1]) end end
Rather than write this out every time, you can just call
gpu(::DataLoader)
:gpu_train_loader = Flux.DataLoader((X, Y), batchsize=64, shuffle=true) |> gpu # ... model definition, optimiser setup for epoch in 1:epochs for (x, y) in gpu_train_loader grads = gradient(m -> loss(m, x, y), model) Flux.update!(opt_state, model, grads[1]) end end
This is equivalent to
DataLoader(MLUtils.mapobs(gpu, (X, Y)); keywords...)
. Something similar can also be done withCUDA.CuIterator
,gpu_train_loader = CUDA.CuIterator(train_loader)
. However, this only works with a limited number of data types:first(train_loader)
should be a tuple (orNamedTuple
) of arrays.Transferring all training data to the GPU at once before creating the
DataLoader
. This is usually performed for smaller datasets which are sure to fit in the available GPU memory.gpu_train_loader = Flux.DataLoader((X, Y) |> gpu, batchsize = 32) # ... for epoch in 1:epochs for (x, y) in gpu_train_loader # ...
Here
(X, Y) |> gpu
appliesgpu
to both arrays, as it recurses into structures.
Saving GPU-Trained Models
After the training process is done, we must always transfer the trained model back to the CPU memory before serializing or saving to disk. This can be done with cpu
:
model = cpu(model) # or model = model |> cpu
and then
using BSON
# ...
BSON.@save "./path/to/trained_model.bson" model
# in this approach the cpu-transferred model (referenced by the variable `model`)
# only exists inside the `let` statement
let model = cpu(model)
# ...
BSON.@save "./path/to/trained_model.bson" model
end
# is equivalent to the above, but uses `key=value` storing directive from BSON.jl
BSON.@save "./path/to/trained_model.bson" model = cpu(model)
The reason behind this is that models trained in the GPU but not transferred to the CPU memory scope will expect CuArray
s as input. In other words, Flux models expect input data coming from the same kind device in which they were trained on.
In controlled scenarios in which the data fed to the loaded models is guaranteed to be in the GPU there's no need to transfer them back to CPU memory scope, however in production environments, where artifacts are shared among different processes, equipments or configurations, there is no guarantee that the CUDA.jl package will be available for the process performing inference on the model loaded from the disk.
Disabling CUDA or choosing which GPUs are visible to Flux
Sometimes it is required to control which GPUs are visible to julia
on a system with multiple GPUs or disable GPUs entirely. This can be achieved with an environment variable CUDA_VISIBLE_DEVICES
.
To disable all devices:
$ export CUDA_VISIBLE_DEVICES='-1'
To select specific devices by device id:
$ export CUDA_VISIBLE_DEVICES='0,1'
More information for conditional use of GPUs in CUDA.jl can be found in its documentation, and information about the specific use of the variable is described in the Nvidia CUDA blog post.
Data movement across GPU devices
Flux also supports getting handles to specific GPU devices, and transferring models from one GPU device to another GPU device from the same backend. Let's try it out for NVIDIA GPUs. First, we list all the available devices:
julia> using Flux, CUDA;
julia> CUDA.devices()
CUDA.DeviceIterator() for 3 devices:
0. NVIDIA TITAN RTX
1. NVIDIA TITAN RTX
2. NVIDIA TITAN RTX
Then, let's select the device with id 0
:
julia> device0 = gpu_device(1)
(::CUDADevice{CuDevice}) (generic function with 4 methods)
julia> device0.device
CuDevice(0): NVIDIA TITAN RTX
Notice that indexing starts from 0
in the CUDA.devices()
output, but gpu_device!
expects the device id starting from 1
.
Then, let's move a simple dense layer to the GPU represented by device0
:
julia> dense_model = Dense(2 => 3)
Dense(2 => 3) # 9 parameters
julia> dense_model = dense_model |> device0;
julia> dense_model.weight
3×2 CuArray{Float32, 2, CUDA.DeviceMemory}:
-0.142062 -0.131455
-0.828134 -1.06552
0.608595 -1.05375
julia> CUDA.device(dense_model.weight) # check the GPU to which dense_model is attached
CuDevice(0): NVIDIA TITAN RTX
Next, we'll get a handle to the device with id 1
, and move dense_model
to that device:
julia> device1 = gpu_device(2)
(::CUDADevice{CuDevice}) (generic function with 4 methods)
julia> dense_model = dense_model |> device1; # don't directly print the model; see warning below
julia> CUDA.device(dense_model.weight)
CuDevice(1): NVIDIA TITAN RTX
Due to a limitation in Metal.jl
, currently this kind of data movement across devices is only supported for CUDA
and AMDGPU
backends.
Distributed data parallel training
Distributed support is experimental and could change in the future.
Flux supports now distributed data parallel training with DistributedUtils
module. If you want to run your code on multiple GPUs, you have to install MPI.jl
(see docs for more info).
julia> using MPI
julia> MPI.install_mpiexecjl()
Now you can run your code with mpiexecjl --project=. -n <np> julia <filename>.jl
from CLI.
You can use either the MPIBackend
or NCCLBackend
, the latter only if also NCCL.jl
is loaded. First, initialize a backend with DistributedUtils.initialize
, e.g.
julia> using Flux, MPI, NCCL, CUDA
julia> CUDA.allowscalar(false)
julia> DistributedUtils.initialize(NCCLBackend)
julia> backend = DistributedUtils.get_distributed_backend(NCCLBackend)
NCCLBackend{Communicator, MPIBackend{MPI.Comm}}(Communicator(Ptr{NCCL.LibNCCL.ncclComm} @0x000000000607a660), MPIBackend{MPI.Comm}(MPI.Comm(1140850688)))
Pass your model, as well as any data to GPU device.
julia> model = Chain(Dense(1 => 256, tanh), Dense(256 => 1)) |> gpu
Chain(
Dense(1 => 256, tanh), # 512 parameters
Dense(256 => 1), # 257 parameters
) # Total: 4 arrays, 769 parameters, 744 bytes.
julia> x = rand(Float32, 1, 16) |> gpu
1×16 CUDA.CuArray{Float32, 2, CUDA.DeviceMemory}:
0.239324 0.331029 0.924996 0.55593 0.853093 0.874513 0.810269 0.935858 0.477176 0.564591 0.678907 0.729682 0.96809 0.115833 0.66191 0.75822
julia> y = x .^ 3
1×16 CUDA.CuArray{Float32, 2, CUDA.DeviceMemory}:
0.0137076 0.0362744 0.791443 0.171815 0.620854 0.668804 0.53197 0.819654 0.108651 0.179971 0.312918 0.388508 0.907292 0.00155418 0.29 0.435899
In this case, we are training on a total of 16 * number of processes
samples. You can also use DistributedUtils.DistributedDataContainer
to split the data uniformly across processes (or do it manually).
julia> data = DistributedUtils.DistributedDataContainer(backend, x)
Flux.DistributedUtils.DistributedDataContainer(Float32[0.23932439 0.33102947 … 0.66191036 0.75822026], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16])
You have to wrap your model in DistributedUtils.FluxDistributedModel
and synchronize it (broadcast accross all processes):
julia> model = DistributedUtils.synchronize!!(backend, DistributedUtils.FluxDistributedModel(model); root=0)
Chain(
Dense(1 => 256, tanh), # 512 parameters
Dense(256 => 1), # 257 parameters
) # Total: 4 arrays, 769 parameters, 744 bytes.
Time to set up an optimizer by using DistributedUtils.DistributedOptimizer
and synchronize it as well.
julia> using Optimisers
julia> opt = DistributedUtils.DistributedOptimizer(backend, Optimisers.Adam(0.001f0))
DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8))
julia> st_opt = Optimisers.setup(opt, model)
(layers = ((weight = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0; 0.0; … ; 0.0; 0.0;;], Float32[0.0; 0.0; … ; 0.0; 0.0;;], (0.9, 0.999))), bias = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], (0.9, 0.999))), σ = ()), (weight = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0 0.0 … 0.0 0.0], Float32[0.0 0.0 … 0.0 0.0], (0.9, 0.999))), bias = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0], Float32[0.0], (0.9, 0.999))), σ = ())),)
julia> st_opt = DistributedUtils.synchronize!!(backend, st_opt; root=0)
(layers = ((weight = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0; 0.0; … ; 0.0; 0.0;;], Float32[0.0; 0.0; … ; 0.0; 0.0;;], (0.9, 0.999))), bias = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], (0.9, 0.999))), σ = ()), (weight = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0 0.0 … 0.0 0.0], Float32[0.0 0.0 … 0.0 0.0], (0.9, 0.999))), bias = Leaf(DistributedOptimizer{MPIBackend{Comm}}(MPIBackend{Comm}(Comm(1140850688)), Adam(0.001, (0.9, 0.999), 1.0e-8)), (Float32[0.0], Float32[0.0], (0.9, 0.999))), σ = ())),)
Now you can define loss and train the model.
julia> loss(model) = mean((model(x) .- y).^2)
loss (generic function with 1 method)
julia> for epoch in 1:100
global model, st_opt
l, grad = Zygote.withgradient(loss, model)
println("Epoch $epoch: Loss $l")
st_opt, model = Optimisers.update(st_opt, model, grad[1])
end
Epoch 1: Loss 0.011638729
Epoch 2: Loss 0.0116432225
Epoch 3: Loss 0.012763695
...
Remember that in order to run it on multiple GPUs you have to run from CLI mpiexecjl --project=. -n <np> julia <filename>.jl
, where <np>
is the number of processes that you want to use. The number of processes usually corresponds to the number of gpus.
By default MPI.jl
MPI installation is CUDA-unaware so if you want to run it in CUDA-aware mode, read more here on custom installation and rebuilding MPI.jl
. Then test if your MPI is CUDA-aware by
julia> import Pkg
julia> Pkg.test("MPI"; test_args=["--backend=CUDA"])
If it is, set your local preference as below
julia> using Preferences
julia> set_preferences!("Flux", "FluxDistributedMPICUDAAware" => true)
We don't run CUDA-aware tests so you're running it at own risk.
Checking GPU Availability
By default, Flux will run the checks on your system to see if it can support GPU functionality. You can check if Flux identified a valid GPU setup by typing the following:
julia> using CUDA
julia> CUDA.functional()
true
For AMD GPU:
julia> using AMDGPU
julia> AMDGPU.functional()
true
julia> AMDGPU.functional(:MIOpen)
true
For Metal GPU:
julia> using Metal
julia> Metal.functional()
true