Bottlenecks in data pipelines and how to measure and fix them
When training large deep learning models on a GPU we clearly want wait as short as possible for the training to complete. The hardware bottleneck is usually the GPU power you have available to you. This means that data pipelines need to be fast enough to keep the GPU at 100% utilization, that is, keep it from "starving". Reducing the time the GPU has to wait for the next batch of data directly lowers the training time until the GPU is fully utilized. There are other ways to reduce training time like using hyperparameter schedules and different optimizers for faster convergence, but we'll only talk about improving GPU utilization here.
The main cause of low GPU utilization is that the next batch of data is not available after a training step and the GPU has to wait. This means that in order to get full GPU utilization,
loading a batch must not take longer than a training step; and
the data must be loaded in the background, so that it is ready the moment the GPU needs it.
These issues can be addressed by
using worker threads to load multiple batches in parallel
keeping the primary thread free; and
reducing the time it takes to load a single batch
FastAI.jl by default uses
DataLoader
from the
DataLoaders.jl
package which addresses points 1. and 2. For those familiar with PyTorch, it closely resembles
torch.utils.data.DataLoader
. It also efficiently collates the data by reusing a buffer where supported.
We can measure the large performance difference by comparing a naive sequential data iterator with
eachobsparallel
, the data iterator that
DataLoader
uses.
using
DataLoaders
:
batchviewcollated
using
FastAI
using
FastAI
.
Datasets
data
,
blocks
=
load
(
datarecipes
(
)
[
"
imagenette2-320
"
]
)
task
=
ImageClassificationSingle
(
blocks
,
size
=
(
224
,
224
)
)
# maps data processing over `data`
taskdata
=
taskdataset
(
data
,
task
,
Training
(
)
)
# creates a data container of collated batches
batchdata
=
batchviewcollated
(
taskdata
,
16
)
NBATCHES
=
200
# sequential data iterator
@
time
for
(
i
,
batch
)
in
enumerate
(
getobs
(
batchdata
,
i
)
for
i
in
1
:
numobs
(
batchdata
)
)
i
!=
NBATCHES
||
break
end
# parallel data iterator
@
time
for
(
i
,
batch
)
in
enumerate
(
eachobsparallel
(
batchdata
)
)
i
!=
NBATCHES
||
break
end
Running each timer twice to forego compilation time, the sequential iterator takes 20 seconds while the parallel iterator using 11 background threads only takes 2.5 seconds. This certainly isn't a proper benchmark, but it shows the performance can be improved by an order of magnitude with no effort.
Beside increasing the amount of compute available with worker threads as above, the data loading performance can also be improved by reducing the time it takes to load a single batch. Since a batch is made up of some number of observations, this usually boils down to reducing the loading time of a single observation. If you're using the
LearningTask
API, this can be further broken down into the loading and encoding part.
So how do you know if your GPU is underutilized? If it isn't, then improving data pipeline performance won't help you at all! One way to check this is to start training and run
> watch -n 0.1 nvidia-smi
in a terminal which displays and refreshs GPU stats every 1/10th of a second. If
GPU-Util
stays between 90% and 99%, you're good!
If that's not the case, you might see it frantically jumping up and down. We can get a better estimate of how much training time can be sped up by running the following experiment:
Load one batch and run
n
optimization steps on this batch. The time this takes corresponds to the training time when the GPU does not have to wait for data to be available.
Next take your data iterator and time iterating over the first
n
batches
without an optimization step.
The speed of the complete training loop (data loading and optimization) will be around the maximum of either measurement. Roughly speaking, if 1. takes 100 seconds and 2. takes 200 seconds, you know that you can speed up training by about a factor of 2 if you reduce data loading time by half, after which the GPU will become the bottleneck.
using
FastAI
using
FastAI
.
Datasets
using
FluxTraining
:
step!
data
,
blocks
=
load
(
datarecipes
(
)
[
"
imagenette2-320
"
]
)
task
=
ImageClassificationSingle
(
blocks
)
learner
=
tasklearner
(
task
,
data
)
NBATCHES
=
100
# Measure GPU time
batch
=
gpu
(
first
(
learner
.
data
.
training
)
)
learner
.
model
=
gpu
(
model
)
@
time
for
i
in
1
:
NBATCHES
step!
(
learner
,
batch
,
TrainingPhase
(
)
)
end
# Measure data loading time
@
time
for
(
i
,
batch
)
in
zip
(
learner
.
data
.
training
,
1
:
NBATCHES
)
end
Again, make sure to run each measurement twice so you don't include the compilation time.
To find performance bottlenecks in the loading of each observation, you'll want to compare the time it takes to load an observation of the task data container and the time it takes to encode that observation.
using
BenchmarkTools
using
FastAI
using
FastAI
.
Datasets
,
FastAI
.
MLUtils
# Since loading times can vary per observation, we'll average the measurements over multiple observations
N
=
10
data
=
MLUtils
.
ObsView
(
data
,
1
:
N
)
# Time it takes to load an `(image, class)` observation
@
btime
for
i
in
1
:
N
getobs
(
data
,
i
)
end
# Time it takes to encode an `(image, class)` observation into `(x, y)`
obss
=
[
getobs
(
data
,
i
)
for
i
in
1
:
N
]
@
btime
for
i
in
1
:
N
encodesample
(
task
,
Training
(
)
,
obss
[
i
]
)
end
This will give you a pretty good idea of where the performance bottleneck is. Note that the encoding performance is often dependent of the task configuration. If we used
ImageClassification
with input size
(64, 64)
it would be much faster.
So, you've identified the data pipeline as a performance bottleneck. What now? Before anything else, make sure you're doing the following:
Use
DataLoaders.DataLoader
as a data iterator. If you're using
taskdataloaders
or
tasklearner
, this is already the case.
Start Julia with multiple threads by specifying the
-t n
/
-t auto
flag when starting Julia. If it is successful,
Threads.nthreads()
should be larger than
1
.
If the data loading is still slowing down training, you'll probably have to speed up the loading of each observation. As mentioned above, this can be broken down into observation loading and encoding. The exact strategy will depend on your use case, but here are some examples.
For many computer vision tasks, you will resize and crop images to a specific size during training for GPU performance reasons. If the images themselves are large, loading them from disk itself can take some time. If your dataset consists of 1920x1080 resolution images but you're resizing them to 256x256 during training, you're wasting a lot of time loading the large images. Presizing means saving resized versions of each image to disk once, and then loading these smaller versions during training. We can see the performance difference using ImageNette since it comes in 3 sizes: original, 360px and 180px.
data_orig
=
load
(
datarecipes
(
)
[
"
imagenette2
"
]
)
@
time
eachobsparallel
(
data_orig
,
buffered
=
false
)
data_320px
=
load
(
datarecipes
(
)
[
"
imagenette2-320
"
]
)
@
time
eachobsparallel
(
data_320px
,
buffered
=
false
)
data_160px
=
load
(
datarecipes
(
)
[
"
imagenette2-160
"
]
)
@
time
eachobsparallel
(
data_160px
,
buffered
=
false
)
When implementing the
LearningTask
interface, you have the option to implement
encode!(buf, task, context, sample)
, an inplace version of
encode
that reuses a buffer to avoid allocations. Reducing allocations often speeds up the encoding step and can also reduce the frequency of garbage collector pauses during training which can reduce GPU utilization.
Many kinds of augmentation can be composed efficiently. A prime example of this are image transformations like resizing, scaling and cropping which are powered by DataAugmentation.jl . See its documentation to find out how to implement efficient, composable data transformations.
The following pages link back here:
Presizing vision datasets for performance, fastai API comparison