DataLoader
struct
defined in module
MLUtils
DataLoader(data; [batchsize, buffer, collate, parallel, partial, rng, shuffle])
An object that iterates over mini-batches of
data
, each mini-batch containing
batchsize
observations (except possibly the last one).
Takes as input a single data array, a tuple (or a named tuple) of arrays, or in general any
data
object that implements the
numobs
and
getobs
methods.
The last dimension in each array is the observation dimension, i.e. the one divided into mini-batches.
The original data is preserved in the
data
field of the DataLoader.
data
: The data to be iterated over. The data type has to be supported by
numobs
and
getobs
.
batchsize
: If less than 0, iterates over individual observations. Otherwise, each iteration (except possibly the last) yields a mini-batch containing
batchsize
observations. Default
1
.
buffer
: If
buffer=true
and supported by the type of
data
, a buffer will be allocated and reused for memory efficiency. You can also pass a preallocated object to
buffer
. Default
false
.
collate
: Batching behavior. If
nothing
(default), a batch is
getobs(data, indices)
. If
false
, each batch is
[getobs(data, i) for i in indices]
. When
true
, applies
batch
to the vector of observations in a batch, recursively collating arrays in the last dimensions. See
batch
for more information and examples.
parallel
: Whether to use load data in parallel using worker threads. Greatly speeds up data loading by factor of available threads. Requires starting Julia with multiple threads. Check
Threads.nthreads()
to see the number of available threads.
Passing
parallel = true
breaks ordering guarantees. Default
false
.
partial
: This argument is used only when
batchsize > 0
. If
partial=false
and the number of observations is not divisible by the batchsize, then the last mini-batch is dropped. Default
true
.
rng
: A random number generator. Default
Random.GLOBAL_RNG
.
shuffle
: Whether to shuffle the observations before iterating. Unlike wrapping the data container with
shuffleobs(data)
,
shuffle=true
ensures that the observations are shuffled anew every time you start iterating over
eachobs
. Default
false
.
julia> Xtrain = rand(10, 100);
julia> array_loader = DataLoader(Xtrain, batchsize=2);
julia> for x in array_loader
@assert size(x) == (10, 2)
# do something with x, 50 times
end
julia> array_loader.data === Xtrain
true
julia> tuple_loader = DataLoader((Xtrain,), batchsize=2); # similar, but yielding 1-element tuples
julia> for x in tuple_loader
@assert x isa Tuple{Matrix}
@assert size(x[1]) == (10, 2)
end
julia> Ytrain = rand('a':'z', 100); # now make a DataLoader yielding 2-element named tuples
julia> train_loader = DataLoader((data=Xtrain, label=Ytrain), batchsize=5, shuffle=true);
julia> for epoch in 1:100
for (x, y) in train_loader # access via tuple destructuring
@assert size(x) == (10, 5)
@assert size(y) == (5,)
# loss += f(x, y) # etc, runs 100 * 20 times
end
end
julia> first(train_loader).label isa Vector{Char} # access via property name
true
julia> first(train_loader).label == Ytrain[1:5] # because of shuffle=true
false
julia> foreach(println∘summary, DataLoader(rand(Int8, 10, 64), batchsize=30)) # partial=false would omit last
10×30 Matrix{Int8}
10×30 Matrix{Int8}
10×4 Matrix{Int8}
There are
2
methods for MLUtils.DataLoader
:
The following pages link back here:
Siamese image similarity, fastai API comparison
FastAI.jl , tasks/taskdata.jl , Flux.jl , functor.jl , MLUtils.jl , eachobs.jl