Working with Data, using MLUtils.jl

Flux re-exports the DataLoader type and utility functions for working with data from MLUtils.

`DataLoader`

The DataLoader can be used to create mini-batches of data, in the format train! expects.

MLUtils.DataLoader — Type

DataLoader(data; [batchsize, buffer, collate, parallel, partial, rng, shuffle])

An object that iterates over mini-batches of data, each mini-batch containing batchsize observations (except possibly the last one).

Takes as input a single data array, a tuple (or a named tuple) of arrays, or in general any data object that implements the numobs and getobs methods.

The last dimension in each array is the observation dimension, i.e. the one divided into mini-batches.

The original data is preserved in the data field of the DataLoader.

Arguments

data: The data to be iterated over. The data type has to be supported by numobs and getobs.
batchsize: If less than 0, iterates over individual observations. Otherwise, each iteration (except possibly the last) yields a mini-batch containing batchsize observations. Default 1.
buffer: If buffer=true and supported by the type of data, a buffer will be allocated and reused for memory efficiency. You can also pass a preallocated object to buffer. Default false.
collate: Batching behavior. If nothing (default), a batch is getobs(data, indices). If false, each batch is [getobs(data, i) for i in indices]. When true, applies batch to the vector of observations in a batch, recursively collating arrays in the last dimensions. See batch for more information and examples.
parallel: Whether to use load data in parallel using worker threads. Greatly speeds up data loading by factor of available threads. Requires starting Julia with multiple threads. Check Threads.nthreads() to see the number of available threads. Passing parallel = true breaks ordering guarantees. Default false.
partial: This argument is used only when batchsize > 0. If partial=false and the number of observations is not divisible by the batchsize, then the last mini-batch is dropped. Default true.
rng: A random number generator. Default Random.GLOBAL_RNG.
shuffle: Whether to shuffle the observations before iterating. Unlike wrapping the data container with shuffleobs(data), shuffle=true ensures that the observations are shuffled anew every time you start iterating over eachobs. Default false.

Examples

julia> Xtrain = rand(10, 100);

julia> array_loader = DataLoader(Xtrain, batchsize=2);

julia> for x in array_loader
         @assert size(x) == (10, 2)
         # do something with x, 50 times
       end

julia> array_loader.data === Xtrain
true

julia> tuple_loader = DataLoader((Xtrain,), batchsize=2);  # similar, but yielding 1-element tuples

julia> for x in tuple_loader
         @assert x isa Tuple{Matrix}
         @assert size(x[1]) == (10, 2)
       end

julia> Ytrain = rand('a':'z', 100);  # now make a DataLoader yielding 2-element named tuples

julia> train_loader = DataLoader((data=Xtrain, label=Ytrain), batchsize=5, shuffle=true);

julia> for epoch in 1:100
         for (x, y) in train_loader  # access via tuple destructuring
           @assert size(x) == (10, 5)
           @assert size(y) == (5,)
           # loss += f(x, y) # etc, runs 100 * 20 times
         end
       end

julia> first(train_loader).label isa Vector{Char}  # access via property name
true

julia> first(train_loader).label == Ytrain[1:5]  # because of shuffle=true
false

julia> foreach(println∘summary, DataLoader(rand(Int8, 10, 64), batchsize=30))  # partial=false would omit last
10×30 Matrix{Int8}
10×30 Matrix{Int8}
10×4 Matrix{Int8}

Utility Functions

The utility functions are meant to be used while working with data; these functions help create inputs for your models or batch your dataset.

MLUtils.batch — Function

batch(xs)

Batch the arrays in xs into a single array with an extra dimension.

If the elements of xs are tuples, named tuples, or dicts, the output will be of the same type.