Working with data using MLUtils.jl

Flux re-exports the DataLoader type and utility functions for working with data from MLUtils.

DataLoader

DataLoader can be used to handle iteration over mini-batches of data.

Flux's website has a dedicated tutorial on DataLoader for more information.

MLUtils.DataLoaderType
DataLoader(data; [batchsize, buffer, collate, parallel, partial, rng, shuffle])

An object that iterates over mini-batches of data, each mini-batch containing batchsize observations (except possibly the last one).

Takes as input a single data array, a tuple (or a named tuple) of arrays, or in general any data object that implements the numobs and getobs methods.

The last dimension in each array is the observation dimension, i.e. the one divided into mini-batches.

The original data is preserved in the data field of the DataLoader.

Arguments

  • data: The data to be iterated over. The data type has to be supported by numobs and getobs.
  • batchsize: If less than 0, iterates over individual observations. Otherwise, each iteration (except possibly the last) yields a mini-batch containing batchsize observations. Default 1.
  • buffer: If buffer=true and supported by the type of data, a buffer will be allocated and reused for memory efficiency. You can also pass a preallocated object to buffer. Default false.
  • collate: Batching behavior. If nothing (default), a batch is getobs(data, indices). If false, each batch is [getobs(data, i) for i in indices]. When true, applies batch to the vector of observations in a batch, recursively collating arrays in the last dimensions. See batch for more information and examples.
  • parallel: Whether to use load data in parallel using worker threads. Greatly speeds up data loading by factor of available threads. Requires starting Julia with multiple threads. Check Threads.nthreads() to see the number of available threads. Passing parallel = true breaks ordering guarantees. Default false.
  • partial: This argument is used only when batchsize > 0. If partial=false and the number of observations is not divisible by the batchsize, then the last mini-batch is dropped. Default true.
  • rng: A random number generator. Default Random.GLOBAL_RNG.
  • shuffle: Whether to shuffle the observations before iterating. Unlike wrapping the data container with shuffleobs(data), shuffle=true ensures that the observations are shuffled anew every time you start iterating over eachobs. Default false.

Examples

julia> Xtrain = rand(10, 100);

julia> array_loader = DataLoader(Xtrain, batchsize=2);

julia> for x in array_loader
         @assert size(x) == (10, 2)
         # do something with x, 50 times
       end

julia> array_loader.data === Xtrain
true

julia> tuple_loader = DataLoader((Xtrain,), batchsize=2);  # similar, but yielding 1-element tuples

julia> for x in tuple_loader
         @assert x isa Tuple{Matrix}
         @assert size(x[1]) == (10, 2)
       end

julia> Ytrain = rand('a':'z', 100);  # now make a DataLoader yielding 2-element named tuples

julia> train_loader = DataLoader((data=Xtrain, label=Ytrain), batchsize=5, shuffle=true);

julia> for epoch in 1:100
         for (x, y) in train_loader  # access via tuple destructuring
           @assert size(x) == (10, 5)
           @assert size(y) == (5,)
           # loss += f(x, y) # etc, runs 100 * 20 times
         end
       end

julia> first(train_loader).label isa Vector{Char}  # access via property name
true

julia> first(train_loader).label == Ytrain[1:5]  # because of shuffle=true
false

julia> foreach(println∘summary, DataLoader(rand(Int8, 10, 64), batchsize=30))  # partial=false would omit last
10×30 Matrix{Int8}
10×30 Matrix{Int8}
10×4 Matrix{Int8}

Utility functions for working with data

The utility functions are meant to be used while working with data; these functions help create inputs for your models or batch your dataset.

Below is a non-exhaustive list of such utility functions.

MLUtils.unsqueezeFunction
unsqueeze(x; dims)

Return x reshaped into an array one dimensionality higher than x, where dims indicates in which dimension x is extended.

See also flatten, stack.

Examples

julia> unsqueeze([1 2; 3 4], dims=2)
2×1×2 Array{Int64, 3}:
[:, :, 1] =
 1
 3

[:, :, 2] =
 2
 4


julia> xs = [[1, 2], [3, 4], [5, 6]]
3-element Vector{Vector{Int64}}:
 [1, 2]
 [3, 4]
 [5, 6]

julia> unsqueeze(xs, dims=1)
1×3 Matrix{Vector{Int64}}:
 [1, 2]  [3, 4]  [5, 6]
unsqueeze(; dims)

Returns a function which, acting on an array, inserts a dimension of size 1 at dims.

Examples

julia> rand(21, 22, 23) |> unsqueeze(dims=2) |> size
(21, 1, 22, 23)
MLUtils.flattenFunction
flatten(x::AbstractArray)

Reshape arbitrarly-shaped input into a matrix-shaped output, preserving the size of the last dimension.

See also unsqueeze.

Examples

julia> rand(3,4,5) |> flatten |> size
(12, 5)
MLUtils.stackFunction
stack(xs; dims)

Concatenate the given array of arrays xs into a single array along the given dimension dims.

See also stack and batch.

Examples

julia> xs = [[1, 2], [3, 4], [5, 6]]
3-element Vector{Vector{Int64}}:
 [1, 2]
 [3, 4]
 [5, 6]

julia> stack(xs, dims=1)
3×2 Matrix{Int64}:
 1  2
 3  4
 5  6

julia> stack(xs, dims=2)
2×3 Matrix{Int64}:
 1  3  5
 2  4  6

julia> stack(xs, dims=3)
2×1×3 Array{Int64, 3}:
[:, :, 1] =
 1
 2

[:, :, 2] =
 3
 4

[:, :, 3] =
 5
 6
MLUtils.unstackFunction
unstack(xs; dims)

Unroll the given xs into an array of arrays along the given dimension dims.

See also stack and unbatch.

Examples

julia> unstack([1 3 5 7; 2 4 6 8], dims=2)
4-element Vector{Vector{Int64}}:
 [1, 2]
 [3, 4]
 [5, 6]
 [7, 8]
MLUtils.numobsFunction
numobs(data)

Return the total number of observations contained in data.

If data does not have numobs defined, then this function falls back to length(data). Authors of custom data containers should implement Base.length for their type instead of numobs. numobs should only be implemented for types where there is a difference between numobs and Base.length (such as multi-dimensional arrays).

See also getobs

MLUtils.getobsFunction
getobs(data, [idx])

Return the observations corresponding to the observation-index idx. Note that idx can be any type as long as data has defined getobs for that type.

If data does not have getobs defined, then this function falls back to data[idx]. Authors of custom data containers should implement Base.getindex for their type instead of getobs. getobs should only be implemented for types where there is a difference between getobs and Base.getindex (such as multi-dimensional arrays).

The returned observation(s) should be in the form intended to be passed as-is to some learning algorithm. There is no strict interface requirement on how this "actual data" must look like. Every author behind some custom data container can make this decision themselves. The output should be consistent when idx is a scalar vs vector.

See also getobs! and numobs

MLUtils.getobs!Function
getobs!(buffer, data, idx)

Inplace version of getobs(data, idx). If this method is defined for the type of data, then buffer should be used to store the result, instead of allocating a dedicated object.

Implementing this function is optional. In the case no such method is provided for the type of data, then buffer will be ignored and the result of getobs returned. This could be because the type of data may not lend itself to the concept of copy!. Thus, supporting a custom getobs! is optional and not required.

MLUtils.chunkFunction
chunk(x, n; [dims])
chunk(x; [size, dims])

Split x into n parts or alternatively, into equal chunks of size size. The parts contain the same number of elements except possibly for the last one that can be smaller.

If x is an array, dims can be used to specify along which dimension to split (defaults to the last dimension).

Examples

julia> chunk(1:10, 3)
3-element Vector{UnitRange{Int64}}:
 1:4
 5:8
 9:10

julia> chunk(1:10; size = 2)
5-element Vector{UnitRange{Int64}}:
 1:2
 3:4
 5:6
 7:8
 9:10

julia> x = reshape(collect(1:20), (5, 4))
5×4 Matrix{Int64}:
 1   6  11  16
 2   7  12  17
 3   8  13  18
 4   9  14  19
 5  10  15  20

julia> xs = chunk(x, 2, dims=1)
2-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}}:
 [1 6 11 16; 2 7 12 17; 3 8 13 18]
 [4 9 14 19; 5 10 15 20]

julia> xs[1]
3×4 view(::Matrix{Int64}, 1:3, :) with eltype Int64:
 1  6  11  16
 2  7  12  17
 3  8  13  18

julia> xes = chunk(x; size = 2, dims = 2)
2-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}}:
 [1 6; 2 7; … ; 4 9; 5 10]
 [11 16; 12 17; … ; 14 19; 15 20]

julia> xes[2]
5×2 view(::Matrix{Int64}, :, 3:4) with eltype Int64:
 11  16
 12  17
 13  18
 14  19
 15  20
MLUtils.group_countsFunction
group_counts(x)

Count the number of times that each element of x appears.

See also group_indices

Examples

julia> group_counts(['a', 'b', 'b'])
Dict{Char, Int64} with 2 entries:
  'a' => 1
  'b' => 2
MLUtils.group_indicesFunction
group_indices(x) -> Dict

Computes the indices of elements in the vector x for each distinct value contained. This information is useful for resampling strategies, such as stratified sampling.

See also group_counts.

Examples

julia> x = [:yes, :no, :maybe, :yes];

julia> group_indices(x)
Dict{Symbol, Vector{Int64}} with 3 entries:
  :yes   => [1, 4]
  :maybe => [3]
  :no    => [2]
MLUtils.batchFunction
batch(xs)

Batch the arrays in xs into a single array with an extra dimension.

If the elements of xs are tuples, named tuples, or dicts, the output will be of the same type.

See also unbatch.

Examples

julia> batch([[1,2,3], 
              [4,5,6]])
3×2 Matrix{Int64}:
 1  4
 2  5
 3  6

julia> batch([(a=[1,2], b=[3,4])
               (a=[5,6], b=[7,8])]) 
(a = [1 5; 2 6], b = [3 7; 4 8])
MLUtils.unbatchFunction
unbatch(x)

Reverse of the batch operation, unstacking the last dimension of the array x.

See also unstack.

Examples

julia> unbatch([1 3 5 7;
                     2 4 6 8])
4-element Vector{Vector{Int64}}:
 [1, 2]
 [3, 4]
 [5, 6]
 [7, 8]
MLUtils.batchseqFunction
batchseq(seqs, pad)

Take a list of N sequences, and turn them into a single sequence where each item is a batch of N. Short sequences will be padded by pad.

Examples

julia> batchseq([[1, 2, 3], [4, 5]], 0)
3-element Vector{Vector{Int64}}:
 [1, 4]
 [2, 5]
 [3, 0]
Base.rpadMethod
rpad(v::AbstractVector, n::Integer, p)

Return the given sequence padded with p up to a maximum length of n.

Examples

julia> rpad([1, 2], 4, 0)
4-element Vector{Int64}:
 1
 2
 0
 0

julia> rpad([1, 2, 3], 2, 0)
3-element Vector{Int64}:
 1
 2
 3