MLUtils.jl

Flux re-exports the DataLoader type and utility functions for working with data from MLUtils.

DataLoader

DataLoader can be used to handle iteration over mini-batches of data.

MLUtils.DataLoaderType
DataLoader(data; batchsize=1, shuffle=false, partial=true, rng=GLOBAL_RNG)

An object that iterates over mini-batches of data, each mini-batch containing batchsize observations (except possibly the last one).

Takes as input a single data tensor, or a tuple (or a named tuple) of tensors. The last dimension in each tensor is the observation dimension, i.e. the one divided into mini-batches.

If shuffle=true, it shuffles the observations each time iterations are re-started. If partial=false and the number of observations is not divisible by the batchsize, then the last mini-batch is dropped.

The original data is preserved in the data field of the DataLoader.

Examples

julia> Xtrain = rand(10, 100);

julia> array_loader = DataLoader(Xtrain, batchsize=2);

julia> for x in array_loader
         @assert size(x) == (10, 2)
         # do something with x, 50 times
       end

julia> array_loader.data === Xtrain
true

julia> tuple_loader = DataLoader((Xtrain,), batchsize=2);  # similar, but yielding 1-element tuples

julia> for x in tuple_loader
         @assert x isa Tuple{Matrix}
         @assert size(x[1]) == (10, 2)
       end

julia> Ytrain = rand('a':'z', 100);  # now make a DataLoader yielding 2-element named tuples

julia> train_loader = DataLoader((data=Xtrain, label=Ytrain), batchsize=5, shuffle=true);

julia> for epoch in 1:100
         for (x, y) in train_loader  # access via tuple destructuring
           @assert size(x) == (10, 5)
           @assert size(y) == (5,)
           # loss += f(x, y) # etc, runs 100 * 20 times
         end
       end

julia> first(train_loader).label isa Vector{Char}  # access via property name
true

julia> first(train_loader).label == Ytrain[1:5]  # because of shuffle=true
false

julia> foreach(println∘summary, DataLoader(rand(Int8, 10, 64), batchsize=30))  # partial=false would omit last
10×30 Matrix{Int8}
10×30 Matrix{Int8}
10×4 Matrix{Int8}

Utility functions for working with data

The utility functions are meant to be used while working with data; these functions help create inputs for your models or batch your dataset.

Below is a non-exhaustive list of such utility functions.

MLUtils.unsqueezeFunction
unsqueeze(x; dims)

Return x reshaped into an array one dimensionality higher than x, where dims indicates in which dimension x is extended.

See also flatten, stack.

Examples

julia> unsqueeze([1 2; 3 4], dims=2)
2×1×2 Array{Int64, 3}:
[:, :, 1] =
 1
 3

[:, :, 2] =
 2
 4


julia> xs = [[1, 2], [3, 4], [5, 6]]
3-element Vector{Vector{Int64}}:
 [1, 2]
 [3, 4]
 [5, 6]

julia> unsqueeze(xs, dims=1)
1×3 Matrix{Vector{Int64}}:
 [1, 2]  [3, 4]  [5, 6]
unsqueeze(; dims)

Returns a function which, acting on an array, inserts a dimension of size 1 at dims.

Examples

julia> rand(21, 22, 23) |> unsqueeze(dims=2) |> size
(21, 1, 22, 23)
MLUtils.stackFunction
stack(xs; dims)

Concatenate the given array of arrays xs into a single array along the given dimension dims.

See also stack and batch.

Examples

julia> xs = [[1, 2], [3, 4], [5, 6]]
3-element Vector{Vector{Int64}}:
 [1, 2]
 [3, 4]
 [5, 6]

julia> stack(xs, dims=1)
3×2 Matrix{Int64}:
 1  2
 3  4
 5  6

julia> stack(xs, dims=2)
2×3 Matrix{Int64}:
 1  3  5
 2  4  6

julia> stack(xs, dims=3)
2×1×3 Array{Int64, 3}:
[:, :, 1] =
 1
 2

[:, :, 2] =
 3
 4

[:, :, 3] =
 5
 6
MLUtils.unstackFunction
unstack(xs; dims)

Unroll the given xs into an array of arrays along the given dimension dims.

See also stack and unbatch.

Examples

julia> unstack([1 3 5 7; 2 4 6 8], dims=2)
4-element Vector{Vector{Int64}}:
 [1, 2]
 [3, 4]
 [5, 6]
 [7, 8]
MLUtils.chunkFunction
chunk(x, n; [dims])

Split x into n parts. The parts contain the same number of elements except possibly for the last one that can be smaller.

If x is an array, dims can be used to specify along which dimension to split (defaults to the last dimension).

Examples

julia> chunk(1:10, 3)
3-element Vector{UnitRange{Int64}}:
 1:4
 5:8
 9:10

julia> x = reshape(collect(1:20), (5, 4))
5×4 Matrix{Int64}:
 1   6  11  16
 2   7  12  17
 3   8  13  18
 4   9  14  19
 5  10  15  20

julia> xs = chunk(x, 2, dims=1)
2-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}}:
 [1 6 11 16; 2 7 12 17; 3 8 13 18]
 [4 9 14 19; 5 10 15 20]

julia> xs[1]
3×4 view(::Matrix{Int64}, 1:3, :) with eltype Int64:
 1  6  11  16
 2  7  12  17
 3  8  13  18
MLUtils.group_countsFunction
group_counts(x)

Count the number of times that each element of x appears.

See also group_indices

Examples

julia> group_counts(['a', 'b', 'b'])
Dict{Char, Int64} with 2 entries:
  'a' => 1
  'b' => 2
MLUtils.batchFunction
batch(xs)

Batch the arrays in xs into a single array with an extra dimension.

If the elements of xs are tuples, named tuples, or dicts, the output will be of the same type.

See also unbatch.

Examples

julia> batch([[1,2,3], 
              [4,5,6]])
3×2 Matrix{Int64}:
 1  4
 2  5
 3  6

julia> batch([(a=[1,2], b=[3,4])
               (a=[5,6], b=[7,8])]) 
(a = [1 5; 2 6], b = [3 7; 4 8])
MLUtils.unbatchFunction
unbatch(x)

Reverse of the batch operation, unstacking the last dimension of the array x.

See also unstack.

Examples

```jldoctest julia> unbatch([1 3 5 7; 2 4 6 8]) 4-element Vector{Vector{Int64}}: [1, 2] [3, 4] [5, 6] [7, 8]

MLUtils.batchseqFunction
batchseq(seqs, pad)

Take a list of N sequences, and turn them into a single sequence where each item is a batch of N. Short sequences will be padded by pad.

Examples

julia> batchseq([[1, 2, 3], [4, 5]], 0)
3-element Vector{Vector{Int64}}:
 [1, 4]
 [2, 5]
 [3, 0]
Base.rpadMethod
rpad(v::AbstractVector, n::Integer, p)

Return the given sequence padded with p up to a maximum length of n.

Examples

julia> rpad([1, 2], 4, 0)
4-element Vector{Int64}:
 1
 2
 0
 0

julia> rpad([1, 2, 3], 2, 0)
3-element Vector{Int64}:
 1
 2
 3