MLUtils.jl

Flux re-exports the DataLoader type and utility functions for working with data from MLUtils.

DataLoader can be used to handle iteration over mini-batches of data.

MLUtils.DataLoaderType
DataLoader(data; batchsize=1, shuffle=false, partial=true, rng=GLOBAL_RNG)

An object that iterates over mini-batches of data, each mini-batch containing batchsize observations (except possibly the last one).

Takes as input a single data tensor, or a tuple (or a named tuple) of tensors. The last dimension in each tensor is the observation dimension, i.e. the one divided into mini-batches.

If shuffle=true, it shuffles the observations each time iterations are re-started. If partial=false and the number of observations is not divisible by the batchsize, then the last mini-batch is dropped.

The original data is preserved in the data field of the DataLoader.

Examples

julia> Xtrain = rand(10, 100);

@assert size(x) == (10, 2)
# do something with x, 50 times
end

true

@assert x isa Tuple{Matrix}
@assert size(x[1]) == (10, 2)
end

julia> Ytrain = rand('a':'z', 100);  # now make a DataLoader yielding 2-element named tuples

julia> for epoch in 1:100
for (x, y) in train_loader  # access via tuple destructuring
@assert size(x) == (10, 5)
@assert size(y) == (5,)
# loss += f(x, y) # etc, runs 100 * 20 times
end
end

julia> first(train_loader).label isa Vector{Char}  # access via property name
true

julia> first(train_loader).label == Ytrain[1:5]  # because of shuffle=true
false

julia> foreach(println∘summary, DataLoader(rand(Int8, 10, 64), batchsize=30))  # partial=false would omit last
10×30 Matrix{Int8}
10×30 Matrix{Int8}
10×4 Matrix{Int8}

Utility functions for working with data

The utility functions are meant to be used while working with data; these functions help create inputs for your models or batch your dataset.

Below is a non-exhaustive list of such utility functions.

MLUtils.unsqueezeFunction
unsqueeze(x; dims)

Return x reshaped into an array one dimensionality higher than x, where dims indicates in which dimension x is extended.

See also flatten, stack.

Examples

julia> unsqueeze([1 2; 3 4], dims=2)
2×1×2 Array{Int64, 3}:
[:, :, 1] =
1
3

[:, :, 2] =
2
4

julia> xs = [[1, 2], [3, 4], [5, 6]]
3-element Vector{Vector{Int64}}:
[1, 2]
[3, 4]
[5, 6]

julia> unsqueeze(xs, dims=1)
1×3 Matrix{Vector{Int64}}:
[1, 2]  [3, 4]  [5, 6]
unsqueeze(; dims)

Returns a function which, acting on an array, inserts a dimension of size 1 at dims.

Examples

julia> rand(21, 22, 23) |> unsqueeze(dims=2) |> size
(21, 1, 22, 23)
MLUtils.stackFunction
stack(xs; dims)

Concatenate the given array of arrays xs into a single array along the given dimension dims.

See also stack and batch.

Examples

julia> xs = [[1, 2], [3, 4], [5, 6]]
3-element Vector{Vector{Int64}}:
[1, 2]
[3, 4]
[5, 6]

julia> stack(xs, dims=1)
3×2 Matrix{Int64}:
1  2
3  4
5  6

julia> stack(xs, dims=2)
2×3 Matrix{Int64}:
1  3  5
2  4  6

julia> stack(xs, dims=3)
2×1×3 Array{Int64, 3}:
[:, :, 1] =
1
2

[:, :, 2] =
3
4

[:, :, 3] =
5
6
MLUtils.unstackFunction
unstack(xs; dims)

Unroll the given xs into an array of arrays along the given dimension dims.

See also stack and unbatch.

Examples

julia> unstack([1 3 5 7; 2 4 6 8], dims=2)
4-element Vector{Vector{Int64}}:
[1, 2]
[3, 4]
[5, 6]
[7, 8]
MLUtils.chunkFunction
chunk(x, n; [dims])

Split x into n parts. The parts contain the same number of elements except possibly for the last one that can be smaller.

If x is an array, dims can be used to specify along which dimension to split (defaults to the last dimension).

Examples

julia> chunk(1:10, 3)
3-element Vector{UnitRange{Int64}}:
1:4
5:8
9:10

julia> x = reshape(collect(1:20), (5, 4))
5×4 Matrix{Int64}:
1   6  11  16
2   7  12  17
3   8  13  18
4   9  14  19
5  10  15  20

julia> xs = chunk(x, 2, dims=1)
2-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}}:
[1 6 11 16; 2 7 12 17; 3 8 13 18]
[4 9 14 19; 5 10 15 20]

julia> xs[1]
3×4 view(::Matrix{Int64}, 1:3, :) with eltype Int64:
1  6  11  16
2  7  12  17
3  8  13  18
MLUtils.group_countsFunction
group_counts(x)

Count the number of times that each element of x appears.

See also group_indices

Examples

julia> group_counts(['a', 'b', 'b'])
Dict{Char, Int64} with 2 entries:
'a' => 1
'b' => 2
MLUtils.batchFunction
batch(xs)

Batch the arrays in xs into a single array with an extra dimension.

If the elements of xs are tuples, named tuples, or dicts, the output will be of the same type.

See also unbatch.

Examples

julia> batch([[1,2,3],
[4,5,6]])
3×2 Matrix{Int64}:
1  4
2  5
3  6

julia> batch([(a=[1,2], b=[3,4])
(a=[5,6], b=[7,8])])
(a = [1 5; 2 6], b = [3 7; 4 8])
MLUtils.batchseqFunction
batchseq(seqs, pad)

Take a list of N sequences, and turn them into a single sequence where each item is a batch of N. Short sequences will be padded by pad.

Examples

julia> batchseq([[1, 2, 3], [4, 5]], 0)
3-element Vector{Vector{Int64}}:
[1, 4]
[2, 5]
[3, 0]
Base.rpadMethod
rpad(v::AbstractVector, n::Integer, p)

Return the given sequence padded with p up to a maximum length of n.

Examples

julia> rpad([1, 2], 4, 0)
4-element Vector{Int64}:
1
2
0
0

julia> rpad([1, 2, 3], 2, 0)
3-element Vector{Int64}:
1
2
3