Working with Data, using MLUtils.jl
Flux re-exports the DataLoader
type and utility functions for working with data from MLUtils.
DataLoader
The DataLoader
can be used to create mini-batches of data, in the format train!
expects.
Flux
's website has a dedicated tutorial on DataLoader
for more information.
MLUtils.DataLoader
— TypeDataLoader(data; [batchsize, buffer, collate, parallel, partial, rng, shuffle])
An object that iterates over mini-batches of data
, each mini-batch containing batchsize
observations (except possibly the last one).
Takes as input a single data array, a tuple (or a named tuple) of arrays, or in general any data
object that implements the numobs
and getobs
methods.
The last dimension in each array is the observation dimension, i.e. the one divided into mini-batches.
The original data is preserved in the data
field of the DataLoader.
Arguments
data
: The data to be iterated over. The data type has to be supported bynumobs
andgetobs
.batchsize
: If less than 0, iterates over individual observations. Otherwise, each iteration (except possibly the last) yields a mini-batch containingbatchsize
observations. Default1
.buffer
: Ifbuffer=true
and supported by the type ofdata
, a buffer will be allocated and reused for memory efficiency. You can also pass a preallocated object tobuffer
. Defaultfalse
.collate
: Batching behavior. Ifnothing
(default), a batch isgetobs(data, indices)
. Iffalse
, each batch is[getobs(data, i) for i in indices]
. Whentrue
, appliesbatch
to the vector of observations in a batch, recursively collating arrays in the last dimensions. Seebatch
for more information and examples.parallel
: Whether to use load data in parallel using worker threads. Greatly speeds up data loading by factor of available threads. Requires starting Julia with multiple threads. CheckThreads.nthreads()
to see the number of available threads. Passingparallel = true
breaks ordering guarantees. Defaultfalse
.partial
: This argument is used only whenbatchsize > 0
. Ifpartial=false
and the number of observations is not divisible by the batchsize, then the last mini-batch is dropped. Defaulttrue
.rng
: A random number generator. DefaultRandom.GLOBAL_RNG
.shuffle
: Whether to shuffle the observations before iterating. Unlike wrapping the data container withshuffleobs(data)
,shuffle=true
ensures that the observations are shuffled anew every time you start iterating overeachobs
. Defaultfalse
.
Examples
julia> Xtrain = rand(10, 100);
julia> array_loader = DataLoader(Xtrain, batchsize=2);
julia> for x in array_loader
@assert size(x) == (10, 2)
# do something with x, 50 times
end
julia> array_loader.data === Xtrain
true
julia> tuple_loader = DataLoader((Xtrain,), batchsize=2); # similar, but yielding 1-element tuples
julia> for x in tuple_loader
@assert x isa Tuple{Matrix}
@assert size(x[1]) == (10, 2)
end
julia> Ytrain = rand('a':'z', 100); # now make a DataLoader yielding 2-element named tuples
julia> train_loader = DataLoader((data=Xtrain, label=Ytrain), batchsize=5, shuffle=true);
julia> for epoch in 1:100
for (x, y) in train_loader # access via tuple destructuring
@assert size(x) == (10, 5)
@assert size(y) == (5,)
# loss += f(x, y) # etc, runs 100 * 20 times
end
end
julia> first(train_loader).label isa Vector{Char} # access via property name
true
julia> first(train_loader).label == Ytrain[1:5] # because of shuffle=true
false
julia> foreach(println∘summary, DataLoader(rand(Int8, 10, 64), batchsize=30)) # partial=false would omit last
10×30 Matrix{Int8}
10×30 Matrix{Int8}
10×4 Matrix{Int8}
Utility Functions
The utility functions are meant to be used while working with data; these functions help create inputs for your models or batch your dataset.
MLUtils.unsqueeze
— Functionunsqueeze(x; dims)
Return x
reshaped into an array one dimensionality higher than x
, where dims
indicates in which dimension x
is extended. dims
can be an integer between 1 and ndims(x)+1
.
Examples
julia> unsqueeze([1 2; 3 4], dims=2)
2×1×2 Array{Int64, 3}:
[:, :, 1] =
1
3
[:, :, 2] =
2
4
julia> xs = [[1, 2], [3, 4], [5, 6]]
3-element Vector{Vector{Int64}}:
[1, 2]
[3, 4]
[5, 6]
julia> unsqueeze(xs, dims=1)
1×3 Matrix{Vector{Int64}}:
[1, 2] [3, 4] [5, 6]
unsqueeze(; dims)
Returns a function which, acting on an array, inserts a dimension of size 1 at dims
.
Examples
julia> rand(21, 22, 23) |> unsqueeze(dims=2) |> size
(21, 1, 22, 23)
MLUtils.flatten
— Functionflatten(x::AbstractArray)
Reshape arbitrarly-shaped input into a matrix-shaped output, preserving the size of the last dimension.
See also unsqueeze
.
Examples
julia> rand(3,4,5) |> flatten |> size
(12, 5)
Compat.stack
— Functionstack(df::AbstractDataFrame[, measure_vars[, id_vars] ];
variable_name=:variable, value_name=:value,
view::Bool=false, variable_eltype::Type=String)
Stack a data frame df
, i.e. convert it from wide to long format.
Return the long-format DataFrame
with: columns for each of the id_vars
, column value_name
(:value
by default) holding the values of the stacked columns (measure_vars
), and column variable_name
(:variable
by default) a vector holding the name of the corresponding measure_vars
variable.
If view=true
then return a stacked view of a data frame (long format). The result is a view because the columns are special AbstractVectors
that return views into the original data frame.
Arguments
df
: the AbstractDataFrame to be stackedmeasure_vars
: the columns to be stacked (the measurement variables), as a column selector (Symbol
, string or integer;:
,Cols
,All
,Between
,Not
, a regular expression, or a vector ofSymbol
s, strings or integers). If neithermeasure_vars
orid_vars
are given,measure_vars
defaults to all floating point columns.id_vars
: the identifier columns that are repeated during stacking, as a column selector (Symbol
, string or integer;:
,Cols
,All
,Between
,Not
, a regular expression, or a vector ofSymbol
s, strings or integers). Defaults to all variables that are notmeasure_vars
variable_name
: the name (Symbol
or string) of the new stacked column that shall hold the names of each ofmeasure_vars
value_name
: the name (Symbol
or string) of the new stacked column containing the values from each ofmeasure_vars
view
: whether the stacked data frame should be a view rather than contain freshly allocated vectors.variable_eltype
: determines the element type of columnvariable_name
. By default aPooledArray{String}
is created. Ifvariable_eltype=Symbol
aPooledVector{Symbol}
is created, and ifvariable_eltype=CategoricalValue{String}
aCategoricalArray{String}
is produced (callusing CategoricalArrays
first if needed) Passing any other typeT
will produce aPooledVector{T}
column as long as it supports conversion fromString
. Whenview=true
, aRepeatedVector{T}
is produced.
Metadata: table-level :note
-style metadata and column-level :note
-style metadata for identifier columns are preserved.
Examples
julia> df = DataFrame(a=repeat(1:3, inner=2),
b=repeat(1:2, inner=3),
c=repeat(1:1, inner=6),
d=repeat(1:6, inner=1),
e=string.('a':'f'))
6×5 DataFrame
Row │ a b c d e
│ Int64 Int64 Int64 Int64 String
─────┼────────────────────────────────────
1 │ 1 1 1 1 a
2 │ 1 1 1 2 b
3 │ 2 1 1 3 c
4 │ 2 2 1 4 d
5 │ 3 2 1 5 e
6 │ 3 2 1 6 f
julia> stack(df, [:c, :d])
12×5 DataFrame
Row │ a b e variable value
│ Int64 Int64 String String Int64
─────┼───────────────────────────────────────
1 │ 1 1 a c 1
2 │ 1 1 b c 1
3 │ 2 1 c c 1
4 │ 2 2 d c 1
5 │ 3 2 e c 1
6 │ 3 2 f c 1
7 │ 1 1 a d 1
8 │ 1 1 b d 2
9 │ 2 1 c d 3
10 │ 2 2 d d 4
11 │ 3 2 e d 5
12 │ 3 2 f d 6
julia> stack(df, [:c, :d], [:a])
12×3 DataFrame
Row │ a variable value
│ Int64 String Int64
─────┼────────────────────────
1 │ 1 c 1
2 │ 1 c 1
3 │ 2 c 1
4 │ 2 c 1
5 │ 3 c 1
6 │ 3 c 1
7 │ 1 d 1
8 │ 1 d 2
9 │ 2 d 3
10 │ 2 d 4
11 │ 3 d 5
12 │ 3 d 6
julia> stack(df, Not([:a, :b, :e]))
12×5 DataFrame
Row │ a b e variable value
│ Int64 Int64 String String Int64
─────┼───────────────────────────────────────
1 │ 1 1 a c 1
2 │ 1 1 b c 1
3 │ 2 1 c c 1
4 │ 2 2 d c 1
5 │ 3 2 e c 1
6 │ 3 2 f c 1
7 │ 1 1 a d 1
8 │ 1 1 b d 2
9 │ 2 1 c d 3
10 │ 2 2 d d 4
11 │ 3 2 e d 5
12 │ 3 2 f d 6
julia> stack(df, Not([:a, :b, :e]), variable_name=:somemeasure)
12×5 DataFrame
Row │ a b e somemeasure value
│ Int64 Int64 String String Int64
─────┼──────────────────────────────────────────
1 │ 1 1 a c 1
2 │ 1 1 b c 1
3 │ 2 1 c c 1
4 │ 2 2 d c 1
5 │ 3 2 e c 1
6 │ 3 2 f c 1
7 │ 1 1 a d 1
8 │ 1 1 b d 2
9 │ 2 1 c d 3
10 │ 2 2 d d 4
11 │ 3 2 e d 5
12 │ 3 2 f d 6
MLUtils.unstack
— FunctionMLUtils.numobs
— Functionnumobs(data)
Return the total number of observations contained in data
.
If data
does not have numobs
defined, then in the case of Tables.table(data) == true
returns the number of rows, otherwise returns length(data)
.
Authors of custom data containers should implement Base.length
for their type instead of numobs
. numobs
should only be implemented for types where there is a difference between numobs
and Base.length
(such as multi-dimensional arrays).
getobs
supports by default nested combinations of array, tuple, named tuples, and dictionaries.
See also getobs
.
Examples
# named tuples
x = (a = [1, 2, 3], b = rand(6, 3))
numobs(x) == 3
# dictionaries
x = Dict(:a => [1, 2, 3], :b => rand(6, 3))
numobs(x) == 3
All internal containers must have the same number of observations:
julia> x = (a = [1, 2, 3, 4], b = rand(6, 3));
julia> numobs(x)
ERROR: DimensionMismatch: All data containers must have the same number of observations.
Stacktrace:
[1] _check_numobs_error()
@ MLUtils ~/.julia/dev/MLUtils/src/observation.jl:163
[2] _check_numobs
@ ~/.julia/dev/MLUtils/src/observation.jl:130 [inlined]
[3] numobs(data::NamedTuple{(:a, :b), Tuple{Vector{Int64}, Matrix{Float64}}})
@ MLUtils ~/.julia/dev/MLUtils/src/observation.jl:177
[4] top-level scope
@ REPL[35]:1
MLUtils.getobs
— Functiongetobs(data, [idx])
Return the observations corresponding to the observation index idx
. Note that idx
can be any type as long as data
has defined getobs
for that type. If idx
is not provided, then materialize all observations in data
.
If data
does not have getobs
defined, then in the case of Tables.table(data) == true
returns the row(s) in position idx
, otherwise returns data[idx]
.
Authors of custom data containers should implement Base.getindex
for their type instead of getobs
. getobs
should only be implemented for types where there is a difference between getobs
and Base.getindex
(such as multi-dimensional arrays).
The returned observation(s) should be in the form intended to be passed as-is to some learning algorithm. There is no strict interface requirement on how this "actual data" must look like. Every author behind some custom data container can make this decision themselves. The output should be consistent when idx
is a scalar vs vector.
getobs
supports by default nested combinations of array, tuple, named tuples, and dictionaries.
Examples
# named tuples
x = (a = [1, 2, 3], b = rand(6, 3))
getobs(x, 2) == (a = 2, b = x.b[:, 2])
getobs(x, [1, 3]) == (a = [1, 3], b = x.b[:, [1, 3]])
# dictionaries
x = Dict(:a => [1, 2, 3], :b => rand(6, 3))
getobs(x, 2) == Dict(:a => 2, :b => x[:b][:, 2])
getobs(x, [1, 3]) == Dict(:a => [1, 3], :b => x[:b][:, [1, 3]])
MLUtils.getobs!
— Functiongetobs!(buffer, data, idx)
Inplace version of getobs(data, idx)
. If this method is defined for the type of data
, then buffer
should be used to store the result, instead of allocating a dedicated object.
Implementing this function is optional. In the case no such method is provided for the type of data
, then buffer
will be ignored and the result of getobs
returned. This could be because the type of data
may not lend itself to the concept of copy!
. Thus, supporting a custom getobs!
is optional and not required.
MLUtils.chunk
— Functionchunk(x, n; [dims])
chunk(x; [size, dims])
Split x
into n
parts or alternatively, if size
is an integer, into equal chunks of size size
. The parts contain the same number of elements except possibly for the last one that can be smaller.
In case size
is a collection of integers instead, the elements of x
are split into chunks of the given sizes.
If x
is an array, dims
can be used to specify along which dimension to split (defaults to the last dimension).
Examples
julia> chunk(1:10, 3)
3-element Vector{UnitRange{Int64}}:
1:4
5:8
9:10
julia> chunk(1:10; size = 2)
5-element Vector{UnitRange{Int64}}:
1:2
3:4
5:6
7:8
9:10
julia> x = reshape(collect(1:20), (5, 4))
5×4 Matrix{Int64}:
1 6 11 16
2 7 12 17
3 8 13 18
4 9 14 19
5 10 15 20
julia> xs = chunk(x, 2, dims=1)
2-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}}:
[1 6 11 16; 2 7 12 17; 3 8 13 18]
[4 9 14 19; 5 10 15 20]
julia> xs[1]
3×4 view(::Matrix{Int64}, 1:3, :) with eltype Int64:
1 6 11 16
2 7 12 17
3 8 13 18
julia> xes = chunk(x; size = 2, dims = 2)
2-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}}:
[1 6; 2 7; … ; 4 9; 5 10]
[11 16; 12 17; … ; 14 19; 15 20]
julia> xes[2]
5×2 view(::Matrix{Int64}, :, 3:4) with eltype Int64:
11 16
12 17
13 18
14 19
15 20
julia> chunk(1:6; size = [2, 4])
2-element Vector{UnitRange{Int64}}:
1:2
3:6
chunk(x, partition_idxs; [npartitions, dims])
Partition the array x
along the dimension dims
according to the indexes in partition_idxs
.
partition_idxs
must be sorted and contain only positive integers between 1 and the number of partitions.
If the number of partition npartitions
is not provided, it is inferred from partition_idxs
.
If dims
is not provided, it defaults to the last dimension.
See also unbatch
.
Examples
julia> x = reshape([1:10;], 2, 5)
2×5 Matrix{Int64}:
1 3 5 7 9
2 4 6 8 10
julia> chunk(x, [1, 2, 2, 3, 3])
3-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}}:
[1; 2;;]
[3 5; 4 6]
[7 9; 8 10]
MLUtils.group_counts
— Functiongroup_counts(x)
Count the number of times that each element of x
appears.
See also group_indices
Examples
julia> group_counts(['a', 'b', 'b'])
Dict{Char, Int64} with 2 entries:
'a' => 1
'b' => 2
MLUtils.group_indices
— Functiongroup_indices(x) -> Dict
Computes the indices of elements in the vector x
for each distinct value contained. This information is useful for resampling strategies, such as stratified sampling.
See also group_counts
.
Examples
julia> x = [:yes, :no, :maybe, :yes];
julia> group_indices(x)
Dict{Symbol, Vector{Int64}} with 3 entries:
:yes => [1, 4]
:maybe => [3]
:no => [2]
MLUtils.batch
— Functionbatch(xs)
Batch the arrays in xs
into a single array with an extra dimension.
If the elements of xs
are tuples, named tuples, or dicts, the output will be of the same type.
See also unbatch
.
Examples
julia> batch([[1,2,3],
[4,5,6]])
3×2 Matrix{Int64}:
1 4
2 5
3 6
julia> batch([(a=[1,2], b=[3,4])
(a=[5,6], b=[7,8])])
(a = [1 5; 2 6], b = [3 7; 4 8])
MLUtils.unbatch
— FunctionMLUtils.batchseq
— Functionbatchseq(seqs, val = 0)
Take a list of N
sequences, and turn them into a single sequence where each item is a batch of N
. Short sequences will be padded by val
.
Examples
julia> batchseq([[1, 2, 3], [4, 5]], 0)
3-element Vector{Vector{Int64}}:
[1, 4]
[2, 5]
[3, 0]
Missing docstring for MLUtils.rpad(v::AbstractVector, n::Integer, p)
. Check Documenter's build log for details.