Working with Data, using MLUtils.jl

Flux re-exports the DataLoader type and utility functions for working with data from MLUtils.

DataLoader

The DataLoader can be used to create mini-batches of data, in the format train! expects.

MLUtils.DataLoaderType
DataLoader(data; [batchsize, buffer, collate, parallel, partial, rng, shuffle])

An object that iterates over mini-batches of data, each mini-batch containing batchsize observations (except possibly the last one).

Takes as input a single data array, a tuple (or a named tuple) of arrays, or in general any data object that implements the numobs and getobs methods.

The last dimension in each array is the observation dimension, i.e. the one divided into mini-batches.

The original data is preserved in the data field of the DataLoader.

Arguments

  • data: The data to be iterated over. The data type has to be supported by numobs and getobs.
  • batchsize: If less than 0, iterates over individual observations. Otherwise, each iteration (except possibly the last) yields a mini-batch containing batchsize observations. Default 1.
  • buffer: If buffer=true and supported by the type of data, a buffer will be allocated and reused for memory efficiency. You can also pass a preallocated object to buffer. Default false.
  • collate: Batching behavior. If nothing (default), a batch is getobs(data, indices). If false, each batch is [getobs(data, i) for i in indices]. When true, applies batch to the vector of observations in a batch, recursively collating arrays in the last dimensions. See batch for more information and examples.
  • parallel: Whether to use load data in parallel using worker threads. Greatly speeds up data loading by factor of available threads. Requires starting Julia with multiple threads. Check Threads.nthreads() to see the number of available threads. Passing parallel = true breaks ordering guarantees. Default false.
  • partial: This argument is used only when batchsize > 0. If partial=false and the number of observations is not divisible by the batchsize, then the last mini-batch is dropped. Default true.
  • rng: A random number generator. Default Random.GLOBAL_RNG.
  • shuffle: Whether to shuffle the observations before iterating. Unlike wrapping the data container with shuffleobs(data), shuffle=true ensures that the observations are shuffled anew every time you start iterating over eachobs. Default false.

Examples

julia> Xtrain = rand(10, 100);

julia> array_loader = DataLoader(Xtrain, batchsize=2);

julia> for x in array_loader
         @assert size(x) == (10, 2)
         # do something with x, 50 times
       end

julia> array_loader.data === Xtrain
true

julia> tuple_loader = DataLoader((Xtrain,), batchsize=2);  # similar, but yielding 1-element tuples

julia> for x in tuple_loader
         @assert x isa Tuple{Matrix}
         @assert size(x[1]) == (10, 2)
       end

julia> Ytrain = rand('a':'z', 100);  # now make a DataLoader yielding 2-element named tuples

julia> train_loader = DataLoader((data=Xtrain, label=Ytrain), batchsize=5, shuffle=true);

julia> for epoch in 1:100
         for (x, y) in train_loader  # access via tuple destructuring
           @assert size(x) == (10, 5)
           @assert size(y) == (5,)
           # loss += f(x, y) # etc, runs 100 * 20 times
         end
       end

julia> first(train_loader).label isa Vector{Char}  # access via property name
true

julia> first(train_loader).label == Ytrain[1:5]  # because of shuffle=true
false

julia> foreach(println∘summary, DataLoader(rand(Int8, 10, 64), batchsize=30))  # partial=false would omit last
10×30 Matrix{Int8}
10×30 Matrix{Int8}
10×4 Matrix{Int8}
source

Utility Functions

The utility functions are meant to be used while working with data; these functions help create inputs for your models or batch your dataset.

MLUtils.batchFunction
batch(xs)

Batch the arrays in xs into a single array with an extra dimension.

If the elements of xs are tuples, named tuples, or dicts, the output will be of the same type.

See also unbatch.

Examples

julia> batch([[1,2,3], 
              [4,5,6]])
3×2 Matrix{Int64}:
 1  4
 2  5
 3  6

julia> batch([(a=[1,2], b=[3,4])
               (a=[5,6], b=[7,8])]) 
(a = [1 5; 2 6], b = [3 7; 4 8])
source
MLUtils.batchsizeFunction
batchsize(data::BatchView) -> Int

Return the fixed size of each batch in data.

Examples

using MLUtils
X, Y = MLUtils.load_iris()

A = BatchView(X, batchsize=30)
@assert batchsize(A) == 30
source
MLUtils.batchseqFunction
batchseq(seqs, val = 0)

Take a list of N sequences, and turn them into a single sequence where each item is a batch of N. Short sequences will be padded by val.

Examples

julia> batchseq([[1, 2, 3], [4, 5]], 0)
3-element Vector{Vector{Int64}}:
 [1, 4]
 [2, 5]
 [3, 0]
source
MLUtils.BatchViewType
BatchView(data, batchsize; partial=true, collate=nothing)
BatchView(data; batchsize=1, partial=true, collate=nothing)

Create a view of the given data that represents it as a vector of batches. Each batch will contain an equal amount of observations in them. The batch-size can be specified using the parameter batchsize. In the case that the size of the dataset is not dividable by the specified batchsize, the remaining observations will be ignored if partial=false. If partial=true instead the last batch-size can be slightly smaller.

Note that any data access is delayed until getindex is called.

If used as an iterator, the object will iterate over the dataset once, effectively denoting an epoch.

For BatchView to work on some data structure, the type of the given variable data must implement the data container interface. See ObsView for more info.

Arguments

  • data : The object describing the dataset. Can be of any type as long as it implements getobs and numobs (see Details for more information).

  • batchsize : The batch-size of each batch. It is the number of observations that each batch must contain (except possibly for the last one).

  • partial : If partial=false and the number of observations is not divisible by the batch-size, then the last mini-batch is dropped.

  • collate: Batching behavior. If nothing (default), a batch is getobs(data, indices). If false, each batch is [getobs(data, i) for i in indices]. When true, applies batch to the vector of observations in a batch, recursively collating arrays in the last dimensions. See batch for more information and examples.

Examples

using MLUtils
X, Y = MLUtils.load_iris()

A = BatchView(X, batchsize=30)
@assert typeof(A) <: BatchView <: AbstractVector
@assert eltype(A) <: SubArray{Float64,2}
@assert length(A) == 5 # Iris has 150 observations
@assert size(A[1]) == (4,30) # Iris has 4 features

# 5 batches of size 30 observations
for x in BatchView(X, batchsize=30)
    @assert typeof(x) <: SubArray{Float64,2}
    @assert numobs(x) === 30
end

# 7 batches of size 20 observations
# Note that the iris dataset has 150 observations,
# which means that with a batchsize of 20, the last
# 10 observations will be ignored
for (x, y) in BatchView((X, Y), batchsize=20, partial=false)
    @assert typeof(x) <: SubArray{Float64,2}
    @assert typeof(y) <: SubArray{String,1}
    @assert numobs(x) == numobs(y) == 20
end

# collate tuple observations
for (x, y) in BatchView((rand(10, 3), ["a", "b", "c"]), batchsize=2, collate=true, partial=false)
    @assert size(x) == (10, 2)
    @assert size(y) == (2,)
end


# randomly assign observations to one and only one batch.
for (x, y) in BatchView(shuffleobs((X, Y)), batchsize=20)
    @assert typeof(x) <: SubArray{Float64,2}
    @assert typeof(y) <: SubArray{String,1}
end
source
MLUtils.chunkFunction
chunk(x, n; [dims])
chunk(x; [size, dims])

Split x into n parts or alternatively, if size is an integer, into equal chunks of size size. The parts contain the same number of elements except possibly for the last one that can be smaller.

In case size is a collection of integers instead, the elements of x are split into chunks of the given sizes.

If x is an array, dims can be used to specify along which dimension to split (defaults to the last dimension).

Examples

julia> chunk(1:10, 3)
3-element Vector{UnitRange{Int64}}:
 1:4
 5:8
 9:10

julia> chunk(1:10; size = 2)
5-element Vector{UnitRange{Int64}}:
 1:2
 3:4
 5:6
 7:8
 9:10

julia> x = reshape(collect(1:20), (5, 4))
5×4 Matrix{Int64}:
 1   6  11  16
 2   7  12  17
 3   8  13  18
 4   9  14  19
 5  10  15  20

julia> xs = chunk(x, 2, dims=1)
2-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}}:
 [1 6 11 16; 2 7 12 17; 3 8 13 18]
 [4 9 14 19; 5 10 15 20]

julia> xs[1]
3×4 view(::Matrix{Int64}, 1:3, :) with eltype Int64:
 1  6  11  16
 2  7  12  17
 3  8  13  18

julia> xes = chunk(x; size = 2, dims = 2)
2-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}}:
 [1 6; 2 7; … ; 4 9; 5 10]
 [11 16; 12 17; … ; 14 19; 15 20]

julia> xes[2]
5×2 view(::Matrix{Int64}, :, 3:4) with eltype Int64:
 11  16
 12  17
 13  18
 14  19
 15  20

julia> chunk(1:6; size = [2, 4])
2-element Vector{UnitRange{Int64}}:
 1:2
 3:6
source
chunk(x, partition_idxs; [npartitions, dims])

Partition the array x along the dimension dims according to the indexes in partition_idxs.

partition_idxs must be sorted and contain only positive integers between 1 and the number of partitions.

If the number of partition npartitions is not provided, it is inferred from partition_idxs.

If dims is not provided, it defaults to the last dimension.

See also unbatch.

Examples

julia> x = reshape([1:10;], 2, 5)
2×5 Matrix{Int64}:
 1  3  5  7   9
 2  4  6  8  10

julia> chunk(x, [1, 2, 2, 3, 3])
3-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}}:
 [1; 2;;]
 [3 5; 4 6]
 [7 9; 8 10]
source
MLUtils.eachobsFunction
eachobs(data; kws...)

Return an iterator over data.

Supports the same arguments as DataLoader. The batchsize default is -1 here while it is 1 for DataLoader.

Examples

X = rand(4,100)

for x in eachobs(X)
    # loop entered 100 times
    @assert typeof(x) <: Vector{Float64}
    @assert size(x) == (4,)
end

# mini-batch iterations
for x in eachobs(X, batchsize=10)
    # loop entered 10 times
    @assert typeof(x) <: Matrix{Float64}
    @assert size(x) == (4,10)
end

# support for tuples, named tuples, dicts
for (x, y) in eachobs((X, Y))
    # ...
end
source
MLUtils.fill_likeFunction
fill_like(x, val, [element_type=eltype(x)], [dims=size(x)]))

Create an array with the given element type and size, based upon the given source array x. All element of the new array will be set to val. The third and fourth arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.

See also zeros_like and ones_like.

Examples

julia> x = rand(Float32, 2)
2-element Vector{Float32}:
 0.16087806
 0.89916044

julia> fill_like(x, 1.7, (3, 3))
3×3 Matrix{Float32}:
 1.7  1.7  1.7
 1.7  1.7  1.7
 1.7  1.7  1.7

julia> using CUDA

julia> x = CUDA.rand(2, 2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.803167  0.476101
 0.303041  0.317581

julia> fill_like(x, 1.7, Float64)
2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
 1.7  1.7
 1.7  1.7
source
MLUtils.filterobsFunction
filterobs(f, data)

Return a subset of data container data including all indices i for which f(getobs(data, i)) === true.

data = 1:10
numobs(data) == 10
fdata = filterobs(>(5), data)
numobs(fdata) == 5
source
MLUtils.flattenFunction
flatten(x::AbstractArray)

Reshape arbitrarly-shaped input into a matrix-shaped output, preserving the size of the last dimension.

See also unsqueeze.

Examples

julia> rand(3,4,5) |> flatten |> size
(12, 5)
source
MLUtils.getobsFunction
getobs(data, [idx])

Return the observations corresponding to the observation index idx. Note that idx can be any type as long as data has defined getobs for that type. If idx is not provided, then materialize all observations in data.

If data does not have getobs defined, then in the case of Tables.table(data) == true returns the row(s) in position idx, otherwise returns data[idx].

Authors of custom data containers should implement Base.getindex for their type instead of getobs. getobs should only be implemented for types where there is a difference between getobs and Base.getindex (such as multi-dimensional arrays).

The returned observation(s) should be in the form intended to be passed as-is to some learning algorithm. There is no strict interface requirement on how this "actual data" must look like. Every author behind some custom data container can make this decision themselves. The output should be consistent when idx is a scalar vs vector.

getobs supports by default nested combinations of array, tuple, named tuples, and dictionaries.

See also getobs! and numobs.

Examples

# named tuples 
x = (a = [1, 2, 3], b = rand(6, 3))

getobs(x, 2) == (a = 2, b = x.b[:, 2])
getobs(x, [1, 3]) == (a = [1, 3], b = x.b[:, [1, 3]])


# dictionaries
x = Dict(:a => [1, 2, 3], :b => rand(6, 3))

getobs(x, 2) == Dict(:a => 2, :b => x[:b][:, 2])
getobs(x, [1, 3]) == Dict(:a => [1, 3], :b => x[:b][:, [1, 3]])
source
MLUtils.getobs!Function
getobs!(buffer, data, idx)

Inplace version of getobs(data, idx). If this method is defined for the type of data, then buffer should be used to store the result, instead of allocating a dedicated object.

Implementing this function is optional. In the case no such method is provided for the type of data, then buffer will be ignored and the result of getobs returned. This could be because the type of data may not lend itself to the concept of copy!. Thus, supporting a custom getobs! is optional and not required.

See also getobs and numobs.

source
MLUtils.joinobsFunction
joinobs(datas...)

Concatenate data containers datas.

data1, data2 = 1:10, 11:20
jdata = joinumobs(data1, data2)
getobs(jdata, 15) == 15
source
MLUtils.group_countsFunction
group_counts(x)

Count the number of times that each element of x appears.

See also group_indices

Examples

julia> group_counts(['a', 'b', 'b'])
Dict{Char, Int64} with 2 entries:
  'a' => 1
  'b' => 2
source
MLUtils.group_indicesFunction
group_indices(x) -> Dict

Computes the indices of elements in the vector x for each distinct value contained. This information is useful for resampling strategies, such as stratified sampling.

See also group_counts.

Examples

julia> x = [:yes, :no, :maybe, :yes];

julia> group_indices(x)
Dict{Symbol, Vector{Int64}} with 3 entries:
  :yes   => [1, 4]
  :maybe => [3]
  :no    => [2]
source
MLUtils.groupobsFunction
groupobs(f, data)

Split data container data data into different data containers, grouping observations by f(obs).

data = -10:10
datas = groupobs(>(0), data)
length(datas) == 2
source
MLUtils.kfoldsFunction
kfolds(n::Integer, k = 5) -> Tuple

Compute the train/validation assignments for k repartitions of n observations, and return them in the form of two vectors. The first vector contains the index-vectors for the training subsets, and the second vector the index-vectors for the validation subsets respectively. A general rule of thumb is to use either k = 5 or k = 10. The following code snippet generates the indices assignments for k = 5

julia> train_idx, val_idx = kfolds(10, 5);

Each observation is assigned to the validation subset once (and only once). Thus, a union over all validation index-vectors reproduces the full range 1:n. Note that there is no random assignment of observations to subsets, which means that adjacent observations are likely to be part of the same validation subset.

julia> train_idx
5-element Array{Array{Int64,1},1}:
 [3,4,5,6,7,8,9,10]
 [1,2,5,6,7,8,9,10]
 [1,2,3,4,7,8,9,10]
 [1,2,3,4,5,6,9,10]
 [1,2,3,4,5,6,7,8]

julia> val_idx
5-element Array{UnitRange{Int64},1}:
 1:2
 3:4
 5:6
 7:8
 9:10
source
kfolds(data, [k = 5])

Repartition a data container k times using a k folds strategy and return the sequence of folds as a lazy iterator. Only data subsets are created, which means that no actual data is copied until getobs is invoked.

Conceptually, a k-folds repartitioning strategy divides the given data into k roughly equal-sized parts. Each part will serve as validation set once, while the remaining parts are used for training. This results in k different partitions of data.

In the case that the size of the dataset is not dividable by the specified k, the remaining observations will be evenly distributed among the parts.

for (x_train, x_val) in kfolds(X, k=10)
    # code called 10 times
    # nobs(x_val) may differ up to ±1 over iterations
end

Multiple variables are supported (e.g. for labeled data)

for ((x_train, y_train), val) in kfolds((X, Y), k=10)
    # ...
end

By default the folds are created using static splits. Use shuffleobs to randomly assign observations to the folds.

for (x_train, x_val) in kfolds(shuffleobs(X), k = 10)
    # ...
end

See leavepout for a related function.

source
MLUtils.leavepoutFunction
leavepout(n::Integer, [size = 1]) -> Tuple

Compute the train/validation assignments for k ≈ n/size repartitions of n observations, and return them in the form of two vectors. The first vector contains the index-vectors for the training subsets, and the second vector the index-vectors for the validation subsets respectively. Each validation subset will have either size or size+1 observations assigned to it. The following code snippet generates the index-vectors for size = 2.

julia> train_idx, val_idx = leavepout(10, 2);

Each observation is assigned to the validation subset once (and only once). Thus, a union over all validation index-vectors reproduces the full range 1:n. Note that there is no random assignment of observations to subsets, which means that adjacent observations are likely to be part of the same validation subset.

julia> train_idx
5-element Array{Array{Int64,1},1}:
 [3,4,5,6,7,8,9,10]
 [1,2,5,6,7,8,9,10]
 [1,2,3,4,7,8,9,10]
 [1,2,3,4,5,6,9,10]
 [1,2,3,4,5,6,7,8]

julia> val_idx
5-element Array{UnitRange{Int64},1}:
 1:2
 3:4
 5:6
 7:8
 9:10
source
leavepout(data, p = 1)

Repartition a data container using a k-fold strategy, where k is chosen in such a way, that each validation subset of the resulting folds contains roughly p observations. Defaults to p = 1, which is also known as "leave-one-out" partitioning.

The resulting sequence of folds is returned as a lazy iterator. Only data subsets are created. That means no actual data is copied until getobs is invoked.

for (train, val) in leavepout(X, p=2)
    # if nobs(X) is dividable by 2,
    # then numobs(val) will be 2 for each iteraton,
    # otherwise it may be 3 for the first few iterations.
end

Seekfolds for a related function.

source
MLUtils.mapobsFunction
mapobs(f, data; batched=:auto)

Lazily map f over the observations in a data container data. Returns a new data container mdata that can be indexed and has a length. Indexing triggers the transformation f.

The batched keyword argument controls the behavior of mdata[idx] and mdata[idxs] where idx is an integer and idxs is a vector of integers:

  • batched=:auto (default). Let f handle the two cases. Calls f(getobs(data, idx)) and f(getobs(data, idxs)).
  • batched=:never. The function f is always called on a single observation. Calls f(getobs(data, idx)) and [f(getobs(data, idx)) for idx in idxs].
  • batched=:always. The function f is always called on a batch of observations. Calls getobs(f(getobs(data, [idx])), 1) and f(getobs(data, idxs)).

Examples

julia> data = (a=[1,2,3], b=[1,2,3]);

julia> mdata = mapobs(data) do x
         (c = x.a .+ x.b,  d = x.a .- x.b)
       end
mapobs(#25, (a = [1, 2, 3], b = [1, 2, 3]); batched=:auto))

julia> mdata[1]
(c = 2, d = 0)

julia> mdata[1:2]
(c = [2, 4], d = [0, 0])
source
mapobs(fs, data)

Lazily map each function in tuple fs over the observations in data container data. Returns a tuple of transformed data containers.

source
mapobs(namedfs::NamedTuple, data)

Map a NamedTuple of functions over data, turning it into a data container of NamedTuples. Field syntax can be used to select a column of the resulting data container.

data = 1:10
nameddata = mapobs((x = sqrt, y = log), data)
getobs(nameddata, 10) == (x = sqrt(10), y = log(10))
getobs(nameddata.x, 10) == sqrt(10)
source
MLUtils.numobsFunction
numobs(data)

Return the total number of observations contained in data.

If data does not have numobs defined, then in the case of Tables.table(data) == true returns the number of rows, otherwise returns length(data).

Authors of custom data containers should implement Base.length for their type instead of numobs. numobs should only be implemented for types where there is a difference between numobs and Base.length (such as multi-dimensional arrays).

getobs supports by default nested combinations of array, tuple, named tuples, and dictionaries.

See also getobs.

Examples


# named tuples 
x = (a = [1, 2, 3], b = rand(6, 3))
numobs(x) == 3

# dictionaries
x = Dict(:a => [1, 2, 3], :b => rand(6, 3))
numobs(x) == 3

All internal containers must have the same number of observations:

julia> x = (a = [1, 2, 3, 4], b = rand(6, 3));

julia> numobs(x)
ERROR: DimensionMismatch: All data containers must have the same number of observations.
Stacktrace:
 [1] _check_numobs_error()
   @ MLUtils ~/.julia/dev/MLUtils/src/observation.jl:163
 [2] _check_numobs
   @ ~/.julia/dev/MLUtils/src/observation.jl:130 [inlined]
 [3] numobs(data::NamedTuple{(:a, :b), Tuple{Vector{Int64}, Matrix{Float64}}})
   @ MLUtils ~/.julia/dev/MLUtils/src/observation.jl:177
 [4] top-level scope
   @ REPL[35]:1
source
MLUtils.normaliseFunction
normalise(x; dims=ndims(x), ϵ=1e-5)

Normalise the array x to mean 0 and standard deviation 1 across the dimension(s) given by dims. Per default, dims is the last dimension.

ϵ is a small additive factor added to the denominator for numerical stability.

source
MLUtils.obsviewFunction
obsview(data, [indices])

Returns a lazy view of the observations in data that correspond to the given indices. No data will be copied except of the indices. It is similar to constructing an ObsView, but returns a SubArray if the type of data is Array or SubArray. Furthermore, this function may be extended for custom types of data that also want to provide their own subset-type.

In case data is a tuple, the constructor will be mapped over its elements. That means that the constructor returns a tuple of ObsView instead of a ObsView of tuples.

If instead you want to get the subset of observations corresponding to the given indices in their native type, use getobs.

See ObsView for more information.

source
MLUtils.ObsViewType
ObsView(data, [indices])

Used to represent a subset of some data of arbitrary type by storing which observation-indices the subset spans. Furthermore, subsequent subsettings are accumulated without needing to access actual data.

The main purpose for the existence of ObsView is to delay data access and movement until an actual batch of data (or single observation) is needed for some computation. This is particularily useful when the data is not located in memory, but on the hard drive or some remote location. In such a scenario one wants to load the required data only when needed.

Any data access is delayed until getindex is called, and even getindex returns the result of obsview which in general avoids data movement until getobs is called. If used as an iterator, the view will iterate over the dataset once, effectively denoting an epoch. Each iteration will return a lazy subset to the current observation.

Arguments

  • data : The object describing the dataset. Can be of any type as long as it implements getobs and numobs (see Details for more information).

  • indices : Optional. The index or indices of the observation(s) in data that the subset should represent. Can be of type Int or some subtype of AbstractVector.

Methods

  • getindex : Returns the observation(s) of the given index/indices. No data is copied aside from the required indices.

  • numobs : Returns the total number observations in the subset.

  • getobs : Returns the underlying data that the ObsView represents at the given relative indices. Note that these indices are in "subset space", and in general will not directly correspond to the same indices in the underlying data set.

Details

For ObsView to work on some data structure, the desired type MyType must implement the following interface:

  • getobs(data::MyType, idx) : Should return the observation(s) indexed by idx. In what form is up to the user. Note that idx can be of type Int or AbstractVector.

  • numobs(data::MyType) : Should return the total number of observations in data

The following methods can also be provided and are optional:

  • getobs(data::MyType) : By default this function is the identity function. If that is not the behaviour that you want for your type, you need to provide this method as well.

  • obsview(data::MyType, idx) : If your custom type has its own kind of subset type, you can return it here. An example for such a case are SubArray for representing a subset of some AbstractArray.

  • getobs!(buffer, data::MyType, [idx]) : Inplace version of getobs(data, idx). If this method is provided for MyType, then eachobs can preallocate a buffer that is then reused every iteration. Note: buffer should be equivalent to the return value of getobs(::MyType, ...), since this is how buffer is preallocated by default.

Examples

X, Y = MLUtils.load_iris()

# The iris set has 150 observations and 4 features
@assert size(X) == (4,150)

# Represents the 80 observations as a ObsView
v = ObsView(X, 21:100)
@assert numobs(v) == 80
@assert typeof(v) <: ObsView
# getobs indexes into v
@assert getobs(v, 1:10) == X[:, 21:30]

# Use `obsview` to avoid boxing into ObsView
# for types that provide a custom "subset", such as arrays.
# Here it instead creates a native SubArray.
v = obsview(X, 1:100)
@assert numobs(v) == 100
@assert typeof(v) <: SubArray

# Also works for tuples of arbitrary length
subset = obsview((X, Y), 1:100)
@assert numobs(subset) == 100
@assert typeof(subset) <: Tuple # tuple of SubArray

# Use as iterator
for x in ObsView(X)
    @assert typeof(x) <: SubArray{Float64,1}
end

# iterate over each individual labeled observation
for (x, y) in ObsView((X, Y))
    @assert typeof(x) <: SubArray{Float64,1}
    @assert typeof(y) <: String
end

# same but in random order
for (x, y) in ObsView(shuffleobs((X, Y)))
    @assert typeof(x) <: SubArray{Float64,1}
    @assert typeof(y) <: String
end

# Indexing: take first 10 observations
x, y = ObsView((X, Y))[1:10]

See also

obsview, getobs, numobs, splitobs, shuffleobs, kfolds.

source
MLUtils.ones_likeFunction
ones_like(x, [element_type=eltype(x)], [dims=size(x)]))

Create an array with the given element type and size, based upon the given source array x. All element of the new array will be set to 1. The second and third arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.

See also zeros_like and fill_like.

Examples

julia> x = rand(Float32, 2)
2-element Vector{Float32}:
 0.8621633
 0.5158395

julia> ones_like(x, (3, 3))
3×3 Matrix{Float32}:
 1.0  1.0  1.0
 1.0  1.0  1.0
 1.0  1.0  1.0

julia> using CUDA

julia> x = CUDA.rand(2, 2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.82297   0.656143
 0.701828  0.391335

julia> ones_like(x, Float64)
2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
 1.0  1.0
 1.0  1.0
source
MLUtils.oversampleFunction
oversample(data, classes; fraction=1, shuffle=true)
oversample(data::Tuple; fraction=1, shuffle=true)

Generate a re-balanced version of data by repeatedly sampling existing observations in such a way that every class will have at least fraction times the number observations of the largest class in classes. This way, all classes will have a minimum number of observations in the resulting data set relative to what largest class has in the given (original) data.

As an example, by default (i.e. with fraction = 1) the resulting dataset will be near perfectly balanced. On the other hand, with fraction = 0.5 every class in the resulting data with have at least 50% as many observations as the largest class.

The classes input is an array with the same length as numobs(data).

The convenience parameter shuffle determines if the resulting data will be shuffled after its creation; if it is not shuffled then all the repeated samples will be together at the end, sorted by class. Defaults to true.

The output will contain both the resampled data and classes.

# 6 observations with 3 features each
X = rand(3, 6)
# 2 classes, severely imbalanced
Y = ["a", "b", "b", "b", "b", "a"]

# oversample the class "a" to match "b"
X_bal, Y_bal = oversample(X, Y)

# this results in a bigger dataset with repeated data
@assert size(X_bal) == (3,8)
@assert length(Y_bal) == 8

# now both "a", and "b" have 4 observations each
@assert sum(Y_bal .== "a") == 4
@assert sum(Y_bal .== "b") == 4

For this function to work, the type of data must implement numobs and getobs.

Note that if data is a tuple and classes is not given, then it will be assumed that the last element of the tuple contains the classes.

julia> data = DataFrame(X1=rand(6), X2=rand(6), Y=[:a,:b,:b,:b,:b,:a])
6×3 DataFrames.DataFrame
│ Row │ X1        │ X2          │ Y │
├─────┼───────────┼─────────────┼───┤
│ 1   │ 0.226582  │ 0.0443222   │ a │
│ 2   │ 0.504629  │ 0.722906    │ b │
│ 3   │ 0.933372  │ 0.812814    │ b │
│ 4   │ 0.522172  │ 0.245457    │ b │
│ 5   │ 0.505208  │ 0.11202     │ b │
│ 6   │ 0.0997825 │ 0.000341996 │ a │

julia> getobs(oversample(data, data.Y))
8×3 DataFrame
 Row │ X1        X2         Y      
     │ Float64   Float64    Symbol 
─────┼─────────────────────────────
   1 │ 0.376304  0.100022   a
   2 │ 0.467095  0.185437   b
   3 │ 0.481957  0.319906   b
   4 │ 0.336762  0.390811   b
   5 │ 0.376304  0.100022   a
   6 │ 0.427064  0.0648339  a
   7 │ 0.427064  0.0648339  a
   8 │ 0.457043  0.490688   b

See ObsView for more information on data subsets. See also undersample.

source
MLUtils.randobsFunction
randobs(data, [n])

Pick a random observation or a batch of n random observations from data. For this function to work, the type of data must implement numobs and getobs.

source
MLUtils.rand_likeFunction
rand_like([rng=default_rng()], x, [element_type=eltype(x)], [dims=size(x)])

Create an array with the given element type and size, based upon the given source array x. All element of the new array will be set to a random value. The last two arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.

The default random number generator is used, unless a custom one is passed in explicitly as the first argument.

See also Base.rand and randn_like.

Examples

julia> x = ones(Float32, 2)
2-element Vector{Float32}:
 1.0
 1.0

julia> rand_like(x, (3, 3))
3×3 Matrix{Float32}:
 0.780032  0.920552  0.53689
 0.121451  0.741334  0.5449
 0.55348   0.138136  0.556404

julia> using CUDA

julia> CUDA.ones(2, 2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 1.0  1.0
 1.0  1.0

julia> rand_like(x, Float64)
2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
 0.429274  0.135379
 0.718895  0.0098756
source
MLUtils.randn_likeFunction
randn_like([rng=default_rng()], x, [element_type=eltype(x)], [dims=size(x)])

Create an array with the given element type and size, based upon the given source array x. All element of the new array will be set to a random value drawn from a normal distribution. The last two arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.

The default random number generator is used, unless a custom one is passed in explicitly as the first argument.

See also Base.randn and rand_like.

Examples

julia> x = ones(Float32, 2)
2-element Vector{Float32}:
 1.0
 1.0

julia> randn_like(x, (3, 3))
3×3 Matrix{Float32}:
 -0.385331    0.956231   0.0745102
  1.43756    -0.967328   2.06311
  0.0482372   1.78728   -0.902547

julia> using CUDA

julia> CUDA.ones(2, 2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 1.0  1.0
 1.0  1.0

julia> randn_like(x, Float64)
2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
 -0.578527   0.823445
 -1.01338   -0.612053
source
MLUtils.rpad_constantFunction
rpad_constant(v::AbstractArray, n::Union{Integer, Tuple}, val = 0; dims=:)

Return the given sequence padded with val along the dimensions dims up to a maximum length in each direction specified by n.

Examples

julia> rpad_constant([1, 2], 4, -1) # passing with -1 up to size 4
4-element Vector{Int64}:
 1
 2
 -1
 -1

julia> rpad_constant([1, 2, 3], 2) # no padding if length is already greater than n
3-element Vector{Int64}:
 1
 2
 3

julia> rpad_constant([1 2; 3 4], 4; dims=1) # padding along the first dimension
4×2 Matrix{Int64}:
 1  2
 3  4
 0  0
 0  0 

julia> rpad_constant([1 2; 3 4], 4) # padding along all dimensions by default
4×2 Matrix{Int64}:
 1  2
 3  4
 0  0
 0  0 
source
MLUtils.shuffleobsFunction
shuffleobs([rng], data)

Return a "subset" of data that spans all observations, but has the order of the observations shuffled.

The values of data itself are not copied. Instead only the indices are shuffled. This function calls obsview to accomplish that, which means that the return value is likely of a different type than data.

# For Arrays the subset will be of type SubArray
@assert typeof(shuffleobs(rand(4,10))) <: SubArray

# Iterate through all observations in random order
for x in eachobs(shuffleobs(X))
    ...
end

The optional parameter rng allows one to specify the random number generator used for shuffling. This is useful when reproducible results are desired. By default, uses the global RNG. See Random in Julia's standard library for more info.

For this function to work, the type of data must implement numobs and getobs. See ObsView for more information.

source
MLUtils.splitobsFunction
splitobs(n::Int; at) -> Tuple

Compute the indices for two or more disjoint subsets of the range 1:n with splits given by at.

Examples

julia> splitobs(100, at=0.7)
(1:70, 71:100)

julia> splitobs(100, at=(0.1, 0.4))
(1:10, 11:50, 51:100)
source
splitobs(data; at, shuffle=false) -> Tuple

Partition the data into two or more subsets. When at is a number (between 0 and 1) this specifies the proportion in the first subset. When at is a tuple, each entry specifies the proportion an a subset, with the last having 1-sum(at). In all there are length(at)+1 subsets returned.

If shuffle=true, randomly permute the observations before splitting.

Supports any datatype implementing the numobs and getobs interfaces – including arrays, tuples & NamedTuples of arrays.

Examples

julia> splitobs(permutedims(1:100); at=0.7)  # simple 70%-30% split, of a matrix
([1 2 … 69 70], [71 72 … 99 100])

julia> data = (x=ones(2,10), n=1:10)  # a NamedTuple, consistent last dimension
(x = [1.0 1.0 … 1.0 1.0; 1.0 1.0 … 1.0 1.0], n = 1:10)

julia> splitobs(data, at=(0.5, 0.3))  # a 50%-30%-20% split, e.g. train/test/validation
((x = [1.0 1.0 … 1.0 1.0; 1.0 1.0 … 1.0 1.0], n = 1:5), (x = [1.0 1.0 1.0; 1.0 1.0 1.0], n = 6:8), (x = [1.0 1.0; 1.0 1.0], n = 9:10))

julia> train, test = splitobs((permutedims(1.0:100.0), 101:200), at=0.7, shuffle=true);  # split a Tuple

julia> vec(test[1]) .+ 100 == test[2]
true
source
MLUtils.unbatchFunction
unbatch(x)

Reverse of the batch operation, unstacking the last dimension of the array x.

See also unstack and chunk.

Examples

julia> unbatch([1 3 5 7;
                2 4 6 8])
4-element Vector{Vector{Int64}}:
 [1, 2]
 [3, 4]
 [5, 6]
 [7, 8]
source
MLUtils.undersampleFunction
undersample(data, classes; shuffle=true)

Generate a class-balanced version of data by subsampling its observations in such a way that the resulting number of observations will be the same number for every class. This way, all classes will have as many observations in the resulting data set as the smallest class has in the given (original) data.

The convenience parameter shuffle determines if the resulting data will be shuffled after its creation; if it is not shuffled then all the observations will be in their original order. Defaults to false.

The output will contain both the resampled data and classes.

# 6 observations with 3 features each
X = rand(3, 6)
# 2 classes, severely imbalanced
Y = ["a", "b", "b", "b", "b", "a"]

# subsample the class "b" to match "a"
X_bal, Y_bal = undersample(X, Y)

# this results in a smaller dataset
@assert size(X_bal) == (3,4)
@assert length(Y_bal) == 4

# now both "a", and "b" have 2 observations each
@assert sum(Y_bal .== "a") == 2
@assert sum(Y_bal .== "b") == 2

For this function to work, the type of data must implement numobs and getobs.

Note that if data is a tuple, then it will be assumed that the last element of the tuple contains the targets.

julia> data = DataFrame(X1=rand(6), X2=rand(6), Y=[:a,:b,:b,:b,:b,:a])
6×3 DataFrames.DataFrame
│ Row │ X1        │ X2          │ Y │
├─────┼───────────┼─────────────┼───┤
│ 1   │ 0.226582  │ 0.0443222   │ a │
│ 2   │ 0.504629  │ 0.722906    │ b │
│ 3   │ 0.933372  │ 0.812814    │ b │
│ 4   │ 0.522172  │ 0.245457    │ b │
│ 5   │ 0.505208  │ 0.11202     │ b │
│ 6   │ 0.0997825 │ 0.000341996 │ a │

julia> getobs(undersample(data, data.Y))
4×3 DataFrame
 Row │ X1        X2         Y      
     │ Float64   Float64    Symbol 
─────┼─────────────────────────────
   1 │ 0.427064  0.0648339  a
   2 │ 0.376304  0.100022   a
   3 │ 0.467095  0.185437   b
   4 │ 0.457043  0.490688   b

See ObsView for more information on data subsets. See also oversample.

source
MLUtils.unsqueezeFunction
unsqueeze(x; dims)

Return x reshaped into an array one dimensionality higher than x, where dims indicates in which dimension x is extended. dims can be an integer between 1 and ndims(x)+1.

See also flatten, stack.

Examples

julia> unsqueeze([1 2; 3 4], dims=2)
2×1×2 Array{Int64, 3}:
[:, :, 1] =
 1
 3

[:, :, 2] =
 2
 4


julia> xs = [[1, 2], [3, 4], [5, 6]]
3-element Vector{Vector{Int64}}:
 [1, 2]
 [3, 4]
 [5, 6]

julia> unsqueeze(xs, dims=1)
1×3 Matrix{Vector{Int64}}:
 [1, 2]  [3, 4]  [5, 6]
source
unsqueeze(; dims)

Returns a function which, acting on an array, inserts a dimension of size 1 at dims.

Examples

julia> rand(21, 22, 23) |> unsqueeze(dims=2) |> size
(21, 1, 22, 23)
source
MLUtils.unstackFunction
unstack(xs; dims)

Unroll the given xs into an array of arrays along the given dimension dims.

See also stack, unbatch, and chunk.

Examples

julia> unstack([1 3 5 7; 2 4 6 8], dims=2)
4-element Vector{Vector{Int64}}:
 [1, 2]
 [3, 4]
 [5, 6]
 [7, 8]
source
MLUtils.zeros_likeFunction
zeros_like(x, [element_type=eltype(x)], [dims=size(x)]))

Create an array with the given element type and size, based upon the given source array x. All element of the new array will be set to 0. The second and third arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.

See also ones_like and fill_like.

Examples

julia> x = rand(Float32, 2)
2-element Vector{Float32}:
 0.4005432
 0.36934233

julia> zeros_like(x, (3, 3))
3×3 Matrix{Float32}:
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0

julia> using CUDA

julia> x = CUDA.rand(2, 2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.0695155  0.667979
 0.558468   0.59903

julia> zeros_like(x, Float64)
2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
 0.0  0.0
 0.0  0.0
source