# Working with Data, using MLUtils.jl

Flux re-exports the `DataLoader`

type and utility functions for working with data from MLUtils.

`DataLoader`

The `DataLoader`

can be used to create mini-batches of data, in the format `train!`

expects.

`MLUtils.DataLoader`

— Type`DataLoader(data; [batchsize, buffer, collate, parallel, partial, rng, shuffle])`

An object that iterates over mini-batches of `data`

, each mini-batch containing `batchsize`

observations (except possibly the last one).

Takes as input a single data array, a tuple (or a named tuple) of arrays, or in general any `data`

object that implements the `numobs`

and `getobs`

methods.

The last dimension in each array is the observation dimension, i.e. the one divided into mini-batches.

The original data is preserved in the `data`

field of the DataLoader.

**Arguments**

`data`

: The data to be iterated over. The data type has to be supported by`numobs`

and`getobs`

.`batchsize`

: If less than 0, iterates over individual observations. Otherwise, each iteration (except possibly the last) yields a mini-batch containing`batchsize`

observations. Default`1`

.`buffer`

: If`buffer=true`

and supported by the type of`data`

, a buffer will be allocated and reused for memory efficiency. You can also pass a preallocated object to`buffer`

. Default`false`

.`collate`

: Batching behavior. If`nothing`

(default), a batch is`getobs(data, indices)`

. If`false`

, each batch is`[getobs(data, i) for i in indices]`

. When`true`

, applies`batch`

to the vector of observations in a batch, recursively collating arrays in the last dimensions. See`batch`

for more information and examples.`parallel`

: Whether to use load data in parallel using worker threads. Greatly speeds up data loading by factor of available threads. Requires starting Julia with multiple threads. Check`Threads.nthreads()`

to see the number of available threads.**Passing**. Default`parallel = true`

breaks ordering guarantees`false`

.`partial`

: This argument is used only when`batchsize > 0`

. If`partial=false`

and the number of observations is not divisible by the batchsize, then the last mini-batch is dropped. Default`true`

.`rng`

: A random number generator. Default`Random.GLOBAL_RNG`

.`shuffle`

: Whether to shuffle the observations before iterating. Unlike wrapping the data container with`shuffleobs(data)`

,`shuffle=true`

ensures that the observations are shuffled anew every time you start iterating over`eachobs`

. Default`false`

.

**Examples**

```
julia> Xtrain = rand(10, 100);
julia> array_loader = DataLoader(Xtrain, batchsize=2);
julia> for x in array_loader
@assert size(x) == (10, 2)
# do something with x, 50 times
end
julia> array_loader.data === Xtrain
true
julia> tuple_loader = DataLoader((Xtrain,), batchsize=2); # similar, but yielding 1-element tuples
julia> for x in tuple_loader
@assert x isa Tuple{Matrix}
@assert size(x[1]) == (10, 2)
end
julia> Ytrain = rand('a':'z', 100); # now make a DataLoader yielding 2-element named tuples
julia> train_loader = DataLoader((data=Xtrain, label=Ytrain), batchsize=5, shuffle=true);
julia> for epoch in 1:100
for (x, y) in train_loader # access via tuple destructuring
@assert size(x) == (10, 5)
@assert size(y) == (5,)
# loss += f(x, y) # etc, runs 100 * 20 times
end
end
julia> first(train_loader).label isa Vector{Char} # access via property name
true
julia> first(train_loader).label == Ytrain[1:5] # because of shuffle=true
false
julia> foreach(println∘summary, DataLoader(rand(Int8, 10, 64), batchsize=30)) # partial=false would omit last
10×30 Matrix{Int8}
10×30 Matrix{Int8}
10×4 Matrix{Int8}
```

## Utility Functions

The utility functions are meant to be used while working with data; these functions help create inputs for your models or batch your dataset.

`MLUtils.batch`

— Function`batch(xs)`

Batch the arrays in `xs`

into a single array with an extra dimension.

If the elements of `xs`

are tuples, named tuples, or dicts, the output will be of the same type.

See also `unbatch`

.

**Examples**

```
julia> batch([[1,2,3],
[4,5,6]])
3×2 Matrix{Int64}:
1 4
2 5
3 6
julia> batch([(a=[1,2], b=[3,4])
(a=[5,6], b=[7,8])])
(a = [1 5; 2 6], b = [3 7; 4 8])
```

`MLUtils.batchsize`

— Function`batchsize(data::BatchView) -> Int`

Return the fixed size of each batch in `data`

.

**Examples**

```
using MLUtils
X, Y = MLUtils.load_iris()
A = BatchView(X, batchsize=30)
@assert batchsize(A) == 30
```

`MLUtils.batchseq`

— Function`batchseq(seqs, val = 0)`

Take a list of `N`

sequences, and turn them into a single sequence where each item is a batch of `N`

. Short sequences will be padded by `val`

.

**Examples**

```
julia> batchseq([[1, 2, 3], [4, 5]], 0)
3-element Vector{Vector{Int64}}:
[1, 4]
[2, 5]
[3, 0]
```

`MLUtils.BatchView`

— Type```
BatchView(data, batchsize; partial=true, collate=nothing)
BatchView(data; batchsize=1, partial=true, collate=nothing)
```

Create a view of the given `data`

that represents it as a vector of batches. Each batch will contain an equal amount of observations in them. The batch-size can be specified using the parameter `batchsize`

. In the case that the size of the dataset is not dividable by the specified `batchsize`

, the remaining observations will be ignored if `partial=false`

. If `partial=true`

instead the last batch-size can be slightly smaller.

Note that any data access is delayed until `getindex`

is called.

If used as an iterator, the object will iterate over the dataset once, effectively denoting an epoch.

For `BatchView`

to work on some data structure, the type of the given variable `data`

must implement the data container interface. See `ObsView`

for more info.

**Arguments**

: The object describing the dataset. Can be of any type as long as it implements`data`

`getobs`

and`numobs`

(see Details for more information).: The batch-size of each batch. It is the number of observations that each batch must contain (except possibly for the last one).`batchsize`

: If`partial`

`partial=false`

and the number of observations is not divisible by the batch-size, then the last mini-batch is dropped.: Batching behavior. If`collate`

`nothing`

(default), a batch is`getobs(data, indices)`

. If`false`

, each batch is`[getobs(data, i) for i in indices]`

. When`true`

, applies`batch`

to the vector of observations in a batch, recursively collating arrays in the last dimensions. See`batch`

for more information and examples.

**Examples**

```
using MLUtils
X, Y = MLUtils.load_iris()
A = BatchView(X, batchsize=30)
@assert typeof(A) <: BatchView <: AbstractVector
@assert eltype(A) <: SubArray{Float64,2}
@assert length(A) == 5 # Iris has 150 observations
@assert size(A[1]) == (4,30) # Iris has 4 features
# 5 batches of size 30 observations
for x in BatchView(X, batchsize=30)
@assert typeof(x) <: SubArray{Float64,2}
@assert numobs(x) === 30
end
# 7 batches of size 20 observations
# Note that the iris dataset has 150 observations,
# which means that with a batchsize of 20, the last
# 10 observations will be ignored
for (x, y) in BatchView((X, Y), batchsize=20, partial=false)
@assert typeof(x) <: SubArray{Float64,2}
@assert typeof(y) <: SubArray{String,1}
@assert numobs(x) == numobs(y) == 20
end
# collate tuple observations
for (x, y) in BatchView((rand(10, 3), ["a", "b", "c"]), batchsize=2, collate=true, partial=false)
@assert size(x) == (10, 2)
@assert size(y) == (2,)
end
# randomly assign observations to one and only one batch.
for (x, y) in BatchView(shuffleobs((X, Y)), batchsize=20)
@assert typeof(x) <: SubArray{Float64,2}
@assert typeof(y) <: SubArray{String,1}
end
```

`MLUtils.chunk`

— Function```
chunk(x, n; [dims])
chunk(x; [size, dims])
```

Split `x`

into `n`

parts or alternatively, if `size`

is an integer, into equal chunks of size `size`

. The parts contain the same number of elements except possibly for the last one that can be smaller.

In case `size`

is a collection of integers instead, the elements of `x`

are split into chunks of the given sizes.

If `x`

is an array, `dims`

can be used to specify along which dimension to split (defaults to the last dimension).

**Examples**

```
julia> chunk(1:10, 3)
3-element Vector{UnitRange{Int64}}:
1:4
5:8
9:10
julia> chunk(1:10; size = 2)
5-element Vector{UnitRange{Int64}}:
1:2
3:4
5:6
7:8
9:10
julia> x = reshape(collect(1:20), (5, 4))
5×4 Matrix{Int64}:
1 6 11 16
2 7 12 17
3 8 13 18
4 9 14 19
5 10 15 20
julia> xs = chunk(x, 2, dims=1)
2-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}}:
[1 6 11 16; 2 7 12 17; 3 8 13 18]
[4 9 14 19; 5 10 15 20]
julia> xs[1]
3×4 view(::Matrix{Int64}, 1:3, :) with eltype Int64:
1 6 11 16
2 7 12 17
3 8 13 18
julia> xes = chunk(x; size = 2, dims = 2)
2-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}}:
[1 6; 2 7; … ; 4 9; 5 10]
[11 16; 12 17; … ; 14 19; 15 20]
julia> xes[2]
5×2 view(::Matrix{Int64}, :, 3:4) with eltype Int64:
11 16
12 17
13 18
14 19
15 20
julia> chunk(1:6; size = [2, 4])
2-element Vector{UnitRange{Int64}}:
1:2
3:6
```

`chunk(x, partition_idxs; [npartitions, dims])`

Partition the array `x`

along the dimension `dims`

according to the indexes in `partition_idxs`

.

`partition_idxs`

must be sorted and contain only positive integers between 1 and the number of partitions.

If the number of partition `npartitions`

is not provided, it is inferred from `partition_idxs`

.

If `dims`

is not provided, it defaults to the last dimension.

See also `unbatch`

.

**Examples**

```
julia> x = reshape([1:10;], 2, 5)
2×5 Matrix{Int64}:
1 3 5 7 9
2 4 6 8 10
julia> chunk(x, [1, 2, 2, 3, 3])
3-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}}:
[1; 2;;]
[3 5; 4 6]
[7 9; 8 10]
```

`MLUtils.eachobs`

— Function`eachobs(data; kws...)`

Return an iterator over `data`

.

Supports the same arguments as `DataLoader`

. The `batchsize`

default is `-1`

here while it is `1`

for `DataLoader`

.

**Examples**

```
X = rand(4,100)
for x in eachobs(X)
# loop entered 100 times
@assert typeof(x) <: Vector{Float64}
@assert size(x) == (4,)
end
# mini-batch iterations
for x in eachobs(X, batchsize=10)
# loop entered 10 times
@assert typeof(x) <: Matrix{Float64}
@assert size(x) == (4,10)
end
# support for tuples, named tuples, dicts
for (x, y) in eachobs((X, Y))
# ...
end
```

`MLUtils.fill_like`

— Function`fill_like(x, val, [element_type=eltype(x)], [dims=size(x)]))`

Create an array with the given element type and size, based upon the given source array `x`

. All element of the new array will be set to `val`

. The third and fourth arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.

See also `zeros_like`

and `ones_like`

.

**Examples**

```
julia> x = rand(Float32, 2)
2-element Vector{Float32}:
0.16087806
0.89916044
julia> fill_like(x, 1.7, (3, 3))
3×3 Matrix{Float32}:
1.7 1.7 1.7
1.7 1.7 1.7
1.7 1.7 1.7
julia> using CUDA
julia> x = CUDA.rand(2, 2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
0.803167 0.476101
0.303041 0.317581
julia> fill_like(x, 1.7, Float64)
2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
1.7 1.7
1.7 1.7
```

`MLUtils.filterobs`

— Function`filterobs(f, data)`

Return a subset of data container `data`

including all indices `i`

for which `f(getobs(data, i)) === true`

.

```
data = 1:10
numobs(data) == 10
fdata = filterobs(>(5), data)
numobs(fdata) == 5
```

`MLUtils.flatten`

— Function`flatten(x::AbstractArray)`

Reshape arbitrarly-shaped input into a matrix-shaped output, preserving the size of the last dimension.

See also `unsqueeze`

.

**Examples**

```
julia> rand(3,4,5) |> flatten |> size
(12, 5)
```

`MLUtils.getobs`

— Function`getobs(data, [idx])`

Return the observations corresponding to the observation index `idx`

. Note that `idx`

can be any type as long as `data`

has defined `getobs`

for that type. If `idx`

is not provided, then materialize all observations in `data`

.

If `data`

does not have `getobs`

defined, then in the case of `Tables.table(data) == true`

returns the row(s) in position `idx`

, otherwise returns `data[idx]`

.

Authors of custom data containers should implement `Base.getindex`

for their type instead of `getobs`

. `getobs`

should only be implemented for types where there is a difference between `getobs`

and `Base.getindex`

(such as multi-dimensional arrays).

The returned observation(s) should be in the form intended to be passed as-is to some learning algorithm. There is no strict interface requirement on how this "actual data" must look like. Every author behind some custom data container can make this decision themselves. The output should be consistent when `idx`

is a scalar vs vector.

`getobs`

supports by default nested combinations of array, tuple, named tuples, and dictionaries.

**Examples**

```
# named tuples
x = (a = [1, 2, 3], b = rand(6, 3))
getobs(x, 2) == (a = 2, b = x.b[:, 2])
getobs(x, [1, 3]) == (a = [1, 3], b = x.b[:, [1, 3]])
# dictionaries
x = Dict(:a => [1, 2, 3], :b => rand(6, 3))
getobs(x, 2) == Dict(:a => 2, :b => x[:b][:, 2])
getobs(x, [1, 3]) == Dict(:a => [1, 3], :b => x[:b][:, [1, 3]])
```

`MLUtils.getobs!`

— Function`getobs!(buffer, data, idx)`

Inplace version of `getobs(data, idx)`

. If this method is defined for the type of `data`

, then `buffer`

should be used to store the result, instead of allocating a dedicated object.

Implementing this function is optional. In the case no such method is provided for the type of `data`

, then `buffer`

will be *ignored* and the result of `getobs`

returned. This could be because the type of `data`

may not lend itself to the concept of `copy!`

. Thus, supporting a custom `getobs!`

is optional and not required.

`MLUtils.joinobs`

— Function`joinobs(datas...)`

Concatenate data containers `datas`

.

```
data1, data2 = 1:10, 11:20
jdata = joinumobs(data1, data2)
getobs(jdata, 15) == 15
```

`MLUtils.group_counts`

— Function`group_counts(x)`

Count the number of times that each element of `x`

appears.

See also `group_indices`

**Examples**

```
julia> group_counts(['a', 'b', 'b'])
Dict{Char, Int64} with 2 entries:
'a' => 1
'b' => 2
```

`MLUtils.group_indices`

— Function`group_indices(x) -> Dict`

Computes the indices of elements in the vector `x`

for each distinct value contained. This information is useful for resampling strategies, such as stratified sampling.

See also `group_counts`

.

**Examples**

```
julia> x = [:yes, :no, :maybe, :yes];
julia> group_indices(x)
Dict{Symbol, Vector{Int64}} with 3 entries:
:yes => [1, 4]
:maybe => [3]
:no => [2]
```

`MLUtils.groupobs`

— Function`groupobs(f, data)`

Split data container data `data`

into different data containers, grouping observations by `f(obs)`

.

```
data = -10:10
datas = groupobs(>(0), data)
length(datas) == 2
```

`MLUtils.kfolds`

— Function`kfolds(n::Integer, k = 5) -> Tuple`

Compute the train/validation assignments for `k`

repartitions of `n`

observations, and return them in the form of two vectors. The first vector contains the index-vectors for the training subsets, and the second vector the index-vectors for the validation subsets respectively. A general rule of thumb is to use either `k = 5`

or `k = 10`

. The following code snippet generates the indices assignments for `k = 5`

`julia> train_idx, val_idx = kfolds(10, 5);`

Each observation is assigned to the validation subset once (and only once). Thus, a union over all validation index-vectors reproduces the full range `1:n`

. Note that there is no random assignment of observations to subsets, which means that adjacent observations are likely to be part of the same validation subset.

```
julia> train_idx
5-element Array{Array{Int64,1},1}:
[3,4,5,6,7,8,9,10]
[1,2,5,6,7,8,9,10]
[1,2,3,4,7,8,9,10]
[1,2,3,4,5,6,9,10]
[1,2,3,4,5,6,7,8]
julia> val_idx
5-element Array{UnitRange{Int64},1}:
1:2
3:4
5:6
7:8
9:10
```

`kfolds(data, [k = 5])`

Repartition a `data`

container `k`

times using a `k`

folds strategy and return the sequence of folds as a lazy iterator. Only data subsets are created, which means that no actual data is copied until `getobs`

is invoked.

Conceptually, a k-folds repartitioning strategy divides the given `data`

into `k`

roughly equal-sized parts. Each part will serve as validation set once, while the remaining parts are used for training. This results in `k`

different partitions of `data`

.

In the case that the size of the dataset is not dividable by the specified `k`

, the remaining observations will be evenly distributed among the parts.

```
for (x_train, x_val) in kfolds(X, k=10)
# code called 10 times
# nobs(x_val) may differ up to ±1 over iterations
end
```

Multiple variables are supported (e.g. for labeled data)

```
for ((x_train, y_train), val) in kfolds((X, Y), k=10)
# ...
end
```

By default the folds are created using static splits. Use `shuffleobs`

to randomly assign observations to the folds.

```
for (x_train, x_val) in kfolds(shuffleobs(X), k = 10)
# ...
end
```

See `leavepout`

for a related function.

`MLUtils.leavepout`

— Function`leavepout(n::Integer, [size = 1]) -> Tuple`

Compute the train/validation assignments for `k ≈ n/size`

repartitions of `n`

observations, and return them in the form of two vectors. The first vector contains the index-vectors for the training subsets, and the second vector the index-vectors for the validation subsets respectively. Each validation subset will have either `size`

or `size+1`

observations assigned to it. The following code snippet generates the index-vectors for `size = 2`

.

`julia> train_idx, val_idx = leavepout(10, 2);`

Each observation is assigned to the validation subset once (and only once). Thus, a union over all validation index-vectors reproduces the full range `1:n`

. Note that there is no random assignment of observations to subsets, which means that adjacent observations are likely to be part of the same validation subset.

```
julia> train_idx
5-element Array{Array{Int64,1},1}:
[3,4,5,6,7,8,9,10]
[1,2,5,6,7,8,9,10]
[1,2,3,4,7,8,9,10]
[1,2,3,4,5,6,9,10]
[1,2,3,4,5,6,7,8]
julia> val_idx
5-element Array{UnitRange{Int64},1}:
1:2
3:4
5:6
7:8
9:10
```

`leavepout(data, p = 1)`

Repartition a `data`

container using a k-fold strategy, where `k`

is chosen in such a way, that each validation subset of the resulting folds contains roughly `p`

observations. Defaults to `p = 1`

, which is also known as "leave-one-out" partitioning.

The resulting sequence of folds is returned as a lazy iterator. Only data subsets are created. That means no actual data is copied until `getobs`

is invoked.

```
for (train, val) in leavepout(X, p=2)
# if nobs(X) is dividable by 2,
# then numobs(val) will be 2 for each iteraton,
# otherwise it may be 3 for the first few iterations.
end
```

See`kfolds`

for a related function.

`MLUtils.mapobs`

— Function`mapobs(f, data; batched=:auto)`

Lazily map `f`

over the observations in a data container `data`

. Returns a new data container `mdata`

that can be indexed and has a length. Indexing triggers the transformation `f`

.

The batched keyword argument controls the behavior of `mdata[idx]`

and `mdata[idxs]`

where `idx`

is an integer and `idxs`

is a vector of integers:

`batched=:auto`

(default). Let`f`

handle the two cases. Calls`f(getobs(data, idx))`

and`f(getobs(data, idxs))`

.`batched=:never`

. The function`f`

is always called on a single observation. Calls`f(getobs(data, idx))`

and`[f(getobs(data, idx)) for idx in idxs]`

.`batched=:always`

. The function`f`

is always called on a batch of observations. Calls`getobs(f(getobs(data, [idx])), 1)`

and`f(getobs(data, idxs))`

.

**Examples**

```
julia> data = (a=[1,2,3], b=[1,2,3]);
julia> mdata = mapobs(data) do x
(c = x.a .+ x.b, d = x.a .- x.b)
end
mapobs(#25, (a = [1, 2, 3], b = [1, 2, 3]); batched=:auto))
julia> mdata[1]
(c = 2, d = 0)
julia> mdata[1:2]
(c = [2, 4], d = [0, 0])
```

`mapobs(fs, data)`

Lazily map each function in tuple `fs`

over the observations in data container `data`

. Returns a tuple of transformed data containers.

`mapobs(namedfs::NamedTuple, data)`

Map a `NamedTuple`

of functions over `data`

, turning it into a data container of `NamedTuple`

s. Field syntax can be used to select a column of the resulting data container.

```
data = 1:10
nameddata = mapobs((x = sqrt, y = log), data)
getobs(nameddata, 10) == (x = sqrt(10), y = log(10))
getobs(nameddata.x, 10) == sqrt(10)
```

`MLUtils.numobs`

— Function`numobs(data)`

Return the total number of observations contained in `data`

.

If `data`

does not have `numobs`

defined, then in the case of `Tables.table(data) == true`

returns the number of rows, otherwise returns `length(data)`

.

Authors of custom data containers should implement `Base.length`

for their type instead of `numobs`

. `numobs`

should only be implemented for types where there is a difference between `numobs`

and `Base.length`

(such as multi-dimensional arrays).

`getobs`

supports by default nested combinations of array, tuple, named tuples, and dictionaries.

See also `getobs`

.

**Examples**

```
# named tuples
x = (a = [1, 2, 3], b = rand(6, 3))
numobs(x) == 3
# dictionaries
x = Dict(:a => [1, 2, 3], :b => rand(6, 3))
numobs(x) == 3
```

All internal containers must have the same number of observations:

```
julia> x = (a = [1, 2, 3, 4], b = rand(6, 3));
julia> numobs(x)
ERROR: DimensionMismatch: All data containers must have the same number of observations.
Stacktrace:
[1] _check_numobs_error()
@ MLUtils ~/.julia/dev/MLUtils/src/observation.jl:163
[2] _check_numobs
@ ~/.julia/dev/MLUtils/src/observation.jl:130 [inlined]
[3] numobs(data::NamedTuple{(:a, :b), Tuple{Vector{Int64}, Matrix{Float64}}})
@ MLUtils ~/.julia/dev/MLUtils/src/observation.jl:177
[4] top-level scope
@ REPL[35]:1
```

`MLUtils.normalise`

— Function`normalise(x; dims=ndims(x), ϵ=1e-5)`

Normalise the array `x`

to mean 0 and standard deviation 1 across the dimension(s) given by `dims`

. Per default, `dims`

is the last dimension.

`ϵ`

is a small additive factor added to the denominator for numerical stability.

`MLUtils.obsview`

— Function`obsview(data, [indices])`

Returns a lazy view of the observations in `data`

that correspond to the given `indices`

. No data will be copied except of the indices. It is similar to constructing an `ObsView`

, but returns a `SubArray`

if the type of `data`

is `Array`

or `SubArray`

. Furthermore, this function may be extended for custom types of `data`

that also want to provide their own subset-type.

In case `data`

is a tuple, the constructor will be mapped over its elements. That means that the constructor returns a tuple of `ObsView`

instead of a `ObsView`

of tuples.

If instead you want to get the subset of observations corresponding to the given `indices`

in their native type, use `getobs`

.

See `ObsView`

for more information.

`MLUtils.ObsView`

— Type`ObsView(data, [indices])`

Used to represent a subset of some `data`

of arbitrary type by storing which observation-indices the subset spans. Furthermore, subsequent subsettings are accumulated without needing to access actual data.

The main purpose for the existence of `ObsView`

is to delay data access and movement until an actual batch of data (or single observation) is needed for some computation. This is particularily useful when the data is not located in memory, but on the hard drive or some remote location. In such a scenario one wants to load the required data only when needed.

Any data access is delayed until `getindex`

is called, and even `getindex`

returns the result of `obsview`

which in general avoids data movement until `getobs`

is called. If used as an iterator, the view will iterate over the dataset once, effectively denoting an epoch. Each iteration will return a lazy subset to the current observation.

**Arguments**

: The object describing the dataset. Can be of any type as long as it implements`data`

`getobs`

and`numobs`

(see Details for more information).: Optional. The index or indices of the observation(s) in`indices`

`data`

that the subset should represent. Can be of type`Int`

or some subtype of`AbstractVector`

.

**Methods**

: Returns the observation(s) of the given index/indices. No data is copied aside from the required indices.`getindex`

: Returns the total number observations in the subset.`numobs`

: Returns the underlying data that the`getobs`

`ObsView`

represents at the given relative indices. Note that these indices are in "subset space", and in general will not directly correspond to the same indices in the underlying data set.

**Details**

For `ObsView`

to work on some data structure, the desired type `MyType`

must implement the following interface:

`getobs(data::MyType, idx)`

: Should return the observation(s) indexed by`idx`

. In what form is up to the user. Note that`idx`

can be of type`Int`

or`AbstractVector`

.`numobs(data::MyType)`

: Should return the total number of observations in`data`

The following methods can also be provided and are optional:

`getobs(data::MyType)`

: By default this function is the identity function. If that is not the behaviour that you want for your type, you need to provide this method as well.`obsview(data::MyType, idx)`

: If your custom type has its own kind of subset type, you can return it here. An example for such a case are`SubArray`

for representing a subset of some`AbstractArray`

.`getobs!(buffer, data::MyType, [idx])`

: Inplace version of`getobs(data, idx)`

. If this method is provided for`MyType`

, then`eachobs`

can preallocate a buffer that is then reused every iteration. Note:`buffer`

should be equivalent to the return value of`getobs(::MyType, ...)`

, since this is how`buffer`

is preallocated by default.

**Examples**

```
X, Y = MLUtils.load_iris()
# The iris set has 150 observations and 4 features
@assert size(X) == (4,150)
# Represents the 80 observations as a ObsView
v = ObsView(X, 21:100)
@assert numobs(v) == 80
@assert typeof(v) <: ObsView
# getobs indexes into v
@assert getobs(v, 1:10) == X[:, 21:30]
# Use `obsview` to avoid boxing into ObsView
# for types that provide a custom "subset", such as arrays.
# Here it instead creates a native SubArray.
v = obsview(X, 1:100)
@assert numobs(v) == 100
@assert typeof(v) <: SubArray
# Also works for tuples of arbitrary length
subset = obsview((X, Y), 1:100)
@assert numobs(subset) == 100
@assert typeof(subset) <: Tuple # tuple of SubArray
# Use as iterator
for x in ObsView(X)
@assert typeof(x) <: SubArray{Float64,1}
end
# iterate over each individual labeled observation
for (x, y) in ObsView((X, Y))
@assert typeof(x) <: SubArray{Float64,1}
@assert typeof(y) <: String
end
# same but in random order
for (x, y) in ObsView(shuffleobs((X, Y)))
@assert typeof(x) <: SubArray{Float64,1}
@assert typeof(y) <: String
end
# Indexing: take first 10 observations
x, y = ObsView((X, Y))[1:10]
```

**See also**

`MLUtils.ones_like`

— Function`ones_like(x, [element_type=eltype(x)], [dims=size(x)]))`

Create an array with the given element type and size, based upon the given source array `x`

. All element of the new array will be set to 1. The second and third arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.

See also `zeros_like`

and `fill_like`

.

**Examples**

```
julia> x = rand(Float32, 2)
2-element Vector{Float32}:
0.8621633
0.5158395
julia> ones_like(x, (3, 3))
3×3 Matrix{Float32}:
1.0 1.0 1.0
1.0 1.0 1.0
1.0 1.0 1.0
julia> using CUDA
julia> x = CUDA.rand(2, 2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
0.82297 0.656143
0.701828 0.391335
julia> ones_like(x, Float64)
2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
1.0 1.0
1.0 1.0
```

`MLUtils.oversample`

— Function```
oversample(data, classes; fraction=1, shuffle=true)
oversample(data::Tuple; fraction=1, shuffle=true)
```

Generate a re-balanced version of `data`

by repeatedly sampling existing observations in such a way that every class will have at least `fraction`

times the number observations of the largest class in `classes`

. This way, all classes will have a minimum number of observations in the resulting data set relative to what largest class has in the given (original) `data`

.

As an example, by default (i.e. with `fraction = 1`

) the resulting dataset will be near perfectly balanced. On the other hand, with `fraction = 0.5`

every class in the resulting data with have at least 50% as many observations as the largest class.

The `classes`

input is an array with the same length as `numobs(data)`

.

The convenience parameter `shuffle`

determines if the resulting data will be shuffled after its creation; if it is not shuffled then all the repeated samples will be together at the end, sorted by class. Defaults to `true`

.

The output will contain both the resampled data and classes.

```
# 6 observations with 3 features each
X = rand(3, 6)
# 2 classes, severely imbalanced
Y = ["a", "b", "b", "b", "b", "a"]
# oversample the class "a" to match "b"
X_bal, Y_bal = oversample(X, Y)
# this results in a bigger dataset with repeated data
@assert size(X_bal) == (3,8)
@assert length(Y_bal) == 8
# now both "a", and "b" have 4 observations each
@assert sum(Y_bal .== "a") == 4
@assert sum(Y_bal .== "b") == 4
```

For this function to work, the type of `data`

must implement `numobs`

and `getobs`

.

Note that if `data`

is a tuple and `classes`

is not given, then it will be assumed that the last element of the tuple contains the classes.

```
julia> data = DataFrame(X1=rand(6), X2=rand(6), Y=[:a,:b,:b,:b,:b,:a])
6×3 DataFrames.DataFrame
│ Row │ X1 │ X2 │ Y │
├─────┼───────────┼─────────────┼───┤
│ 1 │ 0.226582 │ 0.0443222 │ a │
│ 2 │ 0.504629 │ 0.722906 │ b │
│ 3 │ 0.933372 │ 0.812814 │ b │
│ 4 │ 0.522172 │ 0.245457 │ b │
│ 5 │ 0.505208 │ 0.11202 │ b │
│ 6 │ 0.0997825 │ 0.000341996 │ a │
julia> getobs(oversample(data, data.Y))
8×3 DataFrame
Row │ X1 X2 Y
│ Float64 Float64 Symbol
─────┼─────────────────────────────
1 │ 0.376304 0.100022 a
2 │ 0.467095 0.185437 b
3 │ 0.481957 0.319906 b
4 │ 0.336762 0.390811 b
5 │ 0.376304 0.100022 a
6 │ 0.427064 0.0648339 a
7 │ 0.427064 0.0648339 a
8 │ 0.457043 0.490688 b
```

See `ObsView`

for more information on data subsets. See also `undersample`

.

`MLUtils.randobs`

— Function`MLUtils.rand_like`

— Function`rand_like([rng=default_rng()], x, [element_type=eltype(x)], [dims=size(x)])`

Create an array with the given element type and size, based upon the given source array `x`

. All element of the new array will be set to a random value. The last two arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.

The default random number generator is used, unless a custom one is passed in explicitly as the first argument.

See also `Base.rand`

and `randn_like`

.

**Examples**

```
julia> x = ones(Float32, 2)
2-element Vector{Float32}:
1.0
1.0
julia> rand_like(x, (3, 3))
3×3 Matrix{Float32}:
0.780032 0.920552 0.53689
0.121451 0.741334 0.5449
0.55348 0.138136 0.556404
julia> using CUDA
julia> CUDA.ones(2, 2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
1.0 1.0
1.0 1.0
julia> rand_like(x, Float64)
2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
0.429274 0.135379
0.718895 0.0098756
```

`MLUtils.randn_like`

— Function`randn_like([rng=default_rng()], x, [element_type=eltype(x)], [dims=size(x)])`

Create an array with the given element type and size, based upon the given source array `x`

. All element of the new array will be set to a random value drawn from a normal distribution. The last two arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.

The default random number generator is used, unless a custom one is passed in explicitly as the first argument.

See also `Base.randn`

and `rand_like`

.

**Examples**

```
julia> x = ones(Float32, 2)
2-element Vector{Float32}:
1.0
1.0
julia> randn_like(x, (3, 3))
3×3 Matrix{Float32}:
-0.385331 0.956231 0.0745102
1.43756 -0.967328 2.06311
0.0482372 1.78728 -0.902547
julia> using CUDA
julia> CUDA.ones(2, 2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
1.0 1.0
1.0 1.0
julia> randn_like(x, Float64)
2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
-0.578527 0.823445
-1.01338 -0.612053
```

`MLUtils.rpad_constant`

— Function`rpad_constant(v::AbstractArray, n::Union{Integer, Tuple}, val = 0; dims=:)`

Return the given sequence padded with `val`

along the dimensions `dims`

up to a maximum length in each direction specified by `n`

.

**Examples**

```
julia> rpad_constant([1, 2], 4, -1) # passing with -1 up to size 4
4-element Vector{Int64}:
1
2
-1
-1
julia> rpad_constant([1, 2, 3], 2) # no padding if length is already greater than n
3-element Vector{Int64}:
1
2
3
julia> rpad_constant([1 2; 3 4], 4; dims=1) # padding along the first dimension
4×2 Matrix{Int64}:
1 2
3 4
0 0
0 0
julia> rpad_constant([1 2; 3 4], 4) # padding along all dimensions by default
4×2 Matrix{Int64}:
1 2
3 4
0 0
0 0
```

`MLUtils.shuffleobs`

— Function`shuffleobs([rng], data)`

Return a "subset" of `data`

that spans all observations, but has the order of the observations shuffled.

The values of `data`

itself are not copied. Instead only the indices are shuffled. This function calls `obsview`

to accomplish that, which means that the return value is likely of a different type than `data`

.

```
# For Arrays the subset will be of type SubArray
@assert typeof(shuffleobs(rand(4,10))) <: SubArray
# Iterate through all observations in random order
for x in eachobs(shuffleobs(X))
...
end
```

The optional parameter `rng`

allows one to specify the random number generator used for shuffling. This is useful when reproducible results are desired. By default, uses the global RNG. See `Random`

in Julia's standard library for more info.

For this function to work, the type of `data`

must implement `numobs`

and `getobs`

. See `ObsView`

for more information.

`MLUtils.splitobs`

— Function`splitobs(n::Int; at) -> Tuple`

Compute the indices for two or more disjoint subsets of the range `1:n`

with splits given by `at`

.

**Examples**

```
julia> splitobs(100, at=0.7)
(1:70, 71:100)
julia> splitobs(100, at=(0.1, 0.4))
(1:10, 11:50, 51:100)
```

`splitobs(data; at, shuffle=false) -> Tuple`

Partition the `data`

into two or more subsets. When `at`

is a number (between 0 and 1) this specifies the proportion in the first subset. When `at`

is a tuple, each entry specifies the proportion an a subset, with the last having `1-sum(at)`

. In all there are `length(at)+1`

subsets returned.

If `shuffle=true`

, randomly permute the observations before splitting.

Supports any datatype implementing the `numobs`

and `getobs`

interfaces – including arrays, tuples & NamedTuples of arrays.

**Examples**

```
julia> splitobs(permutedims(1:100); at=0.7) # simple 70%-30% split, of a matrix
([1 2 … 69 70], [71 72 … 99 100])
julia> data = (x=ones(2,10), n=1:10) # a NamedTuple, consistent last dimension
(x = [1.0 1.0 … 1.0 1.0; 1.0 1.0 … 1.0 1.0], n = 1:10)
julia> splitobs(data, at=(0.5, 0.3)) # a 50%-30%-20% split, e.g. train/test/validation
((x = [1.0 1.0 … 1.0 1.0; 1.0 1.0 … 1.0 1.0], n = 1:5), (x = [1.0 1.0 1.0; 1.0 1.0 1.0], n = 6:8), (x = [1.0 1.0; 1.0 1.0], n = 9:10))
julia> train, test = splitobs((permutedims(1.0:100.0), 101:200), at=0.7, shuffle=true); # split a Tuple
julia> vec(test[1]) .+ 100 == test[2]
true
```

`MLUtils.unbatch`

— Function`MLUtils.undersample`

— Function`undersample(data, classes; shuffle=true)`

Generate a class-balanced version of `data`

by subsampling its observations in such a way that the resulting number of observations will be the same number for every class. This way, all classes will have as many observations in the resulting data set as the smallest class has in the given (original) `data`

.

The convenience parameter `shuffle`

determines if the resulting data will be shuffled after its creation; if it is not shuffled then all the observations will be in their original order. Defaults to `false`

.

The output will contain both the resampled data and classes.

```
# 6 observations with 3 features each
X = rand(3, 6)
# 2 classes, severely imbalanced
Y = ["a", "b", "b", "b", "b", "a"]
# subsample the class "b" to match "a"
X_bal, Y_bal = undersample(X, Y)
# this results in a smaller dataset
@assert size(X_bal) == (3,4)
@assert length(Y_bal) == 4
# now both "a", and "b" have 2 observations each
@assert sum(Y_bal .== "a") == 2
@assert sum(Y_bal .== "b") == 2
```

For this function to work, the type of `data`

must implement `numobs`

and `getobs`

.

Note that if `data`

is a tuple, then it will be assumed that the last element of the tuple contains the targets.

```
julia> data = DataFrame(X1=rand(6), X2=rand(6), Y=[:a,:b,:b,:b,:b,:a])
6×3 DataFrames.DataFrame
│ Row │ X1 │ X2 │ Y │
├─────┼───────────┼─────────────┼───┤
│ 1 │ 0.226582 │ 0.0443222 │ a │
│ 2 │ 0.504629 │ 0.722906 │ b │
│ 3 │ 0.933372 │ 0.812814 │ b │
│ 4 │ 0.522172 │ 0.245457 │ b │
│ 5 │ 0.505208 │ 0.11202 │ b │
│ 6 │ 0.0997825 │ 0.000341996 │ a │
julia> getobs(undersample(data, data.Y))
4×3 DataFrame
Row │ X1 X2 Y
│ Float64 Float64 Symbol
─────┼─────────────────────────────
1 │ 0.427064 0.0648339 a
2 │ 0.376304 0.100022 a
3 │ 0.467095 0.185437 b
4 │ 0.457043 0.490688 b
```

See `ObsView`

for more information on data subsets. See also `oversample`

.

`MLUtils.unsqueeze`

— Function`unsqueeze(x; dims)`

Return `x`

reshaped into an array one dimensionality higher than `x`

, where `dims`

indicates in which dimension `x`

is extended. `dims`

can be an integer between 1 and `ndims(x)+1`

.

**Examples**

```
julia> unsqueeze([1 2; 3 4], dims=2)
2×1×2 Array{Int64, 3}:
[:, :, 1] =
1
3
[:, :, 2] =
2
4
julia> xs = [[1, 2], [3, 4], [5, 6]]
3-element Vector{Vector{Int64}}:
[1, 2]
[3, 4]
[5, 6]
julia> unsqueeze(xs, dims=1)
1×3 Matrix{Vector{Int64}}:
[1, 2] [3, 4] [5, 6]
```

`unsqueeze(; dims)`

Returns a function which, acting on an array, inserts a dimension of size 1 at `dims`

.

**Examples**

```
julia> rand(21, 22, 23) |> unsqueeze(dims=2) |> size
(21, 1, 22, 23)
```

`MLUtils.unstack`

— Function`MLUtils.zeros_like`

— Function`zeros_like(x, [element_type=eltype(x)], [dims=size(x)]))`

Create an array with the given element type and size, based upon the given source array `x`

. All element of the new array will be set to 0. The second and third arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.

See also `ones_like`

and `fill_like`

.

**Examples**

```
julia> x = rand(Float32, 2)
2-element Vector{Float32}:
0.4005432
0.36934233
julia> zeros_like(x, (3, 3))
3×3 Matrix{Float32}:
0.0 0.0 0.0
0.0 0.0 0.0
0.0 0.0 0.0
julia> using CUDA
julia> x = CUDA.rand(2, 2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
0.0695155 0.667979
0.558468 0.59903
julia> zeros_like(x, Float64)
2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
0.0 0.0
0.0 0.0
```