# Working with data using MLUtils.jl

Flux re-exports the `DataLoader`

type and utility functions for working with data from MLUtils.

## DataLoader

`DataLoader`

can be used to handle iteration over mini-batches of data.

`Flux`

's website has a dedicated tutorial on `DataLoader`

for more information.

`MLUtils.DataLoader`

— Type`DataLoader(data; [batchsize, buffer, collate, parallel, partial, rng, shuffle])`

An object that iterates over mini-batches of `data`

, each mini-batch containing `batchsize`

observations (except possibly the last one).

Takes as input a single data array, a tuple (or a named tuple) of arrays, or in general any `data`

object that implements the `numobs`

and `getobs`

methods.

The last dimension in each array is the observation dimension, i.e. the one divided into mini-batches.

The original data is preserved in the `data`

field of the DataLoader.

**Arguments**

`data`

: The data to be iterated over. The data type has to be supported by`numobs`

and`getobs`

.`batchsize`

: If less than 0, iterates over individual observations. Otherwise, each iteration (except possibly the last) yields a mini-batch containing`batchsize`

observations. Default`1`

.`buffer`

: If`buffer=true`

and supported by the type of`data`

, a buffer will be allocated and reused for memory efficiency. You can also pass a preallocated object to`buffer`

. Default`false`

.`collate`

: Batching behavior. If`nothing`

(default), a batch is`getobs(data, indices)`

. If`false`

, each batch is`[getobs(data, i) for i in indices]`

. When`true`

, applies`batch`

to the vector of observations in a batch, recursively collating arrays in the last dimensions. See`batch`

for more information and examples.`parallel`

: Whether to use load data in parallel using worker threads. Greatly speeds up data loading by factor of available threads. Requires starting Julia with multiple threads. Check`Threads.nthreads()`

to see the number of available threads.**Passing**. Default`parallel = true`

breaks ordering guarantees`false`

.`partial`

: This argument is used only when`batchsize > 0`

. If`partial=false`

and the number of observations is not divisible by the batchsize, then the last mini-batch is dropped. Default`true`

.`rng`

: A random number generator. Default`Random.GLOBAL_RNG`

.`shuffle`

: Whether to shuffle the observations before iterating. Unlike wrapping the data container with`shuffleobs(data)`

,`shuffle=true`

ensures that the observations are shuffled anew every time you start iterating over`eachobs`

. Default`false`

.

**Examples**

```
julia> Xtrain = rand(10, 100);
julia> array_loader = DataLoader(Xtrain, batchsize=2);
julia> for x in array_loader
@assert size(x) == (10, 2)
# do something with x, 50 times
end
julia> array_loader.data === Xtrain
true
julia> tuple_loader = DataLoader((Xtrain,), batchsize=2); # similar, but yielding 1-element tuples
julia> for x in tuple_loader
@assert x isa Tuple{Matrix}
@assert size(x[1]) == (10, 2)
end
julia> Ytrain = rand('a':'z', 100); # now make a DataLoader yielding 2-element named tuples
julia> train_loader = DataLoader((data=Xtrain, label=Ytrain), batchsize=5, shuffle=true);
julia> for epoch in 1:100
for (x, y) in train_loader # access via tuple destructuring
@assert size(x) == (10, 5)
@assert size(y) == (5,)
# loss += f(x, y) # etc, runs 100 * 20 times
end
end
julia> first(train_loader).label isa Vector{Char} # access via property name
true
julia> first(train_loader).label == Ytrain[1:5] # because of shuffle=true
false
julia> foreach(println∘summary, DataLoader(rand(Int8, 10, 64), batchsize=30)) # partial=false would omit last
10×30 Matrix{Int8}
10×30 Matrix{Int8}
10×4 Matrix{Int8}
```

## Utility functions for working with data

The utility functions are meant to be used while working with data; these functions help create inputs for your models or batch your dataset.

Below is a non-exhaustive list of such utility functions.

`MLUtils.unsqueeze`

— Function`unsqueeze(x; dims)`

Return `x`

reshaped into an array one dimensionality higher than `x`

, where `dims`

indicates in which dimension `x`

is extended.

**Examples**

```
julia> unsqueeze([1 2; 3 4], dims=2)
2×1×2 Array{Int64, 3}:
[:, :, 1] =
1
3
[:, :, 2] =
2
4
julia> xs = [[1, 2], [3, 4], [5, 6]]
3-element Vector{Vector{Int64}}:
[1, 2]
[3, 4]
[5, 6]
julia> unsqueeze(xs, dims=1)
1×3 Matrix{Vector{Int64}}:
[1, 2] [3, 4] [5, 6]
```

`unsqueeze(; dims)`

Returns a function which, acting on an array, inserts a dimension of size 1 at `dims`

.

**Examples**

```
julia> rand(21, 22, 23) |> unsqueeze(dims=2) |> size
(21, 1, 22, 23)
```

`MLUtils.flatten`

— Function`flatten(x::AbstractArray)`

Reshape arbitrarly-shaped input into a matrix-shaped output, preserving the size of the last dimension.

See also `unsqueeze`

.

**Examples**

```
julia> rand(3,4,5) |> flatten |> size
(12, 5)
```

`MLUtils.stack`

— Function`stack(xs; dims)`

Concatenate the given array of arrays `xs`

into a single array along the given dimension `dims`

.

**Examples**

```
julia> xs = [[1, 2], [3, 4], [5, 6]]
3-element Vector{Vector{Int64}}:
[1, 2]
[3, 4]
[5, 6]
julia> stack(xs, dims=1)
3×2 Matrix{Int64}:
1 2
3 4
5 6
julia> stack(xs, dims=2)
2×3 Matrix{Int64}:
1 3 5
2 4 6
julia> stack(xs, dims=3)
2×1×3 Array{Int64, 3}:
[:, :, 1] =
1
2
[:, :, 2] =
3
4
[:, :, 3] =
5
6
```

`MLUtils.unstack`

— Function`MLUtils.numobs`

— Function`numobs(data)`

Return the total number of observations contained in `data`

.

If `data`

does not have `numobs`

defined, then this function falls back to `length(data)`

. Authors of custom data containers should implement `Base.length`

for their type instead of `numobs`

. `numobs`

should only be implemented for types where there is a difference between `numobs`

and `Base.length`

(such as multi-dimensional arrays).

See also `getobs`

`MLUtils.getobs`

— Function`getobs(data, [idx])`

Return the observations corresponding to the observation-index `idx`

. Note that `idx`

can be any type as long as `data`

has defined `getobs`

for that type.

If `data`

does not have `getobs`

defined, then this function falls back to `data[idx]`

. Authors of custom data containers should implement `Base.getindex`

for their type instead of `getobs`

. `getobs`

should only be implemented for types where there is a difference between `getobs`

and `Base.getindex`

(such as multi-dimensional arrays).

The returned observation(s) should be in the form intended to be passed as-is to some learning algorithm. There is no strict interface requirement on how this "actual data" must look like. Every author behind some custom data container can make this decision themselves. The output should be consistent when `idx`

is a scalar vs vector.

`MLUtils.getobs!`

— Function`getobs!(buffer, data, idx)`

Inplace version of `getobs(data, idx)`

. If this method is defined for the type of `data`

, then `buffer`

should be used to store the result, instead of allocating a dedicated object.

Implementing this function is optional. In the case no such method is provided for the type of `data`

, then `buffer`

will be *ignored* and the result of `getobs`

returned. This could be because the type of `data`

may not lend itself to the concept of `copy!`

. Thus, supporting a custom `getobs!`

is optional and not required.

`MLUtils.chunk`

— Function```
chunk(x, n; [dims])
chunk(x; [size, dims])
```

Split `x`

into `n`

parts or alternatively, into equal chunks of size `size`

. The parts contain the same number of elements except possibly for the last one that can be smaller.

If `x`

is an array, `dims`

can be used to specify along which dimension to split (defaults to the last dimension).

**Examples**

```
julia> chunk(1:10, 3)
3-element Vector{UnitRange{Int64}}:
1:4
5:8
9:10
julia> chunk(1:10; size = 2)
5-element Vector{UnitRange{Int64}}:
1:2
3:4
5:6
7:8
9:10
julia> x = reshape(collect(1:20), (5, 4))
5×4 Matrix{Int64}:
1 6 11 16
2 7 12 17
3 8 13 18
4 9 14 19
5 10 15 20
julia> xs = chunk(x, 2, dims=1)
2-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}}:
[1 6 11 16; 2 7 12 17; 3 8 13 18]
[4 9 14 19; 5 10 15 20]
julia> xs[1]
3×4 view(::Matrix{Int64}, 1:3, :) with eltype Int64:
1 6 11 16
2 7 12 17
3 8 13 18
julia> xes = chunk(x; size = 2, dims = 2)
2-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}}:
[1 6; 2 7; … ; 4 9; 5 10]
[11 16; 12 17; … ; 14 19; 15 20]
julia> xes[2]
5×2 view(::Matrix{Int64}, :, 3:4) with eltype Int64:
11 16
12 17
13 18
14 19
15 20
```

`MLUtils.group_counts`

— Function`group_counts(x)`

Count the number of times that each element of `x`

appears.

See also `group_indices`

**Examples**

```
julia> group_counts(['a', 'b', 'b'])
Dict{Char, Int64} with 2 entries:
'a' => 1
'b' => 2
```

`MLUtils.group_indices`

— Function`group_indices(x) -> Dict`

Computes the indices of elements in the vector `x`

for each distinct value contained. This information is useful for resampling strategies, such as stratified sampling.

See also `group_counts`

.

**Examples**

```
julia> x = [:yes, :no, :maybe, :yes];
julia> group_indices(x)
Dict{Symbol, Vector{Int64}} with 3 entries:
:yes => [1, 4]
:maybe => [3]
:no => [2]
```

`MLUtils.batch`

— Function`batch(xs)`

Batch the arrays in `xs`

into a single array with an extra dimension.

If the elements of `xs`

are tuples, named tuples, or dicts, the output will be of the same type.

See also `unbatch`

.

**Examples**

```
julia> batch([[1,2,3],
[4,5,6]])
3×2 Matrix{Int64}:
1 4
2 5
3 6
julia> batch([(a=[1,2], b=[3,4])
(a=[5,6], b=[7,8])])
(a = [1 5; 2 6], b = [3 7; 4 8])
```

`MLUtils.unbatch`

— Function`MLUtils.batchseq`

— Function`batchseq(seqs, pad)`

Take a list of `N`

sequences, and turn them into a single sequence where each item is a batch of `N`

. Short sequences will be padded by `pad`

.

**Examples**

```
julia> batchseq([[1, 2, 3], [4, 5]], 0)
3-element Vector{Vector{Int64}}:
[1, 4]
[2, 5]
[3, 0]
```

`Base.rpad`

— Method`rpad(v::AbstractVector, n::Integer, p)`

Return the given sequence padded with `p`

up to a maximum length of `n`

.

**Examples**

```
julia> rpad([1, 2], 4, 0)
4-element Vector{Int64}:
1
2
0
0
julia> rpad([1, 2, 3], 2, 0)
3-element Vector{Int64}:
1
2
3
```