One-Hot Encoding with OneHotArrays.jl

It's common to encode categorical variables (like true, false or cat, dog) in "one-of-k" or "one-hot" form. OneHotArrays.jl provides the onehot function to make this easy.

julia> using OneHotArrays

julia> onehot(:b, [:a, :b, :c])
3-element OneHotVector(::UInt32) with eltype Bool:
 ⋅
 1
 ⋅

julia> onehot(:c, [:a, :b, :c])
3-element OneHotVector(::UInt32) with eltype Bool:
 ⋅
 ⋅
 1

There is also a onecold function, which is an inverse of onehot. It can also be given an array of numbers instead of booleans, in which case it performs an argmax-like operation, returning the label with the highest corresponding weight.

julia> onecold(ans, [:a, :b, :c])
:c

julia> onecold([true, false, false], [:a, :b, :c])
:a

julia> onecold([0.3, 0.2, 0.5], [:a, :b, :c])
:c

For multiple samples at once, onehotbatch creates a batch (matrix) of one-hot vectors, and onecold treats matrices as batches.

julia> using OneHotArrays

julia> onehotbatch([:b, :a, :b], [:a, :b, :c])
3×3 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
 ⋅  1  ⋅
 1  ⋅  1
 ⋅  ⋅  ⋅

julia> onecold(ans, [:a, :b, :c])
3-element Vector{Symbol}:
 :b
 :a
 :b

Note that these operations returned OneHotVector and OneHotMatrix rather than Arrays. OneHotVectors behave like normal vectors but avoid any unnecessary cost compared to using an integer index directly. For example, multiplying a matrix with a one-hot vector simply slices out the relevant row of the matrix under the hood.

Function listing

OneHotArrays.onehot — Function

onehot(x, labels, [default])

Returns a OneHotVector which is roughly a sparse representation of x .== labels.

Instead of storing say Vector{Bool}, it stores the index of the first occurrence of x in labels. If x is not found in labels, then it either returns onehot(default, labels), or gives an error if no default is given.

See also onehotbatch to apply this to many xs, and onecold to reverse either of these, as well as to generalise argmax.

Examples

julia> β = onehot(:b, (:a, :b, :c))
3-element OneHotVector(::UInt32) with eltype Bool:
 ⋅
 1
 ⋅

julia> αβγ = (onehot(0, 0:2), β, onehot(:z, [:a, :b, :c], :c))  # uses default
(Bool[1, 0, 0], Bool[0, 1, 0], Bool[0, 0, 1])

julia> hcat(αβγ...)  # preserves sparsity
3×3 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
 1  ⋅  ⋅
 ⋅  1  ⋅
 ⋅  ⋅  1

OneHotArrays.onecold — Function

onecold(y::AbstractArray, labels = 1:size(y,1))

Roughly the inverse operation of onehot or onehotbatch: This finds the index of the largest element of y, or each column of y, and looks them up in labels.

If labels are not specified, the default is integers 1:size(y,1) – the same operation as argmax(y, dims=1) but sometimes a different return type.

Examples

julia> onecold([false, true, false])
2

julia> onecold([0.3, 0.2, 0.5], (:a, :b, :c))
:c

julia> onecold([ 1  0  0  1  0  1  0  1  0  0  1
                 0  1  0  0  0  0  0  0  1  0  0
                 0  0  0  0  1  0  0  0  0  0  0
                 0  0  0  0  0  0  1  0  0  0  0
                 0  0  1  0  0  0  0  0  0  1  0 ], 'a':'e') |> String
"abeacadabea"

OneHotArrays.onehotbatch — Function

onehotbatch(xs, labels, [default])

Returns a OneHotMatrix where kth column of the matrix is onehot(xs[k], labels). This is a sparse matrix, which stores just a Vector{UInt32} containing the indices of the nonzero elements.

If one of the inputs in xs is not found in labels, that column is onehot(default, labels) if default is given, else an error.

If xs has more dimensions, N = ndims(xs) > 1, then the result is an AbstractArray{Bool, N+1} which is one-hot along the first dimension, i.e. result[:, k...] == onehot(xs[k...], labels).

Note that xs can be any iterable, such as a string. And that using a tuple for labels will often speed up construction, certainly for less than 32 classes.

Examples

julia> oh = onehotbatch("abracadabra", 'a':'e', 'e')
5×11 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
 1  ⋅  ⋅  1  ⋅  1  ⋅  1  ⋅  ⋅  1
 ⋅  1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  ⋅  ⋅
 ⋅  ⋅  ⋅  ⋅  1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  ⋅

julia> reshape(1:15, 3, 5) * oh  # this matrix multiplication is done efficiently
3×11 Matrix{Int64}:
 1  4  13  1  7  1  10  1  4  13  1
 2  5  14  2  8  2  11  2  5  14  2
 3  6  15  3  9  3  12  3  6  15  3

OneHotArrays.OneHotArray — Type

OneHotArray{T, N, M, I} <: AbstractArray{Bool, M}
OneHotArray(indices, L)

A one-hot M-dimensional array with L labels (i.e. size(A, 1) == L and sum(A, dims=1) == 1) stored as a compact N == M-1-dimensional array of indices.

Typically constructed by onehot and onehotbatch. Parameter I is the type of the underlying storage, and T its eltype.

OneHotArrays.OneHotVector — Type

OneHotVector{T} = OneHotArray{T, 0, 1, T}
OneHotVector(indices, L)

A one-hot vector with L labels (i.e. length(A) == L and count(A) == 1) typically constructed by onehot. Stored efficiently as a single index of type T, usually UInt32.

OneHotArrays.OneHotMatrix — Type

OneHotMatrix{T, I} = OneHotArray{T, 1, 2, I}
OneHotMatrix(indices, L)

A one-hot matrix (with L labels) typically constructed using onehotbatch. Stored efficiently as a vector of indices with type I and eltype T.