One-Hot Encoding with OneHotArrays.jl
It's common to encode categorical variables (like true
, false
or cat
, dog
) in "one-of-k" or "one-hot" form. OneHotArrays.jl provides the onehot
function to make this easy.
julia> using OneHotArrays
julia> onehot(:b, [:a, :b, :c])
3-element OneHotVector(::UInt32) with eltype Bool:
⋅
1
⋅
julia> onehot(:c, [:a, :b, :c])
3-element OneHotVector(::UInt32) with eltype Bool:
⋅
⋅
1
There is also a onecold
function, which is an inverse of onehot
. It can also be given an array of numbers instead of booleans, in which case it performs an argmax
-like operation, returning the label with the highest corresponding weight.
julia> onecold(ans, [:a, :b, :c])
:c
julia> onecold([true, false, false], [:a, :b, :c])
:a
julia> onecold([0.3, 0.2, 0.5], [:a, :b, :c])
:c
For multiple samples at once, onehotbatch
creates a batch (matrix) of one-hot vectors, and onecold
treats matrices as batches.
julia> using OneHotArrays
julia> onehotbatch([:b, :a, :b], [:a, :b, :c])
3×3 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
⋅ 1 ⋅
1 ⋅ 1
⋅ ⋅ ⋅
julia> onecold(ans, [:a, :b, :c])
3-element Vector{Symbol}:
:b
:a
:b
Note that these operations returned OneHotVector
and OneHotMatrix
rather than Array
s. OneHotVector
s behave like normal vectors but avoid any unnecessary cost compared to using an integer index directly. For example, multiplying a matrix with a one-hot vector simply slices out the relevant row of the matrix under the hood.
Function listing
OneHotArrays.onehot
— Functiononehot(x, labels, [default])
Returns a OneHotVector
which is roughly a sparse representation of x .== labels
.
Instead of storing say Vector{Bool}
, it stores the index of the first occurrence of x
in labels
. If x
is not found in labels, then it either returns onehot(default, labels)
, or gives an error if no default is given.
See also onehotbatch
to apply this to many x
s, and onecold
to reverse either of these, as well as to generalise argmax
.
Examples
julia> β = onehot(:b, (:a, :b, :c))
3-element OneHotVector(::UInt32) with eltype Bool:
⋅
1
⋅
julia> αβγ = (onehot(0, 0:2), β, onehot(:z, [:a, :b, :c], :c)) # uses default
(Bool[1, 0, 0], Bool[0, 1, 0], Bool[0, 0, 1])
julia> hcat(αβγ...) # preserves sparsity
3×3 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
1 ⋅ ⋅
⋅ 1 ⋅
⋅ ⋅ 1
OneHotArrays.onecold
— Functiononecold(y::AbstractArray, labels = 1:size(y,1))
Roughly the inverse operation of onehot
or onehotbatch
: This finds the index of the largest element of y
, or each column of y
, and looks them up in labels
.
If labels
are not specified, the default is integers 1:size(y,1)
– the same operation as argmax(y, dims=1)
but sometimes a different return type.
Examples
julia> onecold([false, true, false])
2
julia> onecold([0.3, 0.2, 0.5], (:a, :b, :c))
:c
julia> onecold([ 1 0 0 1 0 1 0 1 0 0 1
0 1 0 0 0 0 0 0 1 0 0
0 0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0
0 0 1 0 0 0 0 0 0 1 0 ], 'a':'e') |> String
"abeacadabea"
OneHotArrays.onehotbatch
— Functiononehotbatch(xs, labels, [default])
Returns a OneHotMatrix
where k
th column of the matrix is onehot(xs[k], labels)
. This is a sparse matrix, which stores just a Vector{UInt32}
containing the indices of the nonzero elements.
If one of the inputs in xs
is not found in labels
, that column is onehot(default, labels)
if default
is given, else an error.
If xs
has more dimensions, N = ndims(xs) > 1
, then the result is an AbstractArray{Bool, N+1}
which is one-hot along the first dimension, i.e. result[:, k...] == onehot(xs[k...], labels)
.
Note that xs
can be any iterable, such as a string. And that using a tuple for labels
will often speed up construction, certainly for less than 32 classes.
Examples
julia> oh = onehotbatch("abracadabra", 'a':'e', 'e')
5×11 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
1 ⋅ ⋅ 1 ⋅ 1 ⋅ 1 ⋅ ⋅ 1
⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅
⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅
⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅
julia> reshape(1:15, 3, 5) * oh # this matrix multiplication is done efficiently
3×11 Matrix{Int64}:
1 4 13 1 7 1 10 1 4 13 1
2 5 14 2 8 2 11 2 5 14 2
3 6 15 3 9 3 12 3 6 15 3
OneHotArrays.OneHotArray
— TypeOneHotArray{T, N, M, I} <: AbstractArray{Bool, M}
OneHotArray(indices, L)
A one-hot M
-dimensional array with L
labels (i.e. size(A, 1) == L
and sum(A, dims=1) == 1
) stored as a compact N == M-1
-dimensional array of indices.
Typically constructed by onehot
and onehotbatch
. Parameter I
is the type of the underlying storage, and T
its eltype.
OneHotArrays.OneHotVector
— TypeOneHotVector{T} = OneHotArray{T, 0, 1, T}
OneHotVector(indices, L)
A one-hot vector with L
labels (i.e. length(A) == L
and count(A) == 1
) typically constructed by onehot
. Stored efficiently as a single index of type T
, usually UInt32
.
OneHotArrays.OneHotMatrix
— TypeOneHotMatrix{T, I} = OneHotArray{T, 1, 2, I}
OneHotMatrix(indices, L)
A one-hot matrix (with L
labels) typically constructed using onehotbatch
. Stored efficiently as a vector of indices with type I
and eltype T
.