undersample
function
defined in module
MLUtils
undersample(data, classes; shuffle=true)
Generate a class-balanced version of
data
by subsampling its observations in such a way that the resulting number of observations will be the same number for every class. This way, all classes will have as many observations in the resulting data set as the smallest class has in the given (original)
data
.
The convenience parameter
shuffle
determines if the resulting data will be shuffled after its creation; if it is not shuffled then all the observations will be in their original order. Defaults to
false
.
The output will contain both the resampled data and classes.
# 6 observations with 3 features each
X
=
rand
(
3
,
6
)
# 2 classes, severely imbalanced
Y
=
[
"
a
"
,
"
b
"
,
"
b
"
,
"
b
"
,
"
b
"
,
"
a
"
]
# subsample the class "b" to match "a"
X_bal
,
Y_bal
=
undersample
(
X
,
Y
)
# this results in a smaller dataset
@
assert
size
(
X_bal
)
==
(
3
,
4
)
@
assert
length
(
Y_bal
)
==
4
# now both "a", and "b" have 2 observations each
@
assert
sum
(
Y_bal
.==
"
a
"
)
==
2
@
assert
sum
(
Y_bal
.==
"
b
"
)
==
2
For this function to work, the type of
data
must implement
numobs
and
getobs
.
Note that if
data
is a tuple, then it will be assumed that the last element of the tuple contains the targets.
julia> data = DataFrame(X1=rand(6), X2=rand(6), Y=[:a,:b,:b,:b,:b,:a])
6×3 DataFrames.DataFrame
│ Row │ X1 │ X2 │ Y │
├─────┼───────────┼─────────────┼───┤
│ 1 │ 0.226582 │ 0.0443222 │ a │
│ 2 │ 0.504629 │ 0.722906 │ b │
│ 3 │ 0.933372 │ 0.812814 │ b │
│ 4 │ 0.522172 │ 0.245457 │ b │
│ 5 │ 0.505208 │ 0.11202 │ b │
│ 6 │ 0.0997825 │ 0.000341996 │ a │
julia> getobs(undersample(data, data.Y))
4×3 DataFrame
Row │ X1 X2 Y
│ Float64 Float64 Symbol
─────┼─────────────────────────────
1 │ 0.427064 0.0648339 a
2 │ 0.376304 0.100022 a
3 │ 0.467095 0.185437 b
4 │ 0.457043 0.490688 b
See
ObsView
for more information on data subsets. See also
oversample
.
There are
2
methods for MLUtils.undersample
:
The following pages link back here: