undersample
function defined in module
MLUtils
undersample(data, classes; shuffle=true)
Generate a class-balanced version of
data by subsampling its observations in such a way that the resulting number of observations will be the same number for every class. This way, all classes will have as many observations in the resulting data set as the smallest class has in the given (original)
data.
The convenience parameter
shuffle determines if the resulting data will be shuffled after its creation; if it is not shuffled then all the observations will be in their original order. Defaults to
false.
The output will contain both the resampled data and classes.
# 6 observations with 3 features each
X
=
rand
(
3
,
6
)
# 2 classes, severely imbalanced
Y
=
[
"
a
"
,
"
b
"
,
"
b
"
,
"
b
"
,
"
b
"
,
"
a
"
]
# subsample the class "b" to match "a"
X_bal
,
Y_bal
=
undersample
(
X
,
Y
)
# this results in a smaller dataset
@
assert
size
(
X_bal
)
==
(
3
,
4
)
@
assert
length
(
Y_bal
)
==
4
# now both "a", and "b" have 2 observations each
@
assert
sum
(
Y_bal
.==
"
a
"
)
==
2
@
assert
sum
(
Y_bal
.==
"
b
"
)
==
2
For this function to work, the type of
data must implement
numobs and
getobs.
Note that if
data is a tuple, then it will be assumed that the last element of the tuple contains the targets.
julia> data = DataFrame(X1=rand(6), X2=rand(6), Y=[:a,:b,:b,:b,:b,:a])
6×3 DataFrames.DataFrame
│ Row │ X1 │ X2 │ Y │
├─────┼───────────┼─────────────┼───┤
│ 1 │ 0.226582 │ 0.0443222 │ a │
│ 2 │ 0.504629 │ 0.722906 │ b │
│ 3 │ 0.933372 │ 0.812814 │ b │
│ 4 │ 0.522172 │ 0.245457 │ b │
│ 5 │ 0.505208 │ 0.11202 │ b │
│ 6 │ 0.0997825 │ 0.000341996 │ a │
julia> getobs(undersample(data, data.Y))
4×3 DataFrame
Row │ X1 X2 Y
│ Float64 Float64 Symbol
─────┼─────────────────────────────
1 │ 0.427064 0.0648339 a
2 │ 0.376304 0.100022 a
3 │ 0.467095 0.185437 b
4 │ 0.457043 0.490688 b
See
ObsView for more information on data subsets. See also
oversample.
There are
2
methods for MLUtils.undersample:
The following pages link back here: