MLUtils
"""
ObsView(data, [indices])
Used to represent a subset of some `data` of arbitrary type by
storing which observation-indices the subset spans. Furthermore,
subsequent subsettings are accumulated without needing to access
actual data.
The main purpose for the existence of `ObsView` is to delay
data access and movement until an actual batch of data (or single
observation) is needed for some computation. This is particularily
useful when the data is not located in memory, but on the hard
drive or some remote location. In such a scenario one wants to
load the required data only when needed.
Any data access is delayed until `getindex` is called,
and even `getindex` returns the result of
[`obsview`](@ref) which in general avoids data movement until
[`getobs`](@ref) is called.
If used as an iterator, the view will iterate over the dataset
once, effectively denoting an epoch. Each iteration will return a
lazy subset to the current observation.
# Arguments
- **`data`** : The object describing the dataset. Can be of any
type as long as it implements [`getobs`](@ref) and
[`numobs`](@ref) (see Details for more information).
- **`indices`** : Optional. The index or indices of the
observation(s) in `data` that the subset should represent.
Can be of type `Int` or some subtype of `AbstractVector`.
# Methods
- **`getindex`** : Returns the observation(s) of the given
index/indices. No data is copied aside
from the required indices.
- **`numobs`** : Returns the total number observations in the subset.
- **`getobs`** : Returns the underlying data that the
`ObsView` represents at the given relative indices. Note
that these indices are in "subset space", and in general will
not directly correspond to the same indices in the underlying
data set.
# Details
For `ObsView` to work on some data structure, the desired type
`MyType` must implement the following interface:
- `getobs(data::MyType, idx)` :
Should return the observation(s) indexed by `idx`.
In what form is up to the user.
Note that `idx` can be of type `Int` or `AbstractVector`.
- `numobs(data::MyType)` :
Should return the total number of observations in `data`
The following methods can also be provided and are optional:
- `getobs(data::MyType)` :
By default this function is the identity function.
If that is not the behaviour that you want for your type,
you need to provide this method as well.
- `obsview(data::MyType, idx)` :
If your custom type has its own kind of subset type, you can
return it here. An example for such a case are `SubArray` for
representing a subset of some `AbstractArray`.
- `getobs!(buffer, data::MyType, [idx])` :
Inplace version of `getobs(data, idx)`. If this method
is provided for `MyType`, then `eachobs` can preallocate a buffer that is then reused
every iteration. Note: `buffer` should be equivalent to the
return value of `getobs(::MyType, ...)`, since this is how
`buffer` is preallocated by default.
# Examples
```julia
X, Y = MLUtils.load_iris()
# The iris set has 150 observations and 4 features
@assert size(X) == (4,150)
# Represents the 80 observations as a ObsView
v = ObsView(X, 21:100)
@assert numobs(v) == 80
@assert typeof(v) <: ObsView
# getobs indexes into v
@assert getobs(v, 1:10) == X[:, 21:30]
# Use `obsview` to avoid boxing into ObsView
# for types that provide a custom "subset", such as arrays.
# Here it instead creates a native SubArray.
v = obsview(X, 1:100)
@assert numobs(v) == 100
@assert typeof(v) <: SubArray
# Also works for tuples of arbitrary length
subset = obsview((X, Y), 1:100)
@assert numobs(subset) == 100
@assert typeof(subset) <: Tuple # tuple of SubArray
# Use as iterator
for x in ObsView(X)
@assert typeof(x) <: SubArray{Float64,1}
end
# iterate over each individual labeled observation
for (x, y) in ObsView((X, Y))
@assert typeof(x) <: SubArray{Float64,1}
@assert typeof(y) <: String
end
# same but in random order
for (x, y) in ObsView(shuffleobs((X, Y)))
@assert typeof(x) <: SubArray{Float64,1}
@assert typeof(y) <: String
end
# Indexing: take first 10 observations
x, y = ObsView((X, Y))[1:10]
```
# See also
[`obsview`](@ref), [`getobs`](@ref), [`numobs`](@ref),
[`splitobs`](@ref), [`shuffleobs`](@ref),
[`kfolds`](@ref).
"""
struct
ObsView
{
Tdata
,
I
<:
Union
{
Int
,
AbstractVector
}
}
<:
AbstractDataContainer
data
::
Tdata
indices
::
I
function
ObsView
(
data
::
T
,
indices
::
I
)
where
{
T
,
I
}
1
<=
minimum
(
indices
)
||
throw
(
BoundsError
(
data
,
indices
)
)
maximum
(
indices
)
<=
numobs
(
data
)
||
throw
(
BoundsError
(
data
,
indices
)
)
new
{
T
,
I
}
(
data
,
indices
)
end
end
ObsView
(
data
)
=
ObsView
(
data
,
1
:
numobs
(
data
)
)
ObsView
(
subset
::
ObsView
)
=
subset
function
ObsView
(
subset
::
ObsView
,
indices
::
Union
{
Int
,
AbstractVector
}
)
ObsView
(
subset
.
data
,
subset
.
indices
[
indices
]
)
end
function
Base
.
show
(
io
::
IO
,
subset
::
ObsView
)
if
get
(
io
,
:
compact
,
false
)
print
(
io
,
"
ObsView{
"
,
typeof
(
subset
.
data
)
,
"
} with
"
,
numobs
(
subset
)
,
"
observations
"
)
else
print
(
io
,
summary
(
subset
)
,
"
\n
"
,
numobs
(
subset
)
,
"
observations
"
)
end
end
function
Base
.
summary
(
subset
::
ObsView
)
io
=
IOBuffer
(
)
print
(
io
,
typeof
(
subset
)
.
name
.
name
,
"
(
"
)
Base
.
showarg
(
io
,
subset
.
data
,
false
)
print
(
io
,
"
,
"
)
Base
.
showarg
(
io
,
subset
.
indices
,
false
)
print
(
io
,
')'
)
first
(
readlines
(
seek
(
io
,
0
)
)
)
end
compare if both subsets cover the same observations of the same data we don't care how the indices are stored, just that they match in order and values
function
Base
.
:
(
==
)
(
s1
::
ObsView
,
s2
::
ObsView
)
s1
.
data
==
s2
.
data
&&
s1
.
indices
==
s2
.
indices
end
Base
.
IteratorEltype
(
::
Type
{
<:
ObsView
}
)
=
Base
.
EltypeUnknown
(
)
@
propagate_inbounds
Base
.
getindex
(
subset
::
ObsView
,
idx
)
=
obsview
(
subset
.
data
,
subset
.
indices
[
idx
]
)
Base
.
length
(
subset
::
ObsView
)
=
length
(
subset
.
indices
)
getobs
(
subset
::
ObsView
)
=
getobs
(
subset
.
data
,
subset
.
indices
)
@
propagate_inbounds
getobs
(
subset
::
ObsView
,
idx
)
=
getobs
(
subset
.
data
,
subset
.
indices
[
idx
]
)
getobs!
(
buffer
,
subset
::
ObsView
)
=
getobs!
(
buffer
,
subset
.
data
,
subset
.
indices
)
@
propagate_inbounds
getobs!
(
buffer
,
subset
::
ObsView
,
idx
)
=
getobs!
(
buffer
,
subset
.
data
,
subset
.
indices
[
idx
]
)
Base
.
parent
(
x
::
ObsView
)
=
x
.
data
"""
obsview(data, [indices])
Returns a lazy view of the observations in `data` that
correspond to the given `indices`. No data will be copied except
of the indices. It is similar to constructing an [`ObsView`](@ref),
but returns a `SubArray` if the type of
`data` is `Array` or `SubArray`. Furthermore, this function may
be extended for custom types of `data` that also want to provide
their own subset-type.
In case `data` is a tuple, the constructor will be mapped
over its elements. That means that the constructor returns a
tuple of `ObsView` instead of a `ObsView` of tuples.
If instead you want to get the subset of observations
corresponding to the given `indices` in their native type, use
`getobs`.
See [`ObsView`](@ref) for more information.
"""
obsview
(
data
,
indices
=
1
:
numobs
(
data
)
)
=
ObsView
(
data
,
indices
)
obsview
(
A
::
SubArray
)
=
A
function
obsview
(
A
::
AbstractArray
{
T
,
N
}
,
idx
)
where
{
T
,
N
}
I
=
ntuple
(
_
->
:
,
N
-
1
)
return
view
(
A
,
I
...
,
idx
)
end
getobs
(
a
::
SubArray
)
=
getobs
(
a
.
parent
,
last
(
a
.
indices
)
)
function
obsview
(
tup
::
Union
{
Tuple
,
NamedTuple
}
,
indices
)
map
(
data
->
obsview
(
data
,
indices
)
,
tup
)
end