kfolds

function defined in module MLUtils


			kfolds(n::Integer, k = 5) -> Tuple

Compute the train/validation assignments for k repartitions of n observations, and return them in the form of two vectors. The first vector contains the index-vectors for the training subsets, and the second vector the index-vectors for the validation subsets respectively. A general rule of thumb is to use either k = 5 or k = 10. The following code snippet generates the indices assignments for k = 5


			
			
			
			
			
			
			julia
			>
			 
			train_idx
			,
			 
			val_idx
			 
			=
			 
			

			kfolds
			(
			10
			,
			 
			5
			)
			;

Each observation is assigned to the validation subset once (and only once). Thus, a union over all validation index-vectors reproduces the full range 1:n. Note that there is no random assignment of observations to subsets, which means that adjacent observations are likely to be part of the same validation subset.


			
			julia> train_idx
5-element Array{Array{Int64,1},1}:
 [3,4,5,6,7,8,9,10]
 [1,2,5,6,7,8,9,10]
 [1,2,3,4,7,8,9,10]
 [1,2,3,4,5,6,9,10]
 [1,2,3,4,5,6,7,8]

julia> val_idx
5-element Array{UnitRange{Int64},1}:
 1:2
 3:4
 5:6
 7:8
 9:10

			kfolds(data, [k = 5])

Repartition a data container k times using a k folds strategy and return the sequence of folds as a lazy iterator. Only data subsets are created, which means that no actual data is copied until getobs is invoked.

Conceptually, a k-folds repartitioning strategy divides the given data into k roughly equal-sized parts. Each part will serve as validation set once, while the remaining parts are used for training. This results in k different partitions of data.

In the case that the size of the dataset is not dividable by the specified k, the remaining observations will be evenly distributed among the parts.


			
			
			
			for
			
			 
			
			(
			x_train
			,
			 
			x_val
			)
			 
			in
			 
			

			kfolds
			(
			X
			,
			 
			
			k
			=
			10
			)
			
			
    
			# code called 10 times
			
    
			# nobs(x_val) may differ up to ±1 over iterations
			

			end

Multiple variables are supported (e.g. for labeled data)


			
			
			
			for
			
			 
			
			(
			
			(
			x_train
			,
			 
			y_train
			)
			,
			 
			val
			)
			 
			in
			 
			

			kfolds
			(
			
			(
			X
			,
			 
			Y
			)
			,
			 
			
			k
			=
			10
			)
			
			
    
			# ...
			

			end

By default the folds are created using static splits. Use shuffleobs to randomly assign observations to the folds.


			
			
			
			for
			
			 
			
			(
			x_train
			,
			 
			x_val
			)
			 
			in
			 
			

			kfolds
			(
			

	
			shuffleobs
			(
			X
			)
			,
			 
			
			k
			 
			=
			 
			10
			)
			
			
    
			# ...
			

			end

See leavepout for a related function.

Methods

There are 3 methods for MLUtils.kfolds: