Keypoint regression

Single keypoint regression consists of localizing a keypoint in an image. Here we'll be training on a head pose dataset, where every image has a person in it and the head of the person is annotated. Since keypoint datasets all have different formats, we have to do a bit more manual work to get the task dataset loaded. First we import everything we'll need:


			
			
			
			
			import
			
			 
			CairoMakie
			;
			 
			
			
			CairoMakie
			.
			
			activate!
			(
			
			type
			=
			
			"
			png
			"
			)
			

			
			
			using
			
			 
			DelimitedFiles
			:
			
			 
			readdlm
			

			
			using
			
			 

	
			FastAI
			,
			
			 
			FastVision
			,
			
			 

	
			Flux
			,
			
			 
			Metalhead
			

			
			using
			
			 

	
			FastAI
			.
			FilePathsBase
			

			
			import
			
			 
			FastVision
			.

	
			DataAugmentation

Creating a task data container

load( datasets ()[id]) downloads the files, but it's up to us to load them into a usable format. In the end, the task data container should contain tuples of an image and a keypoint each.


			
			
			
			path
			 
			=
			 
			
			load
			(
			
			

	
			datasets
			(
			)
			[
			
			"
			biwi_head_pose
			"
			]
			)

			"/home/lorenz/.julia/datadeps/fastai-biwi_head_pose"

			
			
			
			
			files
			 
			=
			 
			

	
			loadfolderdata
			(
			path
			)
			;
			

			
			files
			[
			
			1
			:
			10
			]

			10-element Vector{String}:
 "/home/lorenz/.julia/datadeps/fastai-biwi_head_pose/01"
 "/home/lorenz/.julia/datadeps/fastai-biwi_head_pose/01.obj"
 "/home/lorenz/.julia/datadeps/fastai-biwi_head_pose/02"
 "/home/lorenz/.julia/datadeps/fastai-biwi_head_pose/02.obj"
 "/home/lorenz/.julia/datadeps/fastai-biwi_head_pose/03"
 "/home/lorenz/.julia/datadeps/fastai-biwi_head_pose/03.obj"
 "/home/lorenz/.julia/datadeps/fastai-biwi_head_pose/04"
 "/home/lorenz/.julia/datadeps/fastai-biwi_head_pose/04.obj"
 "/home/lorenz/.julia/datadeps/fastai-biwi_head_pose/05"
 "/home/lorenz/.julia/datadeps/fastai-biwi_head_pose/05.obj"

First we create a loadfolderdata from the directory where the dataset has been downloaded to:

Loading a loadfolderdata simply treats every file as a single observation. However, that is not what we want here: for every observation we have one image and one annotation file that make up one observation and we want to ignore all other files, like the README. To achieve this, we'll create two data containers containing all the image paths and annotation paths respectively by filtering the container with all paths.


			
			
			
			imagefiles
			 
			=
			 
			

	
			loadfolderdata
			(
			path
			,
			 
			
			filterfn
			=
			
			FastVision
			.
			

	
			isimagefile
			)
			

			
			annotfiles
			 
			=
			 
			

	
			loadfolderdata
			(
			path
			,
			 
			
			filterfn
			=
			
			p
			 
			->
			 
			
			occursin
			(
			
			"
			pose
			"
			,
			 
			

	
			pathname
			(
			p
			)
			)
			)

			ObsView(::MLDatasets.FileDataset{typeof(identity), String}, ::Vector{Int64})
 15678 observations

			
			
			
			(
			

	
			getobs
			(
			imagefiles
			,
			 
			1
			)
			,
			 
			

	
			getobs
			(
			annotfiles
			,
			 
			1
			)
			)

			("/home/lorenz/.julia/datadeps/fastai-biwi_head_pose/01/frame_00003_rgb.jpg", "/home/lorenz/.julia/datadeps/fastai-biwi_head_pose/01/frame_00003_pose.txt")

Next we need to map functions over each observation that load the data from the files. An image file can be loaded using the loadfile utility. The keypoints have a custom format, so we write a helper function to parse them from a text file. The details of how the format is loaded aren't important.


			
			
			
			
			readcalibrationfile
			(
			p
			)
			 
			=
			 
			
			
			readdlm
			(
			
			string
			(
			p
			)
			)
			[
			
			1
			:
			3
			,
			 
			
			1
			:
			3
			]
			

			

			
			CAL
			 
			=
			 
			
			readcalibrationfile
			(
			
			joinpath
			(
			path
			,
			 
			
			"
			01
			"
			,
			 
			
			"
			rgb.cal
			"
			)
			)
			

			

			
			function
			 
			
			loadannotfile
			(
			annotpath
			,
			 
			
			cal
			 
			=
			 
			CAL
			)
			
			
    
			
			ctr
			 
			=
			 
			
			
			readdlm
			(
			
			string
			(
			annotpath
			)
			)
			[
			4
			,
			:
			]
			
    
			
			cx
			 
			=
			
			
			
			 
			
			ctr
			[
			1
			]
			 
			*
			 
			
			cal
			[
			1
			,
			1
			]
			/
			
			ctr
			[
			3
			]
			 
			+
			 
			
			cal
			[
			1
			,
			3
			]
			
    
			
			cy
			 
			=
			
			
			
			 
			
			ctr
			[
			2
			]
			 
			*
			 
			
			cal
			[
			2
			,
			2
			]
			/
			
			ctr
			[
			3
			]
			 
			+
			 
			
			cal
			[
			2
			,
			3
			]
			
    
			
			return
			 
			
			[
			
			
			
			FastVision
			.
			
			SVector
			(
			cy
			,
			 
			cx
			)
			 
			.+
			 
			1
			]
			

			end

			loadannotfile (generic function with 2 methods)

Now we can use mapobs to lazily map the loading function over the container. Note that beside loading the image and keypoint, we also extract the subject ID from the path. We'll use this in a bit for splitting the dataset appropriately and we don't have access to the path information anymore once we have a container of loaded data.


			
			
			
			data
			 
			=
			 
			
			(
			
    
			

	
			mapobs
			(

	
			loadfile
			,
			 
			imagefiles
			)
			,
			
    
			

	
			mapobs
			(
			loadannotfile
			,
			 
			annotfiles
			)
			

			)
			

			
			ids
			 
			=
			 
			
			map
			(
			
			p
			 
			->
			 
			
			parse
			(
			Int
			,
			 
			

	
			pathname
			(
			

	
			pathparent
			(
			p
			)
			)
			)
			,
			 
			imagefiles
			)
			

			
			obs
			 
			=
			
			
			 
			image
			,
			 
			ks
			 
			=
			 
			

	
			getobs
			(
			data
			,
			 
			2000
			)

			(ColorTypes.RGB{FixedPointNumbers.N0f8}[RGB{N0f8}(0.0,0.0,0.0) RGB{N0f8}(0.004,0.004,0.004) … RGB{N0f8}(0.063,0.043,0.027) RGB{N0f8}(0.016,0.0,0.0); RGB{N0f8}(0.051,0.051,0.051) RGB{N0f8}(0.859,0.859,0.859) … RGB{N0f8}(0.522,0.502,0.486) RGB{N0f8}(0.059,0.039,0.024); … ; RGB{N0f8}(0.043,0.012,0.0) RGB{N0f8}(0.129,0.094,0.059) … RGB{N0f8}(0.463,0.447,0.435) RGB{N0f8}(0.078,0.059,0.043); RGB{N0f8}(0.027,0.0,0.0) RGB{N0f8}(0.055,0.02,0.0) … RGB{N0f8}(0.067,0.051,0.039) RGB{N0f8}(0.016,0.0,0.0)], StaticArrays.SVector{2, Float64}[[237.24571342301059, 412.9481253320988]])

We can visualize an observation using DataAugmentation.showitems if we wrap the data in item types:

Before we can start using this data container for training, we need to split it into a training and validation dataset. Since there are 13 different persons with many images each, randomly splitting the container does not make sense. The validation dataset would then contain many images that are very similar to those seen in training, and would hence say little about the generalization ability of a model. We instead use the first 12 subjects as a training dataset and validate on the last.


			
			
			
			using
			
			 

	
			FastAI
			.

	
			MLUtils
			

			

			
			traindata
			 
			=
			 
			
			

	
			MLUtils
			.
			

	
			ObsView
			(
			data
			,
			 
			
			(
			
			1
			:
			

	
			numobs
			(
			data
			)
			)
			[
			
			ids
			 
			.!=
			 
			13
			]
			)
			

			
			validdata
			 
			=
			 
			
			

	
			MLUtils
			.
			

	
			ObsView
			(
			data
			,
			 
			
			(
			
			1
			:
			

	
			numobs
			(
			data
			)
			)
			[
			
			ids
			 
			.==
			 
			13
			]
			)
			

			
			

	
			numobs
			(
			traindata
			)
			,
			 
			

	
			numobs
			(
			validdata
			)

			(15193, 485)

The learning task

Next we need to define a learning task that encodes and augments each image and keypoint in a form that we can train a model on. Here we make use of ProjectiveTransforms for resizing, cropping and augmenting the image and keypoint and ImagePreprocessing to reshape and normalize the image. Finally, KeypointPreprocessing makes sure keypoints fall between -1 and 1.


			
			
			
			sz
			 
			=
			 
			
			(
			224
			,
			 
			224
			)
			

			
			task
			 
			=
			 
			

	
			SupervisedTask
			(
			
    
			
			(
			
			

	
			Image
			{
			2
			}
			(
			)
			,
			 
			
			

	
			Keypoints
			{
			2
			}
			(
			1
			)
			)
			,
			
    
			
			(
			
        
			

	
			ProjectiveTransforms
			(
			sz
			,
			 
			
			buffered
			=
			true
			,
			 
			
			augmentations
			=
			

	
			augs_projection
			(
			
			max_warp
			=
			0
			)
			)
			,
			
        
			

	
			ImagePreprocessing
			(
			)
			,
			
        
			

	
			KeypointPreprocessing
			(
			sz
			)
			,
			
    
			)
			

			)

			SupervisedTask(Image{2} -> Keypoints{2, 1})

We can check that each image is resized to (224, 224) and the keypoints are normalized:


			
			
			
			
			im
			,
			 
			k
			 
			=
			 
			

	
			getobs
			(
			traindata
			,
			 
			1
			)
			

			
			
			x
			,
			 
			y
			 
			=
			 
			

	
			encodesample
			(
			task
			,
			 
			

	
			Training
			(
			)
			,
			 
			
			(
			im
			,
			 
			k
			)
			)
			

			
			
			summary
			(
			x
			)
			,
			 
			y

			("224×224×3 Array{Float32, 3}", Float32[0.4072907, -0.48156565])

Decoding the encoded targets should give back a point within the original image bounds:


			
			
			
			

	
			FastAI
			.
			

	
			decodeypred
			(
			task
			,
			 
			

	
			Training
			(
			)
			,
			 
			y
			)

			1-element Vector{StaticArrays.SVector{2, Float32}}:
 [157.61655, 58.064644]

			
			
			
			
			xs
			,
			 
			ys
			 
			=
			 
			
			

	
			FastAI
			.
			

	
			makebatch
			(
			task
			,
			 
			traindata
			,
			 
			
			1
			:
			2
			)
			

			

	
			showbatch
			(
			task
			,
			 
			
			(
			xs
			,
			 
			ys
			)
			)

That is looking good! We can see that the keypoint is aligned with center of the head even after heavy augmentation. Now it is finally time to train a model.

Training

We'll use a modified ResNet as a model backbone. and add a couple layers that regress the keypoint. taskmodel knows how to do this by looking at the data blocks used and calling blockmodel (KeypointTensor{2, Float32}((1,)), KeypointTensor{2, Float32}((1,)), backbone).

The implementation, for reference, looks like this:


			
			
			
			function
			 
			
			

	
			blockmodel
			(
			
			inblock
			::
			

	
			ImageTensor
			{
			N
			}
			,
			 
			
			outblock
			::
			

	
			KeypointTensor
			{
			N
			}
			,
			 
			backbone
			)
			 
			where
			 
			N
			
			

			
    
			
			outsz
			 
			=
			 
			
			

	
			Flux
			.
			

	
			outputsize
			(
			backbone
			,
			 
			
			(
			
			
			ntuple
			(
			
			_
			 
			->
			 
			256
			,
			 
			N
			)
			...
			,
			 
			
			inblock
			.
			
			nchannels
			,
			 
			1
			)
			)
			

			
    
			
			outch
			 
			=
			 
			
			outsz
			[
			
			end
			-
			1
			]
			

			
    
			
			head
			 
			=
			 
			
			
			Models
			.
			
			visionhead
			(
			outch
			,
			 
			
			
			prod
			(
			
			outblock
			.
			
			sz
			)
			*
			N
			,
			 
			
			p
			 
			=
			 
			0.
			)
			

			
    
			
			return
			 
			

	
			Chain
			(
			backbone
			,
			 
			head
			)
			

			

			end

			
			
			
			backbone
			 
			=
			 
			
			
			
			
			Metalhead
			.
			
			ResNet
			(
			34
			)
			.
			
			layers
			[
			
			1
			:
			
			end
			-
			1
			]
			
			

			
			model
			 
			=
			 
			

	
			taskmodel
			(
			task
			,
			 
			backbone
			)
			;

Next we create a pair of training and validation data loaders. They take care of batching and loading the data in parallel in the background.


			
			
			
			
			
			traindl
			,
			 
			validdl
			 
			=
			 
			
			

	
			FastAI
			.
			

	
			taskdataloaders
			(
			traindata
			,
			 
			validdata
			,
			 
			task
			,
			 
			16
			)
			;

With the addition of an optimizer and a loss function, we can now create a Learner and start training. Just like taskmodel, tasklossfn selects the appropriate loss function for a BlockTasks blocks. Here both the encoded target block and model output block are block = KeypointTensor{2, Float32}((1,)), so blocklossfn(block, block) is called which returns Mean Squared Error as a suitable loss function.


			
			
			
			learner
			 
			=
			 
			

	
			Learner
			(
			
    
			model
			,
			
    
			

	
			tasklossfn
			(
			task
			)
			
			;
			
    
			
			data
			=
			
			(
			traindl
			,
			 
			validdl
			)
			,
			
    
			
			optimizer
			=
			
			

	
			Flux
			.
			

	
			Adam
			(
			)
			,
			
    
			
			callbacks
			=
			
			[
			

	
			ToGPU
			(
			)
			]
			)

			Learner()

			
			
			

	
			fitonecycle!
			(
			learner
			,
			 
			5
			)

			Epoch 1 TrainingPhase(): 100%|██████████████████████████| Time: 0:02:55m
┌───────────────┬───────┬─────────┐
│         Phase  Epoch     Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │   1.0 │ 0.28911 │
└───────────────┴───────┴─────────┘
Epoch 1 ValidationPhase(): 100%|████████████████████████| Time: 0:00:08
┌─────────────────┬───────┬─────────┐
│           Phase  Epoch     Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │   1.0 │ 0.01148 │
└─────────────────┴───────┴─────────┘
Epoch 2 TrainingPhase(): 100%|██████████████████████████| Time: 0:01:25
┌───────────────┬───────┬─────────┐
│         Phase  Epoch     Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │   2.0 │ 0.04745 │
└───────────────┴───────┴─────────┘
Epoch 2 ValidationPhase(): 100%|████████████████████████| Time: 0:00:01
┌─────────────────┬───────┬─────────┐
│           Phase  Epoch     Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │   2.0 │ 0.01927 │
└─────────────────┴───────┴─────────┘
Epoch 3 TrainingPhase(): 100%|██████████████████████████| Time: 0:01:25
┌───────────────┬───────┬─────────┐
│         Phase  Epoch     Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │   3.0 │ 0.03839 │
└───────────────┴───────┴─────────┘
Epoch 3 ValidationPhase(): 100%|████████████████████████| Time: 0:00:01
┌─────────────────┬───────┬─────────┐
│           Phase  Epoch     Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │   3.0 │ 0.01213 │
└─────────────────┴───────┴─────────┘
Epoch 4 TrainingPhase(): 100%|██████████████████████████| Time: 0:01:27
┌───────────────┬───────┬─────────┐
│         Phase  Epoch     Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │   4.0 │ 0.03249 │
└───────────────┴───────┴─────────┘
Epoch 4 ValidationPhase(): 100%|████████████████████████| Time: 0:00:01
┌─────────────────┬───────┬─────────┐
│           Phase  Epoch     Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │   4.0 │ 0.00193 │
└─────────────────┴───────┴─────────┘
Epoch 5 TrainingPhase(): 100%|██████████████████████████| Time: 0:01:30
┌───────────────┬───────┬─────────┐
│         Phase  Epoch     Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │   5.0 │ 0.02201 │
└───────────────┴───────┴─────────┘
Epoch 5 ValidationPhase(): 100%|████████████████████████| Time: 0:00:00
┌─────────────────┬───────┬─────────┐
│           Phase  Epoch     Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │   5.0 │ 0.00079 │
└─────────────────┴───────┴─────────┘

We can save the model for later inference using savetaskmodel:


			
			
			

	
			savetaskmodel
			(
			
			"
			keypointregression.jld2
			"
			,
			 
			task
			,
			 
			
			learner
			.
			
			model
			)

The loss is going down during training which is a good sign, but visualizing the predictions against the ground truth will give us a better idea of how well the model performs. We'll use showoutputs to compare batches of encoded targets and model outputs. For this we can run the model on a batch from the validation dataset and see how it performs.


			
			
			

	
			showoutputs
			(
			task
			,
			 
			learner
			
			;
			 
			
			n
			 
			=
			 
			3
			,
			 
			
			context
			 
			=
			 
			

	
			Validation
			(
			)
			)

We can also see that the trained model generalizes well to the heavy augmentation employed during training. The augmentation also explains why the training loss is so much higher than the validation loss.


			
			
			

	
			showoutputs
			(
			task
			,
			 
			learner
			
			;
			 
			
			n
			 
			=
			 
			3
			,
			 
			
			context
			 
			=
			 
			

	
			Training
			(
			)
			)