Single keypoint regression consists of localizing a keypoint in an image. Here we'll be training on a head pose dataset, where every image has a person in it and the head of the person is annotated. Since keypoint datasets all have different formats, we have to do a bit more manual work to get the task dataset loaded. First we import everything we'll need:
import
CairoMakie
;
CairoMakie
.
activate!
(
type
=
"
png
"
)
using
DelimitedFiles
:
readdlm
using
FastAI
,
FastVision
,
Flux
,
Metalhead
using
FastAI
.
FilePathsBase
import
FastVision
.
DataAugmentation
load(
datasets
()[id])
downloads the files, but it's up to us to load them into a usable format. In the end, the task data container should contain tuples of an image and a keypoint each.
path
=
load
(
datasets
(
)
[
"
biwi_head_pose
"
]
)
"/home/lorenz/.julia/datadeps/fastai-biwi_head_pose"
files
=
loadfolderdata
(
path
)
;
files
[
1
:
10
]
10-element Vector{String}:
"/home/lorenz/.julia/datadeps/fastai-biwi_head_pose/01"
"/home/lorenz/.julia/datadeps/fastai-biwi_head_pose/01.obj"
"/home/lorenz/.julia/datadeps/fastai-biwi_head_pose/02"
"/home/lorenz/.julia/datadeps/fastai-biwi_head_pose/02.obj"
"/home/lorenz/.julia/datadeps/fastai-biwi_head_pose/03"
"/home/lorenz/.julia/datadeps/fastai-biwi_head_pose/03.obj"
"/home/lorenz/.julia/datadeps/fastai-biwi_head_pose/04"
"/home/lorenz/.julia/datadeps/fastai-biwi_head_pose/04.obj"
"/home/lorenz/.julia/datadeps/fastai-biwi_head_pose/05"
"/home/lorenz/.julia/datadeps/fastai-biwi_head_pose/05.obj"
First we create a
loadfolderdata
from the directory where the dataset has been downloaded to:
Loading a
loadfolderdata
simply treats every file as a single observation. However, that is not what we want here: for every observation we have one image and one annotation file that make up one observation and we want to ignore all other files, like the README. To achieve this, we'll create two data containers containing all the image paths and annotation paths respectively by filtering the container with all paths.
imagefiles
=
loadfolderdata
(
path
,
filterfn
=
FastVision
.
isimagefile
)
annotfiles
=
loadfolderdata
(
path
,
filterfn
=
p
->
occursin
(
"
pose
"
,
pathname
(
p
)
)
)
ObsView(::MLDatasets.FileDataset{typeof(identity), String}, ::Vector{Int64})
15678 observations
("/home/lorenz/.julia/datadeps/fastai-biwi_head_pose/01/frame_00003_rgb.jpg", "/home/lorenz/.julia/datadeps/fastai-biwi_head_pose/01/frame_00003_pose.txt")
Next we need to map functions over each observation that load the data from the files. An image file can be loaded using the
loadfile
utility. The keypoints have a custom format, so we write a helper function to parse them from a text file. The details of how the format is loaded aren't important.
readcalibrationfile
(
p
)
=
readdlm
(
string
(
p
)
)
[
1
:
3
,
1
:
3
]
CAL
=
readcalibrationfile
(
joinpath
(
path
,
"
01
"
,
"
rgb.cal
"
)
)
function
loadannotfile
(
annotpath
,
cal
=
CAL
)
ctr
=
readdlm
(
string
(
annotpath
)
)
[
4
,
:
]
cx
=
ctr
[
1
]
*
cal
[
1
,
1
]
/
ctr
[
3
]
+
cal
[
1
,
3
]
cy
=
ctr
[
2
]
*
cal
[
2
,
2
]
/
ctr
[
3
]
+
cal
[
2
,
3
]
return
[
FastVision
.
SVector
(
cy
,
cx
)
.+
1
]
end
loadannotfile (generic function with 2 methods)
Now we can use
mapobs
to lazily map the loading function over the container. Note that beside loading the image and keypoint, we also extract the subject ID from the path. We'll use this in a bit for splitting the dataset appropriately and we don't have access to the path information anymore once we have a container of loaded data.
data
=
(
mapobs
(
loadfile
,
imagefiles
)
,
mapobs
(
loadannotfile
,
annotfiles
)
)
ids
=
map
(
p
->
parse
(
Int
,
pathname
(
pathparent
(
p
)
)
)
,
imagefiles
)
obs
=
image
,
ks
=
getobs
(
data
,
2000
)
(ColorTypes.RGB{FixedPointNumbers.N0f8}[RGB{N0f8}(0.0,0.0,0.0) RGB{N0f8}(0.004,0.004,0.004) … RGB{N0f8}(0.063,0.043,0.027) RGB{N0f8}(0.016,0.0,0.0); RGB{N0f8}(0.051,0.051,0.051) RGB{N0f8}(0.859,0.859,0.859) … RGB{N0f8}(0.522,0.502,0.486) RGB{N0f8}(0.059,0.039,0.024); … ; RGB{N0f8}(0.043,0.012,0.0) RGB{N0f8}(0.129,0.094,0.059) … RGB{N0f8}(0.463,0.447,0.435) RGB{N0f8}(0.078,0.059,0.043); RGB{N0f8}(0.027,0.0,0.0) RGB{N0f8}(0.055,0.02,0.0) … RGB{N0f8}(0.067,0.051,0.039) RGB{N0f8}(0.016,0.0,0.0)], StaticArrays.SVector{2, Float64}[[237.24571342301059, 412.9481253320988]])
We can visualize an observation using
DataAugmentation.showitems
if we wrap the data in item types:
DataAugmentation
.
showitems
(
(
DataAugmentation
.
Image
(
image
)
,
DataAugmentation
.
Keypoints
(
ks
,
size
(
image
)
)
)
)
Before we can start using this data container for training, we need to split it into a training and validation dataset. Since there are 13 different persons with many images each, randomly splitting the container does not make sense. The validation dataset would then contain many images that are very similar to those seen in training, and would hence say little about the generalization ability of a model. We instead use the first 12 subjects as a training dataset and validate on the last.
using
FastAI
.
MLUtils
traindata
=
MLUtils
.
ObsView
(
data
,
(
1
:
numobs
(
data
)
)
[
ids
.!=
13
]
)
validdata
=
MLUtils
.
ObsView
(
data
,
(
1
:
numobs
(
data
)
)
[
ids
.==
13
]
)
numobs
(
traindata
)
,
numobs
(
validdata
)
(15193, 485)
Next we need to define a learning task that encodes and augments each image and keypoint in a form that we can train a model on. Here we make use of
ProjectiveTransforms
for resizing, cropping and augmenting the image and keypoint and
ImagePreprocessing
to reshape and normalize the image. Finally,
KeypointPreprocessing
makes sure keypoints fall between -1 and 1.
sz
=
(
224
,
224
)
task
=
SupervisedTask
(
(
Image
{
2
}
(
)
,
Keypoints
{
2
}
(
1
)
)
,
(
ProjectiveTransforms
(
sz
,
buffered
=
true
,
augmentations
=
augs_projection
(
max_warp
=
0
)
)
,
ImagePreprocessing
(
)
,
KeypointPreprocessing
(
sz
)
,
)
)
SupervisedTask(Image{2} -> Keypoints{2, 1})
We can check that each image is resized to
(224, 224)
and the keypoints are normalized:
im
,
k
=
getobs
(
traindata
,
1
)
x
,
y
=
encodesample
(
task
,
Training
(
)
,
(
im
,
k
)
)
summary
(
x
)
,
y
("224×224×3 Array{Float32, 3}", Float32[0.4072907, -0.48156565])
Decoding the encoded targets should give back a point within the original image bounds:
FastAI
.
decodeypred
(
task
,
Training
(
)
,
y
)
1-element Vector{StaticArrays.SVector{2, Float32}}:
[157.61655, 58.064644]
That is looking good! We can see that the keypoint is aligned with center of the head even after heavy augmentation. Now it is finally time to train a model.
We'll use a modified ResNet as a model backbone. and add a couple layers that regress the keypoint.
taskmodel
knows how to do this by looking at the data blocks used and calling
blockmodel
(KeypointTensor{2, Float32}((1,)), KeypointTensor{2, Float32}((1,)), backbone)
.
The implementation, for reference, looks like this:
function
blockmodel
(
inblock
::
ImageTensor
{
N
}
,
outblock
::
KeypointTensor
{
N
}
,
backbone
)
where
N
outsz
=
Flux
.
outputsize
(
backbone
,
(
ntuple
(
_
->
256
,
N
)
...
,
inblock
.
nchannels
,
1
)
)
outch
=
outsz
[
end
-
1
]
head
=
Models
.
visionhead
(
outch
,
prod
(
outblock
.
sz
)
*
N
,
p
=
0.
)
return
Chain
(
backbone
,
head
)
end
backbone
=
Metalhead
.
ResNet
(
34
)
.
layers
[
1
:
end
-
1
]
model
=
taskmodel
(
task
,
backbone
)
;
Next we create a pair of training and validation data loaders. They take care of batching and loading the data in parallel in the background.
traindl
,
validdl
=
FastAI
.
taskdataloaders
(
traindata
,
validdata
,
task
,
16
)
;
With the addition of an optimizer and a loss function, we can now create a
Learner
and start training. Just like
taskmodel
,
tasklossfn
selects the appropriate loss function for a
BlockTask
s blocks. Here both the encoded target block and model output block are
block = KeypointTensor{2, Float32}((1,))
, so
blocklossfn(block, block)
is called which returns Mean Squared Error as a suitable loss function.
learner
=
Learner
(
model
,
tasklossfn
(
task
)
;
data
=
(
traindl
,
validdl
)
,
optimizer
=
Flux
.
Adam
(
)
,
callbacks
=
[
ToGPU
(
)
]
)
Learner()
fitonecycle!
(
learner
,
5
)
Epoch 1 TrainingPhase(): 100%|██████████████████████████| Time: 0:02:55m
┌───────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │ 1.0 │ 0.28911 │
└───────────────┴───────┴─────────┘
Epoch 1 ValidationPhase(): 100%|████████████████████████| Time: 0:00:08
┌─────────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │ 1.0 │ 0.01148 │
└─────────────────┴───────┴─────────┘
Epoch 2 TrainingPhase(): 100%|██████████████████████████| Time: 0:01:25
┌───────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │ 2.0 │ 0.04745 │
└───────────────┴───────┴─────────┘
Epoch 2 ValidationPhase(): 100%|████████████████████████| Time: 0:00:01
┌─────────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │ 2.0 │ 0.01927 │
└─────────────────┴───────┴─────────┘
Epoch 3 TrainingPhase(): 100%|██████████████████████████| Time: 0:01:25
┌───────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │ 3.0 │ 0.03839 │
└───────────────┴───────┴─────────┘
Epoch 3 ValidationPhase(): 100%|████████████████████████| Time: 0:00:01
┌─────────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │ 3.0 │ 0.01213 │
└─────────────────┴───────┴─────────┘
Epoch 4 TrainingPhase(): 100%|██████████████████████████| Time: 0:01:27
┌───────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │ 4.0 │ 0.03249 │
└───────────────┴───────┴─────────┘
Epoch 4 ValidationPhase(): 100%|████████████████████████| Time: 0:00:01
┌─────────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │ 4.0 │ 0.00193 │
└─────────────────┴───────┴─────────┘
Epoch 5 TrainingPhase(): 100%|██████████████████████████| Time: 0:01:30
┌───────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │ 5.0 │ 0.02201 │
└───────────────┴───────┴─────────┘
Epoch 5 ValidationPhase(): 100%|████████████████████████| Time: 0:00:00
┌─────────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │ 5.0 │ 0.00079 │
└─────────────────┴───────┴─────────┘
We can save the model for later inference using
savetaskmodel
:
savetaskmodel
(
"
keypointregression.jld2
"
,
task
,
learner
.
model
)
The loss is going down during training which is a good sign, but visualizing the predictions against the ground truth will give us a better idea of how well the model performs. We'll use
showoutputs
to compare batches of encoded targets and model outputs. For this we can run the model on a batch from the validation dataset and see how it performs.
showoutputs
(
task
,
learner
;
n
=
3
,
context
=
Validation
(
)
)
We can also see that the trained model generalizes well to the heavy augmentation employed during training. The augmentation also explains why the training loss is so much higher than the validation loss.
showoutputs
(
task
,
learner
;
n
=
3
,
context
=
Training
(
)
)