There are many approaches, we shall start with the simple ones (which perform badly) and continue to those performing a bit better.
Loss via Linear regression
Combine all the features into one loss function, and calculate it (diff of mouth + diff of nose + diff of eyes, etc). The results are usually quite rough, so different cascade are suggested to fine tune the windows on which the CNN is running.
- Facial recognition with no cascade: Tutorial with full code Lasgne
- Joint location with cascade: DeepPose - Run one DNN to get rough estimates of joint locations. cut a small window around each joint location and run the same network architecture (but with different parameters) on it to get finer estimation. Do it again one more time to get best results.
- Deep Convolutional Network Cascade for Facial Point Detection
Modification 1: abs on tan, instead of ReLU, on some layers.
Modification 2: locally-shared weights instead of globally shared weights in the Convolution layer.
Modification 3: lots of networks structure:Train 3 networks with the same architecture (high1) to detect eyes region, nose region and mouth region.
Then pass it to multiple shallow architecture (shallow1), again each trained by itself, and it only gets a small region. This one can only slighly modify output location, as we assume it is more accurate, but 'does not see the big picture.
Then again pass it to multiple shallow arch... with even smaller region.
Modification 1: abs on tan, instead of ReLU, on some layers.
Modification 2: locally-shared weights instead of globally shared weights in the Convolution layer.
Modification 3: lots of networks structure:Train 3 networks with the same architecture (high1) to detect eyes region, nose region and mouth region.
Then pass it to multiple shallow architecture (shallow1), again each trained by itself, and it only gets a small region. This one can only slighly modify output location, as we assume it is more accurate, but 'does not see the big picture.
Then again pass it to multiple shallow arch... with even smaller region.
Classification as Loss function
Cascade: This is usually used for bounding-box calculations on classification tasks. As you already have a good classifier model, you want to re-use it. This is done by cutting the image into multiple very small images, and checking each one of the sub-images for the existance of a "nose"/"eye". if so mark what there is there. If you apply it to the whole image, you can a rough estimate of where the nose is.
Heatmap as Loss function
The output of the network is a heatmap (WxH pixels with intensity levels from 0 to 1), where few pixels around the feature are highlighred. This apears to provide better results, as it is a more "natual" calculation for a CNN. Do not that regular classification models are shaped like a cone, with smaller and smaller layers till the FC result. This architecture is not adequate here.
In theory, one may not need a cascade here.
Localization as side-effect
The CAM technique (Class activation mapping) can generate it automatically , as a side-effect of the attention model.
Appendix, Honorable mentions: Linear combination of the previous layers
this is good for the whole face pixels, not as good for location of occulded features in the face which is fully not facing-the-camera.
Datasets and competition results for human joints
FLIC: Frames Labeled In Cinema contains 4000 training and 1000 test images obtained from popular Hollywood movies. The images contain people in diverse poses and especially diverse clothing. For each labeled human, 10 upper body joints are labeled.
Note: FLIC_Full contains more challanging(occulded) scenes. too hard to train from.
FLIC-motion-dataset
includes short clips. in this case, maybe the motion will help estimate better.
Leeds Sports Dataset [12] and its extension [13], which we will jointly denote by LSP. Combined they contain 11000 training and 1000 testing images. These are images from sports activities and as such are quite challenging in terms of appearance and especially articulations. In addition, the majority of people have 150 pixel height which makes the pose estimation even more challenging. In this dataset, for each person the full body is labeled with total 14 joints.
The dataset includes around 25K images containing over 40K people with annotated body joints. The images were systematically collected using an established taxonomy of every day human activities. Overall the dataset covers 410 human activities and each image is provided with an activity label
2016 Convolutional Pose Machines
Best results:
2016 Convolutional Pose Machines
2015 DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation
No comments:
Post a Comment