Wednesday, July 6, 2016

Features regression/localization

There are many approaches, we shall start with the simple ones (which perform badly) and continue to those performing a bit better.

Loss via Linear regression

Combine all the features  into one loss function, and calculate it (diff of mouth + diff of nose + diff of eyes, etc).  The results are usually quite rough, so different cascade are suggested to fine tune the windows on which the CNN is running.

  • Facial recognition with no cascade: Tutorial with full code Lasgne 
  • Joint location with cascade: DeepPose Run one DNN to get rough estimates of joint locations. cut a small window around each joint location and run the same network architecture (but with different parameters) on it to get finer estimation. Do it again one more time to get best results.
  • Deep Convolutional Network Cascade for Facial Point Detection
    Modification 1: abs on tan, instead of  ReLU, on some layers.
    Modification 2: locally-shared weights instead of globally shared weights in the Convolution layer.
    Modification 3: lots of networks structure:Train 3 networks with the same architecture (high1) to detect eyes region, nose region and mouth region.
    Then pass it to multiple shallow architecture (shallow1), again each trained by itself, and it only gets a small region. This one can only slighly modify output location, as we assume it is more accurate, but 'does not see the big picture.
    Then again pass it to multiple shallow arch...  with even smaller region.

Classification as Loss function

Cascade: This is usually used for bounding-box calculations on classification tasks.  As you already have a good classifier model, you want to re-use it.  This is done by cutting the image into multiple very small images, and checking each one of the sub-images for the existance of a "nose"/"eye". if so mark what there is there. If you apply it to the whole image, you can a rough estimate of where the nose is.

Detection
region proposal networks - faster rcnn
yolo

Heatmap as Loss function

The output of the network is a heatmap  (WxH pixels with intensity levels from 0 to 1), where few pixels around the feature are highlighred. This apears to provide better results, as it is a more "natual" calculation for a CNN.  Do not that regular classification models are shaped like a cone, with smaller and smaller layers till the FC result.  This architecture is not adequate here.
In theory, one may not need a cascade here.

Localization as side-effect

The CAM technique (Class activation mapping) can generate it automatically , as a side-effect of the attention model.


Appendix, Honorable mentions: Linear combination of the previous layers


 this is good for the whole face pixels, not as good for location of occulded features in the face which is fully not facing-the-camera.


Datasets and competion results for Object Segmentaion
Coco - Huge one.


Datasets and compettion results for Faces

HPDatabase 
Youtube face DB
TAU article TCNN for Facial Landmark Detection with Tweaked Convolutional Neural Networks. implementation in caffe


Datasets and competition results for human joints 

2016 - Coco key point challenge - new (July 2016) and probably the best one to use.  90K person instances labeled with keypoints (the majority of people in COCO at medium and large scales) and over 1 million total labeled keypoints.

The Human Pose Recovery and Behavior Analysis HuPBA 8k+ dataset , from cha learn

FLIC: Frames Labeled In Cinema contains 4000 training and 1000 test images obtained from popular Hollywood movies. The images contain people in diverse poses and especially diverse clothing. For each labeled human, 10 upper body joints are labeled.
Note: FLIC_Full contains more challanging(occulded) scenes. too hard to train from.
FLIC-motion-dataset
 includes short clips. in this case, maybe the motion will help estimate better.

Leeds Sports Dataset [12] and its extension [13], which we will jointly denote by LSP. Combined they contain 11000 training and 1000 testing images. These are images from sports activities and as such are quite challenging in terms of appearance and especially articulations. In addition, the majority of people have 150 pixel height which makes the pose estimation even more challenging. In this dataset, for each person the full body is labeled with total 14 joints.

The dataset includes around 25K images containing over 40K people with annotated body joints. The images were systematically collected using an established taxonomy of every day human activities. Overall the dataset covers 410 human activities and each image is provided with an activity label
Best results for MPII

2016 Convolutional Pose Machines in caffe/matlab ; Stack hourglass in torch
2015 DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation






No comments: