Thursday, August 4, 2016

segmentations

The main two segmentation types are:
semantic-segmentation : color the entire image with a color per class type (usually up to 10 classes like road,cars,pedestreians etc). If there are 3 cars overlapping, we will just see all of them with same color.
Implementation is, usually done in one feed-forward using one softmax on the final layer
see U-Net /SegNet / DeepLab / PSPNet and here

instance-segmentation: color each instance seperatly, even if they are of the same class (3 cars should have 3 masks).
The usual implementation is a first sweep for object-proposals, and then refining each image-patch in a second sweep. Using multiple heads, you can get later bounding-box/classification/human joints key-points, of each image-path. Thus it is, usually, slower than semantic-segmentation.
notable architectures:  r-cnn / fast r-cnn / faster r-cnn / mask r-cnn and  deep-mask/ sharp-mask / multipath-net.  Yolo has another approach.


DeepMask  "Learning to segment object candidates" is below.
I want to also mention new extension to DeepMask, which build on it and improve.
SharpMask "Learning to refine object segments" - better attempt:  really good  article!! which both improve the speed to DeepMask itself and then adds architecture on it.
A MultiPath Network for Object Detection


Object detection and Object proposals.
Object detection is one of the most foundation tasks in computer vision, unlike classification which say what is the main object somewhere in the image, object-detection it needs to find multiple objects in the scene, and their exact location bounding-box or even their pixel-mask.
The usual CNN architectures are good at returning one object classification and/or one location, but not multiple ones in one sweep.
Until recently, the dominant paradigm in object detection was the (dumb) sliding window framework: a classifier is applied at every object location and scale, which means running it a-lot of times.
More recently, Girshick et al. [10] proposed a two-phase approach R-CNN. First, a rich set of object proposals (i.e., a set of image regions which are likely to contain an object) is generated using a fast (but possibly imprecise) algorithm. Second, a convolutional neural network classifier is applied on each of the proposals. This approach provides a notable gain in object detection accuracy compared to classic sliding window approaches. Since then, most state-of-the-art object detectors rely on object proposals as a first pre-processing step [10, 15, 33].


DeepMask goal: Given an input image patch, our algorithm generates a class-agnostic mask and an associated score which estimates the likelihood of the patch fully containing a centered object (without any notion of an object category. In other words, it's a binary-classifier of "something is in the center of the region" and not "it's a horse" or "it's a ball").


The core of our model is a ConvNet which jointly predicts the mask and the object score. A large part of the network is shared between those two tasks: only the last few network layers are specialized for separately outputting a mask and score prediction

Input:
image-patch
label binary classfier : 1 or -1 , if object is fully contained, and centered in the patch.
label: mask:  1 or -1 on all the pixels, only relevant if label exist


Architecture:
Based of VGG-A  (8 conv layers, 5 polling layer) without the last pooling layer and dense-head.  They also use pre-trained weights.  Meaning 4 max-polling which take 3xWxH image and make it 512 x W/16 x H/16. look at the image for the different heads.
Note that in the seg-branch there are no activation (no 'relu' or other)

Joint-loss: 
Smartly defined as 
score_constant * binary log-loss on the score (1/-1) 
+ (Yscore+1)/path-size * sum of binray log-loss per pixels

We alternate between backpropagating through the segmentation branch and scoring branch (and set score-constant to  1/ 32 ).
For the scoring branch, the data is sampled such that the model is trained with an equal number of positive and negative samples. Note that the factor multiplying the first term (Yscore+1)implies that we only backpropagate the error over the segmentation branch if yk = 1. In other words, the segmentation is learning only on real-objects and at test time, we ignore it's input otherwise.

Full-scene
Go over all the image in strides of 16 pixels.  Some trick is used (Read the paper). Test multiple patches each time


Implementation Code

unofficial-implementation in keras
Let's discuss some non-trivial points.

combined-loss
Here we have one combined loss function for two classifier-heads, which can't work in parallel as if the classifier-head decides there is no object, the segmentation-head output is ignored, we do not back-propagate loss from that head.
Here is a workaround. there is a request for change on thgis multi-target-loss
Another trick (more of a hack) if to use dirty-pixel , like the upper-left which will mark what we have.

def binary_regression_error(y_true, y_pred):
    return score_output_lambda * K.log(1 + K.exp(-y_true*y_pred))

def mask_binary_regression_error(y_true, y_pred):
    trick = 0.5 * (1 - y_true[:,0,0])  # for Batch of 3, will where one is negative will be [0,1,1]
    #note: the mean is per batch K.sum [0,1,1] * [0.33, 0.33 , -0.333] (the mean of 3 batches)
    #also note that it's not normailized between different batches( some can be all zeroes)
    return trick * K.mean(K.log(1 + K.exp(-y_true*y_pred)),(1,2))

If the absolute value of one brach loss is much-higher than the other, we will not "learn" in the other bracnh. In the article, they specify they alternated between propogating the branches.
One can simply change the "strength" of the loss each round (and then recompile)
score_output_lambda = 1 if round_number%2==0 else min_score_output_lambda




SharpMask

First, faster , better DeepMask, by using resnet instead of VGG, and by changing the classification heads to 128x1x1 conv (output: 128x10x10) tensor-> 512x1x1 tensor.  Split to:
Score:  add 1024x1x1  then 1x1x1 score
Segmentation:


No comments: