Monday, July 18, 2016

Statefarm - experiment 2

Use pre-trained googlenet.
Step 1: get a pre-training googlenet
Find one on the net , usually converted from caffe model. make sure you understand how to pre-process the data , by testing few images.

Step 2: Cut it's head classifier and replace it with yours. Then train your new classifier on the frozen body (the no-head part).
googlenet classifier has 1000 categories, and there are actually 3 heads to the body - a "hydra" :)
There is a main one and 2 aux classifiers in the middle. All need to be replaced and trained.

This is the graph of the main-classifier trained, while all the rest of the model is frozen (in-fact, I dumped to file the output of the 'merge' layer prior to the classifiers sections, it saved a lot of time, but cost few dozen of Gigs of disk space)
After 12 (model_chapter6_12epoc) , we don't see any increase.  This is the output of 32:
loss: 0.5412 - acc: 0.8648 - val_loss: 0.9205 - val_acc: 0.6908

In the same way, we will train the other 2 aux classifiers. aux1 (middle one)

  loss: 0.1330 - acc: 0.9771 - val_loss: 1.2633 - val_acc: 0.8178
Saved model to disk model_chapter6_aux1_try2_11epoc

aux0 - behaves surprisingly well (look at the wierd behaviour of the validation - higher than the training in epoc1. then going down...

loss: 7.4296 - acc: 0.3291 - val_loss: 0.4846 - val_acc: 0.8513
Saved model to disk model_chapter6_aux0_1epoc

loss: 0.1624 - acc: 0.9737 - val_loss: 0.5066 - val_acc: 0.9082
Saved model to disk model_chapter6_aux0_8epoc

Even without fine-tuning, we might use the aux0 for validation loss of 0.5, or the other classifiers for 0.9/1.2 loss.
Let's test this and submit the model with aux0
(used model_chapter6_aux0_25epoc)
Validation score= 0.1 accuracy= 0.92
LeaderBoard: 1.82
Conclusion: Classic case of overfit to the validation. So let's continue to step 3

Step 3: connect the new heads and fine-tune the entire model. 
We will do it by freezing most layers (the inception-blocks), except the last one/two blocks.
We will use the new classifier. 
Note on this quite "low" number -
  • We did not augment the data while training this
  • We used rmsprop which is a "fast but less accurate" one.
We did this, as we will have a training step later, which should work with augmentation and better optimization.

3.1 bad-experiment example (eveyone have bugs...)
Use only partial graph (only the aux0). high learning rate 0.001. heavy augmentation (flip/zoom/shear). Can you see the problem here?

This should never happen, and is usually a bug.  The bug in this case was in bad-random flip (on training always flip . on validation never flip)
result in: model_chapter6_aux0_finetune7epoc

3.2 Can we use only aux0 and a small subset of the googlenet?  (the answer is no...)
We again only partial graph with aux0, this time slower learning rate of 0.0001
result sample in: model_chapter6_aux0_finetune_lr_1e40epoc This proved to be have bad results

3.3 Let's finetune the whole model, and look the the result of the end-classifer. fine tuned 16 epocs using SGD 0.003 This proved to be great improvement, LB=0.51286
lock the first layers: conv1, conv2, inception_3a/b, inception_4a , loss1
keep the other trainalbe: inception_4b/c/d/e inception_5a/b and loss2/3 compile while adding loss_weights and add weight to the main classifier: full_model.compile(loss='categorical_crossentropy',loss_weights=[0.2,0.2,0.6], optimizer=SGD(lr=0.003, momentum=0.9), [stopped in the middle] augmentation used: googlenet_augment shift 0.05 rotation 8 degrees, zoom 0.1, shear 0.2
saved model after 16 epocs: model_chapter6_finetune_all_lr_1e4_binary
see: statefarm-chapter6-finetune-0.003-fix_aug.ipynb

validation score (overfit again) SCORE= 0.0529023816348 accuracy= 0.908552631579 confusion matrix:
[[291   0   4   1   1   0   0   0   4  24]
 [  1 298   0  19   0   0   4   0   0   0]
 [  0   0 315   0   2   0   0   0   1   0]
 [  0   1   0 317   0   0   1   0   1   0]
 [  0   0   1   2 313   0   0   0   0   0]
 [  0   0   0   0   0 321   0   0   0   0]
 [  0   0   1   0   0   0 318   1   0   0]
 [  0   0   0   0   0   0   0 256   0   0]
 [  0   0   7   0   0   0   1   1 243   2]
 [156   0   0   0   5   0   1   0   1 125]]
                                precision    recall  f1-score   support

              0 normal driving       0.65      0.90      0.75       325
             1 texting - right       1.00      0.93      0.96       322
2 talking on the phone - right       0.96      0.99      0.98       318
              3 texting - left       0.94      0.99      0.96       320
 4 talking on the phone - left       0.98      0.99      0.98       316
         5 operating the radio       1.00      1.00      1.00       321
                    6 drinking       0.98      0.99      0.99       320
             7 reaching behind       0.99      1.00      1.00       256
             8 hair and makeup       0.97      0.96      0.96       254
        9 talking to passenger       0.83      0.43      0.57       288

                   avg / total       0.93      0.92      0.92      3040

This is the confusion matrix of aux1 classifier (the intermidiate one)
Validation SCORE= 0.0607746900158 accuracy= 0.899342105263
LB score: 0.768
comparing the two confusion-matrixes, 0 and 9 classes
[[191   0  16   4   7  10   1   0  14  82]
 [  0 309   0   5   0   0   7   1   0   0]
 [  1   0 314   0   1   1   0   0   1   0]
 [  0   1   0 312   0   5   0   0   0   2]
 [  4   0   8   5 299   0   0   0   0   0]
 [  0   0   0   0   0 321   0   0   0   0]
 [  0   0   0   1   0   0 316   0   2   1]
 [  1   0   1   0   0   0   0 246   0   8]
 [  0   0   0   0   0   0   0   0 254   0]
 [ 57   0   0   0   5   2   3   0   2 219]]
                                precision    recall  f1-score   support

              0 normal driving       0.75      0.59      0.66       325
             1 texting - right       1.00      0.96      0.98       322
2 talking on the phone - right       0.93      0.99      0.96       318
              3 texting - left       0.95      0.97      0.96       320
 4 talking on the phone - left       0.96      0.95      0.95       316
         5 operating the radio       0.95      1.00      0.97       321
                    6 drinking       0.97      0.99      0.98       320
             7 reaching behind       1.00      0.96      0.98       256
             8 hair and makeup       0.93      1.00      0.96       254
        9 talking to passenger       0.70      0.76      0.73       288

                   avg / total       0.91      0.91      0.91      3040

What will happen if we average the 2 results?
nothing fancy, just simple average of all improves to 0.419 (!)

other results of open competitors
ensamble of VGG16 (0.27) + googlenet (0.38)  together are generate:  0.22
adding-small-blocks from other images , helped a bit more.

Looking at some results:

Running few experiment, I constantly get bad results for some of the classes. This is the report:
precision recall f1-score support
              0 normal driving       0.74      0.80      0.77       325
             1 texting - right       0.99      0.97      0.98       322
2 talking on the phone - right       0.95      0.91      0.93       318
              3 texting - left       0.79      0.99      0.88       320
 4 talking on the phone - left       0.98      0.94      0.96       316
         5 operating the radio       0.96      1.00      0.98       321
                    6 drinking       0.96      0.96      0.96       320
             7 reaching behind       0.99      1.00      0.99       256
             8 hair and makeup       0.87      0.92      0.89       254
        9 talking to passenger       0.89      0.59      0.71       288

                   avg / total       0.91      0.91      0.91      3040

The recall for "9-talking to passenger" is extremely bad. (0.59)
The precision for "3 - texting left" is 0.78
and both the precision and recall for "0-normal driving" are bad 0.74/0.80

Let's have a look at some photos, from these categories:
category 0:  
The driver has , usually 2 hands on the wheel, with the head straight ahead, or slightly tilted twards the camera. bad-classification exists, usually when driver looks hard to the right side(probably '9" cattegory)
category 0:  In total 2076 good. 94 bad-human-classification.  4.3% bad classification.

cateogry 9: In total 1364 good. 477 bad-human-classification . 26% (!) bad classification. Mainly drivers looking forward (class "0").  

categoty 3: 
Goog ground truth: rarely right-hand completely shadows the phone (or at least 90% of it). sometime users look cokletely to he right (passenger side), but still, always have a phone.
So why do we have texting-left percision :0.79? let's look at the confusion matrix, we see there are 30 predictions where it was actually class 0

conclusions so far:
Remarkably bad groud-truth classificaiton on category 9 was (according to forum entry) due to classification by whole video-section instead of individual frames.
This means that if in a 30 seconds video, the user did 75% of the time action 9, and 25% of the time action 0, all 100% will be counted as action 9 in the groud-truth.

In other words, there is no way (even a human) can classify it correctly above 75%.
There are two options here:
1. Hack : Reconstruct the video from single-frames, classify the whole section and mark accordingly. 
2. Understand that class 0 can mean both 0 and 9, and artificially change final weights accordingly to minimize error rate

