Use pre-trained googlenet.
Step 1: get a pre-training googlenet
Find one on the net , usually converted from caffe model. make sure you understand how to pre-process the data , by testing few images.
Step 2: Cut it's head classifier and replace it with yours. Then train your new classifier on the frozen body (the no-head part).
googlenet classifier has 1000 categories, and there are actually 3 heads to the body - a "hydra" :)
There is a main one and 2 aux classifiers in the middle. All need to be replaced and trained.
Step 1: get a pre-training googlenet
Find one on the net , usually converted from caffe model. make sure you understand how to pre-process the data , by testing few images.
Step 2: Cut it's head classifier and replace it with yours. Then train your new classifier on the frozen body (the no-head part).
googlenet classifier has 1000 categories, and there are actually 3 heads to the body - a "hydra" :)
There is a main one and 2 aux classifiers in the middle. All need to be replaced and trained.
This is the graph of the main-classifier trained, while all the rest of the model is frozen (in-fact, I dumped to file the output of the 'merge' layer prior to the classifiers sections, it saved a lot of time, but cost few dozen of Gigs of disk space)
After 12 (model_chapter6_12epoc) , we don't see any increase. This is the output of 32:
loss: 0.5412 - acc: 0.8648 - val_loss: 0.9205 - val_acc: 0.6908
In the same way, we will train the other 2 aux classifiers. aux1 (middle one)
loss: 0.1330 - acc: 0.9771 - val_loss: 1.2633 - val_acc: 0.8178
Saved model to disk model_chapter6_aux1_try2_11epoc
aux0 - behaves surprisingly well (look at the wierd behaviour of the validation - higher than the training in epoc1. then going down...
loss: 7.4296 - acc: 0.3291 - val_loss: 0.4846 - val_acc: 0.8513 Saved model to disk model_chapter6_aux0_1epoc
loss: 0.1624 - acc: 0.9737 - val_loss: 0.5066 - val_acc: 0.9082 Saved model to disk model_chapter6_aux0_8epoc
Even without fine-tuning, we might use the aux0 for validation loss of 0.5, or the other classifiers for 0.9/1.2 loss.
Let's test this and submit the model with aux0
(used model_chapter6_aux0_25epoc)
Conclusion: Classic case of overfit to the validation. So let's continue to step 3
Let's test this and submit the model with aux0
(used model_chapter6_aux0_25epoc)
Validation score= 0.1 accuracy= 0.92
LeaderBoard: 1.82 Conclusion: Classic case of overfit to the validation. So let's continue to step 3
Step 3: connect the new heads and fine-tune the entire model.
We will do it by freezing most layers (the inception-blocks), except the last one/two blocks.
We will do it by freezing most layers (the inception-blocks), except the last one/two blocks.
We will use the new classifier.
Note on this quite "low" number -
- We did not augment the data while training this
- We used rmsprop which is a "fast but less accurate" one.
We did this, as we will have a training step later, which should work with augmentation and better optimization.
3.1 bad-experiment example (eveyone have bugs...)
Use only partial graph (only the aux0). high learning rate 0.001. heavy augmentation (flip/zoom/shear). Can you see the problem here?
This should never happen, and is usually a bug. The bug in this case was in bad-random flip (on training always flip . on validation never flip)
result in: model_chapter6_aux0_finetune7epoc
3.2 Can we use only aux0 and a small subset of the googlenet? (the answer is no...)
We again only partial graph with aux0, this time slower learning rate of 0.0001
result sample in: model_chapter6_aux0_finetune_lr_1e40epoc
This proved to be have bad results 3.3 Let's finetune the whole model, and look the the result of the end-classifer. fine tuned 16 epocs using SGD 0.003 This proved to be great improvement, LB=0.51286
lock the first layers: conv1, conv2, inception_3a/b, inception_4a , loss1
keep the other trainalbe: inception_4b/c/d/e inception_5a/b and loss2/3 compile while adding loss_weights and add weight to the main classifier: full_model.compile(loss='categorical_crossentropy',loss_weights=[0.2,0.2,0.6], optimizer=SGD(lr=0.003, momentum=0.9), [stopped in the middle] augmentation used: googlenet_augment shift 0.05 rotation 8 degrees, zoom 0.1, shear 0.2
saved model after 16 epocs: model_chapter6_finetune_all_lr_1e4_binary
see: statefarm-chapter6-finetune-0.003-fix_aug.ipynb
validation score (overfit again) SCORE= 0.0529023816348 accuracy= 0.908552631579 confusion matrix:
[[291 0 4 1 1 0 0 0 4 24] [ 1 298 0 19 0 0 4 0 0 0] [ 0 0 315 0 2 0 0 0 1 0] [ 0 1 0 317 0 0 1 0 1 0] [ 0 0 1 2 313 0 0 0 0 0] [ 0 0 0 0 0 321 0 0 0 0] [ 0 0 1 0 0 0 318 1 0 0] [ 0 0 0 0 0 0 0 256 0 0] [ 0 0 7 0 0 0 1 1 243 2] [156 0 0 0 5 0 1 0 1 125]] precision recall f1-score support 0 normal driving 0.65 0.90 0.75 325 1 texting - right 1.00 0.93 0.96 322 2 talking on the phone - right 0.96 0.99 0.98 318 3 texting - left 0.94 0.99 0.96 320 4 talking on the phone - left 0.98 0.99 0.98 316 5 operating the radio 1.00 1.00 1.00 321 6 drinking 0.98 0.99 0.99 320 7 reaching behind 0.99 1.00 1.00 256 8 hair and makeup 0.97 0.96 0.96 254 9 talking to passenger 0.83 0.43 0.57 288 avg / total 0.93 0.92 0.92 3040
This is the confusion matrix of aux1 classifier (the intermidiate one)
other results of open competitors
Validation SCORE= 0.0607746900158 accuracy= 0.899342105263
LB score: 0.768
comparing the two confusion-matrixes, 0 and 9 classes
[[191 0 16 4 7 10 1 0 14 82] [ 0 309 0 5 0 0 7 1 0 0] [ 1 0 314 0 1 1 0 0 1 0] [ 0 1 0 312 0 5 0 0 0 2] [ 4 0 8 5 299 0 0 0 0 0] [ 0 0 0 0 0 321 0 0 0 0] [ 0 0 0 1 0 0 316 0 2 1] [ 1 0 1 0 0 0 0 246 0 8] [ 0 0 0 0 0 0 0 0 254 0] [ 57 0 0 0 5 2 3 0 2 219]] precision recall f1-score support 0 normal driving 0.75 0.59 0.66 325 1 texting - right 1.00 0.96 0.98 322 2 talking on the phone - right 0.93 0.99 0.96 318 3 texting - left 0.95 0.97 0.96 320 4 talking on the phone - left 0.96 0.95 0.95 316 5 operating the radio 0.95 1.00 0.97 321 6 drinking 0.97 0.99 0.98 320 7 reaching behind 1.00 0.96 0.98 256 8 hair and makeup 0.93 1.00 0.96 254 9 talking to passenger 0.70 0.76 0.73 288 avg / total 0.91 0.91 0.91 3040
What will happen if we average the 2 results?
nothing fancy, just simple average of all improves to 0.419 (!)
nothing fancy, just simple average of all improves to 0.419 (!)
other results of open competitors
ensamble of pretrained VGG16 to 0.23
ensamble of VGG16 (0.27) + googlenet (0.38) together are generate: 0.22
adding-small-blocks from other images , helped a bit more.
Appendix
Looking at some results:
Running few experiment, I constantly get bad results for some of the classes. This is the report:
precision recall f1-score support
0 normal driving 0.74 0.80 0.77 325 1 texting - right 0.99 0.97 0.98 322 2 talking on the phone - right 0.95 0.91 0.93 318 3 texting - left 0.79 0.99 0.88 320 4 talking on the phone - left 0.98 0.94 0.96 316 5 operating the radio 0.96 1.00 0.98 321 6 drinking 0.96 0.96 0.96 320 7 reaching behind 0.99 1.00 0.99 256 8 hair and makeup 0.87 0.92 0.89 254 9 talking to passenger 0.89 0.59 0.71 288 avg / total 0.91 0.91 0.91 3040
The recall for "9-talking to passenger" is extremely bad. (0.59)
The precision for "3 - texting left" is 0.78
and both the precision and recall for "0-normal driving" are bad 0.74/0.80
Let's have a look at some photos, from these categories:
category 0:
The driver has , usually 2 hands on the wheel, with the head straight ahead, or slightly tilted twards the camera. bad-classification exists, usually when driver looks hard to the right side(probably '9" cattegory)
category 0: In total 2076 good. 94 bad-human-classification. 4.3% bad classification.
cateogry 9: In total 1364 good. 477 bad-human-classification . 26% (!) bad classification. Mainly drivers looking forward (class "0").
categoty 3:
Goog ground truth: rarely right-hand completely shadows the phone (or at least 90% of it). sometime users look cokletely to he right (passenger side), but still, always have a phone.
So why do we have texting-left percision :0.79? let's look at the confusion matrix, we see there are 30 predictions where it was actually class 0
conclusions so far:
Remarkably bad groud-truth classificaiton on category 9 was (according to forum entry) due to classification by whole video-section instead of individual frames.
This means that if in a 30 seconds video, the user did 75% of the time action 9, and 25% of the time action 0, all 100% will be counted as action 9 in the groud-truth.
In other words, there is no way (even a human) can classify it correctly above 75%.
There are two options here:
1. Hack : Reconstruct the video from single-frames, classify the whole section and mark accordingly.
2. Understand that class 0 can mean both 0 and 9, and artificially change final weights accordingly to minimize error rate