On Software Development: July 2016

Monday, July 18, 2016

Statefarm - experiment 2

Use pre-trained googlenet.
Step 1: get a pre-training googlenet
Find one on the net , usually converted from caffe model. make sure you understand how to pre-process the data , by testing few images.

Step 2: Cut it's head classifier and replace it with yours. Then train your new classifier on the frozen body (the no-head part).
googlenet classifier has 1000 categories, and there are actually 3 heads to the body - a "hydra" :)
There is a main one and 2 aux classifiers in the middle. All need to be replaced and trained.

This is the graph of the main-classifier trained, while all the rest of the model is frozen (in-fact, I dumped to file the output of the 'merge' layer prior to the classifiers sections, it saved a lot of time, but cost few dozen of Gigs of disk space)

After 12 (model_chapter6_12epoc) , we don't see any increase. This is the output of 32:

loss: 0.5412 - acc: 0.8648 - val_loss: 0.9205 - val_acc: 0.6908

In the same way, we will train the other 2 aux classifiers. aux1 (middle one)

  loss: 0.1330 - acc: 0.9771 - val_loss: 1.2633 - val_acc: 0.8178

Saved model to disk model_chapter6_aux1_try2_11epoc

aux0 - behaves surprisingly well (look at the wierd behaviour of the validation - higher than the training in epoc1. then going down...

loss: 7.4296 - acc: 0.3291 - val_loss: 0.4846 - val_acc: 0.8513
Saved model to disk model_chapter6_aux0_1epoc

loss: 0.1624 - acc: 0.9737 - val_loss: 0.5066 - val_acc: 0.9082
Saved model to disk model_chapter6_aux0_8epoc

Even without fine-tuning, we might use the aux0 for validation loss of 0.5, or the other classifiers for 0.9/1.2 loss.
Let's test this and submit the model with aux0
(used model_chapter6_aux0_25epoc)

Validation score= 0.1 accuracy= 0.92

LeaderBoard: 1.82
Conclusion: Classic case of overfit to the validation. So let's continue to step 3

Step 3: connect the new heads and fine-tune the entire model.
We will do it by freezing most layers (the inception-blocks), except the last one/two blocks.

We will use the new classifier.

Note on this quite "low" number -

We did not augment the data while training this
We used rmsprop which is a "fast but less accurate" one.

We did this, as we will have a training step later, which should work with augmentation and better optimization.

3.1 bad-experiment example (eveyone have bugs...)

Use only partial graph (only the aux0). high learning rate 0.001. heavy augmentation (flip/zoom/shear). Can you see the problem here?

This should never happen, and is usually a bug. The bug in this case was in bad-random flip (on training always flip . on validation never flip)
result in: model_chapter6_aux0_finetune7epoc

3.2 Can we use only aux0 and a small subset of the googlenet?  (the answer is no...)


We again only partial graph with aux0, this time slower learning rate of 0.0001

result sample in: model_chapter6_aux0_finetune_lr_1e40epoc This proved to be have bad results

3.3 Let's finetune the whole model, and look the the result of the end-classifer. fine tuned 16 epocs using SGD 0.003 This proved to be great improvement, LB=0.51286
lock the first layers: conv1, conv2, inception_3a/b, inception_4a , loss1
keep the other trainalbe: inception_4b/c/d/e inception_5a/b and loss2/3 compile while adding loss_weights and add weight to the main classifier: full_model.compile(loss='categorical_crossentropy',loss_weights=[0.2,0.2,0.6], optimizer=SGD(lr=0.003, momentum=0.9), [stopped in the middle] augmentation used: googlenet_augment shift 0.05 rotation 8 degrees, zoom 0.1, shear 0.2
saved model after 16 epocs: model_chapter6_finetune_all_lr_1e4_binary
see: statefarm-chapter6-finetune-0.003-fix_aug.ipynb

validation score (overfit again) SCORE= 0.0529023816348 accuracy= 0.908552631579 confusion matrix:

[[291   0   4   1   1   0   0   0   4  24]
 [  1 298   0  19   0   0   4   0   0   0]
 [  0   0 315   0   2   0   0   0   1   0]
 [  0   1   0 317   0   0   1   0   1   0]
 [  0   0   1   2 313   0   0   0   0   0]
 [  0   0   0   0   0 321   0   0   0   0]
 [  0   0   1   0   0   0 318   1   0   0]
 [  0   0   0   0   0   0   0 256   0   0]
 [  0   0   7   0   0   0   1   1 243   2]
 [156   0   0   0   5   0   1   0   1 125]]
                                precision    recall  f1-score   support

              0 normal driving       0.65      0.90      0.75       325
             1 texting - right       1.00      0.93      0.96       322
2 talking on the phone - right       0.96      0.99      0.98       318
              3 texting - left       0.94      0.99      0.96       320
 4 talking on the phone - left       0.98      0.99      0.98       316
         5 operating the radio       1.00      1.00      1.00       321
                    6 drinking       0.98      0.99      0.99       320
             7 reaching behind       0.99      1.00      1.00       256
             8 hair and makeup       0.97      0.96      0.96       254
        9 talking to passenger       0.83      0.43      0.57       288

                   avg / total       0.93      0.92      0.92      3040

This is the confusion matrix of aux1 classifier (the intermidiate one)

Validation SCORE= 0.0607746900158 accuracy= 0.899342105263

LB score: 0.768

comparing the two confusion-matrixes, 0 and 9 classes

[[191   0  16   4   7  10   1   0  14  82]
 [  0 309   0   5   0   0   7   1   0   0]
 [  1   0 314   0   1   1   0   0   1   0]
 [  0   1   0 312   0   5   0   0   0   2]
 [  4   0   8   5 299   0   0   0   0   0]
 [  0   0   0   0   0 321   0   0   0   0]
 [  0   0   0   1   0   0 316   0   2   1]
 [  1   0   1   0   0   0   0 246   0   8]
 [  0   0   0   0   0   0   0   0 254   0]
 [ 57   0   0   0   5   2   3   0   2 219]]
                                precision    recall  f1-score   support

              0 normal driving       0.75      0.59      0.66       325
             1 texting - right       1.00      0.96      0.98       322
2 talking on the phone - right       0.93      0.99      0.96       318
              3 texting - left       0.95      0.97      0.96       320
 4 talking on the phone - left       0.96      0.95      0.95       316
         5 operating the radio       0.95      1.00      0.97       321
                    6 drinking       0.97      0.99      0.98       320
             7 reaching behind       1.00      0.96      0.98       256
             8 hair and makeup       0.93      1.00      0.96       254
        9 talking to passenger       0.70      0.76      0.73       288

                   avg / total       0.91      0.91      0.91      3040

What will happen if we average the 2 results?
nothing fancy, just simple average of all improves to 0.419 (!)

other results of open competitors

ensamble of pretrained VGG16 to 0.23

ensamble of VGG16 (0.27) + googlenet (0.38) together are generate: 0.22

adding-small-blocks from other images , helped a bit more.

Appendix
Looking at some results:

Running few experiment, I constantly get bad results for some of the classes. This is the report:

precision recall f1-score support

              0 normal driving       0.74      0.80      0.77       325
             1 texting - right       0.99      0.97      0.98       322
2 talking on the phone - right       0.95      0.91      0.93       318
              3 texting - left       0.79      0.99      0.88       320
 4 talking on the phone - left       0.98      0.94      0.96       316
         5 operating the radio       0.96      1.00      0.98       321
                    6 drinking       0.96      0.96      0.96       320
             7 reaching behind       0.99      1.00      0.99       256
             8 hair and makeup       0.87      0.92      0.89       254
        9 talking to passenger       0.89      0.59      0.71       288

                   avg / total       0.91      0.91      0.91      3040

The recall for "9-talking to passenger" is extremely bad. (0.59)

The precision for "3 - texting left" is 0.78

and both the precision and recall for "0-normal driving" are bad 0.74/0.80

Let's have a look at some photos, from these categories:

category 0:

The driver has , usually 2 hands on the wheel, with the head straight ahead, or slightly tilted twards the camera. bad-classification exists, usually when driver looks hard to the right side(probably '9" cattegory)

category 0: In total 2076 good. 94 bad-human-classification. 4.3% bad classification.

cateogry 9: In total 1364 good. 477 bad-human-classification . 26% (!) bad classification. Mainly drivers looking forward (class "0").

categoty 3:

Goog ground truth: rarely right-hand completely shadows the phone (or at least 90% of it). sometime users look cokletely to he right (passenger side), but still, always have a phone.

So why do we have texting-left percision :0.79? let's look at the confusion matrix, we see there are 30 predictions where it was actually class 0

conclusions so far:

Remarkably bad groud-truth classificaiton on category 9 was (according to forum entry) due to classification by whole video-section instead of individual frames.

This means that if in a 30 seconds video, the user did 75% of the time action 9, and 25% of the time action 0, all 100% will be counted as action 9 in the groud-truth.

In other words, there is no way (even a human) can classify it correctly above 75%.

There are two options here:

1. Hack : Reconstruct the video from single-frames, classify the whole section and mark accordingly.

2. Understand that class 0 can mean both 0 and 9, and artificially change final weights accordingly to minimize error rate

Wednesday, July 13, 2016

ConfusionMatrix

Let's look at one experiment confusion-matrix

[[259   0   7  30   1   8   0   0   7  13]
 [  1 311   0   8   0   0   2   0   0   0]
 [  3   0 288   5   1   0   8   0  13   0]
 [  0   0   0 318   2   0   0   0   0   0]
 [  6   0   0  10 297   2   0   0   1   0]
 [  1   0   0   0   0 320   0   0   0   0]
 [  0   0   0   1   0   0 308   0  11   0]
 [  0   0   0   0   0   0   1 255   0   0]
 [  0   0   7   1   0   0   1   3 234   8]
 [ 78   2   1  29   1   4   1   0   3 169]]

The X axis is prediction . the Y axis is true-label (all first row true-label is 0)

Let's have a look at row 0.
259 in [0,0] means true-positive results with correct match.

30 in [0,3] means the truths is 0, but we predicted 3

0 in [0,1] means we don't think (Wrongly) that 0 is 1

in total there 259 correct-predictions are 7+30+1+8+7+13=66 wrong predictions.

259/325 = 0.80 . This is the hit-rate, or the recall rate.

Let's look at column 0.

78 in [9,0] means we predicted 0, although it is actually 9. This is a biggest-mistake in one cell.

If we sum all the column, we see total of 1+3+6+1+78=89 false-positive predictions. in total we are correct in 259/(259+89)= 0.74 of our predictions, this is the precision.

To iterate on recall and precision, what will happen it we change the algorithm to a dump "always return 0" algorithm? column 0 will be filled with values. All other columns will be empty.

we will get 325 in [0,0] (all true) and the rest of the diagonal is all false.

The recall will be full 1.00 for 0 category . We always recall correctly this one. For the rest it will be 0.00

The precision will be very bad 325/3040 = ~ 10%

precision recall f1-score support

              0 normal driving       0.74      0.80      0.77       325
             1 texting - right       0.99      0.97      0.98       322
2 talking on the phone - right       0.95      0.91      0.93       318
              3 texting - left       0.79      0.99      0.88       320
 4 talking on the phone - left       0.98      0.94      0.96       316
         5 operating the radio       0.96      1.00      0.98       321
                    6 drinking       0.96      0.96      0.96       320
             7 reaching behind       0.99      1.00      0.99       256
             8 hair and makeup       0.87      0.92      0.89       254
        9 talking to passenger       0.89      0.59      0.71       288

                   avg / total       0.91      0.91      0.91      3040

Let's analyze back to the classification-report.

about "0 - normal-driving" we talked already.

We can see that "1- texting-right" has good recall 0.97, and also good precision 0.99

'3-texting-left" has 0.99 recall, but only 0.79 precision (it's too-strong) which means there are many false-assumptions, let's look at the confusion-matrix, at column 3. 30 predictions were actually 0-normal-driving and 29 are actually 9-talking-to-passenger.

True Positive (TP) eqv. with hit

False Positive (FP) eqv. with false alarm, Type I error

sensitivity or true positive rate (TPR) eqv. with hit rate, recall

precision or positive predictive value (PPV)

F1 score - is the harmonic mean of precision and sensitivity

Tuesday, July 12, 2016

StateFarm experiment 1

Let's start with a simple and quick to run model.

150x150 (32,3,3) (32,3,3) (64,3,3) -> Dense( 3x200) -> dropout0.5-> dense10

each epoc is train: 5*1024 . validate= 1*1024. batch-32

model_chapter3

epoc 0 699s - loss: 18.1189 - acc: 0.2369 - val_loss: 2.1892 - val_acc: 0.3574
epoc 1 765s - loss: 7.5443 - acc: 0.4570 - val_loss: 1.5257 - val_acc: 0.4697
epoc 2 689s - loss: 3.5896 - acc: 0.6488 - val_loss: 1.9699 - val_acc: 0.3590
epoc 3 697s - loss: 1.8959 - acc: 0.7616 - val_loss: 1.8912 - val_acc: 0.3887
epoc 4 707s - loss: 1.2178 - acc: 0.7992 - val_loss: 1.5978 - val_acc: 0.4756
epoc 5 710s - loss: 0.9396 - acc: 0.8277 - val_loss: 1.6677 - val_acc: 0.5829
epoc 6 702s - loss: 0.8008 - acc: 0.8520 - val_loss: 1.9146 - val_acc: 0.5781
epoc 7 707s - loss: 0.6810 - acc: 0.8798 - val_loss: 1.3611 - val_acc: 0.5752
epoc 8 707s - loss: 0.6647 - acc: 0.8748 - val_loss: 1.8251 - val_acc: 0.5314
epoc 9 706s - loss: 0.6234 - acc: 0.8936 - val_loss: 1.5517 - val_acc: 0.5908
epoc 10 709s - loss: 0.5812 - acc: 0.9054 - val_loss: 1.8407 - val_acc: 0.5225

 Usually we will plot loss, but here I plot the accuracy graph (training converges to 95% while validation does not pass the 58%).

continuing till epoc 30 reduce the training loss a bit, and the accuracy, but the validation does not improve.

epoc 30 - loss: 0.3641 - acc: 0.9482 - val_loss: 1.3679 - val_acc: 0.6270

Notes on this run:

After epoc 5 (in this case epoc is sample of 1/4 of the images), we start to overfit. Further epocs do not help (validation stays the same while training loss reduced to be extremely small)

There could be two main reasons:

1. Model is too strong and not regularized enough - Not the case here... it's small , heavy-regularzation and dropout.

2. Model is too strong compared to the data. I think this is the case.

The data

The number of training images is small (20k), further more, they are taken from ~20 videos of 20 actors, cut by frames, while the test set is from different video of different actors.

20 actors is not enough to regularize on all the people in the world.

What can be done?

More data is the obvious solution, but there is none.
Pretrained models are allowed in the competition, if they are pulic and can be used commercialy. Great imporevments were achieved using VGG-16 (10 times better) which can't be commericaly used. What does the pretrained network give us?

Better visual filters on the lower filters.
Cellphone detection on the higher filters.
Probably good human detection, but not clear if good hand localization detection

Or use a cascade of a 2 pretrained-models creating features, combine them into an image/new-channel and provide this to a small model.

A good one for humans exist, but runs in 17s x20,000 images = 340K second / 86,400 = 3.93 days

Further experiment with similiar architectures

experiment 3

Dense 3x200. l2(0.01). BN on all layers exect the 1st dense. adam optimizer
711s - loss: 0.4788 - acc: 0.9227 - val_loss: 2.0435 - val_acc: 0.5019
Saved model to disk model_chapter3_17epoc

#Validation : SCORE of model_chapter3_17epoc 0.290623311932 accuracy 0.434080421885
# Leader-board score = 1.64778

experiment 4

experiment 4 ran with: dense: 200-100-50 . Full BN. Pre-relu SGD(lr=0.001, decay=1e-7, momentum=.9) optimizer.

experiment 5

expeiment 5 ran with: dense 256-124-64. BN on allbut the 1st dense. regular Relu. Adam optimizer

5120/5120 - 1012s - loss: 0.4410 - acc: 0.9084 - val_loss: 1.0536 - val_acc: 0.6631
Saved model to disk model_chapter5_18epoc

On Software Development