Monday, February 13, 2017

LightGBM - accurate and fast tree boosing

Install


1. Open LightGBM github and see instructions. For windows, you will need to compiule with visual-studio (download + install can be done in < 1 hour)
2. use "pylightgbm" python package binding to run this code

Code example

import os
from pylightgbm.models import GBMRegressor

os.environ['LIGHTGBM_EXEC'] = "c:/.../LightGBM/windows/x64/Release/lightgbm"

model = GBMRegressor(
    num_threads=-1,
    learning_rate = 0.03,
    num_iterations = 5000, 
    verbose = False, 
    early_stopping_round = 50,
    feature_fraction = 0.8,
    bagging_fraction = 0.8,

model.fit(
    train_local[features].values, 
   train_local['loss'].values, 
    test_data = [(
        validation[features].values, 
        validation['loss'].values
    )]
)


Sunday, November 6, 2016

Generative Adversarial Networks

No one  can explain it better than OpenAI

There are few important aspects:

Loss Function  (or discriminator)

Classic DCGAN have one discriminator:  IN: Image   Out: Real/Fake.
The generator is simple too.  IN: random-vector(noise)  Out: Image

pix2pix   :  Let's take an example coloring greyscale image.
Discriminator: IN: pair of images (grey+color)  OUT: Real(match) or Fake(no-match). The real will be grey+color of same image. The fake will be grey + generator(fake)->synthetic-color.
Generator:  In: Image   Out:Image.
* They also added that generator need to have L1 similiarity to the output image pair (with some small- lambda size. The main one is to fool the discriminator).

pixel level domain transfer: Let's take an example of a man wearing a sweather and the sweather alone.
Generator:  IN: image of fashion-model  Out: image of sweather
Real/Fake Discriminator: IN: sweather image OUT: real/fake
Domain Discriminator IN: two images, sweather and the fashion-model.  OUT: match/not

Network Architectures

As with all CNNs, the network size, depth and structure is important for the quality of the output.
pix2pix uses U-Net based generators  (Encoder-Decoder but with skip-connection), originally used for segmentation regular CNN, which is great here.
Discrimintors are path-based.

Original articles and code links


Generative adversarial networks have been vigorously explored in the last two years, and many conditional variants have been proposed. Please see the discussion of related work in our paper. Below we point out two papers that especially influenced this work: the original GAN paper from Goodfellow et al., and the DCGAN framework, from which our code is derived. 

2014: Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Networks NIPS, 2014. [PDF]
Code (Theano)

2015: DCGAN Alec Radford
Paper: Alec Radford, Luke Metz, Soumith Chintala. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks ICLR, 2016. [PDF]
Code:
theano.
torch(Soumith)
 keras 170 lines
From Keras: in each batch:
1. generator" Input: noise   Output: Image : predicting a batch  (on first epoc total random)
2. discriminator : Input: Image Output: boolean (real/fake).  Trained on X= batch_size real(mnist) + batch_size generated from last stage. Y is [1..1,0...0]
3. discriminator_on_generator : sequential of generator then discriminator(trainable=False).  Input: noise, output: True/False.  X=new random noise  Y=[1...1] . During training we try to get to 1, as the discriminator is not trainable, it can't change to always 1, so the generator must improve.

2016: pix2pix (Applied) based of DCGAN
Article: Image-to-image translation using conditional adversarial nets (Including many different image type (night/day. color/greyscale.  earth/road map.)
Code: original torch tensorflow

2016: Improved Techniques for Training GANs (goodfellow).
Code: tensor-flow (original)

2016: model-based domain-transfer  (Applied based of DCGAN)
Code: Torch(original)




Recommendation systems


The 3 types
  • Item-Item Content Filtering: “If you liked this item, you might also like …”
  • Item-Item Collaborative Filtering: “Customers who liked this item also liked …”
  • User-Item Collaborative Filtering: “Customers who are similar to you also liked …”
There are other trivial domain dependent (like most-popular / trending today). They use global data (or global-data per category) and do not distinguish between items



Item-Item Content Filtering: If you liked this item, you might also like …”
How: Find similarity between items. Finding similarity is an art of itself, but essentially it's a two steps procedure: cut-into-(good)-features and calculate-distance-using-features. Lately deep-learning is doing both together.
Example: Pandora processes a song into it's features (rock/classic, fast/slow) and finds similar songs.
For books, you can look into book-genre + author-name
For fashion photos, you can (but it's hard) use deep-learning to find similar dress/shoe style


Item-Item Collaborative Filtering: “Customers who liked this item also liked …”
Example: Amazon (at least the first years)
How: Use a lot of user data. Good for companies with big user data and not for new-comers.
Market-basket based: If you have purchase-history you just count at the most common items purchased together (for cat-food:  cat-toy=50% , cat-litter=40%, dog-food=0.3% etc). Drawback of this approach is that we use one purchase items and not the whole history of the user, and that we know he bought  it, but don't know if he liked it in the end or returned it.
Rating based: If you also have enough user-rating(stars), you can look at how much users ranked what, for example: those giving harry-potter 1 '5 stars' also gave harry-potter 2 '5 stars' average, but gave dan-brown '1 star' , 

User-Item Collaborative Filtering: “Customers who are similar to you also liked …”
Example: Netflix challenge
Find similarity between users according to the rating they gave to the same movies, but do that after normalization (some users love-everything with mean score of 4 of 5, some are "haters" with 2 of 5 mean average, so the first giving 3, is like the hater giving 1).

Implementation using NN:
For large data-sets , this work best. It can be implemented in rather small Keras NN.
In few words, find latent vector of users and movies.
Linear: The rating should be the dot-product of a user latent-vector and movie-vector. Optimize to minimize diffs.
Better, non-linear : The rating will be the result of a shallow NN. optimize to minimize diffs again.





Appendix



Thursday, August 11, 2016

statefarm - retrospect

Reviewing the best team competition results.



1. How to train each single-model 

1.1 Syntesize/Augment to generate huge amount of data
  • Synthesize 5M new images: create 5M images, by combining left and right (almost) half of images from the same class.  This is so good, it was able to train from-scratch google net V3 to 0.15 . ref
  • Synthesize images by combing images from the test-set (As in this competition, they all used the same video)


1.2 If not possible, use pre-trained. The stronger, the better.
Resnet-152 > VGG-19 > VGG-16 > googlenet 

1.3 Use semi-supervised learning
"dark-knowledge" - let an ensamble predict on the test-set, take most cofident. 6-12K images, don't use too many

* Some numbers to compare... to compare from the same team/model.googlenet V3 
0.31 pre-trained , augmented (flip/rotate)
0.26 pretrained, augmented + "dark-knowledge"/semi-supervised
0.15 from scratch: but 5M synthesized images(!)

2. How to run a single model

If test-data data can be clusered use this fact (in this competition, yes, it was):
  • hack-the-input and get 3rd place . as the input was sequence of images, use NN for better training and test.  (resnet 0.27->0.18)
  • Other-approach is to run all images on VGG, take mid-layer output and cluster it (1000 clusters) and use the cluster mean result

All images or part of it?
  • Most ran the image as a whole (with/out clustering)
  • R-CNN (tuned differently from object-detection) helped  VGG 0.22>0.175



3. How to choose models for en-sample

  • Try to use different models, trained differently. For example, one VGG and another Resnet. one augmented, the other not...
  • X-fold is common, but basic.



4. How to combine models

  • use scipy minimize function and create a custom geometric average function to minimize logloss of all models.



Statefarm - experiment 3 - VGG16 finetune




VGG:   [conv(x2/3)->max-pool]xfew-times Then classifier head (4096->4096->1000)

Finetune VGG16

I saw one approach , with great results(!) where the the whole model was loaded, and only the last softmax layer was changed from the original(1000) to the new target (10).
In that case finetuning was done on ALL the model together, with slow learning-rate (sgd 1e-4)

I will use another approach:
We will replace the whole classifier head (4096->4096->1000).
1. [optional to save time later]  Load the model without the last dense part. Run it once on all the images and save to disk all the intermediate output  (512x7x7) per image of the train/validate/test info.  for reference, 10K files should take 1.9 GB of disk space.

2. Create alternative classifier. I used a small one due to a bit weak machine.   256->10
    model = Sequential()  
    model.add(Flatten(input_shape=(512,7,7)))
    model.add(Dense(256, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(10, activation='softmax'))

Train it. I used few optimizers
SGD(lr=1e-3, momentum=0.9, nesterov=True)

SGD(lr=1e-3, momentum=0.9, nesterov=False)- BESTSaved model to disk vgg_head_only1_out_0_epoc20

SGD(lr=1e-4, momentum=0.9, nesterov=True)

SGD(lr=1e-4, momentum=0.9, nesterov=False)

'adam'