Saturday, December 6, 2014

Searching for close matches to my choices

Problem:

Every user is asked a serious of binary questions:
What do you prefer, cats or dogs
Do prefer winter or summer
Do you consider computer-science gradatees as scientinst. Yes/No

There will be lots of users (1M), but not more than 300 questions.
Find the 10 closest matches for a certain user.

Solution:

This problem is a similar to search nearest-neighbors of  Hamming distance. It is simplified as there are only binary(yes/no questions), hamming works with more options, but nearest-neighbors is extremely difficult problem....

1. For a query for unkown, new user, this stackoverflow question says that it`s a tricky problem.  For short distances (4 from 32 bit integers) BK-tree and VP-tree can speed up queries up to 10 times, or even 1000 times for very short distance of 1.   For medium ranges and above, no good data structure exists, and brute-force is needed.

If we can assume that our users always have 10 close matches with short distance, we can use one of the suggested data-structure, otherwise, just use blunt-force each time.

2. For a big calculation of all the matches, there might be a better solution. Didn`t find one.