This is a domain-specific problem, not a general technology post, so read is as a computer-science riddle...
The problem
Match-making site between 2 people (like a dating site) or 2+ people activity (like a community basketball game of 5 on 5).
- Each person can have a list of properties and preferences for activities, for example: age-group, distance from home, activity-type.
- A person can choose multiple overlapping activities with priorities between them.
- The selection should be bi-directional and we should make sure that there will not be "sterlie" selections where the other person did not even had a chance to choose him back.
Few other requirements:
- There will be a rating system for people, after enough activities were made.
Discussion
Approaches:
CLUSTERING - TBD
One approach will be to try and cluster the data before any user-query is given, so that later queries will be answered faster. Many search engines use this approach.
The issue here is to define the clustering criteria. It is clear that we can cluster by location (NYC is in different cluster than San Francisco.
We may also cluster in NYC by age (a year for cluster).
And by degrees (none , 1 or 2 and above).
So now in NYC there are 40 clusters (20-60 age group) x 3 degree-types.
Although it can reduce the query time, a query on NYC women, any-degrees, age 25-35 will still yield a lot of data. (8m people in NYC area. This can yield 8m(2x4)=1m. Considrable reduction but still a big cluster...
The issue here is to define the clustering criteria. It is clear that we can cluster by location (NYC is in different cluster than San Francisco.
We may also cluster in NYC by age (a year for cluster).
And by degrees (none , 1 or 2 and above).
So now in NYC there are 40 clusters (20-60 age group) x 3 degree-types.
Although it can reduce the query time, a query on NYC women, any-degrees, age 25-35 will still yield a lot of data. (8m people in NYC area. This can yield 8m(2x4)=1m. Considrable reduction but still a big cluster...
QUERY
Storage: Each (1M) user has document/description of up to 1K, of which only small amount can be used for filtering. For a total of 1GB of storage.
Calculation: We assume the DB sort-by field is not enough, and that we need a smarter score function (like this)
[V1] Let`s first start with the trivial but but slow solution.
Assuming there are 1M persons, the match matrix can grow to be quite big. (1Mx1M)
In addition we need to save the history choices of every user (let`s say 1K for each) and filter them out, so they will not appear again. The issue here is that when we sort the results, for an old user, we will always filter out the first few thousand results.
[V2] Let`s try to refine this. As we know that only people with close-location proximity and age proximity are relevant, we first filter out big mismatches (using database-query)
On-demand, but cache query results
On Init, we do nothing
On change in preferences, we do the N/100 query, sort, and cache the results which will serve our paging result.
On new-person we do another query, and add the recent results to any cached result, as a secondary list.
To illustrate:
PersonA have a cached query [10,000 sorted results, of them we choose the first 5,000] and a list of people already ranked by the user [the last 320 people he saw]
New PersonZ arrives and creates a query. Person C just changed it`s location to Far-away
PersonA now will also have a new-candidate [personF score 54] to combine during the next pagination, and [-MINUS- personF score ]
The first query (each time a user changes it`s preferences) requires ranking N/100. but it is cached.
The bigger issue, is how to save the updates to the cache. If a new user arrives, we may need to update 1,000-10,000 options. No DB can simply persist this load. It may be solved by having the second-table in-memory with no-persistence.
[V3] Only on-demand, no-caching
Starting at a certain hour, only the logged-in users are the subset to query from. This will further considerably reduce the user set. Also use ElasticSearch to do the calculation in-server-farm
see function_score there.
There will be no caching (except maybe client side cache of the next page of results, 5-10 tops).
Problem: as we do re-ranking each time, of everything, it must be super fast, and it is hard to imagine how we can achieve it when needing to rank N/100 results.
[V4] On the first call, do a query on N/100 and save result and time of query.
On the second call, query only new data which changed preferences and merge to create a new snapshot.
Problem: snapshot change can be add-new-person. update-up-score
- On Init, we will load all the N persons into memory, and check match between them (N^2), and will save a mapping of person1-person2 = 50%. person1-person3=95%, etc. we will have N^2/2 such mapping.
- On a change in person preferences, we will recheck all his matches (N)
- On a new person, we will check all his matches (N).
Assuming there are 1M persons, the match matrix can grow to be quite big. (1Mx1M)
In addition we need to save the history choices of every user (let`s say 1K for each) and filter them out, so they will not appear again. The issue here is that when we sort the results, for an old user, we will always filter out the first few thousand results.
[V2] Let`s try to refine this. As we know that only people with close-location proximity and age proximity are relevant, we first filter out big mismatches (using database-query)
On-demand, but cache query results
On Init, we do nothing
On change in preferences, we do the N/100 query, sort, and cache the results which will serve our paging result.
On new-person we do another query, and add the recent results to any cached result, as a secondary list.
To illustrate:
PersonA have a cached query [10,000 sorted results, of them we choose the first 5,000] and a list of people already ranked by the user [the last 320 people he saw]
New PersonZ arrives and creates a query. Person C just changed it`s location to Far-away
PersonA now will also have a new-candidate [personF score 54] to combine during the next pagination, and [-MINUS- personF score ]
The first query (each time a user changes it`s preferences) requires ranking N/100. but it is cached.
The bigger issue, is how to save the updates to the cache. If a new user arrives, we may need to update 1,000-10,000 options. No DB can simply persist this load. It may be solved by having the second-table in-memory with no-persistence.
[V3] Only on-demand, no-caching
Starting at a certain hour, only the logged-in users are the subset to query from. This will further considerably reduce the user set. Also use ElasticSearch to do the calculation in-server-farm
see function_score there.
There will be no caching (except maybe client side cache of the next page of results, 5-10 tops).
Problem: as we do re-ranking each time, of everything, it must be super fast, and it is hard to imagine how we can achieve it when needing to rank N/100 results.
[V4] On the first call, do a query on N/100 and save result and time of query.
On the second call, query only new data which changed preferences and merge to create a new snapshot.
Problem: snapshot change can be add-new-person. update-up-score