Thursday, April 30, 2009

Fighting bytes - 64bit Lean hashmaps

Lets say you need to implement a 375,000,000 (more than 3/8 GB) key-value mapping in live memory.
The good: It is integer based and you have a 40GB RAM machine.
The bad: It is a multi-map : integer to multiple integer mapping.

Lets start with the trivial solution:
First of all, as we use 40GB , we need a 64bit java implementation.
Lets use hashmap where key=Integer , value= LinkedList of Integers.
HashMap cost (64bit) = aroud 60 bytes per entry.
Integer cost(64 bit) = 24 byte. remeber we need one in key and at least one in value.
LinkedList(64bit) = starts with ~80byes , ~40 more per entry.

For fast calculation, lets assume one Integer in key and one in value.
We get N*(60+24+24+120) = N*228 = 85.5GB. Way above what we have - the 40GB include the linux , some GC room and few other applications.


But The real max we can have per entry is 100bytes (which will get us to 37.5GB)

So , what to do?
We must reduce the multi-map cost (LinkedList is a big waste). We can also reduce the Map implementation.

Lets start with the hashmap for a minute.
It mainly consists of an Entry[] = N*pointer_size
and each entry is a hash value(integer) and three pointer(to key,to value, to next-entry) = 16(minimal-per-java-object)+4bytes+3*pointer_size = (64bit machines) 48.
For a total of: N*56.
As the size of Entry[] is typically more than than the current size(load-factor ~0.75) , it will even be bigger.

Use of open-addressing , can reduce the size (although there are certain other issues):
N keys
N values
no need for hash(performance by caching), no need for next-entry.
This can reduce the size by 4+8=12 , for a total of N*44. Drawback is that load-factor need to decrease.

Other optimizations:
if key/value are primitives , save a lof of space by using the value instead a pointer and a heap-allocated value.
Integer as a key costs a pointer + 24 byte (if integers are not object-pooled)
int as a key(in a specialy designed map) only cost 4 byte!!!


array based - http://members.optusnet.com.au/~askitisn/CRPITV91Askitis.pdf





This post (http://stackoverflow.com/questions/629804/what-is-the-most-efficient-java-collections-library) gives a list of few customized hash implementation with up to 1/3 the java.util size.

In particular Trove supplies TIntIntHash for int to int hashin TIntObjectHash and TIntArrayList which can be used for (half) efficent integer based multi-maps.
It does not supports out-of-the-box multimaps.
I`m pretty sure it will take me under the 100 bytes threshold. If not, the only other options is to implement a custom int to multi-int hash table, using open-addresing. I hope I won`t get to that (re-inventing wheels is not my cup of tea)

Friday, April 24, 2009

Nice .... Google App Engine support java

Good new: http://googleappengine.blogspot.com/2009/04/seriously-this-time-new-language-on-app.html
Good news for java web-developers . Sad news for the java cloud-based startups...


Going to port the python app to java. Yes, it will take 10 classes where in python it was 100 rows , but ....