Monday, December 28, 2009

Why small objects cloud hosting is so slow? Amazon S3 and SimpleDB performance

Is it possible to use S3 as a scalable storage for small objects?

S3 is perfect for large objects (1GB file) but is quite bad for small objects like those a media-server needs (songs,images).
Behind the scenes S3 uses standard disks with varying levels of replication and striping. Standard disks have poor random-access, but a good sequential throughput.
Each object has a minimal access time, caused by the disk seek time.see disk throughput and seek-time .
The real solution for small files is either caching the most-accessed in-memory (done by most of the Content-delivery-networks like CloudFront).
If you want to retrieve thousands of small-objects per second and not use a cache, then the only way to do it fast is to spread them across as many physical disks as possible. The number of disks does grows linearly with large files, as few files take a whole disk.
But with small files, the number of disks grows very slowly (500GB contains thousands of images) which may all sit in one machine. That`s why S3 is not appropriate for small files.
see read-performance of S3

For a real fast , cost-effective, random-access of small files the only solution will be different hardware: Solid-state disks. They are becoming cheaper every year and their random access speed is close to sequential performance.
Why there is no CDN with this solution yet? Because caching of the most-access is enough for the typical website. The access pattern of most users is not really random.




Link

Tuesday, November 24, 2009

Command&Conquer managment approach

Great two articles . A must read to anyone who manage anything more then himself.

Friday, July 24, 2009

Flex visualization libraries

multiple layouts : http://labs.kapit.fr/label/diagrammer
Spring based map: http://mark-shepherd.com/blog/springgraph-flex-component/
Facebook integration: http://www.adobe.com/devnet/facebook/

Thursday, April 30, 2009

Fighting bytes - 64bit Lean hashmaps

Lets say you need to implement a 375,000,000 (more than 3/8 GB) key-value mapping in live memory.
The good: It is integer based and you have a 40GB RAM machine.
The bad: It is a multi-map : integer to multiple integer mapping.

Lets start with the trivial solution:
First of all, as we use 40GB , we need a 64bit java implementation.
Lets use hashmap where key=Integer , value= LinkedList of Integers.
HashMap cost (64bit) = aroud 60 bytes per entry.
Integer cost(64 bit) = 24 byte. remeber we need one in key and at least one in value.
LinkedList(64bit) = starts with ~80byes , ~40 more per entry.

For fast calculation, lets assume one Integer in key and one in value.
We get N*(60+24+24+120) = N*228 = 85.5GB. Way above what we have - the 40GB include the linux , some GC room and few other applications.


But The real max we can have per entry is 100bytes (which will get us to 37.5GB)

So , what to do?
We must reduce the multi-map cost (LinkedList is a big waste). We can also reduce the Map implementation.

Lets start with the hashmap for a minute.
It mainly consists of an Entry[] = N*pointer_size
and each entry is a hash value(integer) and three pointer(to key,to value, to next-entry) = 16(minimal-per-java-object)+4bytes+3*pointer_size = (64bit machines) 48.
For a total of: N*56.
As the size of Entry[] is typically more than than the current size(load-factor ~0.75) , it will even be bigger.

Use of open-addressing , can reduce the size (although there are certain other issues):
N keys
N values
no need for hash(performance by caching), no need for next-entry.
This can reduce the size by 4+8=12 , for a total of N*44. Drawback is that load-factor need to decrease.

Other optimizations:
if key/value are primitives , save a lof of space by using the value instead a pointer and a heap-allocated value.
Integer as a key costs a pointer + 24 byte (if integers are not object-pooled)
int as a key(in a specialy designed map) only cost 4 byte!!!


array based - http://members.optusnet.com.au/~askitisn/CRPITV91Askitis.pdf





This post (http://stackoverflow.com/questions/629804/what-is-the-most-efficient-java-collections-library) gives a list of few customized hash implementation with up to 1/3 the java.util size.

In particular Trove supplies TIntIntHash for int to int hashin TIntObjectHash and TIntArrayList which can be used for (half) efficent integer based multi-maps.
It does not supports out-of-the-box multimaps.
I`m pretty sure it will take me under the 100 bytes threshold. If not, the only other options is to implement a custom int to multi-int hash table, using open-addresing. I hope I won`t get to that (re-inventing wheels is not my cup of tea)

Friday, April 24, 2009

Nice .... Google App Engine support java

Good new: http://googleappengine.blogspot.com/2009/04/seriously-this-time-new-language-on-app.html
Good news for java web-developers . Sad news for the java cloud-based startups...


Going to port the python app to java. Yes, it will take 10 classes where in python it was 100 rows , but ....

Thursday, March 26, 2009

Automation

1. Deployment automation
2. Test automation

Automatic deployment and automatic tests are the developer`s best friend.
An application development has three stages: Development , Debugging and Maintanance.
In the first stage you have a lot of time to spare. In the secod you have less and in the third you have none.

Automatic test and deployment save you time and bugs on the 2nd and 3rd stage , when you really need them ( need to fix the bug and redeploy by noon tommorow morning otherwise your very important customer will drop you , and just now your QA is on vacation and your IT guy is on sick day ).
But automation takes time to develop. It will elongate your development time (usually by mere few days) and automation may not be the coolest feature to develop (usually its pretty boring) ... you understand where it is going...

When you are at maintanance mode and suffers greatly from the wasted time on the manual testing/deployment and you understands the need for this , you simply do not have time to spare on building it.


Conclusion: Think ahead , in the development period , if the feature can benfit from automation down the line. If so , add the required days to the (middle) of the development period - you will thanks yourself later for doing so.

Saturday, March 14, 2009

Google App Engine

Prediction:
Google App engine (or its copycats) will be the only way to develop web applications in the following years.
They have everything exactly right except the programming language. Note to google: the average industry developer does not know Python. In fact if he knew Python, he would have probably already been recruited as a Google employee... So your runtime environemnt is not usuable to us , the common programmers.

When(and if) they implement PHP , Java (and maybe even ASP) runtimes , this project will become te industry leader and will kill all the data-stores(including Amazons) and half of the IT guys in the world.

If they will be smart enough, it can even create a standard web-development kit , as long as it is not Python based!!!

Outsourcing

In the current state of the economy, outsourcing become more popular every day.
Recruiting new workers is tough . Recruiting new workers from outsourcing is even tougher.

I'm doing a grouse generalization here and I apologize for it in advance for this observation:
Outsourcing workers can be divided to two groups:
The first group is extremely talented and experienced men who understands they can do a lot more money as outsourced-consultants than as regular workers.
The second group is of the "refugees" - the less talented/experienced men which could not find a regular job as company employee and turned to outsourcing companies as a last resort

The outsourcing H.R. will always tell you their workers are excellent for the job. So how can you distinguish between the extremely talented and the less-than-averages ones?
One way is to have a good technical interview, but as you usually outsource the field in which your group do not have the technology expertise , it will be hard for you to really estimate the interview results.
A second way is quite simple. The expert will demand a salary which is usually more than twice the salary of a regular employee whereas the refugee will usually ask for a salary similar or even less than a regular employee. So if the man you are interviewing wants a lot of money , there is a good chance he is the real deal. If he is cheap , there is a good chance he is not worth a penny.

I'm not saying anyone who demands lots of money is immediately good and that anyone who demands less is immediately bad , but pay attention to that...

Another Gem from Joel

Joel is truly a genius. As I`m "maturing" in work experience , I see more and more how true his observations are.

Joel is has written a great post on "How to be a Product Manager"

And a remark of few other great ones:
1. UI - Never show a UI-screen to an "outsider" before it is fully polished.
2. The law of leaky Abstrsctions

Saturday, January 24, 2009

Running a web-server from home

So , you want to run a web-server from home (why? , for testing purposes, of course! , don`t deploy anything on home server. Your ISP will eventually block you...)

Behind a router:
1. Change tomcat back to port-80.
2. Assign a LAN static-ip to the web-server on the LAN.
3. configure your router to do port-forwarding on port 80 to your static-ip.
4. Use a dynamic-dns to map a domain into your ever-changing ISP provided ip address.
And you are ready to go.


Few resources:
Good ref (for MAC)
http://www.canyouseeme.org/ - external site which tests your open ports.
http://www.no-ip.com/ - dynamic-DNS (with free version for non-commercial usage)

Thursday, January 15, 2009

Web Servers

web-servers
Lots of buzz words are flying around when talking about web-servers. I assume you know what it is , and just want to find which of the free/commercial servers will better suit your need:


web-servers For PHP/CGI/Pyton (etc) developers

1. Apache HTTP server [ = HTTPd]- a C-based implementation of a web-server for which you can use modules with the well-known "script" languages: PHP , Perl , Pyton , Ruby.
2. NginX - 2nd generation server (faster,less known)A C-based implementation of web-server which allow you to use again the "Script" langugs. This is a rather new server with better io utilization which should outperforme the Apahce HTTP server.
3. Lighttpd - another 2nd generation server, less table than NginX.

web-servers For Java developers

1. Apache Tomcat - a JAva-based implementation of web-server for which you can write modules in Java (plain java servlets or JSP). The internal modules which support the servlets are called "catelina" and the JSP is called "jasper" , but these names are only important for knowning which log file to look. It is not related to the popular Apache-c-based-Http-server. ( they share the "Apache" prefix cause they come from the same open-source organization , but there are dozens of Apache projects which are not technology related to one another)
2. Jetty - a small webserver , usually used for web-applications which run on a very small scale ( some application need only to work only on the localhost , and Jetty is a good solution for them)
3. JRun
4. Resin

web-server for ASP.NET
Here, like in many microsoft applications, you have one and only one choice: IIS


Simple so far? lets make it bit more complex - Interconnections!!!!
On some sites, one web-server gets the original request and forward it to another web-server. Few milliseconds later , it gets the response and send it to the user.
In this setups it is possible that the "gateway" web-server will be a c-based one , and the other is java-based one. They will connect to each other in a different protocol (its a waste to use HTTP between two servers) .
For apache-http-server(c-based) to Apache-Tomcat(java-based) , the connector is using the AJP protocol. see The Apache Tomcat Connector - AJP Protocol Reference
http://tomcat.apache.org/connectors-doc/generic_howto/workers.html

Why using interconnections?
usually when the Apache(c-based) is the load-balancer and the Tomcat`s really to the dynamic-http generation.
Note that it is possible to run a web-app (pure java) of a load balancer on tomcat , but it (by rumers) does not get close to the Apache(c-based) performance at load balancing.





Performance Benchmarks


Between different web-server "languages"
It will be intersting to see which web server flavor (c-based, java-based or IIS) performs better in similiar tasks of static pages and similiar dynamic pages. Didn`t find a good comparison , but iI do do not see anyone chaning the implementation languge of his company site just because the other langugle web-server performance is better.

Inside java web-server
http://www.webperformanceinc.com/library/reports/ServletReport/index.html - bottom line is that Tomcat and Resin are close. 1000 concurrent users per second (when some have short sessions , some medium and some large -yes , it does not say much about the numbers in your site)
http://tomcat.apache.org/articles/performance.pdf
http://tomcat.apache.org/articles/benchmark_summary.pdf - static files (Tomcat VS apache-C-based-server)


conclusion
It`s hard to find reliable benchmark between all the web-server matrix. (There should be a "Tom`s Hardware" type of benchmark for web-servers)
That`s said , the actual server choices which are seem to be recommended.
  • Choose dynamic web-server , according to the language you are familiar with. If you use Java , the free Apache Tomcat should be great.
  • If you are not going to have a lot of static pages/images or load-balancing , you can use Tomcat for all the rest too.
  • Otherwise , Choose "static " web-server as one of the 2nd generation C-based servers. NginX is a good choice. Its probably better than the java based servers. You can also use its load balancing capability.
Final notes:

  • Look at the feature matrix of the servers you use. If you need advanced features (load balancing , proxing etc ) , your choice will be easier because the number of servers will be more limited.
  • Remeber to look for the bottlenecks: dynamic web-server usually uses a Database which tend to be the bottleneck. Static web-server bottleneck tend to be the bandwidth itself.