Thursday, September 23, 2010

Postgres 9.0 hot-standby

Postgres 9.0 have a new feature "streaming-replication" + "warm-standby". In few words , it allow asynchronic replication between master and slave (there is a delay , but it is a small one) and the slave can do read-only queries.
synchronios replication is only planned for 9.1

It`s a good time to recap the different sharding/master-slave options and their roles.

At first you develop the code for one DB instance and you assume three things about it:
1. It will never fail - the machine,process and disks are eternal.
2. data-scale: It can grow is size as much as you want - terra-bytes of disks space.
3. query(users) scale : The machine is as fast as you want - It can supports millions of fast queries in parallel.

After few months , you deploy it and then find out that these 3 things are not so easy to achieve. Lets see what are the (Postgres) solutions for them.

query-scale
  1. Scale-Up : Still one machine, but buy the best one money can get - 8 cores,32GB are relatively cheap. But twice as strong machine will cost a lot more the twice the money. If you need x2 or x4 , switch to scale-out.
  2. Scale-Out :
    2.1 Scale reads not writes: (good for mainly read DB)
    Use one master server (read&write) and multiple read-only slave servers which get periodically updated from the master server. The master and slave share the same data.
    Postgres 9.0 supports it out-of-the-box
    2.2 Scale writes by sharding

Fault-tolerance
For transaction-DB where no record can be lost , this is a hard task. For data-warehouse which may loose the latest few minutes of updates , this is much simpler.

No data loss:
  1. Shared-disks : the master writes to a shared storage. When it fails the slave mount it and loads the data. It is a "code-standby" as it can takes few minutes to recover.
  2. pgpool II - slows the updates on the master!
  3. synchronios log shipping (should be in postgres 9.1) - does not (considerably) slow the master. Failover take 15 seconds (10 to detect failure , 5 to start the warm-standby)
Some data loss allowed:
  1. Postgres 9.0 warm-standy & streaming-replication
  2. Slony-1 updates the slave using triggers

Data scale
Start with more disks on the machine , then scale-out with sharding.














Tuesday, July 13, 2010

Axis2 RPC client perfromance

The client time is consisted of few parts:
* Open TCP connection to the server (send sync, recieve syn/ack , send ack = 1 round trip)
* Send/Recieve WS xml (send at least one packaget , recieve at least one back = at least 1 round trip)

Let`s eliminate the TCP connection time. My ping to the server is 222 ms. The server calculation time is small (less than 15ms). So one round-trip = ~235.
The goal is to reduce each WS call to one round-trip.

1. Use high HTTP1.1 keepAlive value:
Pro : reduce the tcp-connection round-trip
Cons: Load-balancing based on IP will not function as intended , as after first connection the client will keep using the same host over and over.

  • Make sure HTTP1.1 keepAlive is turned on on both client and server.
    Tomcat HTTP1.1 keepAlive is turned on by default ( you can shut it off in server.xml , Connector section , by adding maxKeepAliveRequests="1". [default is no parameter at all and means 100 requests )
    Turning it off will cause 2 round-trips - [first=626] then 457, 455 , 458 , ...

  • Keep the connection alive by using it more frequently than the timeout.
    The default keepAliveTimeout in Tomcat 6.0.18 is equals to connectionTimeout which is typically 20000ms=20 seconds.
    It means that the TCP connection will stay alive only if it is used more frequently than this.
    You can configure the tomcat to higher value (120seconds for example) in the server.xml, Connector section.
    Note: adding the parameter keepAliveTimeout="500000" alone is not enough for tomcat6.0.18 ,cause of this bug. you will need to add disableUploadTimeout="false" keepAliveTimeout="500000"

2. If your data is more than few bytes, compress it to save transfer time. You need to configure both the server and the client to support compression.
  • Client - GZIP options
    options.setProperty(HTTPConstants.MC_ACCEPT_GZIP ,true);

    [note A. if you configure HTTPConstants.MC_GZIP_REQUEST , the request itself will be compressed . will work fine for a server which support gzip. will fail on other-servers.
    If you requests are not huge , not worth the risk! ]
    [note B. this option can cause you problems if you have LB which is problematic with chunking :options.setProperty(HTTPConstants.CHUNKED,Boolean.TRUE ) ]

  • Sever: on apache6.0.18 configure on server.xml, Connector . Add: compression="on" , which means try to use compression depending on the client.
    Firefox 3.6.6 by default will not get it compressed, nor a WS client without GZIP options, but our optimized client will use it.
  • Note: this configuration should work with/without compression on both the client and the server side and should be "backward compatible". Transfer time can be considerably reduced (in my case 500ms on 6KB to 260ms on 1KB)

3. TCP connection optimization
  • TCP implementation increase the buffer size , sending more and more bytes per second , as long as the connection is not congested , so after a second or two , you will reach the optimal size. But , on high-latency, high-bandwidth connections (100MB/s , 50ms latency) the OS max tcp-buff limit the max throughput and can reduce by a factor of 2-10.
    A good max value should be: round-trip[ping]* bandwidth .
    If this is your use case (like connection between data-centers) , tune the system.
    example = 100Mb/s = 12.5MB/s . ping 100ms --> 12.5MB/s*0.1s= 1.25MB. the default on UNIX is 256KB , so you can get a boost of x5.

  • If you are not using a persistent TCP connection , and the round-trip is large , for small files it will take few round-trips to get to optimal speed (and by then the file already finished). If this is your case, set the Socket receive/send buffer to high value immediately.
  • For low-latency , small-transmittion , also use Socket.setTcpNoDelay which disable Nagle`s algoritm. Wikipedia quote:
    Nagle's algorithm works by combining a number of small outgoing messages, and sending them all at once. Specifically, as long as there is a sent packet for which the sender has received no acknowledgment, the sender should keep buffering its output until it has a full packet's worth of output, so that output can be sent all at once.


P.S.
1. The default Axis2 RPC Client do not use a pool of HTTP connectors. This default implementation is good only for tests and will cause problems in productions , follow the instruction here and use MultiThreadedHttpConnectionManager.
2. To see TCP connection do (linux) "netstat -nap | grep "







Friday, March 26, 2010

RAID and SSD

I really need a new desktop machine. The current dilemma is between large,cheap mechanical HD or small, costly, SSD.
The second question is "to raid" or not "to raid".

Solid-state-device VS mechanical disks:
Typical windows usage is faster with SSD. Question remains, is there a good&cheap drive. The new generation from SanDisk (g3) claims the following numbers:
Average sequential (higher is better) and Random-access latency(smaller is better)
Read- 220MB/s ,write 120MB/s . 0.1-0.2 ms latency. cost: for 60,120GB for 229$,399$

Mechanical disks: Some numbers:
VelociRaptor VR150 ,10,000 rpm, 300GB 102MB/s , 7ms - ~200$
Segate Baddacuda ST3320613AS , 7200 rpm , 230GB - 99.2MB/s ,17.1ms - 50$

Conclusion: SSD is worth the money

Raid
There are two goals for Raid. One is performance , Second is fault-tolerance. For a desktop owner the first tends to be the only relevant goal. Let`s assume you have a budget of up to 2 HD`s.
Raid 0 - two striped disks. If a file contains 4 sections , disk one contains 1,3 . The other 2,4.
on mechanical/ssd disks: x2 sequential-read , x0.9 random-read



P.S. What does the SATA150/SATA300 Interface means?
This is the BUS speed. SATA150 supports up to 150MB/s . SATA300 (or SATA-II) supports up to 300MB/s.
Make sure that the SATA interface max throughput > HD max throughput .