Fine tune your database parameters to extract the max out of it!
These are few of my experience whilst i was building a search engine and optmising the postgresql database towards sonic-speeds!
Our configuration for the Postgresql server was:
PIV 2.00 Ghz System
One of the first things I noticed after turning on the Servlet program, was that although queries were returned almost as fast as from the previous MySQL based system, the load on the server was much higher. Then I started to go down into the deep details of things. I had optimized MySQL before by greatly increasing cache and buffer sizes and by throwing more ram towards the problem. The single biggest thing that one has to do before running Postgresql,is to provide enough shared buffer space. But then,
How much is enough?
There is a heated debate on it, between people who say that logically the entire RAM could be devoted as against those who say that throwing more RAM after a certain limit has no use. The more shared buffer cache you have, the bigger the percentage of your database that neither causes read()'s nor memory copying from the OS buffer cache.But overall, you will cache a smaller number of blocks because you will be buffering them twice. When you copy a block from the OS buffer to shared memory, the copy still exists in the OS buffer. So that block is now buffered twice. A single disk I/O is far more expensive than hundreds of copies between the OS buffer cache and postgres' shared memory. Also consider all the other things you're doing on the machine -- just little things, like cron and such. All that takes memory. Therefore, it's dangerous not to let the OS manage a good chunk of memory.
It occurs that these two opposing factors could be plotted and make somewhat of a line each. The ideal point would be where they crossed.
Besides I also optimised SQL queries tailored specifically for my purpose. A major drawback in PostgreSQL lies in the implementation of evaluation of queries containing 'IN' and 'EXISTS'. Suppose:
Query 1. SELECT * FROM db1 WHERE ID IN ((SELECT id FROM db2 WHERE word = 'whatever') ) LIMIT 20;
Query 2. SELECT * FROM db1 WHERE ID IN(1234,2345,1242,1256,1245,1567,2222,22,345,234,567,456,35,56) ;
(where ID is the primary key)
The later query is scanned using the index on ID while the former runs into a sequential scan. I think this is called "pilot error" wherein the database runs the subquery for each row in the outer query. Instead, if we use explicit JOINS (as below) then we could force the database to use an index scan instead.
Final Query :
select * from db1, db2 a, db2 b
where id = a.id and a.word='word1'
and id = b.id and b.word='word2'
NOTE: You could also run into a sequential scan, instead of an expected index scan, if the number of tuples to be scanned are more than 30-40% of the total tuples in the table. Though this can be varied by changing the weights assigned to random_page_cost, cpu_tuple_cost, cpu_index_cost and cpu_operator_cost used by the optimizer for making these decesions.
I also decided to throw more RAM for the purpose. I allocated 64MB of RAM towards the shared buffer space. The file /var/lib/pgsql/data/postgresql.conf contains settings for the database server. Postgresql uses system shared memory as a buffer. On a Linux system, you can see how much shared memory was allocated by your system by running the command:
And to view shared memory use on the system:
The result will be in bytes. By default RedHat 7.2 allocates 32MB of shared memory, which might not be enough for postgresql. I increased this limit to 64MB by doing the command:
echo 67108864 > /proc/sys/kernel/shmmax
You need to place this line in your postgresql startup file, or by editing the /etc/rc.d/rc.local file for a more permanent setting.Then in our postgresql.conf I set shared_buffers to 8192.I also set our sort_mem to 16384 (16Megs for a sort memory area). Since connection pooling was in effect, I set max_connections to 50.
And fsync was also set to false.
shared_buffers = 8192
sort_mem = 16384
One hitch I found initially was that the system had to build up and tear down a postgresql connection with each request. This was intolerable, so I started to use the connection pooling features provided by Resin (http://caucho.com).
Acknowledgments : Curt , Bruce , Andrew et all for clearing my doubts!