PostgreSQL 8.3 Warm Stand-by Replication

In an effort to make things a bit more redundant at my work place, I have implemented PostgreSQL 8.3 warm stand-by replication, as outlined in the pgsql 8.3 manual.The manual leaves the possibilities for implementation pretty wide open, which is good for making things scalable, and applying the process in many different ways, but it is bad if you are looking for more detail as to how to set it up. In this tutorial, I will outline my exact methods for implementing warm stand-by for PostgreSQL, running on Ubuntu Server 9.04.

The basic jist of postgresql’s warm stand-by system, is based on the “continuous archiving” ability of pgsql. It utilizes this feature to do “log shipping” of the write ahead log (WAL) files, where the logs are then retrieved by a stand-by server in “continuous recovery” mode. The stand-by server remains in continuous recovery until a given condition has been met, at which point the server comes up into normal operation.

First, you must start with two “as close to the same as possible” servers, as recommended by the postgresql manual. In my case, I have a virtual pgsql server in our primary location, and another virtual pgsql server in our disaster recovery location. Both servers run ubuntu server 9.04 jaunty 32-bit, with postgresql installed from the repositories. PostgreSQL also requires that the two servers be the same architecture (32 or 64 bit).

Once both servers are running, you must create a location to store the archived WAL files. It is highly recommended to have the archive storage located remotely from the primary server, that way, in the event of a primary server failure, you have the latest and greatest WALs for your stand-by server to import. The archive must be readable, and writeable from both the primary and secondary servers. For my installation, I created a NFS share on the stand-by server itself, and then the stand-by server imports the WAL files locally. Just be sure that the archive on the stand-by server is owned by “postgres:postgres”, and that the world permissions are “0″ because all data written to your database will be visible here as it is written. Remember to take precautions, security is your number one concern at all times.

With the remote archive mounted locally, we must configure pgsql for continuous archiving. Edit /etc/postgresql/8.3/main/postgresql.conf and set the “archive_mode” directive to “on”. For my purposes, I set the “archive_command” to “test -f /psql_archive/mounted && test ! -f /psql_archive/%f && rsync -a %p /psql_archive/%f”. I created an empty file called “mounted” on the NFS share, so that I could verify that the share was actually mounted, and prevent writing WAL files to the local place-holder archive directory. So our archive_command basically first checks to see if the share is mounted, If it is, then it checks to see if the current WAL already exists, if it doesn’t, then it writes the file with rsync. The -a parameter of rsync is important because it first copies the data to a place-holder file name, and then changes the name of the destination file only after all the data was successfully copied. This prevents the stand-by server from trying to load a partial log file.

The archive_timeout directive will require some careful thought on your part. Postgresql will only archive a WAL when it reaches 16MB worth of writes, or when the data becomes older than the “archive_timeout” in seconds. Unfortunately, the WAL file is ALWAYS 16MB, whether we have written that much data or not. So setting this value too low, will exponentially increase the amount of bandwidth required to transmit your WALs. If the stand-by server is on the LAN, this is less of an issue, but over the WAN, you’ll need to really be careful how often the the archive times out. Please also remember that if you have 16MB of data written to the database before your time-out, it will ship the log at that point in time, so if you are replicating over the WAN, plan your bandwidth capacity according to the requirements of your database’s write activity. For me, the archive_timeout is only every 10 minutes, which is safe because we don’t have rapidly changing data, and losing 10 minutes worth of data in the event of a failure would not be catastrophic. This only averages out to around  220kbps of continuous bandwidth utilization.

Your main concern with WAN replication during idle database times, is that you do not want the next WAL file attempting to transmit before the previous one has finished. This will cause a chain reaction which will lead to your secondary database falling further and further behind, without an opportunity to catch up. If you happen to run into this during peak periods of activity, it shouldn’t hurt, so long as it’s able to catch up within a time frame that is acceptable to you.

After setting these parameters, and restarting the postgresql service (/etc/init.d/postgresql-8.3 restart) you can watch the archive directory, and you should start seeing WAL files being transmitted. We have to be careful, because if we run in this state too long, we will run our archive out of disk space.

Now we can begin configuring our stand-by server for continous recovery mode. It is my preference to create a new file in /etc/postgresql/8.3/main/recovery.conf and symbolic link to it, but you can optionally create the file in the final destination directory which I’ll mention later.

The continuous recovery portion would be quite a bit more complicated if it weren’t for “pg_standby”. Luckily, pg_standby is installed and compiled by default in ubuntu server 9.04, but you must create a symbolic link to pg_wrapper from /usr/bin.


sudo ln -s /usr/share/postgresql-common/pg_wrapper /usr/bin/pg_standby

Once that is complete, just open your recovery.conf file and add the line (please disregard the line wrap around, this should be a single line in the file):


restore_command = 'pg_standby -d -l -r 3 -s 60 -t /psql_archive/role.master /psql_archive %f %p %r 2>>/var/log/postgresql/pg_standby.log'

This line basically tells postgresql to continue recovery forever until the file /psql_archive/role.master exists. In order to make my fail-over in the event of a disaster a bit more seamless, I have written a custom cron job to create this file automatically, based on an internal DNS change. To put it as simply as possible, I have a generic CNAME “postgres” in my internal DNS, which all of my postgresql clients point to for service. Under normal circumstances, the CNAME points to “pg-primary”. In the event of a fail-over, I will change the CNAME so that it points to “pg-secondary”. Adding the following line to your /etc/crontab file on your stand-by machine will automatically create your “trigger file” after making your DNS change:


* * * * * postgres test -n "`nslookup postgres | grep pg-secondary`" && touch /psql_archive/role.master >/dev/null 2>&1

I still wouldn’t go as far as calling this a “hot standby” just yet, but it does simplify the process of failing over.

So to actually get this thing up and running, we must first take a base backup of the data directory of our primary server. It is neither necessary, nor recommended to shut the server down for the base backup. From a SQL prompt, issue the query:


select pg_start_backup('base_backup');

After that, create a tar archive of the entire data directory. On ubuntu server 9.04, the postgresql data directory is located at /var/lib/postgresql/8.3/main. To do this, issue the command:


tar -czf base_backup.tar.gz /var/lib/postgresql/8.3/main

Once the tar file is made, connect back to the database and issue the query:


select pg_stop_backup();

Copy the resulting file to your stand-by server. Once there, shut down postgresql on the stand-by server, and extract the contents of the tar.gz archive, overwriting the existing data directory. Navigate to the data directory:


cd /var/lib/postgresql/8.3/main

Then create a symbolic link to /etc/postgresql/8.3/main/recovery.conf, if you chose to make it there, or create your recovery.conf file here.


ln -s /etc/postgresql/8.3/main/recovery.conf

Assuming everything is set up right, you should be able to start the postgresql service (/etc/init.d/postgresql-8.3 start). Monitoring the archive directory, you should see the WAL log files start to be consumed and removed every minute or so after one appears.

Unfortunately, there is no way to verify that the data is actually being replicated and recovered into the stand-by server, because during “recovery mode” the database is not operational. To test that everything is working properly, you have to manually create the trigger file:


touch /psql_archive/role.master

This should trigger the recovery process to complete, and the server should come on-line automatically. To verify, check the pgsql data directory (/usr/lib/postgresql/8.3/main) and see if the recovery.conf file has been re-named to recovery.complete. If so, then connect to the secondary database and check for the existance of some known changes since the base backup. If all looks well, then you must repeat the steps above starting from creating the original base backup, in order to get the stand-by server into recovery mode again. Simply re-naming the recovery.complete back to recovery.conf WILL NOT WORK because once the server has been started, it creates a unique ‘instance’ or ‘timeline’ and it will not start a recovery process unless the ‘instance’ of the WAL files match that of the recovering server. Hence the necesity of re-doing the base backup.

I also recommend testing the automatic trigger file by changing the DNS with a NON PRODUCTION set of DNS names, to make sure that it works properly. What works for me, may not work for you.

There are a number of options for recovery to status quo after a complete fail-over scenario. You could reconfigure each of the two servers so that the secondary becomes the primary. You could use the pg_dump and pg_restore tools to migrate your data back to the original primary, and follow the steps in this article from the base backup again to get replication going.

Another method may be to simply do a shutdown the original secondary server, then make a tar of it’s data directory, extract it over top of the original primary server data directory (once repaired of course), then follow the steps to start up the secondary server in recovery mode again. I have not tested that method of recovery, but theoretically it should work as both instances, or timelines should be the same. Please try this last method at your own risk.

Disclamer: Please note that this set up may not work for any/every environment, and I can not and will not be held responsible if the use of this information results in data loss or damage. It is your responsibility to make backups and use precautions while working in a production environment with production data. The information here is provided as-is, and without warranty. That being said, I will be willing to help you out if you run in to a snag, or my directions aren’t quite clear enough.

Further Reading, related articles, and sources:
http://www.postgresql.org/docs/8.3/static/backup-file.html
http://www.postgresql.org/docs/8.3/static/continuous-archiving.html
http://www.postgresql.org/docs/8.3/static/warm-standby.html
http://www.postgresql.org/docs/8.3/static/pgstandby.html

  • http://www.kyledye.com Kyle

    Once again, friggin brilliant!

  • David Lee

    Nice article, thanks.
    It’s probably worth pointing out that pg_standby is available in the postgresql-contrib[-8.4] package – the first time I set this up I didn’t realize and went through a few unnecessary hoops to compile it from source.
    cheers,
    DL

  • SL

    Just wanted to say great job. Very informative and nicely written.

  • http://www.travishegner.com/about-me.html Travis Hegner

    Thanks for all of the good feedback!
    Be on the lookout as I have been exploring Slony-I a little, and I plan to write a tutorial on how to set that up.
    Thanks,
    Travis

  • http://iandouglas.com/ Ian Douglas

    Thanks for a great write-up. With a few small tweaks, I’ve successfully set up a replication node using your notes. Thanks for a very thorough write-up.
    A few notes:
    - Debian 5.0 (Lenny) needs “postgresql-contrib-8.3″ installed before the symlink from pg_wrapper to pg_standby will work.
    - PG v8.3 renames the recovery.conf symlink file to recovery.done
    - We’re seeing old logs hanging around for a two 10-minute cycles. When a new WAL shows up, we see console output like this:
    running restore : OK
    removing “/pg_import/000000010000000000000022″
    Trigger file : /pg_import/role.master
    Waiting for WAL file : 000000010000000000000025
    WAL file path : /pg_import/000000010000000000000025
    Restoring to… : pg_xlog/RECOVERYXLOG
    Sleep interval : 60 seconds
    Max wait interval : 0 forever
    Command for restore : ln -s -f “/pg_import/000000010000000000000025″ “pg_xlog/RECOVERYXLOG”
    Keep archive history : 000000010000000000000024 and later
    … so it removed file 22 when 24 showed up, and updates its trigger to look for file 25, but file 23 and 24 hang around. I was concerned that I hadn’t done something correctly, but all appears well.

  • http://www.travishegner.com/about-me.html Travis Hegner

    Thanks Ian!
    I assume that in case of a busy server, it keeps the last couple of WAL logs laying around in case they haven’t been read in yet. If the number of writes hits 16MB before your time limit, it will send an extra WAL file to the warm standby.
    I have since moved from this method of replication to one based on Slony. I have been very pleased so far. Slony is a little bit harder to set up and maintain, but the fail-over is much smoother, and you can run backups/queries/reports from your slave, without impacting the master’s performance. I intend at some point to do a write up on it, but time is a scarce resource these days.
    Thanks for reading!
    Travis

  • Chris Row

    I was wondering if you ever got around to doing that write up on Slony. I’ve been working on implementing it myself on Debian systems, and while I can get it to mostly work I keep running into little snafus. Some due to the way we have to purge old data from the master database that would break Slony, and some on general setup.
    I’m going over the docs again now but though if you had that write up it might save me some trouble. If I can’t figured it out I may have to fall back to WAL shipping for safety sake.
    Thanks
    -Chris

  • Enzo

    “Unfortunately, there is no way to verify that the data is actually being replicated and recovered into the stand-by server, because during “recovery mode” the database is not operational.”
    You can tail the pg_standby.log file and it will show the files that have been applied. Unfortunately though, this is pretty crappy in a production environment because you need to monitor this constantly and alert if you are too far behind. If you have a lot of databases, you aren’t going to be sitting around tailing log files. There needs to be a better way to monitor this and alert if you are too many log files behind.

  • Abraão

    Nice work. I would like to know if you have some tips about warm replication in the Windows using the Postgres 8.2.11-1. Some material about it is very hard of find. Thanks very much.

  • Peggy

    I have been trying to set this up in an 8.4 environment. When I startup the standby server and look at the standby.log, the Waiting for WAL file lists the WAL file that is the next in sequence for the standby server rather than the next log in sequence from the production server. How do I get it to look for the right WAL file. The %f parameter is providing the file name for the standby server. Has anyone else encountered this problem?