Wednesday, 25 January 2017

PostgreSQL upgrades - Methodology

 Yesterday, i came across a conversation in postgres slack channel, someone was interested in upgrading his database and i realised that there is a pattern i see all the time. Those who are interested in upgrading major postgres versions, only ask questions about pg_upgrade. Don't get me wrong, pg_upgrade is awesome its well documented, and it explains things much better than i can. 
During my time in OmniTI we've done many... MANY upgrades some of them pretty challenging,
so i have come to realise that the upgrade tool is not as important as the plan and the methodology, especially during an upgrade or a migration. For this reason this post won't be about how to use pg_upgrade, instead it will be all about methodology, minimising downtime and reducing risk to almost zero.
For this post i will be showing the steps i would follow in order to upgrade an 8.4.22 to 9.6.1. I will be using docker containers using a dockerfile that i wrote for lab use (more info about this docker file can be found in this post) I will include a couple of "traps" that could potentially cause trouble to my upgrade just so i can show how to work around them.
A typical postgres installation would have one master and one or more slaves and maybe a dev/test server. 
For the needs of this post , i will concentrate on the master and on the test. As a matter of fact the test server will be much more important because thats where we will develop an upgrade plan.
So,  we need 2 containers :
An upgrade target host
An upgrade plan development server (aka a test machine)

So , lets assume that we have a master with the following objects that we want to upgrade :


postgres=# select version();

                                                          version
---------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 8.4.22 on x86_64-unknown-linux-gnu, compiled by GCC gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609, 64-bit
(1 row)

postgres=# \dt

                 List of relations
 Schema |          Name          | Type  |  Owner
--------+------------------------+-------+----------
 public | another_small_table    | table | postgres
 public | big_reporting_table    | table | postgres
 public | really_important_table | table | postgres
 public | small_oltp_table       | table | postgres
(4 rows)

postgres=# \dv

              List of relations
 Schema |       Name       | Type |  Owner
--------+------------------+------+----------
 public | my_stat_activity | view | postgres
(1 row)

First order of business would be to create a replica (or to dump and restore if size permits) on the test server that should run on exactly the same postgres version. 

Next, you should check if the test server has the same configuration and if all contrib modules are installed (not calling them extensions because ... 8.4.x).
So compare postgresql.conf on these 2 servers and compare output of ls -l <postgres binary installation directory>/lib. 
In my case i didn't have pg_buffercache , so i installed it. 
At this point it is a good idea to write down the extensions you would need for 9.6, in my case just pg_buffercache.

To sum up, the steps are :


On the test server :



  1. Install exactly the same version as production
  2. Transfer the configs from production
  3. Install all the extensions that exist in production
  4. Create a replica from production
  5. Promote the replica to be a standalone master


At this point we should have a test environment that looks like production. From this moment on, everything should be timed and documented.


On the Test server



  1. Install the version you want to upgrade to, in my case 9.6.1 and don't put its binaries in the path (yet) while documenting and timing all the steps. DO NOT overwrite your old binaries.
  2. Port your configuration file to the new version, DO NOT copy the old config fileIts not just that some parameters have changed, or that some parameters are not compatible from an old to a new pg version, there are new settings commented out that are important and new defaults that you should tune before putting it to production.
  3. Initdb a cluster using 9.6 binaries and test your configuration files, remember to start it on a different port than 8.4.  Keep the config files somewhere safe ideally at the same place where you develop your migration plan
  4. Stop 8.4 and 9.6 clusters
  5. Delete the previously created (for config test) cluster and re-initialize a new $PGDATA directory using the 9.6 binary. This will be your upgraded $PGDATA directory
  6. Test pg_upgrade (9.6 binary) like this : 

pg_upgrade -b ~/pgsql8/bin/ -B ~/pgsql9/bin/ -d ~/pgdata8 -D ~/pgdata9 -c
Performing Consistency Checks
-----------------------------
Checking cluster versions                                   ok
Checking database user is the install user                  ok
Checking database connection settings                       ok
Checking for prepared transactions                          ok
Checking for reg* system OID user data types                ok
Checking for contrib/isn with bigint-passing mismatch       ok
Checking for roles starting with 'pg_'                      ok
Checking for invalid "line" user columns                    ok
Checking for large objects                                  ok
Checking for presence of required libraries                 ok
Checking database user is the install user                  ok
Checking for prepared transactions                          ok

*Clusters are compatible*


everything looks good,  lets try to upgrade :


time pg_upgrade -b ~/pgsql8/bin/ -B ~/pgsql9/bin/ -d ~/pgdata8 -D ~/pgdata9 -k

Performing Consistency Checks
-----------------------------
Checking cluster versions                                   ok
Checking database user is the install user                  ok
Checking database connection settings                       ok
Checking for prepared transactions                          ok
Checking for reg* system OID user data types                ok
Checking for contrib/isn with bigint-passing mismatch       ok
Checking for roles starting with 'pg_'                      ok
Checking for invalid "line" user columns                    ok
Checking for large objects                                  ok
Creating dump of global objects                             ok
Creating dump of database schemas
                                                            ok
Checking for presence of required libraries                 ok
Checking database user is the install user                  ok
Checking for prepared transactions                          ok

If pg_upgrade fails after this point, you must re-initdb the

new cluster before continuing.

Performing Upgrade

------------------
Analyzing all rows in the new cluster                       ok
Freezing all rows on the new cluster                        ok
Deleting files from new pg_clog                             ok
Copying old pg_clog to new server                           ok
Setting next transaction ID and epoch for new cluster       ok
Deleting files from new pg_multixact/offsets                ok
Setting oldest multixact ID on new cluster                  ok
Resetting WAL archives                                      ok
Setting frozenxid and minmxid counters in new cluster       ok
Restoring global objects in the new cluster                 ok
Restoring database schemas in the new cluster
  postgres
*failure*

Consult the last few lines of "pg_upgrade_dump_11564.log" for

the probable cause of the failure.
Failure, exiting

real 0m7.011s

user 0m0.000s
sys 0m0.050s

This is why we HAVE to test and time things in a test server. If we check the log we'll see :


pg_restore: [archiver (db)] Error from TOC entry 142; 1259 16390 VIEW my_stat_activity postgres

pg_restore: [archiver (db)] could not execute query: ERROR:  column pg_stat_activity.procpid does not exist
LINE 14: ...CT (now() - pg_stat_activity.query_start) AS age, pg_stat_ac...

This was an intended to happen error, just to show things can go wrong, and that you don't want these kinds of things to happen in production. Its just a view that uses columns from pg_stat_activity that were renamed later on. Same thing can happen from some old extensions (like tsearch) so one more reason to test.


In order to fix this i would get the view definition, i would adjust it to work in 9.6, i would test it and when i felt confident that the view works I would write a script that would look something like :

pg_ctl -D ~/pgdata8 start && psql -c "drop view my_stat_activity" postgres && pg_ctl -D ~/pgdata8 stop && rm -rf ~/pgdata9/* && ~/pgsql9/bin/initdb -D ~/pgdata9/ && time pg_upgrade -b ~/pgsql8/bin/ -B ~/pgsql/bin/ -d ~/pgdata -D ~/pgdata9 -k && cp ~/configs9/*.conf ~/pgdata9/ && ~/pgsql9/bin/pg_ctl -D ~/pgdata9/ start  && psql -c "create view my_stat_activity as select now()-query_start as age,pid,state, query from pg_stat_activity where query !='<IDLE>' order by 1 desc;" postgres && psql -c "create extension pg_buffercache" postgres

Don't pay too much attention to the code above since its not even tested, just keep a similar mindset that its a really good idea to automate / script everything, scripts are easy to test, easy to predict and easy to keep and change.

The steps for the code above would be :


  • start pg8
  • drop problematic view
  • stop pg8
  • empty pg9 data dir
  • initdb pg9 data dir
  • upgrade (this should be timed)
  • copy configs to pg9 config
  • start pg9
  • create view
  • create extensions


Upgrade is not done though. Next step would be to analyze the database and this can take a long time!

The last 2 steps would be to delete old cluster and analyze the new one (because optimizer statistics are not kept by pg_upgrade). pg_upgrade provides 2 scripts that could be used for these steps :

analyze_new_cluster.sh

delete_old_cluster.sh

delete_old_cluster.sh can be run now, but before you run analyze_new_cluster you could prioritise work and maybe even run analyze commands in parallel. My dummy database has the following tables :


                 List of relations

 Schema |          Name          | Type  |  Owner
--------+------------------------+-------+----------
 public | another_small_table    | table | postgres
 public | big_reporting_table    | table | postgres
 public | really_important_table | table | postgres
 public | small_oltp_table       | table | postgres

At this point i could say that i want really_important_table analyzed asap followed by another_small_table and small_oltp_table. big_reporting_table can be last or run on a different process. My suggestion is to prioritise important tables first by creating one or more scripts. 

Remember that we are still working on test server so test and time these analyze scripts. eg :

time psql -c "analyze really_important_table;" postgres

time psql -c "analyze small_oltp_table;" postgres
time psql -c "analyze another_small_table;" postgres
time psql -c "analyze big_reporting_table;" postgres

In my opinion some of the most important things in an upgrade are :


  • Minimising risk of losing data
  • Minimising downtime
  • Providing accurate estimations on downtime to whoever needs to know.
  • Providing accurate estimations on analyze. Assuming you'll use hard links (-k) This is the step that will take the most, and during the time that the database lacks statistics, performance will be bad. (READ HOW -k WORKS BEFORE USING IT)
  • Not having any surprises.

I can't emphasis enough on the importance of scripting, timing and testing any upgrade procedure. In my opinion it's more important that the production upgrade itself. Do these steps again and again till you are confident that your procedure :


  • Has been tested well enough
  • Has been documented well enough
  • Can't go wrong
  • Is so easy that you could even train your dog to do it.
This upgraded test database should now be used to thoroughly test the application against the new pg version.


When you start upgrading production, its a good idea to keep two things in mind


  • Keep a backup handy. If the database is large, this is not as easy as it sounds and restoration time can be a lot, if you can afford the safety of a backup though, do it !!
  • ALWAYS have a rollback plan. Make sure a replica of the master is up that has no replication lag, bring it down at the time of the upgrade and be ready to move the application there, just in case something goes south. Even if you don't have a replica in your stack, the upgrade is a good opportunity to get one, even temporarily until the upgrade is done. (protip: if you failover your application to a slave keep connections on your failed-to-upgrade-master restricted making sure no one will commit any transaction).


Thanks for reading.

-- Vasilis

No comments:

Post a Comment