pgreplay-ing logfiles

Recently I wanted to find something to rerun a workload (from logfiles) in postgres. the reasons ? benchmark , testing , troubleshooting.
My colleague Denish reminded me a project called pgreplay. pgreplay reads a PostgreSQL logfile , extracts the SQL statements and executes them in the same order and with (or without) the original timing against a PG database.

I download it, compiled it and went straight to the testing.

The setup :

log_min_messages = error (or more)
   (if you know that you have no cancel requests, 'log' will do)
log_min_error_statement = log (or more)
log_connections = on
log_disconnections = on
log_line_prefix = '%m|%u|%d|%c|' (if you don't use CSV logging)
log_statement = 'all'
lc_messages must be set to English (the encoding does not matter)
bytea_output = escape (from version 9.0 on, only if you want to replay
                        the log on 8.4 or earlier)

It is highly recommended that you use CSV logging, because anything that
the PostgreSQL server or any loaded modules write to standard error will
be written to the stderr log and might confuse the parser.

For small and simple workloads it seemed that it worked fine. Bellow the output from a simple workload :

Replay statistics
=================

Speed factor for replay: 1.000
Total run time: 29.700 seconds
Maximum lag behind schedule: 0 seconds
Calls to the server: 8
(0.269 calls per second)
Total number of connections: 1
Maximum number of concurrent connections: 1
Average number of concurrent connections: 1.000
Average session idle percentage: 99.680%
SQL statements executed: 6
(0 or 0.000% of these completed with error)
Maximum number of concurrent SQL statements: 1
Average number of concurrent SQL statements: 0.003
Average SQL statement duration: 0.016 seconds
Maximum SQL statement duration: 0.038 seconds
Statement duration histogram:
0 to 0.02 seconds: 83.333%
0.02 to 0.1 seconds: 16.667%
0.1 to 0.5 seconds: 0.000%
0.5 to 2 seconds: 0.000%
over 2 seconds: 0.000%

So far so good, i wanted to see what will happen when concurrency occurred so i tried with pgbench , with 2 concurrent everything was ok, so i tried to scale it up a bit to 16 concurrent transactions, and then these nice messages started showing :
"Execution is 30 minutes behind schedule"

The reason ? Locks Lots and lots of Locks ! but let's see closer what happened...
While pgreplay was running trying to handle my logfile the best way it could, a simple lock investigating query was enough to show me that i would never get the successful prompt that i wanted :

monkey=#   SELECT bl.pid     AS blocked_pid,
monkey-#          a.usename AS blocked_user,
monkey-#          kl.pid     AS blocking_pid,
monkey-#          ka.usename AS blocking_user,
monkey-#          a.query    AS blocked_statement
monkey-#    FROM pg_catalog.pg_locks         bl
monkey-#     JOIN pg_catalog.pg_stat_activity a ON a.pid = bl.pid
monkey-#     JOIN pg_catalog.pg_locks         kl ON kl.transactionid = bl.transactionid AND kl.pid != bl.pid
monkey-#     JOIN pg_catalog.pg_stat_activity ka ON ka.pid = kl.pid
monkey-#    WHERE NOT bl.granted;
blocked_pid | blocked_user | blocking_pid | blocking_user |                          blocked_statement
-------------+--------------+--------------+---------------+----------------------------------------------------------------------
        7398 | vasilis      |         7403 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7398 | vasilis      |         7399 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7398 | vasilis      |         7404 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7398 | vasilis      |         7402 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7402 | vasilis      |         7398 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7402 | vasilis      |         7403 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7402 | vasilis      |         7399 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7402 | vasilis      |         7404 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7403 | vasilis      |         7398 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7403 | vasilis      |         7399 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7403 | vasilis      |         7404 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7403 | vasilis      |         7402 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7404 | vasilis      |         7398 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7404 | vasilis      |         7403 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7404 | vasilis      |         7399 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7404 | vasilis      |         7402 | vasilis       | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;
        7409 | vasilis      |         7404 | vasilis       | UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2;

Deadlock, the reason ?
(quote from pgreplay documentation) :

While pgreplay makes sure that commands are sent to the server in the order in which they were originally executed, there is no way to guarantee that they will be executed in the same order during replay: Network delay, processor contention and other factors may cause a later command to "overtake" an earlier one. While this does not matter if the commands don't affect each other, it can lead to SQL statements hitting locks unexpectedly, causing replay to deadlock and "hang". This is particularly likely if many different sessions change the same data repeatedly in short intervals.

And that was my case, lots of processes were changing the same data.
For the record, i also set up a simple troubleshooting project that works on postgres, with 2 users doing their own thing pgreplay worked fine.

conclusion :
Ok , pgreplay didn't do the job for me, i am pretty sure that if i got the logs from a production DB i would run into deadlocks, still messing a bit with it was good because i could use this for replaying simple workload logfiles to a different machine or just because its a cool project that does what it does and you never know when you might need something like this..

Evol Monkey

pgreplay-ing logfiles

Comments

Post a Comment

Popular posts from this blog

Tuning checkpoints

Accessing PostgreSQL data from AWS Lambda

AWS Aurora Postgres, not a great first impression