Measure Replication Delay with Percona Toolkit Heartbeat (pt-heartbeat)

Replication lag occurs when the slave cannot keep up with the changes being made on the master instance. The reasons for a lag can be network congestion, poor performance on the slave or service interruptions of the slave mechanism, leading to a large pile of work to be catched up by the slave. When you use SHOW SLAVE STATUS you can monitor the current delay, which is indicated in the Seconds_Behind_Master: 0column. Ideally, this value is zero, but the smallest granularity you get for this delay counter is one second.

Percona offers a tool that allows measuring the delay more accurately, by writing timestamps in the master database and calculate the difference when they arrive at the slave: pt-heartbeat

On the Master

Create a dedicated database schema called percona and add the following table. 

CREATE TABLE IF NOT EXISTS `percona`.`heartbeat` (
    ts                    varchar(26) NOT NULL,
    server_id             int unsigned NOT NULL PRIMARY KEY,
    file                  varchar(255) DEFAULT NULL,    -- SHOW MASTER STATUS
    position              bigint unsigned DEFAULT NULL, -- SHOW MASTER STATUS
    relay_master_log_file varchar(255) DEFAULT NULL,    -- SHOW SLAVE STATUS
    exec_master_log_pos   bigint unsigned DEFAULT NULL  -- SHOW SLAVE STATUS
);

Then you can launch the script on the master, where it will write new timestamps periodically as long as the script is running.

pt-heartbeat -D percona --update --utc h=127.0.0.1,u=root,p=SECRET,P=3306

The -D flag specifies the database schema where the table for the heartbeat data is stored. The –update command is needed on the master for updating the table and the last part is the DSN, which specifies host address, user name and password. Very important is the UTC flag –utc, that ensures that the timestamps will be interpreted as UTC, regardless of the timezone setting. 

On the Slave

Create a user for reading the replicated heartbeat table like this:

GRANT REPLICATION SLAVE,PROCESS,SELECT ON percona.* TO `percona_checksum`@'10.20.30.40' IDENTIFIED BY 'SECRET';

Then you can run the script and point it to the slave. It will output precise delay counts in fractions of seconds

pt-heartbeat h=127.0.0.1,u=percona_checksum,p=SECRET,P=3306 -D percona --monitor --utc --master-server-id 1

Notice the different DSN, the –monitor flag and the master-server id, which needs to be the one of your master of course. You need this because the tool supports hierarchies of masters and therefore you would need to know which one is to be considered.

Results

The results will look similar to this

0.09s [  0.00s,  0.00s,  0.00s ]
0.02s [  0.20s,  0.00s,  0.00s ]
0.09s [  0.00s,  0.00s,  0.00s ]
0.03s [  0.02s,  0.00s,  0.00s ]
0.09s [  0.01s,  0.00s,  0.00s ]
0.09s [  0.01s,  0.00s,  0.00s ]
0.09s [  0.01s,  0.00s,  0.00s ]
0.08s [  0.01s,  0.00s,  0.00s ]
0.08s [  0.01s,  0.00s,  0.00s ]
0.10s [  0.01s,  0.00s,  0.00s ]
0.12s [  0.02s,  0.00s,  0.00s ]
0.08s [  0.01s,  0.00s,  0.00s ]
0.11s [  0.02s,  0.00s,  0.00s ]
0.08s [  0.02s,  0.00s,  0.00s ]
0.09s [  0.02s,  0.00s,  0.00s ]
0.08s [  0.02s,  0.00s,  0.00s ]
0.08s [  0.03s,  0.01s,  0.00s ]

The output is the current delay followed by moving averages over 1m, 5m and 15m, as you might know from your favorite OS already.

Have a look at the official documentation, there is a lot of options available. 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.