MySQL Replication: Monitor the Replication system

Monitoring

As you may have come to expect, MySQL provides all the data necessary to answer those questions (and many questions you probably haven't even considered), but extracting and understanding the data is something it won't do for you. In Section 7.5.4, later in this chapter, we'll try to provide some details to help you make sense of all the data MySQL provides, which should help you understand the tools that are helpful in processing that data.

Master status

Using the SHOW MASTER STATUS command, the master will tell you about its replication status:

mysql> SHOW MASTER STATUS \G *************************** 1. row ***************************             File: binary-log.004         Position: 635904327     Binlog_do_db: Binlog_ignore_db: 1 row in set (0.00 sec)

The output includes the current binary log filename and the position (or offset) into the binary log where the next query will be written. The other two fields correspond to the binlog-do-db and binlog-ignore-db filtering options in the server's my.cnf file. If you are filtering binary log records on the master, one or both of these will list the database names affected by your filters.

You can also ask the master which binary logs still exist on disk:

mysql> SHOW MASTER LOGS; +----------------+ | Log_name       | +----------------+ | binary-log.001 | | binary-log.002 | | binary-log.003 | | binary-log.004 | +----------------+ 4 rows in set (0.02 sec)

But the output is quite limited. It would be helpful to know the sizes and ages of the files as well. MySQL is doing little more than reading and displaying the contents of the log-bin.index file. To get more information, you need to log on to the server and examine the files by hand.

Slave status

There is significantly more information available on the slave side of replication, mostly because the slaves have more information to keep track of. To start, the SHOW SLAVE STATUS command provides a good summary of the information from both the master.info and relay-log.info files:

mysql> SHOW SLAVE STATUS \G *************************** 1. row ***************************           Master_Host: master.example.com           Master_User: repl           Master_Port: 3306         Connect_retry: 15       Master_Log_File: binary-log.004   Read_Master_Log_Pos: 635904807        Relay_Log_File: relay-log.004         Relay_Log_Pos: 846096118 Relay_Master_Log_File: binary-log.004      Slave_IO_Running: Yes     Slave_SQL_Running: Yes       Replicate_do_db:   Replicate_ignore_db:            Last_errno: 0            Last_error:          Skip_counter: 0   Exec_master_log_pos: 635904807       Relay_log_space: 846096122 1 row in set (0.00 sec)

In addition, there is some other metadata in the output. The Last_errno and Last_error fields provide information about the most recent replication-related error, if any. The Relay_log_space tells you how much space the relay log is consuming.

The two most important fields are Slave_IO_Running and Slave_SQL_Running. They tell you if the IO and slave threads are running.

Replication heartbeat

Watching the values produced by SHOW MASTER STATUS and SHOW SLAVE STATUS can give you a rough idea of how up to date a slave is. The trouble with relying on that information is that you're only looking at bytes. You can determine how many more bytes of log data the slave needs to execute before it is current. However, that doesn't tell you how many queries need to be executed. There's no good way to figure that out, short of running the binary log through mysqlbinlog and counting queries.

It is possible to determine how out of date the slave is with some degree of accuracy by implementing a simple heartbeat system. The heartbeat principle is easy. At a fixed interval, say 20 seconds, a process on the master inserts a record with the latest timestamp into a table. On the slave, a corresponding process reads the most recent record every 20 seconds. Assuming that the system clocks on both machines are in sync, you can tell how far behind the slave is to within 20 seconds of accuracy.

See the write_heartbeat and read_heartbeat scripts in Section 7.5.4 for a sample implementation.

Log Rotation

Binary log files accumulate on the server until they are explicitly removed. The SHOW MASTER LOGS command tells you how many logs there are at any given time. To remove one or more logs, use the PURGE MASTER LOGS TO ... command. It removes all the logs up to but not including the given log name.

Here's an example:

mysql> SHOW MASTER LOGS; +----------------+ | Log_name       | +----------------+ | binary-log.001 | | binary-log.002 | | binary-log.003 | | binary-log.004 | +----------------+ 4 rows in set (0.02 sec)  mysql> PURGE MASTER LOGS TO 'binary-log.004';

The command tells MySQL to remove binary-log.001, binary-log.002, and binary-log.003. Be careful not to remove logs too quickly. If a slave is offline for a significant period of time, there's a chance that it still needs one or more of the logs you removed. If you're in doubt, run SHOW SLAVE STATUS on each slave to verify which log it is using.

To automate this process, see the purge_binary_logs script in Section 7.5.4.

Changing Masters

Sooner or later you'll need to point your slaves at a new master. Maybe the old one is being replaced with a newer, faster computer; perhaps there was a failure, and you are promoting a slave to master. In MySQL 3.23 and 4.0, you need to inform the slaves about their new master. A future version of MySQL is supposed to include a fail-safe replication feature that automates the process.

A planned changing of masters is a straightforward process. (In the event of a master failure, it may not be so easy.) You simply need to issue the CHANGE MASTER TO ... command on each slave. In doing so, you inform the slave of the new master's parameters—the same ones specified in the my.cnf file. The slave will begin replicating from its new master, and MySQL will also update the master.info with the new information.

Using the right values

As usual, the devil is in the details. How do you decide which values to use? What if you get them wrong?

First, let's consider the easy case. If you are in control of the situation, the process is easy. Follow these steps:

Disconnect all clients (not slaves) from the master.
Make sure the new master is completely caught up.
Execute RESET MASTER on the new master.
Make sure each slave is caught up.
Shut down the old master.
Let all clients connect to the new master.
Issue a CHANGE MASTER TO ... command on each slave, pointing it to the new master.

The RESET MASTER command tells the master to flush all its binary logs and start fresh. By starting with a clean slate on the new master, there's no guesswork involved in determining the right log position. Since it's a brand new log, we know the position is 4, because each binary log has a 4-byte header that consumes positions 0-3.

The complete CHANGE MASTER TO ... command looks like this:

mysql> CHANGE MASTER TO     -> MASTER_HOST='newmaster.example.com',     -> MASTER_USER='repl',     -> MASTER_PASSWORD='MySecret!',     -> MASTER_PORT=3306,     -> MASTER_LOG_FILE='log-bin.001',     -> MASTER_LOG_POS=4;

If, on the other hand, the master crashes and you can't bring it back online in a reasonable amount of time, things aren't so clear-cut. If you have only one slave, of course, there's no decision to make. You use the slave. But if you have multiple slaves, you need to determine which one is the most up to date.

By examining the output of SHOW SLAVE STATUS on each slave, you can easily determine which one is closest to matching the master at the time it crashed. Once you know the log name and position, you can construct a CHANGE MASTER TO ... command to run on the remaining slaves.

In doing so, however, you'll likely cause some the slaves to be slightly out of sync with their new master. To illustrate why, assume that each query is assigned an increasing unique ID number. The original master had just executed query 500 when it crashed. The "most up-to-date" slave, the new master, had executed query 496. That means that your best slave is missing four queries, and there's no way to recover them unless your application logs every query it writes, which is unlikely.

Now, let's assume that there are two more slaves, slave2 and slave3; slave2 executed query 490, and slave3 executed query 493. You have a choice. You can either point both slaves at the new master's current position (query 496) or you can try to figure the corresponding offsets for each slave in the new master's binary log. That will take more time, but it means you lose less data.

To find the matching log position for each slave, you need to have the binary log enabled on each slave. Use the mysqlbinlog command (described in Section 7.5.4) to locate the last query executed. Then locate exactly the same query in the new master's binary log. Once you find the query, you'll have the offset you need. The output of mysqlbinlog always includes the offset in a comment right before the query. For example:

$ mysqlbinlog log-bin.001 ... # at 683 #021103 18:36:33 server id 1  log_pos 683   Query  thread_id=288 exec_time=0  error_code=0 SET TIMESTAMP=1036377393; insert into test1 values (8);

The # at 683 line lists the position of the insert into test1 values (8) query in the log.

REF: http://dev.mysql.com/books/hpmysql-excerpts/ch07.html