Chapter 4. The Binary Log
The previous chapter included a very brief introduction to the binary log. In this chapter, we will fill in more details and give a more thorough description of the binary log structure, the replication event format, and how to use the mysqlbinlog tool to investigate and work with the contents of binary logs.
The binary log records changes made to the database. It is usually used for replication, and the binary log then allows the same changes to be made on the slaves as well. Because the binary log normally keeps a record of all changes, you can also use it for auditing purposes to see what happened in the database, and for point-in-time recovery (PITR) by playing back the binary log to a server, repeating changes that were recorded in the binary log. (This is what we did in Reporting, where we played back all changes done between 23:55:00 and 23:59:59.)
The binary log contains information that could change the database. Note that statements that
could potentially change the database are also logged,
even if they don’t actually do change the database. The most notable cases
are those statements that optionally make a change, such as DROP TABLE IF
EXISTS
or CREATE TABLE IF NOT
EXISTS
, along with statements such as DELETE
and UPDATE
that have WHERE
conditions that don’t happen to match any rows on the master.
SELECT
statements are not normally logged because they do not make any changes
to any database. There are, however, exceptions.
The binary log records each transaction in the order that the commit took place on the master. Although transactions may be interleaved on the master, each appears as an uninterrupted sequence in the binary log, the order determined by the time of the commit.
Structure of the Binary Log
Conceptually, the binary log is a sequence of binary log events (also called binlog events or even just events when there is no risk of confusion). As you saw in Chapter 3, the binary log actually consists of several files, as shown in Figure 4-1, that together form the binary log.
The actual events are stored in a series of files called binlog files
with names in the form host
-bin.000001, accompanied by a binlog index file that is usually
named host
-bin.index and keeps track of the existing
binlog files. The binlog file that is currently being written to by the
server is called the active binlog file. If no slaves
are lagging, this is also the file that is being read by the slaves. The
names of the binlog files and the binlog index file can be controlled using the log-bin
and
log-bin-index
options, which you are familiar with from
Configuring the Master. The options are covered in more
detail later in this chapter.
The index file keeps track of all the binlog files used by the
server so that the server can correctly create new binlog files when
necessary, even after server restarts. Each line in the index file
contains the name of a binlog file that is part of the binary log.
Depending on the MySQL version, it can either be the full name or a name
relative to the data directory. Commands that affect the binlog files,
such as PURGE BINARY LOGS
,
RESET MASTER
, and FLUSH LOGS
, also affect the index file by adding
or removing lines to match the files that were added or removed by the
command.
As shown in Figure 4-1, each
binlog file is made up of binlog events, with the Format_description
event serving as the file’s header and the Rotate
event as its footer. Note that a binlog file might not end with a
rotate event if the server was interrupted or crashed.
The Format_description
event
contains information about the server that wrote the binlog file as well
as some critical information about the file’s status. If the server is
stopped and restarted, it creates a new binlog file and writes a new
Format_description
event to it. This is
necessary because changes can potentially occur between bringing a server
down and bringing it up again. For example, the server could be upgraded,
in which case a new Format_description
event would have to be written.
When the server has finished writing a binlog file, a Rotate
event is added to end the file. The event
points to the next binlog file in sequence by giving the name of the file
as well as the position to start reading from.
The Format_description
event and
the Rotate
event will be described in
detail in the next section.
With the exception of control events (e.g., Format_description
, Rotate
, and Incident
), events of a binlog file are grouped
into units called groups, as seen in
Figure 4-2. In transactional
storage engines, each group is roughly equivalent to a transaction, but
for nontransactional storage engines or statements that cannot be part of
a transaction, such as CREATE
or
ALTER
, each statement is a group by
itself.[4]
In short, each group of events in the binlog file contains either a single statement not in a transaction or a transaction consisting of several statements.
Each group is executed entirely or not at all (with the exception of a few well-defined cases). If, for some reason, the slave stops in the middle of a group, replication will start from the beginning of the group and not from the last statement executed. Chapter 8 describes in detail how the slave executes events.
Binlog Event Structure
MySQL 5.0 introduced a new binlog format: binlog format 4. The preceding formats were not easy to extend with additional fields if the need were to arise, so binlog format 4 was designed specifically to be extensible. This is still the event format used in every server version since 5.0, even though each version of the server has extended the binlog format with new events and some events with new fields. Binlog format 4 is the event format described in this chapter.
Each binlog event consists of four parts:
- Common header
The common header is—as the name suggests—common to all events in the binlog file.
The common header contains basic information about the event, the most important fields being the event type and the size of the event.
- Post header
The post header is specific to each event type; in other words, each event type stores different information in this field. But the size of this header, just as with the common header, is the same throughout a given binlog file. The size of each event type is given by the
Format_description
event.- Event body
After the headers comes the event body, which is the variable-sized part of the event. The size and the end position is listed in the common header for the event. The event body stores the main data of the event, which is different for different event types. For the
Query
event, for instance, the body stores the query, and for theUser_var
event, the body stores the name and value of a user variable that was just set by a statement.- Checksum
Starting with MySQL 5.6, there is a checksum at the end of the event, if the server is configured to generate one. The checksum is a 32-bit integer that is used to check that the event has not been corrupted since it was written.
A complete listing of the formats of all events is beyond the
scope of this book, but because the Format_description
and Rotate
events are critical to how the other events are interpreted, we will
briefly cover them here. If you are interested in the details of the
events, you can find them in the MySQL Internals
Manual.
As already noted, the Format_description
event starts every binlog
file and contains common information about the events in the file. The
result is that the Format_description
event can be different
between different files; this typically occurs when a server is upgraded
and restarted. The Format_description_log_event
contains the
following fields:
- Binlog file format version
This is the version of the binlog file, which should not be confused with the version of the server. MySQL versions 3.23, 4.0, and 4.1 use version 3 of the binary log, while MySQL versions 5.0 and later use version 4 of the binary log.
The binlog file format version changes when developers make significant changes in the overall structure of the file or the events. In MySQL version 5.0, the start event for a binlog file was changed to use a different format and the common headers for all events were also changed, which prompted the change in the binlog file format version.
- Server version
This is a version string denoting the server that created the file. This includes the version of the server as well as additional information if special builds are made. The format is normally the three-position version number, followed by a hyphen and any additional build options. For example, “5.5.10-debug-log” means debug build version 5.5.10 of the server.
- Common header length
This field stores the length of the common header. Because it’s here in the
Format_description
, this length can be different for different binlog files. This holds for all events except theFormat_description
andRotate
events, which cannot vary.- Post-header lengths
The post-header length for each event is fixed within a binlog file, and this field stores an array of the post-header length for each event that can occur in the binlog file. Because the number of event types can vary between servers, the number of different event types that the server can produce is stored before this field.
Note
The Rotate
and Format_description
log events have a fixed
length because the server needs them before it knows the size of the
common header length. When connecting to the server, it first sends a
Format_description
event. Because
the length of the common header is stored in the Format_description
event, there is no way
for the server to know what the size of the common header is for the
Rotate
event unless it has a fixed
size. So for these two events, the size of the common header is fixed
and will never change between server versions, even if the size of the
common header changes for other events.
Because both the size of the common header and the size of the
post header for each event type are given in the Format_description
event, extending the format
with new events or even increasing the size of the post headers by
adding new fields is supported by this format and will therefore not
require a change in the binlog file format.
With each extension, particular care is taken to ensure that the extension does not affect interpretation of events that were already in earlier versions. For example, the common header can be extended with an additional field to indicate that the event is compressed and the type of compression used, but if this field is missing—which would be the case if a slave is reading events from an old master—the server should still be able to fall back on its default behavior.
Event Checksums
Because hardware can fail and software can contain bugs, it is necessary to have some way to ensure that data corrupted by such events is not applied on the slave. Random failures can occur anywhere, and if they occur inside a statement, they often lead to a syntax error causing the slave to stop. However, relying on this to prevent corrupt events from being replicated is a poor way to ensure integrity of events in the binary log. This policy would not catch many types of corruptions, such as in timestamps, nor would it work for row-based events where the data is encoded in binary form and random corruptions are more likely to lead to incorrect data.
To ensure the integrity of each event, MySQL 5.6 introduced replication event checksums. When events are written, a checksum is added, and when the events are read, the checksum is computed for the event and compared against the checksum written with the event. If the checksums do not match, execution can be aborted before any attempt is made to apply the event on the slave. The computation of checksums can potentially impact performance, but benchmarking has demonstrated no noticeable performance degradation from the addition and checking of checksums, so they are enabled by default in MySQL 5.6. They can, however, be turned off if necessary.
In MySQL 5.6, checksums can be generated when changes are written either to the binary log or to the relay log, and verified when reading events back from one of these logs.
Replication event checksums are controlled using three options:
binlog-checksum=
type
This option enables checksums and tells the server what checksum computation to use. Currently there are only two choices:
CRC32
uses ISO-3309 CRC-32 checksums, whereasNONE
turns off checksumming. The default isCRC32
, meaning that checksums are generated.master-verify-checksum=
boolean
This option controls whether the master verifies the checksum when reading it from the binary log. This means that the event checksum is verified when it is read from the binary log by the dump thread (see Replication Architecture Basics), but before it is sent out, and also when using
SHOW BINLOG EVENTS
. If any of the events shown is corrupt, the command will throw an error. This option is off by default.slave-sql-verify-checksum=
boolean
This option controls whether the slave verifies the event checksum after reading it from the relay log and before applying it to the slave database. This option is off by default.
If you get a corrupt binary log or relay log, mysqlbinlog
can be used to find the bad checksum using the
--verify-binlog-checksum
option. This option causes
mysqlbinlog to verify the checksum of
each event read and stop when a corrupt event is found, which the
following example demonstrates:
$ client/mysqlbinlog --verify-binlog-checksum master-bin.000001
.
.
.
# at 261
#110406 8:35:28 server id 1 end_log_pos 333 CRC32 0xed927ef2...
SET TIMESTAMP=1302071728/*!*/;
BEGIN
/*!*/;
# at 333
#110406 8:35:28 server id 1 end_log_pos 365 CRC32 0x01ed254d Intvar
SET INSERT_ID=1/*!*/;
ERROR: Error in Log_event::read_log_event(): 'Event crc check failed!...
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
Logging Statements
Starting with MySQL 5.1, row-based replication is also available, which will be covered in Row-Based Replication.
In statement-based replication, the actual executed statement is written to the binary log together with some execution information, and the statement is re-executed on the slave. Because not all events can be logged as statements, there are some exceptions that you should be aware of. This section will describe the process of logging individual statements as well as the important caveats.
Because the binary log is a common resource—all threads write to
it—it is critical to prevent two threads from updating the binary log at
the same time. To handle this, a lock for the binary log—called the
LOCK_log
mutex—is acquired just before each group is
written to the binary log and released just after the group has been
written. Because all session threads for the server can potentially log
transactions to the binary log, it is quite common for several session
threads to block on this lock.
Logging Data Manipulation Language Statements
Data manipulation are usuallylanguage (DML) statements are usually DELETE
, INSERT
, and UPDATE
statements. To support logging changes
in a consistent manner, MySQL writes the binary log while
transaction-level locks are held, and releases them after the binary log
has been written.
To ensure that the binary log is updated consistently with the
tables that the statement modifies, each statement is logged to the
binary log during statement commit, just before the table locks are
released. If the logging were not made as part of the statement, another
statement could be “injected” between the changes that the statement
introduces to the database and the logging of the statement to the
binary log. This would mean that the statements would be logged in a
different order than the order in which they took effect in the
database, which could lead to inconsistencies between master and slave.
For instance, an UPDATE
statement
with a WHERE
clause could update
different rows on the slave because the values in those rows might be
different if the statement order changed.
Logging Data Definition Language Statements
Data definition language (DDL) statements affect a schema, such as CREATE TABLE
and ALTER TABLE
statements. These
create or change objects in the filesystem—for example, table definitions are stored in
.frm files and databases are
represented as filesystem directories—so the server keeps information
about these available in internal data structures. To protect the update
of the internal data structure, it is necessary to acquire an internal
lock (called LOCK_open
) before altering the table
definition.
Because a single lock is used to protect these data structures, the creation, alteration, and destruction of database objects can be a considerable source of performance problems. This includes the creation and destruction of temporary tables, which is quite common as a technique to create an intermediate result set to perform computations on.
If you are creating and destroying a lot of temporary tables, it is often possible to boost performance by reducing the creation (and subsequent destruction) of temporary tables.
Logging Queries
For statement-based replication, the most common binlog event is the Query
event, which is used to write a
statement executed on the master to the binary log. In addition to the
actual statement executed, the event contains some additional
information necessary to execute the statement.
Recall that the binary log can be used for many purposes and contains statements in a potentially different order from that in which they were executed on the master. In some cases, part of the binary log may be played back to a server to perform PITR, and in some cases, replication may start in the middle of a sequence of events because a backup has been restored on a slave before starting replication.
In all these cases, the events are executing in different contexts (i.e., there is information that is implicit when the server executes the statement but that has to be known to execute the statement correctly). Examples include:
- Current database
If the statement refers to a table, function, or procedure without qualifying it with the database, the current database is implicit for the statement.
- Value of user-defined variable
If a statement refers to a user-defined variable, the value of the variable is implicit for the statement.
- Seed for the
RAND
function The
RAND
function is based on a pseudorandom number function, meaning that it can generate a sequence of numbers that are reproducible but appear random in the sense that they are evenly distributed. The function is not really random, but starts from a seed number and applies a pseudorandom function to generate a deterministic sequence of numbers. This means that given the same seed, theRAND
function will always return the same number. However, this makes the seed implicit for the statement.- The current time
Obviously, the time the statement started executing is implicit. Having a correct time is important when calling functions that are dependent on the current time—such as
NOW
andUNIX_TIMESTAMP
—because otherwise they will return different results if there is a delay between the statement execution on the master and on the slave.- Value used when inserting into an
AUTO_INCREMENT
column If a statement inserts a row into a table with a column defined with the
AUTO_INCREMENT
attribute, the value used for that row is implicit for the statement because it depends on the rows inserted before it.- Value returned by a call to
LAST_INSERT_ID
If the
LAST_INSERT_ID
function is used in a statement, it depends on the value inserted by a previous statement, which makes this value implicit for the statement.- Thread ID
For some statements, the thread ID is implicit. For example, if the statement refers to a temporary table or uses the
CURRENT_ID
function, the thread ID is implicit for the statement.
Because the context for executing the statements cannot be known when they’re replayed—either on a slave or on the master after a crash and restart—it is necessary to make the implicit information explicit by adding it to the binary log. This is done in slightly different ways for different kinds of information.
In addition to the previous list, some information is implicit to the execution of triggers and stored routines, but we will cover that separately in Triggers, Events, and Stored Routines.
Let’s consider each of the cases of implicit information individually, demonstrate the problem with each one, and examine how the server handles it.
Current database
The log records the current database by adding it to a special
field of the Query
event. This
field also exists for the events used to handle the LOAD DATA INFILE
statement, discussed in LOAD DATA INFILE Statements, so the description here
applies to that statement as well. The current database also plays an
important role in filtering on the database and is described later in
this chapter.
Current time
Five functions use the current time to compute their values: NOW
, CURDATE
, CURTIME
, UNIX_TIMESTAMP
, and SYSDATE
. The first four functions return a
value based on the time when the statement
started to execute. In contrast, SYSDATE
returns the value when the function
is executed. The difference can best be demonstrated by comparing the
execution of NOW
and SYSDATE
with an intermediate sleep:
mysql> SELECT SYSDATE(), NOW(), SLEEP(2), SYSDATE(), NOW()\G
*************************** 1. row ***************************
SYSDATE(): 2013-06-08 23:24:08
NOW(): 2013-06-08 23:24:08
SLEEP(2): 0
SYSDATE(): 2013-06-08 23:24:10
NOW(): 2013-06-08 23:24:08
1 row in set (2.00 sec)
Both functions are evaluated when they are encountered, but
NOW
returns the time that the
statement started executing, whereas SYSDATE
returns the time when the function
was executed.
To handle these time functions correctly, the timestamp indicating when the event started executing is stored in the event. This value is then copied from the event to the slave execution thread and used as if it were the time the event started executing when computing the value of the time functions.
Because SYSDATE
gets the time
from the operating system directly, it is not safe for statement-based
replication and will return different values on the master and slave
when executed. So unless you really want to have the actual time
inserted into your tables, it is prudent to stay away from this
function.
Context events
Some implicit information is associated with statements that meet certain conditions:
If the statement contains a reference to a user-defined variable (as in Example 4-1), it is necessary to add the value of the user-defined variable to the binary log.
If the statement contains a call to the
RAND
function, it is necessary to add the pseudorandom seed to the binary log.If the statement contains a call to the
LAST_INSERT_ID
function, it is necessary to add the last inserted ID to the binary log.If the statement performs an insert into a table with an
AUTO_INCREMENT
column, it is necessary to add the value that was used for the column (or columns) to the binary log.
In each of these cases, one or more context
events are added to the binary log before the event
containing the query is written. Because there can be several context
events preceding a Query
event, the
binary log can handle multiple user-defined variables together with
the RAND
function, or (almost) any
combination of the previously listed conditions. The binary log stores
the necessary context information through the following events:
User_var
Each such event records the name and value of a single user-defined variable.
Rand
Records the random number seed used by the
RAND
function. The seed is fetched internally from the session’s state.Intvar
If the statement is inserting into an autoincrement column, this event records the value of the internal autoincrement counter for the table before the statement starts.
If the statement contains a call to
LAST_INSERT_ID
, this event records the value that this function returned in the statement.
Example 4-2 shows some
statements that generate all of the context events and how the events
appear when displayed using SHOW BINLOG
EVENTS
. Note that there can be several context events before
each statement.
master>SET @foo = 12;
Query OK, 0 rows affected (0.00 sec) master>SET @bar = 'Smoothnoodlemaps';
Query OK, 0 rows affected (0.00 sec) master>INSERT INTO t1(b,c) VALUES
->(@foo,@bar), (RAND(), 'random');
Query OK, 2 rows affected (0.00 sec) Records: 2 Duplicates: 0 Warnings: 0 master>INSERT INTO t1(b) VALUES (LAST_INSERT_ID());
Query OK, 1 row affected (0.00 sec) master>SHOW BINLOG EVENTS FROM 238\G
*************************** 1. row *************************** Log_name: mysqld1-bin.000001 Pos: 238 Event_type: Query Server_id: 1 End_log_pos: 306 Info: BEGIN *************************** 2. row *************************** Log_name: mysqld1-bin.000001 Pos: 306 Event_type: Intvar Server_id: 1 End_log_pos: 334 Info: INSERT_ID=1 *************************** 3. row *************************** Log_name: mysqld1-bin.000001 Pos: 334 Event_type: RAND Server_id: 1 End_log_pos: 369 Info: rand_seed1=952494611,rand_seed2=949641547 *************************** 4. row *************************** Log_name: mysqld1-bin.000001 Pos: 369 Event_type: User var Server_id: 1 End_log_pos: 413 Info: @`foo`=12 *************************** 5. row *************************** Log_name: mysqld1-bin.000001 Pos: 413 Event_type: User var Server_id: 1 End_log_pos: 465 Info: @`bar`=_utf8 0x536D6F6F74686E6F6F6... *************************** 6. row *************************** Log_name: mysqld1-bin.000001 Pos: 465 Event_type: Query Server_id: 1 End_log_pos: 586 Info: use `test`; INSERT INTO t1(b,c) VALUES (@foo,@bar)... *************************** 7. row *************************** Log_name: mysqld1-bin.000001 Pos: 586 Event_type: Xid Server_id: 1 End_log_pos: 613 Info: COMMIT /* xid=44 */ *************************** 8. row *************************** Log_name: mysqld1-bin.000001 Pos: 613 Event_type: Query Server_id: 1 End_log_pos: 681 Info: BEGIN *************************** 9. row *************************** Log_name: mysqld1-bin.000001 Pos: 681 Event_type: Intvar Server_id: 1 End_log_pos: 709 Info: LAST_INSERT_ID=1 *************************** 10. row *************************** Log_name: mysqld1-bin.000001 Pos: 709 Event_type: Intvar Server_id: 1 End_log_pos: 737 Info: INSERT_ID=3 *************************** 11. row *************************** Log_name: mysqld1-bin.000001 Pos: 737 Event_type: Query Server_id: 1 End_log_pos: 843 Info: use `test`; INSERT INTO t1(b) VALUES (LAST_INSERT_ID()) *************************** 12. row *************************** Log_name: mysqld1-bin.000001 Pos: 843 Event_type: Xid Server_id: 1 End_log_pos: 870 Info: COMMIT /* xid=45 */ 12 rows in set (0.00 sec)
Thread ID
The last implicit piece of information that the binary log
sometimes needs is the thread ID of the MySQL session handling the
statement. The thread ID is necessary when a function is dependent on
the thread ID—such as when it refers to CONNECTION_ID
—but most importantly for handling temporary tables.
Temporary tables are specific to each thread, meaning that two temporary tables with the same name are allowed to coexist, provided they are defined in different sessions. Temporary tables can provide an effective means to improve the performance of certain operations, but they require special handling to work with the binary log.
Internally in the server, temporary tables are handled by creating obscure names for storing the table definitions. The names are based on the process ID of the server, the thread ID that creates the table, and a thread-specific counter to distinguish between different instances of the table from the same thread. This naming scheme allows tables from different threads to be distinguished from each other, but each statement can access its proper table only if the thread ID is stored in the binary log.
Similar to how the current database is handled in the binary
log, the thread ID is stored as a separate field in every Query
event and can therefore be used to compute thread-specific data
and handle temporary tables correctly.
When writing the Query
event,
the thread ID to store in the event is read from the
pseudo_thread_id
server variable. This means that it can be set before executing
a statement, but only if you have SUPER
privileges. This server variable is
intended to be used by mysqlbinlog
to emit statements correctly and should not normally be used.
For a statement that contains a call to the CONNECTION_ID
function or that uses or
creates a temporary table, the Query
event is marked as thread-specific in
the binary log. Because the thread ID is always present in the
Query
event, this flag is not
necessary but is mainly used to allow mysqlbinlog to avoid printing unnecessary
assignments to the pseudo_thread_id
variable.
LOAD DATA INFILE Statements
The LOAD DATA INFILE
statement
makes it easy to fill tables quickly from a file. Unfortunately,
it is dependent on a certain kind of context that cannot be covered by
the context events we have discussed: files that need to be read from
the filesystem.
To handle LOAD DATA INFILE
, the
MySQL server uses a special set of events to handle the transfer of the
file using the binary log. In addition to solving the problem for
LOAD DATA INFILE
, this makes the
statement a very convenient tool for transferring large amounts of data
from the master to the slave, as you will see soon. To correctly
transfer and execute a LOAD DATA
INFILE
statement, several new events are introduced into the
binary log:
Begin_load_query
Append_block
A sequence of one or more of these events follows the
Begin_load_query
event to contain the rest of the file’s data, if the file was larger than the maximum allowed packet size on the connection.Execute_load_query
This event is a specialized variant of the
Query
event that contains theLOAD DATA INFILE
statement executed on the master.Even though the statement contained in this event contains the name of the file that was used on the master, this file will not be sought by the slave. Instead, the contents provided by the preceding
Begin_load_query
andAppend_block
events will be used.
For each LOAD DATA INFILE
statement executed on the master, the file to read is mapped to an
internal file-backed buffer, which is used in the following processing.
In addition, a unique file ID is assigned to the execution of the statement and
is used to refer to the file read by the statement.
While the statement is executing, the file contents are written to
the binary log as a sequence of events starting with a Begin_load_query
event—which indicates the
beginning of a new file—followed by zero or more Append_block
events. Each event written to the
binary log is no larger than the maximum allowed packet size, as
specified by the max-allowed-packet
option.
After the entire file is read and applied to the table, the
execution of the statement terminates by writing the Execute_load_query
event to the binary log.
This event contains the statement executed together with the file ID
assigned to the execution of the statement. Note that the statement is
not the original statement as the user wrote it, but rather a recreated version of the statement.
Note
If you are reading an old binary log, you might instead find
Load_log_event
, Execute_log_event
, and Create_file_
log_event
. These were the events used to
replicate LOAD DATA INFILE
prior to
MySQL version 5.0.3 and were replaced by the implementation just
described.
Example 4-3 shows
the events written to the binary log by a successful execution of a
LOAD DATA INFILE
statement. In the
Info
field, you can see the assigned
file ID—1, in this case—and see that it is used for all the events that
are part of the execution of the statement. You can also see that the
file foo.dat used by the statement
contains more than the maximum allowed packet size of 16384, so it is
split into three events.
master> SHOW BINLOG EVENTS IN 'master-bin.000042' FROM 269\G
*************************** 1. row ***************************
Log_name: master-bin.000042
Pos: 269
Event_type: Begin_load_query
Server_id: 1
End_log_pos: 16676
Info: ;file_id=1;block_len=16384
*************************** 2. row ***************************
Log_name: master-bin.000042
Pos: 16676
Event_type: Append_block
Server_id: 1
End_log_pos: 33083
Info: ;file_id=1;block_len=16384
*************************** 3. row ***************************
Log_name: master-bin.000042
Pos: 33083
Event_type: Append_block
Server_id: 1
End_log_pos: 33633
Info: ;file_id=1;block_len=527
*************************** 4. row ***************************
Log_name: master-bin.000042
Pos: 33633
Event_type: Execute_load_query
Server_id: 1
End_log_pos: 33756
Info: use `test`; LOAD DATA INFILE 'foo.dat' INTO...;file_id=1
4 rows in set (0.00 sec)
Binary Log Filters
It is possible to filter out statements from the binary log using two
options: binlog-do-db
and
binlog-ignore-db
(which we will call
binlog-*-db
, collectively). The
binlog-do-db
option is used when you want to filter
only statements belonging to a certain database, and
binlog-ignore-db
is used when you want to ignore a
certain database but replicate all other databases.
These options can be given multiple times, so to filter out both
the database one_db
and the database
two_db
, you must give both options in
the my.cnf file. For
example:
[mysqld] binlog-ignore-db=one_db binlog-ignore-db=two_db
The way MySQL filters events can be quite a surprise to unfamiliar users, so we’ll explain how filtering works and make some recommendations on how to avoid some of the major headaches.
Figure 4-3 shows
how MySQL determines whether the statement is filtered. The filtering is
done on a statement level—either the entire statement is filtered out or
the entire statement is written to the binary log—and the
binlog-*-db
options use the current database to decide
whether the statement should be filtered, not the database of the tables
affected by the statement.
To help you understand the behavior, consider the statements in
Example 4-4. Each line uses bad
as the current database and changes tables
in different databases.
Now, given these statements, consider what happens if the bad
database is filtered using
binlog-ignore-db=bad
. None of the three statements in
Example 4-4 will be written to the
binary log, even though the second and third statements change tables on
the good
and ugly
database and make no reference to the
bad
database. This might seem strange
at first—why not filter the statement based on the database of the table
changed? But consider what would happen with the third statement if the
ugly
database was filtered instead of
the bad
database. Now one database in
the UPDATE
is filtered out and the
other isn’t. This puts the server in a catch-22 situation, so the
problem is solved by just filtering on the current database, and this
rule is used for all statements (with a few exceptions).
Note
To avoid mistakes when executing statements that can potentially
be filtered out, make it a habit not to write statements so they
qualify table, function, or procedure names with the database name.
Instead, whenever you want to access a table in a different
database, issue a USE
statement to make that database the current database. In other words,
instead of writing:
INSERT INTO other.book VALUES ('MySQL', 'Paul DuBois');
write:
USE other; INSERT INTO book VALUES ('MySQL', 'Paul DuBois');
Using this practice, it is easy to see by inspection that the statement does not update multiple databases simply because no tables should be qualified with a database name.
This behavior does not apply when row-based replication is used. The filtering employed in row-based replication will be discussed in Filtering in Row-Based Replication, but since row-based replication can work with each individual row change, it is able to filter on the actual table that the row is targeted for and does not use the current database.
So, what happens when both binlog-do-db
and
binlog-ignore-db
are used at the same time? For
example, consider a configuration file containing the following two
rules:
[mysqld] binlog-do-db=good binlog-ignore-db=bad
In this case, will the following statement be filtered out or not?
USE ugly; INSERT INTO t1 VALUES (1);
Following the diagram in Figure 4-3, you can see that if
there is at least a binlog-do-db
rule, all
binlog-ignore-db
rules are ignored completely, and
since only the good
database is
included, the previous statement will be filtered out.
Warning
Because of the way that the binlog-*-db
rules
are evaluated, it is pointless to have both
binlog-do-db
and binlog-ignore-db
rules at the same time. Since the binary log can be used for recovery
as well as replication, the recommendation is not
to use the binlog-*-db
options and instead filter out
the event on the slave by using replicate-*
options (these are described in
Filtering and skipping events). Using the
binlog-*-db
option would filter out statements from
the binary log and you will not be able to restore the database from
the binary log in the event of a crash.
Triggers, Events, and Stored Routines
A few other constructions that are treated specially when logged are stored programs—that is, triggers, events, and stored routines (the last is a collective name for stored procedures and stored functions). Their treatment with respect to the binary log contains some elements in common, so they will be covered together in this section. The explanation distinguishes statements of two types: statements that define, destroy, or alter stored programs and statements that invoke them.
Statements that define or destroy stored programs
The following discussion shows triggers in the examples, but the same principles apply to definition of events and stored routines. To understand why the server needs to handle these features specially when writing them to the binary log, consider the code in Example 4-5.
In the example, a table named employee
keeps
information about all employees of an imagined system and a table
named log
keeps a log of interesting information.
Note that the log
table has a timestamp
column that notes the time of a change and that the name
column in the
employee
table is the primary key for the table.
There is also a status
column to
tell whether the addition succeeded or failed.
To track information about employee information changes—for example, for auditing purposes—three triggers are created so that whenever an employee is added, removed, or changed, a log entry of the change is added to a log table.
Notice that the triggers are after triggers, which means that entries are added only if the executed statement is successful. Failed statements will not be logged. We will later extend the example so that unsuccessful attempts are also logged.
CREATE TABLE employee ( name CHAR(64) NOT NULL, email CHAR(64), password CHAR(64), PRIMARY KEY (name) ); CREATE TABLE log ( id INT AUTO_INCREMENT, email CHAR(64), status CHAR(10), message TEXT, ts TIMESTAMP, PRIMARY KEY (id) ); CREATE TRIGGER tr_employee_insert_after AFTER INSERT ON employee FOR EACH ROW INSERT INTO log(email, status, message) VALUES (NEW.email, 'OK', CONCAT('Adding employee ', NEW.name)); CREATE TRIGGER tr_employee_delete_after AFTER DELETE ON employee FOR EACH ROW INSERT INTO log(email, status, message) VALUES (OLD.email, 'OK', 'Removing employee'); delimiter $$ CREATE TRIGGER tr_employee_update_after AFTER UPDATE ON employee FOR EACH ROW BEGIN IF OLD.name != NEW.name THEN INSERT INTO log(email, status, message) VALUES (OLD.email, 'OK', CONCAT('Name change from ', OLD.name, ' to ', NEW.name)); END IF; IF OLD.password != NEW.password THEN INSERT INTO log(email, status, message) VALUES (OLD.email, 'OK', 'Password change'); END IF; IF OLD.email != NEW.email THEN INSERT INTO log(email, status, message) VALUES (OLD.email, 'OK', CONCAT('E-mail change to ', NEW.email)); END IF; END $$ delimiter ;
With these trigger definitions, it is now possible to add and
remove employees as shown in Example 4-6. Here an
employee is added, modified, and removed, and as you can see, each of
the operations is logged to the log
table.
The operations of adding, removing, and modifying employees may
be done by a user who has access to the employee
table, but what about access to the log
table? In
this case, a user who can manipulate the employee
table should not be able to make changes to the
log
table. There are many reasons for this, but
they all boil down to trusting the contents of the
log
table for purposes of maintenance, auditing,
disclosure to legal authorities, and so on. So the DBA may choose to
make access to the employee
table available to many
users while keeping access to the log
table very
restricted.
master>SET @pass = PASSWORD('xyzzy');
Query OK, 0 rows affected (0.00 sec) master>INSERT INTO employee VALUES ('mats', 'mats@example.com', @pass);
Query OK, 1 row affected (0.00 sec) master>UPDATE employee SET name = 'matz'
->WHERE email = 'mats@example.com';
Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0 master>SET @pass = PASSWORD('foobar');
Query OK, 0 rows affected (0.00 sec) master>UPDATE employee SET password = @pass
->WHERE email = 'mats@example.com';
Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0 master>DELETE FROM employee WHERE email = 'mats@example.com';
Query OK, 1 row affected (0.00 sec) master>SELECT * FROM log;
+----+------------------+-------------------------------+---------------------+ | id | email | message | ts | +----+------------------+-------------------------------+---------------------+ | 1 | mats@example.com | Adding employee mats | 2012-11-14 18:56:08 | | 2 | mats@example.com | Name change from mats to matz | 2012-11-14 18:56:11 | | 3 | mats@example.com | Password change | 2012-11-14 18:56:41 | | 4 | mats@example.com | Removing employee | 2012-11-14 18:57:11 | +----+------------------+-------------------------------+---------------------+ 4 rows in set (0.00 sec)
The INSERT
, UPDATE
, and DELETE
in the example can generate a warning that it is unsafe to
log when statement mode is used. This is because it invokes a
trigger that inserts into an autoincrement column. In general, such
warnings should be investigated.
To make sure the triggers can execute successfully against a
highly protected table, they are executed as the user who defined the
trigger, not as the user who changed the contents of the
employee
table. So the CREATE
TRIGGER
statements in Example 4-5 are executed by
the DBA, who has privileges to make additions to the
log
table, whereas the statements altering employee
information in Example 4-6 are executed
through a user management account that only has privileges to change
the employee
table.
When the statements in Example 4-6 are executed,
the employee management account is used for updating entries in the
employee
table, but the DBA privileges are used to
make additions to the log
table. The employee
management account cannot be used to add or remove entries from the
log
table.
As an aside, Example 4-6 assigns passwords to a user variable before using them in the statement. This is done to avoid sending sensitive data in plain text to another server.
If you recall the earlier discussion about implicit information, you may already have noticed that both the user executing a line of code and the user who defines a trigger are implicit. As you will see in Chapter 8, neither the definer nor the invoker of the trigger is critical to executing the trigger on the slave, and the user information is effectively ignored when the slave executes the statement. However, the information is important when the binary log is played back to a server—for instance, when doing a PITR.
To play back a binary log to the server without problems in
handling privileges on all the various tables, it is necessary to
execute all the statements as a user with SUPER
privileges. But the triggers may not have been defined using SUPER
privileges, so it is important to
recreate the triggers with the correct user as the trigger’s definer.
If a trigger is defined with SUPER
privileges instead of by the user who defined the trigger originally,
it might cause a privilege escalation.
To permit a DBA to specify the user under which to execute a
trigger, the CREATE TRIGGER
syntax includes an
optional DEFINER
clause. If a DEFINER
is not given
to the statements—as is the case in Example 4-7—the statement
will be rewritten for the binary log to add a DEFINER
clause and use the current user as
the definer. This means that the definition of the insert trigger
appears in the binary log, as shown in Example 4-7. It lists the
account that created the trigger (root@localhost
) as the definer, which is
what we want in this case.
master> SHOW BINLOG EVENTS FROM 92236 LIMIT 1\G
*************************** 1. row ***************************
Log_name: master-bin.000038
Pos: 92236
Event_type: Query
Server_id: 1
End_log_pos: 92491
Info: use `test`; CREATE DEFINER=`root`@`localhost` TRIGGER ...
1 row in set (0.00 sec)
Statements that invoke triggers and stored routines
Moving over from definitions to invocations, we can ask how the master’s triggers are handled during replication. Well, actually they’re not handled at all.
The statement that invokes the trigger is logged to the binary log, but it is not linked to the particular trigger. Instead, when the slave executes the statement, it automatically executes any triggers associated with the tables affected by the statement. This means that there can be different triggers on the master and the slave, and the triggers on the master will be invoked on the master while the triggers on the slave will be invoked on the slave. For example, if the trigger to add entries to the log table is not necessary on the slave, performance can be improved by eliminating the trigger from the slave.
Still, any context events necessary for replicating correctly
will be written to the binary log before the statement that invokes
the trigger, even if it is just the statements in the trigger that
require the context events. Thus, Example 4-8 shows the binary
log after executing the INSERT
statement in Example 4-5. Note that the
first event writes the INSERT ID
for the log
table’s primary key. This reflects the
use of the log
table in the trigger, but it might
appear to be redundant because the slave will not use the
trigger.
You should, however, note that using different triggers on the
master and slave—or no trigger at all on either the master or slave—is
the exception and that the INSERT
ID
is necessary for replicating the INSERT
statement correctly when the trigger
is both on the master and slave.
master> SHOW BINLOG EVENTS FROM 93340\G
*************************** 1. row ***************************
Log_name: master-bin.000038
Pos: 93340
Event_type: Intvar
Server_id: 1
End_log_pos: 93368
Info: INSERT_ID=1
*************************** 2. row ***************************
Log_name: master-bin.000038
Pos: 93368
Event_type: User var
Server_id: 1
End_log_pos: 93396h
Info: @`pass`=_utf8 0x2A3942353030333433424335324...
utf8_general_ci
*************************** 3. row ***************************
Log_name: master-bin.000038
Pos: 93396
Event_type: Query
Server_id: 1
End_log_pos: 93537
Info: use `test`; INSERT INTO employee VALUES ...
3 rows in set (0.00 sec)
Stored Procedures
Stored functions and stored procedures are known by the common name stored routines. Since the server treats stored procedures and stored functions very differently, stored procedures will be covered in this section and stored functions in the next section.
The situation for stored routines is similar to triggers in some
aspects, but very different in others. Like triggers, stored routines
offer a DEFINER
clause,
and it must be explicitly added to the binary log whether or not the
statement includes it. But the invocation of stored routines is handled
differently from triggers.
To begin, let’s extend Example 4-6, which defines
tables for employees and logs, with some utility routines to work with
the employees. Even though this can be handled with standard INSERT
,
DELETE
, and UPDATE
statements, we’ll use stored procedures
to demonstrate some issues involved in writing them to the binary log.
For these purposes, let’s extend the example with the functions in Example 4-9 for adding and
removing employees.
delimiter $$ CREATE PROCEDURE employee_add(p_name CHAR(64), p_email CHAR(64), p_password CHAR(64)) MODIFIES SQL DATA BEGIN DECLARE l_pass CHAR(64); SET l_pass = PASSWORD(p_password); INSERT INTO employee(name, email, password) VALUES (p_name, p_email, l_pass); END $$ CREATE PROCEDURE employee_passwd(p_email CHAR(64), p_password CHAR(64)) MODIFIES SQL DATA BEGIN DECLARE l_pass CHAR(64); SET l_pass = PASSWORD(p_password); UPDATE employee SET password = l_pass WHERE email = p_email; END $$ CREATE PROCEDURE employee_del(p_name CHAR(64)) MODIFIES SQL DATA BEGIN DELETE FROM employee WHERE name = p_name; END $$ delimiter ;
For the employee_add
and
employee_passwd
procedures, we have
extracted the encrypted password into a separate variable for the
reasons already explained, but the employee_del
procedure just contains a
DELETE
statement, since nothing else
is needed. A binlog entry corresponding to one function is:
master> SHOW BINLOG EVENTS FROM 97911 LIMIT 1\G
*************************** 1. row ***************************
Log_name: master-bin.000038
Pos: 97911
Event_type: Query
Server_id: 1
End_log_pos: 98275
Info: use `test`; CREATE DEFINER=`root`@`localhost`PROCEDURE ...
1 row in set (0.00 sec)
As expected, the definition of this procedure is extended with the DEFINER
clause before writing the definition
to the binary log, but apart from that, the body of the procedure is
left intact. Notice that the CREATE PROCEDURE
statement is
replicated as a Query
event, as are
all DDL statements.
In this regard, stored routines are similar to triggers in the way they are treated by the binary log. But invocation differs significantly from triggers. Example 4-10 calls the procedure that adds an employee and shows the resulting contents of the binary log.
master>CALL employee_add('chuck', 'chuck@example.com', 'abrakadabra');
Query OK, 1 row affected (0.00 sec) master>SHOW BINLOG EVENTS FROM 104033\G
*************************** 1. row *************************** Log_name: master-bin.000038 Pos: 104033 Event_type: Intvar Server_id: 1 End_log_pos: 104061 Info: INSERT_ID=1 *************************** 2. row *************************** Log_name: master-bin.000038 Pos: 104061 Event_type: Query Server_id: 1 End_log_pos: 104416 Info: use `test`; INSERT INTO employee(name, email, password) VALUES ( NAME_CONST('p_name',_utf8'chuck' COLLATE ...), NAME_CONST('p_email',_utf8'chuck@example.com' COLLATE ...), NAME_CONST('pass',_utf8'*FEB349C4FDAA307A...' COLLATE ...)) 2 rows in set (0.00 sec)
In Example 4-10, there are four things that you should note:
The
CALL
statement is not written to the binary log. Instead, the statements executed as a result of the call are written to the binary log. In other words, the body of the stored procedure is unrolled into the binary log.The statement is rewritten to not contain any references to the parameters of the stored procedure—that is,
p_name
,p_email
, andp_password
. Instead, theNAME_CONST
function is used for each parameter to create a result set with a single value.The locally declared variable
pass
is also replaced with aNAME_CONST
expression, where the second parameter contains the encrypted password.Just as when a statement that invokes a trigger is written to the binary log, the statement that calls the stored procedure is preceded by an
Intvar
event holding the insert ID used when adding the employee to thelog
table.
Since neither the parameter names nor the locally declared names
are available outside the stored routine, NAME_CONST
is used to associate the name of
the parameter or local variable with the constant value used when
executing the function. This guarantees that the value can be used in
the same way as the parameter or local variable. However, this change is
not significant; currently it offers no advantages over using the
parameters directly.
Stored Functions
Stored functions share many similarities with stored procedures and some
similarities with triggers. Similar to both stored procedures and
triggers, stored functions have a DEFINER
clause
that is normally (but not always) used when the CREATE FUNCTION
statement is written to the binary log.
In contrast to stored procedures, stored functions can return scalar values and you can therefore embed them in various places in SQL statements. For example, consider the definition of a stored routine in Example 4-11, which extracts the email address of an employee given the employee’s name. The function is a little contrived—it is significantly more efficient to just execute statements directly—but it suits our purposes well.
delimiter $$ CREATE FUNCTION employee_email(p_name CHAR(64)) RETURNS CHAR(64) DETERMINISTIC BEGIN DECLARE l_email CHAR(64); SELECT email INTO l_email FROM employee WHERE name = p_name; RETURN l_email; END $$ delimiter ;
This stored function can be used conveniently in other statements,
as shown in Example 4-12. In
contrast to stored procedures, stored functions have to specify a characteristic—such as
DETERMINISTIC
, NO SQL
, or READS SQL
DATA
—if they are to be written to the binary log.
master>CREATE TABLE collected (
->name CHAR(32),
->email CHAR(64)
->);
Query OK, 0 rows affected (0.09 sec) master>INSERT INTO collected(name, email)
->VALUES ('chuck', employee_email('chuck'));
Query OK, 1 row affected (0.01 sec) master>SELECT employee_email('chuck');
+-------------------------+ | employee_email('chuck') | +-------------------------+ | chuck@example.com | +-------------------------+ 1 row in set (0.00 sec)
When it comes to calls, stored functions are replicated in the
same manner as triggers: as part of the statement that executes the
function. For instance, the binary log doesn’t need any events preceding
the INSERT
statement in Example 4-12, but it will contain
the context events necessary to replicate the stored function inside the INSERT
.
What about SELECT
? Normally,
SELECT
statements are not written to the binary log since they don’t change
any data, but a SELECT
containing a
stored function, as in Example 4-13, is an
exception.
CREATE TABLE log(log_id INT AUTO_INCREMENT PRIMARY KEY, msg TEXT); delimiter $$ CREATE FUNCTION log_message(msg TEXT) RETURNS INT DETERMINISTIC BEGIN INSERT INTO log(msg) VALUES(msg); RETURN LAST_INSERT_ID(); END $$ delimiter ; SELECT log_message('Just a test');
When executing the stored function, the server notices that it
adds a row to the log
table and marks the statement
as an “updating” statement, which means that it will be written to the
binary log. So, for the slightly artificial example in Example 4-13, the binary log will
contain the event:
*************************** 7. row *************************** Log_name: mysql-bin.000001 Pos: 845 Event_type: Query Server_id: 1 End_log_pos: 913 Info: BEGIN *************************** 8. row *************************** Log_name: mysql-bin.000001 Pos: 913 Event_type: Intvar Server_id: 1 End_log_pos: 941 Info: LAST_INSERT_ID=1 *************************** 9. row *************************** Log_name: mysql-bin.000001 Pos: 941 Event_type: Intvar Server_id: 1 End_log_pos: 969 Info: INSERT_ID=1 *************************** 10. row *************************** Log_name: mysql-bin.000001 Pos: 969 Event_type: Query Server_id: 1 End_log_pos: 1109 Info: use `test`; SELECT `test`.`log_message`(_utf8'Just a test' COLLATE... *************************** 11. row *************************** Log_name: mysql-bin.000001 Pos: 1105 Event_type: Xid Server_id: 1 End_log_pos: 1132 Info: COMMIT /* xid=237 */
Events
The events feature is a MySQL extension, not part of standard SQL. Events, which should not be confused with binlog events, are handled by a stored program that is executed regularly by a special event scheduler.
Similar to all other stored programs, definitions of events are
also logged with a DEFINER
clause.
Since events are invoked by the event scheduler, they are always
executed as the definer and do not pose a security risk in the way that
stored functions do.
When events are executed, the statements are written to the binary log directly.
Since the events will be executed on the master, they are automatically disabled on the slave and will therefore not be executed there. If the events were not disabled, they would be executed twice on the slave: once by the master executing the event and replicating the changes to the slave, and once by the slave executing the event directly.
Warning
Because the events are disabled on the slave, it is necessary to enable these events if the slave, for some reason, should lose the master.
So, for example, when promoting a slave as described in Chapter 5, don’t forget to enable the events that were replicated from the master. This is easiest to do using the following statement:
UPDATE mysql.events SET Status = ENABLED WHERE Status = SLAVESIDE_DISABLED;
The purpose of the check is to enable only the events that were disabled when being replicated from the master. There might be events that are disabled for other reasons.
Special Constructions
Even though statement-based replication is normally straightforward, some special constructions have to be handled with care. Recall that for the statement to be executed correctly on the slave, the context has to be correct for the statement. Even though the context events discussed earlier handle part of the context, some constructions have additional context that is not transferred as part of the replication process.
The LOAD_FILE function
The LOAD_FILE
function
allows you to fetch a file and use it as part of an
expression. Although quite convenient at times, the file has to exist
on the slave server to replicate correctly since the file is
not transferred during replication, as the file
to LOAD DATA INFILE
is. With some
ingenuity, you can rewrite a statement involving the LOAD_FILE
function either to use the
LOAD DATA INFILE
statement
or to define a user-defined variable to hold the
contents of the file. For example, take the following statement that
inserts a document into a table:
master>INSERT INTO document(author, body)
->VALUES ('Mats Kindahl', LOAD_FILE('go_intro.xml'));
You can rewrite this statement to use LOAD DATA INFILE
instead. In this case, you
have to take care to specify character strings that cannot exist in
the document as field and line delimiters, since you are going to read
the entire file contents as a single column.
master>LOAD DATA INFILE 'go_intro.xml' INTO TABLE document
->FIELDS TERMINATED BY '@*@' LINES TERMINATED BY '&%&'
->(author, body) SET author = 'Mats Kindahl';
An alternative is to store the file contents in a user-defined variable and then use it in the statement.
master>SET @document = LOAD_FILE('go_intro.xml');
master>INSERT INTO document(author, body) VALUES
->('Mats Kindahl, @document);
Nontransactional Changes and Error Handling
So far we have considered only transactional changes and have not looked at error handling at all. For transactional changes, error handling is pretty uncomplicated: a statement that tries to change transactional tables and fails will not have any effect at all on the table. That’s the entire point of having a transactional system—so the changes that the statement attempts to introduce can be safely ignored. The same applies to transactions that are rolled back: they have no effect on the tables and can therefore simply be discarded without risking inconsistencies between the master and the slave.
A specialty of MySQL is the provisioning of nontransactional storage engines. This can offer some speed advantages, because the storage engine does not have to administer the transactional log that the transactional engines use, and it allows some optimizations on disk access. From a replication perspective, however, nontransactional engines require special considerations.
The most important aspect to note is that replication cannot handle arbitrary nontransactional engines, but has to make some assumptions about how they behave. Some of those limitations are lifted with the introduction of row-based replication in version 5.1—a subject that will be covered in Row-Based Replication—but even in that case, it cannot handle arbitrary storage engines.
One of the features that complicates the issue further, from a replication perspective, is that it is possible to mix transactional and nontransactional engines in the same transaction, and even in the same statement.
To continue with the example used earlier, consider Example 4-14, where the
log
table from Example 4-5 is given a
nontransactional storage engine while the employee
table is given a transactional one. We use the nontransactional MyISAM storage engine for the
log
table to improve its speed, while keeping the
transactional behavior for the employee
table.
We can further extend the example to track unsuccessful attempts
to add employees by creating a pair of insert triggers: a before trigger
and an after trigger. If an administrator sees an entry in the log with
a status
field of FAIL
, it means the before trigger ran, but the
after trigger did not, and therefore an attempt to add an employee
failed.
CREATE TABLE employee ( name CHAR(64) NOT NULL, email CHAR(64), password CHAR(64), PRIMARY KEY (email) ) ENGINE = InnoDB; CREATE TABLE log ( id INT AUTO_INCREMENT, email CHAR(64), message TEXT, status ENUM('FAIL', 'OK') DEFAULT 'FAIL', ts TIMESTAMP, PRIMARY KEY (id) ) ENGINE = MyISAM; delimiter $$ CREATE TRIGGER tr_employee_insert_before BEFORE INSERT ON employee FOR EACH ROW BEGIN INSERT INTO log(email, message) VALUES (NEW.email, CONCAT('Adding employee ', NEW.name)); SET @LAST_INSERT_ID = LAST_INSERT_ID(); END $$ delimiter ; CREATE TRIGGER tr_employee_insert_after AFTER INSERT ON employee FOR EACH ROW UPDATE log SET status = 'OK' WHERE id = @LAST_INSERT_ID;
What are the effects of this change on the binary log?
To begin, let’s consider the INSERT
statement
from Example 4-6.
Assuming the statement is not inside a transaction and AUTOCOMMIT
is 1, the statement will be a
transaction by itself. If the statement executes without errors,
everything will proceed as planned and the statement will be written to
the binary log as a Query
event.
Now, consider what happens if the INSERT
is repeated with the same employee.
Since the email
column is the primary key, this will
generate a duplicate key error when the insertion is attempted, but what
will happen with the statement? Is it written to the binary log or
not?
Let’s have a look…
master>SET @pass = PASSWORD('xyzzy');
Query OK, 0 rows affected (0.00 sec) master>INSERT INTO employee(name,email,password)
->VALUES ('chuck','chuck@example.com',@pass);
ERROR 1062 (23000): Duplicate entry 'chuck@example.com' for key 'PRIMARY' master>SELECT * FROM employee;
+------+--------------------+-------------------------------------------+ | name | email | password | +------+--------------------+-------------------------------------------+ | chuck | chuck@example.com | *151AF6B8C3A6AA09CFCCBD34601F2D309ED54888 | +------+--------------------+-------------------------------------------+ 1 row in set (0.00 sec) master>SHOW BINLOG EVENTS FROM 38493\G
*************************** 1. row *************************** Log_name: master-bin.000038 Pos: 38493 Event_type: User var Server_id: 1 End_log_pos: 38571 Info: @`pass`=_utf8 0x2A31353141463642384333413641413... *************************** 2. row *************************** Log_name: master-bin.000038 Pos: 38571 Event_type: Query Server_id: 1 End_log_pos: 38689 Info: use `test`; INSERT INTO employee(name,email,password)... 2 rows in set (0.00 sec)
As you can see, the statement is written to the binary log even
though the employee
table is transactional and the
statement failed. Looking at the contents of the table using the SELECT
reveals that there is still a single employee, proving the statement was
rolled back—so why is the statement written to the binary log?
Looking into the log
table will reveal the
reason.
master> SELECT * FROM log;
+----+------------------+------------------------+--------+---------------------+
| id | email | message | status | ts |
+----+------------------+------------------------+--------+---------------------+
| 1 | mats@example.com | Adding employee mats | OK | 2010-01-13 15:50:45 |
| 2 | mats@example.com | Name change from ... | OK | 2010-01-13 15:50:48 |
| 3 | mats@example.com | Password change | OK | 2010-01-13 15:50:50 |
| 4 | mats@example.com | Removing employee | OK | 2010-01-13 15:50:52 |
| 5 | mats@example.com | Adding employee mats | OK | 2010-01-13 16:11:45 |
| 6 | mats@example.com | Adding employee mats | FAIL | 2010-01-13 16:12:00 |
+----+--------------+----------------------------+--------+---------------------+
6 rows in set (0.00 sec)
Look at the last line, where the status is FAIL
. This line was added to the table by the
before trigger tr_employee_insert_before
. For the binary log
to faithfully represent the changes made to the database on the master,
it is necessary to write the statement to the binary log if there are
any nontransactional changes present in the statement or in triggers
that are executed as a result of executing the statement. Since the
statement failed, the after trigger tr_employee_insert_after
was not executed, and
therefore the status is still FAIL
from the execution of the before trigger.
Since the statement failed on the master, information about the
failure needs to be written to the binary log as well. The MySQL server
handles this by using an error code field in the Query
event to register the exact error code
that caused the statement to fail. This field is then written to the
binary log together with the event.
The error code is not visible when using the SHOW BINLOG
EVENTS
command, but you can view it using the mysqlbinlog
tool, which we will cover later in the chapter.
Logging Transactions
You have now seen how individual statements are written to the binary log, along with context information, but we did not cover how transactions are logged. In this section, we will briefly cover how transactions are logged.
A transaction can start under a few different circumstances:
When
AUTOCOMMIT=1
and a statement accessing a transactional table starts to execute. Note that a statement that writes only to nontransactional tables—for example, only to MyISAM tables—does not start a transaction.When
AUTOCOMMIT=0
and the previous transaction was committed or aborted either implicitly (by executing a statement that does an implicit commit) or explicitly by usingCOMMIT
orROLLBACK
.
Not every statement that is executed after the transaction has started is part of that transaction. The exceptions require special care from the binary log.
Nontransactional statements are by their very definition not part of the transaction. When they are executed, they take effect immediately and do not wait for the transaction to commit. This also means that it is not possible to roll them back. They don’t affect an open transaction: any transactional statement executed after the nontransactional statement is still added to the currently open transaction.
In addition, several statements do an implicit commit. These can be separated into three groups based on the reason they do an implicit commit.
- Statements that write files
Most DDL statements (
CREATE
,ALTER
, etc.), with some exceptions, do an implicit commit of any outstanding transaction before starting to execute and an implicit commit after they have finished. These statements modify files in the filesystem and are for that reason not transactional.- Statements that modify tables in the mysql database
All statements that create, drop, or modify user accounts or privileges for users do an implicit commit and cannot be part of a transaction. Internally, these statements modify tables in the mysql database, which are all nontransactional.
In MySQL versions earlier than 5.1.3, these statements did not cause an implicit commit, but because they were writing to nontransactional tables, they were treated as nontransactional statements. As you will soon see, this caused some inconsistencies, so implicit commits were added for these statements over the course of several versions.
- Statements that require implicit commits for pragmatic reasons
Statements that lock tables, statements that are used for administrative purposes, and
LOAD DATA INFILE
cause implicit commits in various situations because the implementation requires this to make them work correctly.
Statements that cause an implicit commit are clearly not part of any transaction, because any open transaction is committed before execution starts. You can find a complete list of statements that cause an implicit commit in the online MySQL Reference Manual.
Transaction Cache
The binary log can have statements in a different order from their actual execution, because it combines all the statements in each transaction to keep them together. Multiple sessions can execute simultaneous transactions on a server, and the transactional storage engines maintain their own transactional logs to make sure each transaction executes correctly. These logs are not visible to the user. In contrast, the binary log shows all transactions from all sessions in the order in which they were committed as if each executed sequentially.
To ensure each transaction is written as a unit to the binary log, the server has to separate statements that are executing in different threads. When committing a transaction, the server writes all the statements that are part of the transaction to the binary log as a single unit. For this purpose, the server keeps a transaction cache for each thread, as illustrated in Figure 4-4. Each statement executed for a transaction is placed in the transaction cache, and the contents of the transaction cache are then copied to the binary log and emptied when the transaction commits.
Statements that contain nontransactional changes require special attention. Recall from our previous discussion that nontransactional statements do not cause the current transaction to terminate, so the changes introduced by the execution of a nontransactional statement have to be recorded somewhere without closing the currently open transaction. The situation is further complicated by statements that simultaneously affect transactional and nontransactional tables. These statements are considered transactional but include changes that are not part of the transaction.
Statement-based replication cannot handle this correctly in all situations, so a best-effort approach has been taken. We’ll describe the measures taken by the server, followed by the issues you have to be aware of in order to avoid the replication problems that are left over.
How nontransactional statements are logged
When no transaction is open, nontransactional statements are written to the binary log at the end of execution of the statement and do not “transit” in the transaction cache before ending up in the binary log. If, however, a transaction is open, the rules for how to handle the statement are as follows:
Rule 1: If the statement is marked as transactional, it is written to the transaction cache.
Rule 2: If the statement is not marked as transactional and there are no statements in the transaction cache, the statement is written directly to the binary log.
Rule 3: If the statement is not marked as transactional, but there are statements in the transaction cache, the statement is written to the transaction cache.
The third rule might seem strange, but you can understand the
reasoning if you look at Example 4-15. Returning to our
employee
and log
tables,
consider the statements in Example 4-15, where a modification of a
transactional table comes before modification of a nontransactional
table in the transaction.
1 START TRANSACTION; 2 SET @pass = PASSWORD('xyzzy'); 3 INSERT INTO employee(name,email,password) VALUES ('mats','mats@example.com', @pass); 4 INSERT INTO log(email, message) VALUES ('root@example.com', 'This employee was bad'); 5 COMMIT;
Following rule 3, the statement on line 4 is written to the transaction cache even though the table is nontransactional. If the statement were written directly to the binary log, it would end up before the statement in line 3 because the statement in line 3 would not end up in the binary log until a successful commit in line 5. In short, the slave’s log would end up containing the comment added by the DBA in line 4 before the actual change to the employee in line 3, which is clearly inconsistent with the master. Rule 3 avoids such situations. The left side of Figure 4-5 shows the undesired effects if rule 3 did not apply, whereas the right side shows what actually happens thanks to rule 3.
Rule 3 involves a trade-off. Because the nontransactional statement is cached while the transaction executes, there is a risk that two transactions will update a nontransactional table on the master in a different order from that in which they are written to the binary log.
This situation can arise when there is a dependency between the first transactional and the second nontransactional statement of the transaction, but this cannot generally be handled by the server because it would require parsing each statement completely, including code in all triggers invoked, and performing a dependency analysis. Although technically possible, this would add extra processing to all statements during an open transaction and would therefore affect performance, perhaps significantly. Because the problem can almost always be avoided by designing transactions properly and ensuring that there are no dependencies of this kind in the transaction, the overhead was not added to MySQL.
How to avoid replication problems with nontransactional statements
The best strategy for avoiding problems is not to use nontransactional tables. However, if they are required by the application, a strategy for avoiding the dependencies discussed in the previous section is to ensure that statements affecting nontransactional tables are written first in the transaction. In this case, the statements will be written directly to the binary log, because the transaction cache is empty (refer to rule 2 in the preceding section). The statements are known to have no dependencies.
If you need any values from these statements later in the transaction, you can assign them to temporary tables or variables. After that, the real contents of the transaction can be executed, referencing the temporary tables or variables.
Distributed Transaction Processing Using XA
MySQL version 5.0 lets you coordinate transactions involving different resources by using the X/Open Distributed Transaction Processing model, XA. Although currently not very widely used, XA offers attractive opportunities for coordinating all kinds of resources with transactions.
In version 5.0, the server uses XA internally to coordinate the binary log and the storage engines.
A set of commands allows the client to take advantage of XA synchronization as well. XA allows different statements entered by different users to be treated as a single transaction. On the other hand, it imposes some overhead, so some administrators turn it off globally.
Instructions for working with the XA protocol are beyond the scope of this book, but we will give a brief introduction to XA here before describing how it affects the binary log.
XA includes a transaction manager that coordinates a set of resource managers so that they commit a global transaction as an atomic unit. Each transaction is assigned a unique XID, which is used by the transaction manager and the resource managers. When used internally in the MySQL server, the transaction manager is usually the binary log and the resource managers are the storage engines. The process of committing an XA transaction is shown in Figure 4-8 and consists of two phases.
In phase 1, each storage engine is asked to prepare for a commit. When preparing, the storage engine writes any information it needs to commit correctly to safe storage and then returns an OK message. If any storage engine replies negatively—meaning that it cannot commit the transaction—the commit is aborted and all engines are instructed to roll back the transaction.
After all storage engines have reported that they have prepared
without error, and before phase 2 begins, the transaction cache is
written to the binary log. In contrast to normal transactions, which are
terminated with a normal Query
event with
a COMMIT
, an XA transaction is
terminated with an Xid
event containing
the XID.
In phase 2, all the storage engines that were prepared in phase 1 are asked to commit the transaction. When committing, each storage engine will report that it has committed the transaction in stable storage. It is important to understand that the commit cannot fail: once phase 1 has passed, the storage engine has guaranteed that the transaction can be committed and therefore is not allowed to report failure in phase 2. A hardware failure can, of course, cause a crash, but since the storage engines have stored the information in durable storage, they will be able to recover properly when the server restarts. The restart procedure is discussed in the section The Binary Log and Crash Safety.
After phase 2, the transaction manager is given a chance to discard any shared resources, should it choose to. The binary log does not need to do any such cleanup actions, so it does not do anything special with regard to XA at this step.
In the event that a crash occurs while committing an XA
transaction, the recovery procedure in Figure 4-9 will take place when the server
is restarted. At startup, the server will open the last binary log and
check the Format_description
event. If the binlog-in-use
flag
is set (described in Binlog File Rotation), it
indicates that the server crashed and XA recovery has to be
executed.
The server starts by walking through the binary log that was just
opened and finding the XIDs of all transactions in the binary log by
reading the Xid
events. Each storage
engine loaded into the server will then be asked to commit the
transactions in this list. For each XID in the list, the storage engine
will determine whether a transaction with that XID is prepared but not
committed, and commit it if that is the case. If the storage engine has
prepared a transaction with an XID that is not in
this list, the XID obviously did not make it to the binary log before
the server crashed, so the transaction should be rolled back.
Binary Log Group Commit
Writing data to disk takes time; quite a lot actually. The performance of disk is orders of magnitude slower than main memory, where disk access times is counted in milliseconds, main memory access is counted in nanoseconds. For that reason, operating systems have elaborate memory management systems for keeping part of the files in memory and not performing more writes to disk than necessary. Since database systems have to be safe from crashes, it is necessary to force the data to be written to disk when a transaction is committed.
To avoid the performance impact resulting if each transaction has to be written to disk, multiple independent transactions can be grouped together and written to disk as a single unit, which is called group commit. Because the disk write time is mainly a consequence of the time it takes to move the disk head to the correct position on the disk, and not the amount of data written, this improves performance significantly.
It is, however, not sufficient to commit the transaction data in the storage engine efficiently, the binary log also has to be written efficiently. For that reason, binary log group commit was added in MySQL 5.6. With this, multiple independent transactions can be committed as a group to the binary log, improving performance significantly.
In order for online backup tools like MySQL Enterprise Backup to work correctly, it is important that transactions are written to the binary log in the same order as they are prepared in the storage engine.
The implementation is built around a set of stages that each transaction has to pass through before being fully committed, you can see an illustration in Figure 4-10. Each stage is protected by a mutex so there can never be more than one thread in each stage.
Each stage handles one part of the commit procedure. The first stage flushes the transaction caches of the threads to the file pages, the second stage executes a sync to get the file pages to disk, and the last stage commits all transactions.
To move sessions between stages in an ordered manner, each stage has an associated queue where sessions can queue up for processing. Each stage queue is protected by a mutex that is held briefly while manipulating the queue. You can see the names of the mutexes for each stage in Table 4-1, in case you want to use the performance schema to monitor performance of the commit phase.
Stage | Stage mutex | Stage queue mutex |
Flush |
|
|
Sync |
|
|
Commit |
|
|
Sessions normally run in separate threads, so any session thread that wants to commit a transaction enqueues the session in the flush stage queue. A session thread then proceeds through each stage in the following way:
If the session thread is enqueued to a non-empty stage queue, it is a follower and will go and wait for its transaction to be committed by some other session thread.
If the session thread enqueued to an empty stage queue, it is a (stage) leader and will bring all the sessions registered for the stage through the stage.
The leader empties all the sessions of the queue in a single step. The order of the sessions will be maintained, but after this step, new sessions can enqueue to the stage queue.
The stage processing is done as follows:
For the flush queue, the transaction of each session will be flushed to the binary log in the same order as they enqueued to the flush queue.
For the commit stage, the transactions are committed in the storage engines in the same order as they registered.
The sessions are enqueued to the queue for the next stage in the same order as they were registered for this stage.
Note that the queue might not be empty, which means that a session thread has “caught up” with a preceding leader thread. In this case, the session thread that was the leader and did the processing will become a follower for the next stage and let the leader already queued up handle the processing for the combined sets of sessions. Note that a leader can become a follower, but a follower can never become a leader.
As explained, new leader threads merge their session queues with
older threads that are in the middle of commit procedures. This lets the
system adapt dynamically to changing situations. Normally, the sync
stage is the most expensive stage, so many batches of sessions will pass
through the flush stage and “pile up” in the sync stage queue, after
which a single leader thread will bring all the sessions through that
stage. However, if battery-backed caches are used, fsync
calls are very cheap, so in that case,
sessions might not pile up at this stage.
Transactions have to be committed in order so that online backup methods (such as XtraBackup or MySQL Enterprise Backup) work correctly. For this reason, the commit stage commits the transactions in the same order as they are written to the binary log.
However, it is also possible to commit transactions in parallel in the commit stage. This means that the commits are done in an arbitrary order. This does not affect correctness in normal operations, but it means that you should not take an online backup while doing parallel commits.
Benchmarks did not show any measurable improvement in performance when committing transactions in parallel, so the default is to always commit in order. This is also the recommended setting, unless you have special needs (or just want to test it).
You can control whether transactions should be ordered using binlog_order_commits
. If this is
set to OFF
, transactions are committed in parallel
and the last stage is skipped (the threads commit themselves in an
arbitrary order instead of waiting for the leader).
The flush stage includes an optimization when reading sessions
from the queue. Instead of entering the flush stage and taking the
entire queue for processing, the leader “skims” one session at a time
from the queue, flushes it, and then repeats the procedure. The idea
behind this optimization is that as long as there are sessions
enqueueing, there is no point in advancing to the queue stage, and since
flushing the transaction to the binary log takes time, performing it on
each session provides an opportunity for more sessions that need
committing to enqueue themselves. This optimization showed significant
performance improvements in throughput. But because it can affect
latency, you can control how long the leader thread will keep “skimming”
sessions from the flush queue using the
binlog_max_flush_queue_time
, which takes the number of microseconds that the leader
should skim the queue. Once the timer expires, the entire queue will be
grabbed and the leader reverts to executing the procedure described in
the preceding list for the stage. This means that the time given is not
the maximum latency of a transaction: the leader will have to flush all
the transactions that are registered for the queue before moving to the
sync stage.
Row-Based Replication
The primary goal of replication is to keep the master and the slave synchronized so they have the same data. As you saw earlier, replication offers a number of special features to ensure the results are as close as possible to being identical on master and slave: context events, session-specific IDs, etc.
Despite this, there are still some situations that statement-based replication can’t currently handle correctly:
If an
UPDATE
,DELETE
, orINSERT
statement contains aLIMIT
clause, it may cause problems if a database crashes during execution.If there is an error during execution of a nontransactional statement, there is no guarantee that the effects are the same on the master and the slave.
If a statement contains a call to a UDF, there is no way to ensure the same value is used on the slave.
If the statement contains any nondeterministic function—such as
USER
,CURRENT_USER
, orCONNECTION_ID
—results may differ between master and slave.If a statement updates two tables with autoincrement columns, it will not work correctly, because only a single last insert ID can be replicated, which will then be used for both tables on the slave, while on the master, the insert ID for each table will be used individually.
In these cases, it is better to replicate the actual data being inserted into the tables, which is what row-based replication does.
Instead of replicating the statement that performs the changes, row-based replication replicates each row being inserted, deleted, or updated separately, with the values that were used for the operation. Since the row that is sent to the slave is the same row that is sent to the storage engine, it contains the actual data being inserted into the table. Hence there are no UDFs to consider, no autoincrement counters to keep track of, and no partial execution of statements to take into consideration—just data, plain and simple.
Row-based replication opens up an entirely new set of scenarios that you just cannot accomplish with statement-based replication. However, you must also be aware of some differences in behavior.
When choosing between statement-based and row-based replication, consider the following:
Do you have statements that update a lot of rows, or do the statements usually only change or insert a few rows?
If the statement changes a lot of rows, statement-based replication will have more compact statements and may execute faster. But since the statement is executed on the slave as well, this is not always true. If the statement has a complex optimization and execution plan, it might be faster to use row-based replication, because the logic for finding rows is much faster.
If the statement changes or inserts only a few rows, row-based replication is potentially faster because there is no parsing involved and all processing goes directly to the storage engine.
Do you need to see which statements are executed? The events for handling row-based replication are hard to decode, to say the least. In statement-based replication, the statements are written into the binary log and hence can be read directly.[5]
Statement-based replication has a simple replication model: just execute the same statement on the slave. This has existed for a long time and is familiar to many DBAs. Row-based replication, on the other hand, is less familiar to many DBAs and can therefore potentially be harder to fix when replication fails.
If data is different on master and slave, executing statements can yield different results on master and slave. Sometimes this is intentional—in this case, statement-based replication can and should be used—but sometimes this not intentional and can be prevented through row-based replication.
Enabling Row-based Replication
You can control the format to use when writing to the binary log using the
option binlog-format
. This option can take the values
STATEMENT
, MIXED
, or ROW
(see Options for Row-Based Replication
for more information), which comes both as a global variable and session
variable. This makes it possible for a session to temporarily switch
replication format (but you need SUPER
privileges to change this variable, even
the session version). However, to ensure that row-based replication is
used when starting the server, you need to stop the master and update
the configuration file. Example 4-16
shows the addition needed to enable row-based replication.
[mysqld]
user = mysql
pid-file = /var/run/mysqld/mysqld.pid
socket = /var/run/mysqld/mysqld.sock
port = 3306
basedir = /usr
datadir = /var/lib/mysql
tmpdir = /tmp
log-bin = master-bin
log-bin-index = master-bin.index
server-id = 1
binlog-format = ROW
Using Mixed Mode
Mixed-mode replication is recommended for MySQL version 5.1 and later, but the
default value for the binlog-format
option is STATEMENT
. This might seem odd, but that
decision was made to avoid problems for users who upgrade from versions
5.0 or earlier. Because those versions had no row-based replication and
users had to use statement-based replication, the MySQL developers did
not want servers to make a sudden switch. If the servers suddenly
started sending out row-based replication events when they were
upgraded, the deployment would likely be a mess. To reduce the number of
factors that an upgrading DBA has to consider, the default for this
option remains STATEMENT
.
However, if you use one of the template files distributed with
MySQL version 5.1, you will notice the binlog-format
option has the value MIXED
, per the
recommendation.
The principles behind mixed-mode replication are simple: use statement-based replication normally and switch to row-based replication for unsafe statements. We have already examined the kinds of statements that can lead to problems and why. To summarize, mixed-mode currently switches to row-based replication if:
This list is, by necessity, incomplete: it is being extended as new constructions are discovered unsafe. For a complete and accurate list, refer to the online MySQL Reference Manual.
Binary Log Management
The events mentioned thus far represent some real change of data that occurred on the master. However, things can happen that do not represent any change of data on the master but can affect replication. For example, if the server is stopped, it can potentially affect replication. No events can, of course, be written to the binary log when the server is not running, but this means that if anything is changed in the datafile, it will not be represented in the binary log. A typical example of this is restoring a backup, or otherwise manipulating the datafiles. Such changes are not replicated simply because the server is not running. However, the fact that the server stopped is sometimes represented in the binary log using a binary log event precisely to be able to recognize that there can be a “gap” in the binary log where things could have happened.
Events are needed for other purposes as well. Because the binary logs consist of multiple files, it is necessary to split the groups at convenient places to form the sequence of binlog files. To handle this safely, special events (rotate events) are added to the binary log.
The Binary Log and Crash Safety
As you have seen, changes to the binary log do not correspond to changes to the master databases on a one-to-one basis. It is important to keep the databases and the binary log mutually consistent in case of a crash. In other words, there should be no changes committed to the storage engine that are not written to the binary log, and vice versa.
Nontransactional engines introduce problems right away. For example, it is not possible to guarantee consistency between the binary log and a MyISAM table because MyISAM is nontransactional and the storage engine will carry through any requested change long before any attempt to log the statement.
But for transactional storage engines, MySQL includes measures to make sure that a crash does not cause the binary log to lose too much information.
As we described in Logging Statements, events are written to the binary log before releasing the locks on the table, but after all the changes have been given to the storage engine. So if there is a crash before the storage engine releases the locks, the server has to ensure that any changes recorded to the binary log are actually in the table on the disk before allowing the statement (or transaction) to commit. This requires coordination with standard filesystem synchronization.
Because disk accesses are very expensive compared to memory accesses, operating systems are designed to cache parts of the file in a dedicated part of the main memory—usually called the page cache—and wait to write file data to disk until necessary. Writing to disk becomes necessary when another page must be loaded from disk and the page cache is full, but it can also be requested by an application by doing an explicit call to write the pages of a file to disk.
Recall from the earlier description of XA that when the first
phase is complete, all data has to be written to durable storage—that
is, to disk—for the protocol to handle crashes correctly. This means
that every time a transaction is committed, the page cache has to be
written to disk. This can be very expensive and, depending on the
application, not always necessary. To control how often the data is
written to disk, you can set the sync-binlog
option. This option takes
an integer specifying how often to write the binary log to disk. If the
option is set to 5, for instance, the binary log will be written to disk
on every fifth commit group of statements or transactions. The default
value is 0, which means that the binary log is not explicitly written to
disk by the server, but happens at the discretion of the operating
system.
Note
Note that since the introduction of binary log group commit in
MySQL version 5.6, there is a sync with each commit group and not with
each transaction or statement. This means that with
sync-binlog=1
several transactions will be written to
disk in a batch. You can read more about binary log group commit in
Binary Log Group Commit.
For storage engines that support XA, such as InnoDB, setting the
sync-binlog
option to 1 means that you will not lose
any transactions under normal crashes. For engines that do not support
XA, you might lose at most one transaction.
If, however, every group is written to disk, it means that the
performance suffers, usually a lot. Disk accesses are notoriously slow
and caches are used for precisely the purpose of improving the
performance because they remove the need to always write data to disk.
If you are prepared to risk losing a few transactions or
statements—either because you can handle the work it takes to recover
this manually or because it is not important for the application—you can
set sync-binlog
to a higher value or leave it at the
default.
Binlog File Rotation
MySQL starts a new file to hold binary log events at regular intervals. For practical and administrative reasons, it wouldn’t work to keep writing to a single file—operating systems have limits on file sizes. As mentioned earlier, the file to which the server is currently writing is called the active binlog file.
Switching to a new file is called binary log rotation or binlog file rotation, depending on the context.
Four main activities cause a rotation:
- The server stops
Each time the server starts, it begins a new binary log. We’ll discuss why shortly.
- The binlog file reaches a maximum size
If the binlog file grows too large, it will be automatically rotated. You can control the size of the binlog files using the
binlog-cache-size
server variable.- The binary log is explicitly flushed
The
FLUSH LOGS
command writes all logs to disk and creates a new file to continue writing the binary log. This can be useful when administering recovery images for PITR. Reading from an open binlog file can have unexpected results, so it is advisable to force an explicit flush before trying to use binlog files for recovery.- An incident occurred on the server
In addition to stopping altogether, the server can encounter incidents that cause the binary log to be rotated. These incidents sometimes require special manual intervention from the administrator, because they can leave a “gap” in the replication stream. It is easier for the DBA to handle the incident if the server starts on a fresh binlog file after an incident.
The first event of every binlog file is the Format_description
event, which describes the server that wrote the file along with
information about the contents and status of the file.
Three items are of particular interest here:
- The
binlog-in-use
flag Because a crash can occur while the server is writing to a binlog file, it is critical to indicate when a file was closed properly. Otherwise, a DBA could replay a corrupted file on the master or slave and cause more problems. To provide assurance about the file’s integrity, the
binlog-in-use
flag is set when the file is created and cleared after the final event (Rotate
) has been written to the file. Thus, any program can see whether the binlog file was properly closed.- Binlog file format version
Over the course of MySQL development, the format for the binary log has changed several times, and it will certainly change again. Developers increment the version number for the format when significant changes—notably changes to the common headers—render new files unreadable to previous versions of the server. (The current format, starting with MySQL version 5.0, is version 4.) The binlog file format version field lists its version number; if a different server cannot handle a file with that version, it simply refuses to read the file.
- Server version
This is a string denoting the version of the server that wrote the file. The server version used to run the examples in this chapter was “5.5.31-0ubuntu0.12.04.1,” for instance. As you can see, the string is guaranteed to include the MySQL server version, but it also contains additional information related to the specific build. In some situations, this information can help you or the developers figure out and resolve subtle bugs that can occur when replicating between different versions of the server.
To rotate the binary log safely even in the presence of crashes, the server uses a write-ahead strategy and records its intention in a temporary file called the purge index file (this name was chosen because the file is used while purging binlog files as well, as you will see). Its name is based on that of the index file, so for instance if the name of the index file is master-bin.index, the name of the purge index file is master-bin.~rec~. After creating the new binlog file and updating the index file to point to it, the server removes the purge index file.
Note
In versions of MySQL earlier than 5.1.43, rotation or binlog file purging could leave orphaned files; that is, the files might exist in the filesystem without being mentioned in the index file. Because of this, old files might not be purged correctly, leaving them around and requiring manual cleaning of the files from the directory.
The orphaned files do not cause a problem for replication, but can be considered an annoyance. The procedure shown in this section ensures that no files are orphaned in the event of a crash.
In the event of a crash, if a purge index file is present on the server, the server can compare the purge index file and the index file when it restarts and see what was actually accomplished compared to what was intended.
Before MySQL version 5.6 it was possible that the binary log could be left partially written. This could occur if only a part of the cache was written to the binary log before the server crashed. With MySQL version 5.6, the binary log will be trimmed and the partially written transaction removed from the binary log. This is safe because the transaction is not committed in the storage engine until it was written completely to the binary log.
Incidents
The term “incidents” refers to events that don’t change data on a server but must be written to the binary log because they have the potential to affect replication. Most incidents don’t require special intervention from the DBA—for instance, servers can stop and restart without changes to database files—but there will inevitably be some incidents that call for special action.
Currently, there are two incident events that you might discover in a binary log:
Stop
Indicates that the server was stopped through normal means. If the server crashed, no stop event will be written, even when the server is brought up again. This event is written in the old binlog file (restarting the server rotates to a new file) and contains only a common header; no other information is provided in the event.
When the binary log is replayed on the slave, it ignores any
Stop
events. Normally, the fact that the server stopped does not require special attention and replication can proceed as usual. If the server was switched to a new version while it was stopped, this will be indicated in the next binlog file, and the server reading the binlog file will then stop if it cannot handle the new version of the binlog format. In this sense, theStop
event does not represent a “gap” in the replication stream. However, the event is worth recording because someone might manually restore a backup or make other changes to files before restarting replication, and the DBA replaying the file could find this event in order to start or stop the replay at the right time.Incident
An event type introduced in version 5.1 as a generic incident event. In contrast with the
Stop
event, this event contains an identifier to specify what kind of incident occurred. It is used to indicate that the server was forced to perform actions that almost guarantee that changes are missing from the binary log.For example, incident events in version 5.1 are written if the database was reloaded or if a nontransactional event was too big to fit in the binlog file. MySQL Cluster generates this event when one of the nodes had to reload the database and could therefore be out of sync.
When the binary log is replayed on the slave, it stops with an error if it encounters an
Incident
event. In the case of the MySQL Cluster reload event, it indicates a need to resynchronize the cluster and probably to search for events that are missing from the binary log.
Purging the Binlog File
Over time, the server will accumulate binlog files unless old ones are purged from the filesystem. The server can automatically purge old binary logs from the filesystem, or you can explicitly tell the server to purge the files.
To make the server automatically purge old binlog files, set
the expire-logs-days
option—which is
available as a server variable as well—to the number of days that you
want to keep binlog files. Remember that as with all server variables,
this setting is not preserved between restarts of the server. So if you
want the automatic purging to keep going across restarts, you have to
add the setting to the my.cnf file
for the server.
To purge the binlog files manually, use the PURGE BINARY LOGS
command, which comes in two forms:
PURGE BINARY LOGS BEFORE
datetime
This form of the command will purge all files that are before the given date. If
datetime
is in the middle of a logfile (which it usually is), all files before the one holdingdatetime
will be purged.PURGE BINARY LOGS TO '
filename
'This form of the command will purge all files that precede the given file. In other words, all files before
filename
in the output fromSHOW MASTER LOGS
will be removed, leavingfilename
as the first binlog file.
Binlog files are purged when the server starts or when a binary
log rotation is done. If the server discovers files that require
purging, either because a file is older than
expire-logs-days
or because a PURGE BINARY LOGS
command was executed, it
will start by writing the files that the server has decided are ripe for
purging to the purge index file (for example, master-bin.~rec~). After that, the files are
removed from the filesystem, and finally the purge index file is removed.
In the event of a crash, the server can continue removing files by comparing the contents of the purge index file and the index file and removing all files that were not removed because of a crash. As you saw earlier, the purge index file is used when rotating as well, so if a crash occurs before the index file can be properly updated, the new binlog file will be removed and then re-created when the rotate is repeated.
The mysqlbinlog Utility
One of the more useful tools available to an administrator is the client program mysqlbinlog. This is a small program that can investigate the contents of binlog files as well as relay logfiles (we will cover the relay logs in Chapter 8). In addition to reading binlog files locally, mysqlbinlog can also fetch binlog files remotely from other servers.
The mysqlbinlog is a very useful tool when investigating problems with replication, but you can also use it to implement PITR, as demonstrated in Chapter 3.
The mysqlbinlog tool normally outputs the contents of the binary log in a form that can be executed by sending them to a running server. When statement-based replication is employed, the statements executed are emitted as SQL statements. For row-based replication, which will be introduced in Chapter 8, mysqlbinlog generates some additional data necessary to handle row-based replication. This chapter focuses entirely on statement-based replication, so we will use the command with options to suppress output needed to handle row-based replication.
Some options to mysqlbinlog will be explained in this section, but for a complete list, consult the online MySQL Reference Manual.
Basic Usage
Let’s start with a simple example where we create a binlog file and then look at it using mysqlbinlog. We will start up a client connected to the master and execute the following commands to see how they end up in the binary log:
mysqld1>RESET MASTER;
Query OK, 0 rows affected (0.01 sec) mysqld1>CREATE TABLE employee (
->id INT AUTO_INCREMENT,
->name CHAR(64) NOT NULL,
->email CHAR(64),
->password CHAR(64),
->PRIMARY KEY (id)
->);
Query OK, 0 rows affected (0.00 sec) mysqld1>SET @password = PASSWORD('xyzzy');
Query OK, 0 rows affected (0.00 sec) mysqld1>INSERT INTO employee(name,email,password)
->VALUES ('mats','mats@example.com',@password);
Query OK, 1 row affected (0.01 sec) mysqld1>SHOW BINARY LOGS;
+--------------------+-----------+ | Log_name | File_size | +--------------------+-----------+ | mysqld1-bin.000038 | 670 | +--------------------+-----------+ 1 row in set (0.00 sec)
Let’s now use mysqlbinlog to dump the contents of the binlog file master-bin.000038, which is where all the commands ended up. The output shown in Example 4-17 has been edited slightly to fit the page.
$sudo mysqlbinlog \
>--short-form \
>--force-if-open \
>--base64-output=never \
>/var/lib/mysql1/mysqld1-bin.000038
1 /*!40019 SET @@session.max_insert_delayed_threads=0*/; 2 /*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/; 3 DELIMITER /*!*/; 4 ROLLBACK/*!*/; 5 use test/*!*/; 6 SET TIMESTAMP=1264227693/*!*/; 7 SET @@session.pseudo_thread_id=999999999/*!*/; 8 SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=1, @@session.unique_checks=1, @@session.autocommit=1/*!*/; 9 SET @@session.sql_mode=0/*!*/; 10 SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/; 11 /*!\C utf8 *//*!*/; 12 SET @@session.character_set_client=8,@@session.collation_connection=8, @@session.collation_server=8/*!*/; 13 SET @@session.lc_time_names=0/*!*/; 14 SET @@session.collation_database=DEFAULT/*!*/; 15 CREATE TABLE employee ( 16 id INT AUTO_INCREMENT, 17 name CHAR(64) NOT NULL, 18 email CHAR(64), 19 password CHAR(64), 20 PRIMARY KEY (id) 21 ) ENGINE=InnoDB 22 /*!*/; 23 SET TIMESTAMP=1264227693/*!*/; 24 BEGIN 25 /*!*/; 26 SET INSERT_ID=1/*!*/; 27 SET @`password`:=_utf8 0x2A31353141463... COLLATE `utf8_general_ci`/*!*/; 28 SET TIMESTAMP=1264227693/*!*/; 29 INSERT INTO employee(name,email,password) 30 VALUES ('mats','mats@example.com',@password) 31 /*!*/; 32 COMMIT/*!*/; 33 DELIMITER ; 34 # End of log file 35 ROLLBACK /* added by mysqlbinlog */; 36 /*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
To get this output, we use three options:
short-form
With this option, mysqlbinlog prints only information about the SQL statements issued, and leaves out comments with information about the events in the binary log. This option is useful when mysqlbinlog is used only to play back the events to a server. If you want to investigate the binary log for problems, you will need these comments and should not use this option.
force-if-open
If the binlog file is not closed properly, either because the binlog file is still being written to or because the server crashed, mysqlbinlog will print a warning that this binlog file was not closed properly. This option prevents the printing of that warning.
base64-output=never
This prevents mysqlbinlog from printing base64-encoded events. If mysqlbinlog has to print base64-encoded events, it will also print the
Format_description
event of the binary log to show the encoding used. For statement-based replication, this is not necessary, so this option is used to suppress that event.
In Example 4-17, lines 1–4 contain the preamble printed in every output. Line 3 sets a delimiter that is unlikely to occur elsewhere in the file. The delimiter is also designed to appear as a comment in processing languages that do not recognize the setting of the delimiter.
The rollback on line 4 is issued to ensure the output is not accidentally put inside a transaction because a transaction was started on the client before the output was fed into the client.
We can skip momentarily to the end of the output—lines 33–35—to see the counterpart to lines 1–4. They restore the values set in the preamble and roll back any open transaction. This is necessary in case the binlog file was truncated in the middle of a transaction, to prevent any SQL code following this output from being included in a transaction.
The USE
statement on line 5 is printed whenever the database is changed. Even
though the binary log specifies the current database before each SQL
statement, mysqlbinlog shows only the
changes to the current database. When a USE
statement appears, it is the first line of
a new event.
The first line that is guaranteed to be in the output for each
event is SET TIMESTAMP
, as
shown on lines 6 and 23. This statement gives the timestamp when the event started executing in seconds
since the epoch.
Lines 7–14 contain general settings, but like USE
on line 5, they are printed only for the
first event and whenever their values change.
Because the INSERT
statement
on lines 29–30 is inserting into a table with an
autoincrement column using a user-defined variable, the INSERT_ID
session
variable on line 26 and the user-defined variable on line 27 are set
before the statement. This is the result of the Intvar
and User_var
events in the binary log.
If you omit the short-form
option, each event in
the output will be preceded by some comments about the event that generated the lines. You can
see these comments, which start with hash marks (#) in Example 4-18.
$sudo mysqlbinlog \
>--force-if-open \
>--base64-output=never \
>/var/lib/mysql1/mysqld1-bin.000038
. . . 1 # at 386 2 #100123 7:21:33 server id 1 end_log_pos 414 Intvar 3 SET INSERT_ID=1/*!*/; 4 # at 414 5 #100123 7:21:33 server id 1 end_log_pos 496 User_var 6 SET @`password`:=_utf8 0x2A313531...838 COLLATE `utf8_general_ci`/*!*/; 7 # at 496 8 #100123 7:21:33 server id 1 end_log_pos 643 Query thread_id=6 exec_time=0 error_code=0 9 SET TIMESTAMP=1264227693/*!*/; 10 INSERT INTO employee(name,email,password) 11 VALUES ('mats','mats@example.com',@password) 12 /*!*/; 13 # at 643 14 #100123 7:21:33 server id 1 end_log_pos 670 Xid = 218 15 COMMIT/*!*/; 16 DELIMITER ; 17 # End of log file 18 ROLLBACK /* added by mysqlbinlog */; 19 /*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
The first line of the comment gives the byte position of the
event, and the second line contains other information about the event.
Consider, for example, the INSERT
statement line:
# at 496 #100123 7:21:33 server id 1 end_log_pos 643 Query thread_id=6 exec_time=0 error_code=0
The various parts of the comments have the following meanings:
at 496
The byte position where the event starts (i.e., the first byte of the event).
100123 7:21:33
The timestamp of the event as a
datetime
(date plus time). This is the time when the query started executing or when the events were written to the binary log.server_id 1
The server ID of the server that generated the event. This server ID is used to set the
pseudo_thread_id
session variable, and a line setting this variable is printed if the event is thread-specific and the server ID is different from the previously printed ID.end_log_pos 643
The byte position of the event that follows this event. By taking the difference between this value and the position where the event starts, you can get the length of the event.
Query
The type of event. In Example 4-18, you can see several different types of events, such as
User_var
,Intvar
, andXid
.
The fields after these are event-specific, and hence different for
each event. For the Query
event, we
can see two additional fields:
thread_id=6
The ID of the thread that executed the event. This is used to handle thread-specific queries, such as queries that access temporary tables.
exec_time=0
The execution time of the query in seconds.
Example 4-17 and Example 4-18 dump the output of a single file, but mysqlbinlog accepts multiple files as well. If several binlog files are given, they are processed in order.
The files are printed in the order you request them, and there is
no checking that the Rotate
event
ending each file refers to the next file in sequence. The responsibility
for ensuring that these binlog files make up part of a real binary log
lies on the user.
Thanks to the way the binlog files binlog files are given, they will be processed in
are named, submitting multiple files to mysqlbinlog—such as by using *
as a file-globbing wildcard—is usually not a
problem. However, let’s look at what happens when the binlog file
counter, which is used as an extension to the filename, goes from 999999
to 1000000:
$ ls mysqld1-bin.[0-9]*
mysqld1-bin.000007 mysqld1-bin.000011 mysqld1-bin.000039
mysqld1-bin.000008 mysqld1-bin.000035 mysqld1-bin.1000000
mysqld1-bin.000009 mysqld1-bin.000037 mysqld1-bin.999998
mysqld1-bin.000010 mysqld1-bin.000038 mysqld1-bin.999999
As you can see, the last binlog file to be created is listed before the two binlog files that are earlier in binary log order. So it is worth checking the names of the files before you use wildcards.
Since your binlog files are usually pretty large, you won’t want to print the entire contents of the binlog files and browse them. Instead, there are a few options you can use to limit the output so that only a range of the events is printed.
start-position=
bytepos
The byte position of the first event to dump. Note that if several binlog files are supplied to mysqlbinlog, this position will be interpreted as the position in the first file in the sequence.
If an event does not start at the position given, mysqlbinlog will still try to interpret the bytes starting at that position as an event, which usually leads to garbage output.
stop-position=
bytepos
The byte position of the last event to print. If no event ends at that position, the last event printed will be the event with a position that precedes
bytepos
. If multiple binlog files are given, the position will be the position of the last file in the sequence.start-datetime=
datetime
Prints only events that have a timestamp at or after
datetime
. This will work correctly when multiple files are given—if all events of a file are before thedatetime
, all events will be skipped—but there is no checking that the events are printed in order according to their timestamps.stop-datetime=
datetime
Prints only events that have a timestamp before
datetime
. This is an exclusive range, meaning that if an event is marked2010-01-24 07:58:32
and that exact datetime is given, the event will not be printed.Note that since the timestamp of the event uses the start time of the statement but events are ordered in the binary log based on the commit time, it is possible to have events with a timestamp that comes before the timestamp of the preceding event. Since mysqlbinlog stops at the first event with a timestamp outside the range, there might be events that aren’t displayed because they have timestamps before
datetime
.
Reading remote files
In addition to reading files on a local filesystem, the mysqlbinlog utility can read binlog files
from a remote server. It does this by using the same mechanism that
the slaves use to connect to a master and ask for events. This can be
practical in some cases, since it does not require a shell account on
the machine to read the binlog files, just a user on the server
with REPLICATION
SLAVE
privileges.
To handle remote reading of binlog files, include the read-from-remote-server
option
along with a host and user for connecting to the server, and
optionally a port (if different from the default) and a
password.
When reading from a remote server, give just the name of the binlog file, not the full path.
So to read the Query
event
from Example 4-18 remotely, the
command would look something like the following (the server prompts
for a password, but it is not output when you enter it):
$sudo mysqlbinlog
>--read-from-remote-server
>--host=master.example.com
>--base64-output=never
>--user=repl_user --password
>--start-position=294
>mysqld1-bin.000038
Enter password: /*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/; /*!40019 SET @@session.max_insert_delayed_threads=0*/; /*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/; DELIMITER /*!*/; # at 294 #130608 22:09:19 server id 1 end_log_pos 0 Start: binlog v 4, server v 5.5.31 -0ubuntu0.12.04.1-log created 130608 22:09:19 # at 294 #130608 22:13:08 server id 1 end_log_pos 362 Query thread_id=53 exec_time=0 error_code=0 SET TIMESTAMP=1370722388/*!*/; SET @@session.pseudo_thread_id=53/*!*/; SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0... SET @@session.sql_mode=0/*!*/; SET @@session.auto_increment_increment=1, @@session.auto_increment_offset... /*!\C utf8 *//*!*/; SET @@session.character_set_client=33,@@session.collation_connection=33... SET @@session.lc_time_names=0/*!*/; SET @@session.collation_database=DEFAULT/*!*/; BEGIN /*!*/; # at 362 #130608 22:13:08 server id 1 end_log_pos 390 Intvar SET INSERT_ID=1/*!*/; # at 390 #130608 22:13:08 server id 1 end_log_pos 472 User_var SET @`password`:=_utf8 0x2A31353141463642384333413641413039434643434244333... # at 472 #130608 22:13:08 server id 1 end_log_pos 627 Query thread_id=53 exec_time=0 use `test`/*!*/; SET TIMESTAMP=1370722388/*!*/; INSERT INTO employee(name, email, password) VALUES ('mats', 'mats@example.com', @password) /*!*/; # at 627 #130608 22:13:08 server id 1 end_log_pos 654 Xid = 175 COMMIT/*!*/; DELIMITER ; # End of log file ROLLBACK /* added by mysqlbinlog */; /*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/; /*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;
Reading raw binary logfiles
The mysqlbinlog utility is
very useful for investigating binary logs, but it can also be
used for taking backups of binlog files. This is very useful when you
do not have access to the machine using normal shell but have
REPLICATION SLAVE
privileges. In this case, you can
read the binary logfiles from the server, but they should not be
parsed and rather just saved to files.
The normal way to use mysqlbinlog for taking a backup of binlog files remotely is:
mysqlbinlog --raw --read-from-remote-server \ --host=master.example.com --user=repl_user \ master-bin.000012 master-bin.000013 ...
In this example, the binlog files master-bin.000012 and master-bin.000013 will be read and saved in
the current directory. Note that you have to use
--read-from-remote-server
together with
--raw
. Using --raw
without
--read-from-remote-server
is pointless since that
would be the same as using a plain file copy. The most interesting
options to control the behavior of mysqlbinlog are described below. You can
find a detailed description of how to use mysqlbinlog for backups at the MySQL Reference
Manual.
--result-file=
prefix
This option gives the prefix to use when constructing the files to write to. The prefix can be a directory name (with trailing slash), or any other prefix. It defaults to the empty string, so if this option is not used, files will be written using the same name as they have on the master.
--to-last-log
Normally, only the files given on the command line are transferred, but if this option is provided, only the starting binary log file has to be given. After that, mysqlbinlog will transfer all files after the first one.
--stop-never
Do not stop reading after reaching end of file of the last log: keep waiting for more input. This option is useful when taking a backup that is to be used for point-in-time recovery. See Backup and MySQL Replication for a detailed treatment of backing up for point-in-time recovery.
Interpreting Events
Sometimes, the standard information printed by mysqlbinlog is not sufficient for spotting a
problem, so it is necessary to go into the details of the event and
investigate its content. To handle such situations, you can pass the
hexdump
option to tell mysqlbinlog to
write the actual bytes of the events.
Before going into the details of the events, here are some general rules about the format of the data in the binary log:
- Integer data
Integer fields in the binary log are printed in little-endian order, so you have to read integer fields backward. This means that, for example, the 32-bit block 03 01 00 00 represents the hexadecimal number 103.
- String data
String data is usually stored both with length data and null-terminated. Sometimes, the length data appears just before the string and sometimes it is stored in the post header.
This section will cover the most common events, but an exhaustive
reference concerning the format of all the events is beyond the scope of
this book. Check the Binary Log
section in the MySQL Internals Manual for an exhaustive list of
all the events available and their fields. The most common of all the
events is the Query
event, so let’s concentrate on it first. Example 4-19 shows the output for such
an event.
$sudo mysqlbinlog \
>--force-if-open \
>--hexdump \
>--base64-output=never \
>/var/lib/mysql1/mysqld1-bin.000038
. . . 1 # at 496 2 #100123 7:21:33 server id 1 end_log_pos 643 3 # Position Timestamp Type Master ID Size Master Pos Flags 4 # 1f0 6d 95 5a 4b 02 01 00 00 00 93 00 00 00 83 02 00 00 10 00 5 # 203 06 00 00 00 00 00 00 00 04 00 00 1a 00 00 00 40 |................| 6 # 213 00 00 01 00 00 00 00 00 00 00 00 06 03 73 74 64 |.............std| 7 # 223 04 08 00 08 00 08 00 74 65 73 74 00 49 4e 53 45 |.......test.INSE| 8 # 233 52 54 20 49 4e 54 4f 20 75 73 65 72 28 6e 61 6d |RT.INTO.employee| 9 # 243 65 2c 65 6d 61 69 6c 2c 70 61 73 73 77 6f 72 64 |.name.email.pass| 10 # 253 29 0a 20 20 56 41 4c 55 45 53 20 28 27 6d 61 74 |word....VALUES..| 11 # 263 73 27 2c 27 6d 61 74 73 40 65 78 61 6d 70 6c 65 |.mats...mats.exa| 12 # 273 2e 63 6f 6d 27 2c 40 70 61 73 73 77 6f 72 64 29 |mple.com...passw| 13 # 283 6f 72 64 29 |ord.| 14 # Query thread_id=6 exec_time=0 error_code=0 SET TIMESTAMP=1264227693/*!*/; INSERT INTO employee(name,email,password) VALUES ('mats','mats@example.com',@password)
The first two lines and line 14 are comments listing basic
information that we discussed earlier. Notice that when you use the
hexdump
option, the general information and the
event-specific information are split into two lines, whereas they are
merged in the normal output.
Lines 3 and 4 list the common header:
Timestamp
The timestamp of the event as an integer, stored in little-endian format.
Type
A single byte representing the type of the event. Some event types are given in the MySQL Internals Manual but to get the values for your specific server you need to look in the source code (in the file sql/log_event.h).
Master ID
The server ID of the server that wrote the event, written as an integer. For the event shown in Example 4-19, the server ID is 1.
Size
The size of the event in bytes, written as an integer.
Master Pos
The same as
end_log_pos
(i.e., the start of the event following this event).Flags
This field has 16 bits reserved for general flags concerning the event. The field is mostly unused, but it stores the
binlog-in-use
flag. As you can see in Example 4-19, thebinlog-in-use
flag is set, meaning that the binary log is not closed properly (in this case, because we didn’t flush the logs before calling mysqlbinlog).
After the common header come the post header and body for the
event. As already mentioned, an exhaustive coverage of all the events is
beyond the scope of this book, but we will cover the most important and
commonly used events: the Query
and Format_description
log
events.
Query event post header and body
The Query
event is by
far the most used and also the most complicated event
issued by the server. Part of the reason is that it has to carry a lot
of information about the context of the statement when it was
executed. As already demonstrated, integer variables, user variables,
and random seeds are covered using specific events, but it is also
necessary to provide other information, which is part of this
event.
The post header for the Query
event consists of five fields. Recall that these fields are of fixed
size and that the length of the post header is given in the Format_description
event for the binlog
file, meaning that later MySQL versions may add additional fields if
the need should arise.
- Thread ID
A four-byte unsigned integer representing the thread ID that executed the statement. Even though the thread ID is not always necessary to execute the statement correctly, it is always written into the event.
- Execution time
The number of seconds from the start of execution of the query to when it was written to the binary log, expressed as a four-byte unsigned integer.
- Database name length
The length of the database name, stored as an unsigned one-byte integer. The database name is stored in the event body, but the length is given here.
- Error code
The error code resulting from execution of the statement, stored as a two-byte unsigned integer. This field is included because, in some cases, statements have to be logged to the binary log even when they fail.
- Status variables length
The length of the block in the event body storing the status variables, stored as a two-byte unsigned integer. This status block is sometimes used with a
Query
event to store various status variables, such asSQL_MODE
.
The event body consists of the following fields, which are all of variable length.
- Status variables
A sequence of status variables. Each status variable is represented by a single integer followed by the value of the status variable. The interpretation and length of each status variable value depends on which status variable it concerns. Status variables are not always present; they are added only when necessary. Some examples of status variables follow:
Q_SQL_MODE_CODE
Q_AUTO_INCREMENT
This status variable contains the values of
auto_increment_increment
andauto_increment_offset
used for the statement, assuming that they are not the default of 1.Q_CHARSET
This status variable contains the character set code and collation used by the connection and the server when the statement was executed.
- Current database
The name of the current database, stored as a null-terminated string. Notice that the length of the database name is given in the post header.
- Statement text
The statement that was executed. The length of the statement can be computed from the information in the common header and the post header. This statement is normally identical to the original statement written, but in some cases, the statement is rewritten before it is stored in the binary log. For instance, as you saw earlier in this chapter, triggers and stored procedures are stored with
DEFINER
clauses specified.
Format description event post header and body
The Format_description
event
records important information about the binlog file
format, the event format, and the server. Because it has to remain
robust between versions—it
should still be possible to interpret it even if the binlog format
changes—there are some
restrictions on which changes are allowed.
One of the more important restrictions is that the common header
of both the Format_description
event and the Rotate
event is
fixed at 19 bytes. This means that it is not possible to
extend the event with new fields in the common header.
The post header and event body for the Format_description
event contain the
following fields:
- Binlog file version
The version of the binlog file format used by this file. For MySQL versions 5.0 and later, this is 4.
- Server version string
A 50-byte string storing server version information. This is usually the three-part version number followed by information about the options used for the build, “5.5.31-0ubuntu0.12.04.1,” for instance.
- Creation time
A four-byte integer holding the creation time—the number of seconds since the epoch—of the first binlog file written by the server since startup. For later binlog files written by the server, this field will be zero.
This scheme allows a slave to determine that the server was restarted and that the slave should reset state and temporary data—for example, close any open transactions and drop any temporary tables it has created.
- Common header length
The length of the common header for all events in the binlog file except the
Format_description
andRotate
events. As described earlier, the length of the common header for theFormat_description
andRotate
events is fixed at 19 bytes.- Post-header lengths
This is the only variable-length field of the
Format_description
log event. It holds an array containing the size of the post header for each event in the binlog file as a one-byte integer. The value 255 is reserved as the length for the field, so the maximum length of a post header is 254 bytes.
Binary Log Options and Variables
A set of options and variables allow you to configure a vast number of aspects of binary logging.
Several options control such properties as the name of the binlog files and the index file. Most of these options can be manipulated as server variables as well. Some have already been mentioned earlier in the chapter, but here you will find more details on each:
expire-log-days=
days
The number of days that binlog files should be kept. Files that are older than the specified number will be purged from the filesystem when the binary log is rotated or the server restarts.
By default this option is 0, meaning that binlog files are never removed.
log-bin[=
basename
]
The binary log is turned on by adding the
log-bin
option in the my.cnf file, as explained in Chapter 3. In addition to turning on the binary log, this option gives a base name for the binlog files; that is, the portion of the filename before the dot. If an extension is provided, it is removed when forming the base name of the binlog files.If the option is specified without a
basename
, the base name defaults tohost
-bin
wherehost
is the base name—that is, the filename without directory or extension—of the file given by thepid-file
option, which is usually the hostname as given by gethostname(2). For example, ifpid-file
is /usr/run/mysql/master.pid, the default name of the binlog files will be master-bin.000001, master-bin.000002, etc.Since the default value for the
pid-file
option includes the hostname, it is strongly recommended that you give a value to thelog-bin
option. Otherwise the binlog files will change names when the hostname changes (unlesspid-file
is given an explicit value).log-bin-index[=
filename
]Gives a name to the index file. This can be useful if you want to place the index file in a different place from the default.
The default is the same as the base name used for
log-bin
. For example, if the base name used to create binlog files ismaster-bin
, the index file will be named master-bin.index.Similar to the situation for the
log-bin
option, the hostname will be used for constructing the index filename, meaning that if the hostname changes, replication will break. For this reason, it is strongly recommended that you provide a value for this option.log-bin-trust-function-creators
When creating stored functions, it is possible to create specially crafted functions that allow arbitrary data to be read and manipulated on the slave. For this reason, creating stored functions requires the
SUPER
privilege. However, since stored functions are very useful in many circumstances, it might be that the DBA trusts anyone withCREATE ROUTINE
privileges not to write malicious stored functions. For this reason, it is possible to disable theSUPER
privilege requirement for creating stored functions (butCREATE ROUTINE
is still required).binlog-cache-size=
bytes
The size of the in-memory part of the transaction cache in bytes. The transaction cache is backed by disk, so whenever the size of the transaction cache exceeds this value, the remaining data will go to disk.
This can potentially create a performance problem, so increasing the value of this option can improve performance if you use many large transactions.
Note that just allocating a very large buffer might not be a good idea, because that means that other parts of the server get less memory, which might cause performance degradation.
max-binlog-cache-size=
bytes
Use this option to restrict the size of each transaction in the binary log. Since large transactions can potentially block the binary log for a long time, they will cause other threads to convoy on the binary log and can therefore create a significant performance problem. If the size of a transaction exceeds
bytes
, the statement will be aborted with an error.max-binlog-size=
bytes
Specifies the size of each binlog file. When writing a statement or transaction would exceed this value, the binlog file is rotated and writing proceeds in a new, empty binlog file.
Notice that if the transaction or statement exceeds
max-binlog-size
, the binary log will be rotated, but the transaction will be written to the new file in its entirety, exceeding the specified maximum. This is because transactions are never split between binlog files.sync-binlog=
period
Specifies how often to write the binary log to disk using fdatasync(2). The value given is the number of transaction commits for each real call to fdatasync(2). For instance, if a value of 1 is given, fdatasync(2) will be called for each transaction commit, and if a value of 10 is given, fdatasync(2) will be called after each 10 transaction commits.
A value of zero means that there will be no calls to fdatasync(2) at all and that the server trusts the operating system to write the binary log to disk as part of the normal file handling.
read-only
Prevents any client threads—except the slave thread and users with
SUPER
privileges—from updating any data on the server (this does not include temporary tables, which can still be updated). This is useful on slave servers to allow replication to proceed without data being corrupted by clients that connect to the slave.
Options for Row-Based Replication
Use the following options to configure row-based replication:
binlog-format
The
binlog-format
option can be set to use one of the following modes:STATEMENT
This will use the traditional statement-based replication for all statements.
ROW
This will use the shiny new row-based replication for all statements that insert or change data (data manipulation language, or DML, statements). However, statement-based replication must still be used for statements that create tables or otherwise alter the schema (data definition language, or DDL, statements).
MIXED
This is intended to be a safe version of statement-based replication and is the recommended mode to use with MySQL version 5.1 and later. In mixed-mode replication, the server writes the statements to the binary log as statements, but switches to row-based replication if a statement is considered unsafe through one of the criteria we have discussed in this chapter.
The variable also exists as a global server variable and as a session variable. When starting a new session, the global value is copied to the session variable and then the session variable is used to decide how to write statements to the binary log.
binlog-max-row-event-size
Use this option to specify when to start a new event for holding the rows. Because the events are read fully into memory when being processed, this option is a rough way of controlling the size of row-holding events so that not too much memory is used when processing the rows.
binlog-rows-query-log-events
(new in MySQL 5.6)This option causes the server to add an informational event to the binary log before the the row events. The informational event contains the original query that generated the rows.
Note that, because unrecognized events cause the slave to stop,
any slave server before MySQL 5.6.2 will not recognize the event and
hence stop replicating from the master. This means that if you intend to
use an informational event, you need to upgrade all slaves to MySQL
5.6.2 (or later) before enabling
binlog-rows-query-log-events
.
Starting with MySQL 5.6.2, informational events can be added for other purposes as well, but they will not cause the slave to stop. They are intended to allow information to be added to the binary log for those readers (including slaves) that can have use for it, but they should not in any way change the semantics of execution and can therefore be safely ignored by slaves that do not recognize them.
Conclusion
Clearly, there is much to the binary log—including its use, composition, and techniques. We presented these concepts and more in this chapter, including how to control the binary log behavior. The material in this chapter builds a foundation for a greater understanding of the mechanics of the binary log and its importance in logging changes to data.
[4] In some special cases, covered in How nontransactional statements are logged, nontransactional statements can be part of a group.
[5] MySQL 5.6 supports logging the statement that generated the rows to the binary log together with the rows. See the note in Options for Row-Based Replication.
Get MySQL High Availability, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.