If you rigorously check the results of your queries and
updates, you’ll catch many of the problems that could otherwise go
undetected for weeks and cause a lot of grief when the problems finally
grow too large to miss. But problems do creep up on you. Sometimes a
SELECT
suddenly starts returning wrong
results, but your experiments with the query just confirm there is nothing
wrong with it.
In this case, you need to imitate user actions, but in reverse order, until you find the source of the error. If you are lucky, you will catch the problem in a single step. Usually it will take multiple steps, and sometimes a very long time.
A lot of these issues happen because the data is different on
the master and slave in a replication environment. One common problem is duplicate values where they are supposed
to be unique (e.g., if a user relies on an INSERT ON DUPLICATE KEY
UPDATE
statement but a table has a different structure on the
master and slave). For such setups, the user usually notices the problem
later when SELECT
statements query the
slave, instead of noticing them when the INSERT
takes place. Things become even worse
when this happens during circular replication.
To illustrate this problem, we’ll work with a stored procedure that inserts into a table from a temporary table that was created to hold the results of other selects. This is another example of a common technique when a user wants to handle data from large tables without the risk of modifying data inadvertently or blocking other applications that are using the large tables.
Let’s create our table and populate it with temporary values. In a real application, the temporary table would hold a result set from some calculation that is waiting to be stored in the main table:
CREATE TABLE t1(f1 INT) ENGINE=InnoDB; CREATE TEMPORARY TABLE t2(f1 INT) ENGINE=InnoDB;
Now create values in the temporary table:
INSERT INTO t2 VALUES(1),(2),(3);
The stored routine moves data from the temporary table into the main table. It checks first to make sure something is in the temporary table before doing the move. Our version looks like this:
CREATE PROCEDURE p1() BEGIN DECLARE m INT UNSIGNED DEFAULT NULL; CREATE TEMPORARY TABLE IF NOT EXISTS t2(f1 INT) ENGINE=InnoDB; SELECT MAX(f1) INTO m FROM t2; IF m IS NOT NULL THEN INSERT INTO t1(f1) SELECT f1 FROM t2; END IF; END |
This routine creates the temporary table if it does not exist when the routine is called. This prevents errors that would be caused if the temporary table does not exist, but at the same time leads to new issues, as we will see.
Note
The example uses the MAX
function just to check whether there is at least one row in the table. I
prefer MAX
to COUNT
because InnoDB tables do not store the
number of rows they contain, but calculate this value every time the
COUNT
function is called. Therefore,
MAX(indexed_field)
is faster than
COUNT
.
If a slave restarted after the first insert but before the stored procedure call, the temporary table on the slave would be empty and the main table on the slave would have no data. In that case, we will get the following on the master:
mysql> SELECT * FROM t1;
+------+
| f1 |
+------+
| 1 |
| 2 |
| 3 |
+------+
3 rows in set (0.03 sec)
Whereas on the slave we get:
mysql> SELECT * FROM t1;
Empty set (0.00 sec)
Even worse, if we insert into t1
after the stored procedure call runs, we will have a total mess in the
slave’s data.
Suppose we notice the error in an application that reads data from the main table. We now need to find out how data has been inserted into the slave table: was it a direct update on the slave, or was data replicated from the master?
Warning
MySQL replication does not check data consistency for you, so updates of the same objects using both the SQL replication thread and the user thread on the slave leave the data different from the master, which in turn can lead to failure during later replicated events.
Because we imitated this situation in our example, we know at this point why data corruption happened: the slave was restarted after the first insert but before the stored procedure call. In a real-life situation, issues tend to be noticed later when a user issues a select:
mysql> SELECT * FROM t1;
Empty set (0.00 sec)
When you see unexpected results from SELECT
, you need to find out whether this is
caused by the query itself or by something that went wrong earlier. The
query just shown is so simple that it couldn’t produce errors unless the
table was corrupted, so we must try to go back to see how the table was
modified.
Our generic example is in a replicated environment with a read-only slave, so we can be sure that the wrong data arose in one of two ways: either the master inserted the wrong data, or the data was corrupted during replication.
So check first whether the master has the wrong data:
master> SELECT * FROM t1;
+------+
| f1 |
+------+
| 1 |
| 2 |
| 3 |
+------+
3 rows in set (0.03 sec)
The master’s data is correct, so the source of the problem lies in the replication layer.
But why did it happen? Replication seems to be running fine,[5] so we suspect a logic error on the master. Having discovered a possible source of the problem, you need to analyze the stored procedure and the calls on the master to find a fix.
As I said before, the slave server was restarted after events that insert data into the temporary table were replicated and emptied the temporary table, but before the stored procedure call that selects and inserts data into the main table. So the slave just re-created an empty temporary table and inserted no data.
In this case, you can either switch to row-based replication or rewrite the procedure so it does not rely on the existence of the temporary table. Another approach is to truncate and then refill the table so that a sudden restart will not leave the slave without data.
One might think that this example is very artificial and that you can’t predict when a server will suddenly restart. This is correct, but restarts are sure to happen from time to time. Therefore, you need to worry about such errors.
Actually, a slave replicates binary log events one by one, and when data is created within an atomic event (e.g., a transaction or stored procedure call), the slave would not be affected by this issue. But again, this example was just a simple one to show the concept behind events that do happen in real life.
When you experience a problem on a statement that you know is correct, check what your application did before you saw the problem.
More details on replication errors are in Chapter 5.
We’re done for now with problems related to wrong results. Next we’ll turn to some other problems that occur frequently.
Get MySQL Troubleshooting now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.