SQL_SLAVE_SKIP_COUNTER = 1 失败,设置@@gtid_slave_pos 用于跳过给定的 GTID 位置

nel*_*aro 6 mysql replication mariadb gtid

我最近打破了复制,当我试图通过一个不正确的交易时。我得到了以下内容。

MariaDB [(none)]> STOP SLAVE;
Query OK, 0 rows affected (0.05 sec)

MariaDB [(none)]> SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;
ERROR 1966 (HY000): When using parallel replication and GTID with multiple replication domains, @@sql_slave_skip_counter cannot be used. Instead, setting @@gtid_slave_pos explicitly can be used to skip to after a given GTID position.
MariaDB [(none)]> select @@gtid_slave_pos;
+---------------------------------------------+
| @@gtid_slave_pos                            |
+---------------------------------------------+
| 0-1051-1391406,1-1050-1182069,57-1051-98897 |
+---------------------------------------------+
1 row in set (0.00 sec)

MariaDB [(none)]> show variables like '%_pos%';
+----------------------+---------------------------------------------------------+
| Variable_name        | Value                                                   |
+----------------------+---------------------------------------------------------+
| gtid_binlog_pos      | 0-1051-1391406,2-1051-4474,57-1051-98897                |
| gtid_current_pos     | 0-1051-1391406,1-1050-1182069,2-1051-4474,57-1051-98897 |
| gtid_slave_pos       | 0-1051-1391406,1-1050-1182069,57-1051-98897             |
| wsrep_start_position | 00000000-0000-0000-0000-000000000000:-1                 |
+----------------------+---------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)

我需要做什么来解决这个问题。

更新 1

MariaDB [(none)]> show variables like '%gtid%';
+------------------------+------------------------------------------+
| Variable_name          | Value                                    |
+------------------------+------------------------------------------+
| gtid_binlog_pos        | 1-1050-4820789,2-1051-379101,3-1010-3273 |
| gtid_binlog_state      | 1-1050-4820789,2-1051-379101,3-1010-3273 |
| gtid_current_pos       | 1-1050-4819948,2-1051-379101,3-1010-3273 |
| gtid_domain_id         | 3                                        |
| gtid_ignore_duplicates | OFF                                      |
| gtid_seq_no            | 0                                        |
| gtid_slave_pos         | 1-1050-4819948,2-1051-379101,3-1010-3273 |
| gtid_strict_mode       | OFF                                      |
| last_gtid              |                                          |
| wsrep_gtid_domain_id   | 0                                        |
| wsrep_gtid_mode        | OFF                                      |
+------------------------+------------------------------------------+
Run Code Online (Sandbox Code Playgroud)

我按照设置@@gtid_slave_pos的说明尝试了以下操作;

MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: [redacted]
                  Master_User: [redacted]
                  Master_Port: 3306
                Connect_Retry: 5
              Master_Log_File: binary.000591
          Read_Master_Log_Pos: 526511543
               Relay_Log_File: tmsdb-relay-bin.001239
                Relay_Log_Pos: 4
        Relay_Master_Log_File: binary.000591
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1062
                   Last_Error: Could not execute Write_rows_v1 event on table [redacted] Duplicate entry '1134890' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log binary.000591, end_log_pos 60726493
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 60724897
              Relay_Log_Space: 465787660
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1062
               Last_SQL_Error: Could not execute Write_rows_v1 event on table [redacted] Duplicate entry '1134890' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log binary.000591, end_log_pos 60726493
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 1050
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
                   Using_Gtid: Current_Pos
                  Gtid_IO_Pos: 1-1050-4827753,2-1051-379101,3-1010-3273
      Replicate_Do_Domain_Ids: 
  Replicate_Ignore_Domain_Ids: 
                Parallel_Mode: optimistic
1 row in set (0.00 sec)
Run Code Online (Sandbox Code Playgroud)

使用 gtid_slave_pos 变量

MariaDB [(none)]> select @@gtid_slave_pos\G;
*************************** 1. row ***************************
@@gtid_slave_pos: 1-1050-4819948,2-1051-379101,3-1010-3273

MariaDB [(none)]> stop slave;
Query OK, 0 rows affected (0.21 sec)

MariaDB [(none)]> SET GLOBAL gtid_slave_pos='1-1050-4819948,2-1051-379101,3-1010-3274';
Query OK, 0 rows affected (0.10 sec)

MariaDB [(none)]> start slave;
Query OK, 0 rows affected (0.21 sec)
Run Code Online (Sandbox Code Playgroud)

当我在运行上述后检查状态时 Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 3-1010-3274, which is not in the master's binlog'

MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: 
                  Master_Host: 10.56.228.64
                  Master_User: maxscale
                  Master_Port: 3306
                Connect_Retry: 5
              Master_Log_File: binary.000591
          Read_Master_Log_Pos: 60724897
               Relay_Log_File: tmsdb-relay-bin.001239
                Relay_Log_Pos: 4
        Relay_Master_Log_File: binary.000591
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 60724897
              Relay_Log_Space: 249
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 3-1010-3274, which is not in the master's binlog'
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 1050
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
                   Using_Gtid: Current_Pos
                  Gtid_IO_Pos: 1-1050-4819948,2-1051-379101,3-1010-3274
      Replicate_Do_Domain_Ids: 
  Replicate_Ignore_Domain_Ids: 
                Parallel_Mode: optimistic
1 row in set (0.00 sec)
Run Code Online (Sandbox Code Playgroud)

我可以通过以下方式将其恢复到以前的状态

MariaDB [(none)]> stop slave;
Query OK, 0 rows affected (0.01 sec)

MariaDB [(none)]> SET GLOBAL gtid_slave_pos='1-1050-4819948,2-1051-379101,3-1010-3273';
Query OK, 0 rows affected (0.09 sec)

MariaDB [(none)]> start slave;
Query OK, 0 rows affected (0.06 sec)
Run Code Online (Sandbox Code Playgroud)

nel*_*aro 1

我在生产中发现 Parallel_Mode 是最有可能导致我的问题的原因。

我建议使用不同的optimistic

MariaDB [(none)]> select @@slave_parallel_mode\G
*************************** 1. row ***************************
@@slave_parallel_mode: optimistic
Run Code Online (Sandbox Code Playgroud)

如果出现以下错误。

pt-slave-restart 
2018-02-09T10:39:19  tmsdb-relay-bin.000388           4 1032 
DBD::mysql::st execute failed: When using parallel replication and GTID with multiple replication domains, @@sql_slave_skip_counter can not be used. Instead, setting @@gtid_slave_pos explicitly can be used to skip to after a given GTID position. [for Statement "SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1"] at /bin/pt-slave-restart line 5122.
Run Code Online (Sandbox Code Playgroud)

在日志中我看到以下内容:

tail /var/log/mariadb.log
2018-02-09 10:35:46 139919003784960 [ERROR] Slave SQL: Could not execute Update_rows_v1 event on table [tablename]; Can't find record in '[tablename]', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log binary.000953, end_log_pos 264325215, Gtid 1-1050-13462991, Internal MariaDB error code: 1032
2018-02-09 10:35:46 139919003784960 [Warning] Slave: Can't find record in '[tablename]' Error_code: 1032
2018-02-09 10:35:46 139919003784960 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binary.000953' position 262879171; GTID position '1-1050-13462990,2-1051-379101,3-1010-3273'
2018-02-09 10:35:46 139918776985344 [Note] Slave SQL thread exiting, replication stopped in log 'binary.000953' at position 262879171; GTID position '1-1050-13462990,2-1051-379101,3-1010-3273'
Run Code Online (Sandbox Code Playgroud)

要在从站失败后重新启动,您可以执行以下操作。
全部停止slave_parallel_threads并禁用slave_parallel_mode

MariaDB [(none)]> stop slave;
Query OK, 0 rows affected (0.35 sec)
MariaDB [(none)]> set global slave_parallel_threads = 0;
Query OK, 0 rows affected (0.00 sec)
MariaDB [(none)]> set global slave_parallel_mode = none;
Query OK, 0 rows affected (0.00 sec)
MariaDB [(none)]> Start SLAVE;
Query OK, 0 rows affected (0.00 sec)    
Run Code Online (Sandbox Code Playgroud)

我现在用来pt-slave-restart重新启动从站,因为当我只想启动从站时,我不必考虑序列号和一大堆其他需要很长时间的事情。

pt-slave-restart
Run Code Online (Sandbox Code Playgroud)

将运行而不会出现错误,ctrl-c当您很高兴您的奴隶已经赶上时,您可以关闭它。

这与当时没有太大不同,但它会神奇地自动完成。

STOP SLAVE;  
SET GLOBAL sql_slave_skip_counter = 1;  
START SLAVE;  
Run Code Online (Sandbox Code Playgroud)

如果您需要并行线程,那么一旦从属设备赶上或超越了导致问题的事件,您就可以重新启用它们。我会尝试不同的方式, slave_parallel_mod比如保守的

MariaDB [(none)]> stop slave;
Query OK, 0 rows affected (0.01 sec)
MariaDB [(none)]> set global slave_parallel_threads = 4;
Query OK, 0 rows affected (0.00 sec)
MariaDB [(none)]> set global slave_parallel_mode = conservative;
Query OK, 0 rows affected (0.00 sec)
MariaDB [(none)]> start slave;
Query OK, 0 rows affected (0.09 sec)
Run Code Online (Sandbox Code Playgroud)