Found 3 slaves:
XXX -> 10.10.10.10:3306
YYY -> 10.10.10.11:3306
ZZZ -> 10.10.10.12:3306
Will check slave lag on:
XXX -> 10.10.10.10:3306
YYY -> 10.10.10.11:3306
ZZZ -> 10.10.10.12:3306

Operation, tries, wait:

analyze_table, 10, 1
  copy_rows, 10, 0.25
  create_triggers, 10, 1
  drop_triggers, 10, 1
  swap_tables, 10, 1
  update_foreign_keys, 10, 1
Altering `video`.`video`...
Creating new table...
Created new table video._video_new OK.
Altering new table...
Altered `gmivideo`.`_video_new` OK.
2019-12-31T16:48:53 Creating triggers...
2019-12-31T16:48:54 Created triggers OK.
2019-12-31T16:48:54 Copying approximately 3575122 rows...
2019-12-31T16:59:23 Copied rows OK.
2019-12-31T16:59:23 Analyzing new table...
2019-12-31T16:59:23 Swapping tables...
2019-12-31T16:59:23 Swapped original and new tables OK.
2019-12-31T16:59:23 Dropping old table...
2019-12-31T16:59:24 Dropped old table `ivideo`.`__video_old` OK.
2019-12-31T16:59:24 Dropping triggers...
2019-12-31T16:59:24 Dropped triggers OK.
# Event              Count
# ================== =====
# INSERT               285
# mysql_warning_1062   181
# unknown_error          1
Successfully altered `gmivideo`.`video`.

 

!!!!尽量不要使用pt-online-schema-change来修改一张被其他表引用的表,将会导致其他表引用的表名称改成错误的名称:

举例说明:线上故障案例

 

/usr/bin/pt-online-schema-change --alter "修改表结构语句,但是不需要alter table" --charset 原表的字符集 --no-check-unique-key-change --no-check-replication-filters --critical-load Threads_connected:2600,Threads_running
:1500 --max-load Threads_connected:2600,Threads_running:1500 --recurse=1 --check-interval 5 --alter-foreign-keys-method=auto --execute --statistics --max-lag 300 --recursion-method=proce
sslist --progress percentage,1  --user=数据库用户名 --password=数据库用户密码 --host=主库名称 --port=3306 D=数据库名称,t=表名称

 

pt-online-schema-change --alter "add index idx_token (token)" --charset utf8mb4 --no-check-unique-key[49/1851]
-no-check-replication-filters --critical-load Threads_connected:2600,Threads_running:1500 --max-load Threads_connected:2600,Threads_running:1
500 --recurse=1 --check-interval 5 --alter-foreign-keys-method=auto --execute --statistics --max-lag 300 --recursion-method=processlist --pro
gress percentage,1  --user=user --password=password --host=10.10.10.10 --port=3306 D=card,t=g
mivideo_user_9
Found 1 slaves:
10.10.10.10 -> 10.10.10.10:3306
Will check slave lag on:
10.10.10.10 -> 10.10.10.10:3306
Operation, tries, wait:
  analyze_table, 10, 1
  copy_rows, 10, 0.25
  create_triggers, 10, 1
  drop_triggers, 10, 1
  swap_tables, 10, 1
  update_foreign_keys, 10, 1
No foreign keys reference `card`.`table_name`; ignoring --alter-foreign-keys-method.
Altering `card`.`table_name`...
Creating new table...
Created new table card._table_name_new OK.
Altering new table...
Altered `card`.`_table_name_new` OK.
2019-12-31T21:54:14 Creating triggers...
2019-12-31T21:54:15 Created triggers OK.
2019-12-31T21:54:15 Copying approximately 582800 rows...
Copying `card`.`table_name`:   2% 02:18 remain
Copying `card`.`table_name`:   3% 01:41 remain
Copying `card`.`table_name`:   5% 01:22 remain
Copying `card`.`table_name`:   7% 01:10 remain
Copying `card`.`table_name`:   9% 01:03 remain
Copying `card`.`table_name`:  12% 01:05 remain
Copying `card`.`table_name`:  14% 01:00 remain
Copying `card`.`table_name`:  16% 00:56 remain
Copying `card`.`table_name`:  18% 00:52 remain
Copying `card`.`table_name`:  20% 00:49 remain
Copying `card`.`table_name`:  23% 00:46 remain
Copying `card`.`table_name`:  25% 00:44 remain
Copying `card`.`table_name`:  27% 00:42 remain
Copying `card`.`table_name`:  29% 00:42 remain
Copying `card`.`table_name`:  32% 00:40 remain
Copying `card`.`table_name`:  34% 00:38 remain
Copying `card`.`table_name`:  36% 00:36 remain
Copying `card`.`table_name`:  39% 00:34 remain
Copying `card`.`table_name`:  41% 00:32 remain
Copying `card`.`table_name`:  43% 00:30 remain
Copying `card`.`table_name`:  46% 00:30 remain
Copying `card`.`table_name`:  48% 00:28 remain
Copying `card`.`table_name`:  51% 00:26 remain
Copying `card`.`table_name`:  53% 00:25 remain
Copying `card`.`table_name`:  56% 00:23 remain
Copying `card`.`table_name`:  58% 00:22 remain
Copying `card`.`table_name`:  60% 00:20 remain
Copying `card`.`table_name`:  63% 00:19 remain
Copying `card`.`table_name`:  65% 00:17 remain
Copying `card`.`table_name`:  68% 00:16 remain
Copying `card`.`table_name`:  70% 00:15 remain
Copying `card`.`table_name`:  73% 00:13 remain
Copying `card`.`table_name`:  75% 00:12 remain
Copying `card`.`table_name`:  77% 00:11 remain
Copying `card`.`table_name`:  80% 00:10 remain
Copying `card`.`table_name`:  82% 00:08 remain
Copying `card`.`table_name`:  84% 00:07 remain
Copying `card`.`table_name`:  86% 00:06 remain
Copying `card`.`table_name`:  89% 00:05 remain
Copying `card`.`table_name`:  91% 00:04 remain
Copying `card`.`table_name`:  93% 00:03 remain
Copying `card`.`table_name`:  95% 00:02 remain
Copying `card`.`table_name`:  98% 00:00 remain
2019-12-31T21:55:10 Copied rows OK.
2019-12-31T21:55:10 Analyzing new table...
2019-12-31T21:55:10 Swapping tables...
2019-12-31T21:55:10 Swapped original and new tables OK.
2019-12-31T21:55:10 Dropping old table...
2019-12-31T21:55:11 Dropped old table `card`.`_table_name_old` OK.
2019-12-31T21:55:11 Dropping triggers...
2019-12-31T21:55:11 Dropped triggers OK.
# Event              Count
# ================== =====
# INSERT                49
# mysql_warning_1062     5
Successfully altered `card`.`table_name`.
 

 

 

线上执行更改:


shell> /
usr/bin/pt-online-schema-change --alter add index `idx_app_id`(app_id) --print --set-vars lock_wait_timeout=60 --charset=utf8 --chunk-time 1 --critical-load Threads_connected:10000,Threads_running:5000

--max-load Threads_connected:10000,Threads_running:5000 --sleep=1 --recurse=1 --check-interval 5 --no-drop-new-table --no-drop-old-table --alter-foreign-keys-method=none --force --execute

--statistics --max-lag=3 --no-version-check --recursion-method=processlist --progress percentage,1 --user=root --password=123456 --host=10.10.10.10 --port=3306 D=database_name,t=table_name

 

 

 

 ##############

1、一次只能修改一个表的表结构

2、--charset 原表的字符集,这里一定不要搞错了,先确认原表的字符集。

######################

在 mysql 5.5 版本以前,修改表结构如添加索引、修改列,需要锁表,期间不能写入,对于大表这简直是灾难。从5.5特别是5.6里,情况有了好转,支持Online DDL,相关介绍见 这篇文章,而我在实际alter table过程中还是会引起 data meta lock 问题。pt-online-schema-change是Percona-toolkit一员,通过改进原生ddl的方式,达到不锁表在线修改表结构。

1. pt-osc工作过程

  1. 创建一个和要执行 alter 操作的表一样的新的空表结构(是alter之前的结构)
  2. 在新表执行alter table 语句(速度应该很快)
  3. 在原表中创建触发器3个触发器分别对应insert,update,delete操作
  4. 以一定块大小从原表拷贝数据到临时表,拷贝过程中通过原表上的触发器在原表进行的写操作都会更新到新建的临时表
  5. Rename 原表到old表中,在把临时表Rename为原表
  6. 如果有参考该表的外键,根据alter-foreign-keys-method参数的值,检测外键相关的表,做相应设置的处理
  7. 默认最后将旧原表删除

2. 常用选项说明

只介绍部分常用的选项

  • --host=xxx --user=xxx --password=xxx
    连接实例信息,缩写-h xxx -u xxx -p xxx,密码可以使用参数--ask-pass 手动输入。
  • --alter
    结构变更语句,不需要 ALTER TABLE关键字。与原始ddl一样可以指定多个更改,用逗号分隔。
    • 绝大部分情况下表上需要有主键或唯一索引,因为工具在运行当中为了保证新表也是最新的,需要旧表上创建 DELETE和UPDATE 触发器,同步到新表的时候有主键会更快。个别情况是,当alter操作就是在c1列上建立主键时,DELETE触发器将基于c1列。
    • 子句不支持 rename 去给表重命名。
    • alter命令原表就不支持给索引重命名,需要先drop再add,在pt-osc也一样。(mysql 5.7 支持 RENAME INDEX old_index_name TO new_index_name)
      但给字段重命名,千万不要drop-add,整列数据会丢失,使用change col1 col1_new type constraint(保持类型和约束一致,否则相当于修改 column type,不能online)
    • 子句如果是add column并且定义了not null,那么必须指定default值,否则会失败。
    • 如果要删除外键(名 fk_foo),使用工具的时候外键名要加下划线,比如--alter "DROP FOREIGN KEY _fk_foo"
  • D=db_name,t=table_name
    指定要ddl的数据库名和表名

  • --max-load
    默认为Threads_running=25。每个chunk拷贝完后,会检查 SHOW GLOBAL STATUS 的内容,检查指标是否超过了指定的阈值。如果超过,则先暂停。这里可以用逗号分隔,指定多个条件,每个条件格式: status指标=MAX_VALUE或者status指标:MAX_VALUE。如果不指定MAX_VALUE,那么工具会这只其为当前值的120%。
    因为拷贝行有可能会给部分行上锁,Threads_running 是判断当前数据库负载的绝佳指标。

  • --max-lag
    默认1s。每个chunk拷贝完成后,会查看所有复制Slave的延迟情况(Seconds_Behind_Master)。要是延迟大于该值,则暂停复制数据,直到所有从的滞后小于这个值。--check-interval配合使用,指定出现从库滞后超过 max-lag,则该工具将睡眠多长时间,默认1s,再检查。如--max-lag=5 --check-interval=2
    熟悉percona-toolkit的人都知道--recursion-method可以用来指定从库dsn记录。另外,如果从库被停止,将会永远等待,直到从开始同步,并且延迟小于该值。

  • --chunk-time
    默认0.5s,即拷贝数据行的时候,为了尽量保证0.5s内拷完一个chunk,动态调整chunk-size的大小,以适应服务器性能的变化。
    也可以通过另外一个选项--chunk-size禁止动态调整,即每次固定拷贝 1k 行,如果指定则默认1000行,且比 chunk-time 优先生效

  • --set-vars
    使用pt-osc进行ddl要开一个session去操作,set-vars可以在执行alter之前设定这些变量,比如默认会设置--set-vars "wait_timeout=10000,innodb_lock_wait_timeout=1,lock_wait_timeout=60"
    因为使用pt-osc之后ddl的速度会变慢,所以预计2.5h只能还不能改完,记得加大wait_timeout

  • --dry-run
    创建和修改新表,但不会创建触发器、复制数据、和替换原表。并不真正执行,可以看到生成的执行语句,了解其执行步骤与细节,和--print配合最佳。。

  • --execute
    确定修改表,则指定该参数。真正执行alter。–dry-run与–execute必须指定一个,二者相互排斥

3. 使用疑惑(限制)

3.1 原表上不能有触发器存在

这个很容易理解,pt-osc会在原表上创建3个触发器,而一个表上不能同时有2个相同类型的触发器,为简单通用起见,只能一棍子打死。
所以如果要让它支持有触发器存在的表也是可以实现的,思路就是:先找到原表触发器定义;重写原表触发器;最后阶段将原表触发器定义应用到新表。

3.2 通过触发器写数据到临时新表,会不会出现数据不一致或异常

这其实是我的一个顾虑,因为如果update t1,触发update t2,但这条数据还没copy到t2,不就有异常了吗?后台通过打开general_log,看到它创建的触发器:

1
2
3
4
5
6
7
8
9
10
11
12
    6165 Query     CREATE TRIGGER `pt_osc_confluence_sbtest3_del` AFTER DELETE ON `confluence`.`sbtest3` 
FOR EACH ROW DELETE IGNORE FROM `confluence`.`_sbtest3_new` WHERE `confluence`.`_sbtest3_new`.`id` <=> OLD.`id`
6165 Query CREATE TRIGGER `pt_osc_confluence_sbtest3_upd` AFTER UPDATE ON `confluence`.`sbtest3`
FOR EACH ROW REPLACE INTO `confluence`.`_sbtest3_new` (`id`, `k`, `c`, `pad`) VALUES (NEW.`id`, NEW.`k`, NEW.`c`, NEW.`pad`)
6165 Query CREATE TRIGGER `pt_osc_confluence_sbtest3_ins` AFTER INSERT ON `confluence`.`sbtest3`
FOR EACH ROW REPLACE INTO `confluence`.`_sbtest3_new` (`id`, `k`, `c`, `pad`) VALUES (NEW.`id`, NEW.`k`, NEW.`c`, NEW.`pad`)

并且copy操作是:

6165 Query INSERT LOW_PRIORITY IGNORE INTO `confluence`.`_sbtest3_new` (`id`, `k`, `c`, `pad`)
SELECT `id`, `k`, `c`, `pad` FROM `confluence`.`sbtest3` FORCE INDEX(`PRIMARY`) WHERE ((`id` >= '4692805')) AND ((`id` <= '4718680'))
LOCK IN SHARE MODE /*pt-online-schema-change 46459 copy nibble*/

 

在原表上update,新临时表上是replace into整行数据,所以达到有则更新,无则插入。同时配合后面的 insert ignore,保证这条数据不会因为重复而失败。

3.3 为什么外键那么特殊

假设 t1 是要修改的表,t2 有外键依赖于 t1_t1_new 是 alter t1 产生的新临时表。
这里的外键不是看t1上是否存在外键,而是作为子表的 t2。主要问题在 rename t1 时,t1“不存在”导致t2的外键认为参考失败,不允许rename。
pt-osc提供--alter-foreign-keys-method选项来决定怎么处理这种情况:

  • rebuild_constraints,优先采用这种方式
    • 它先通过 alter table t2 drop fk1,add _fk1 重建外键参考,指向新表
    • 再 rename t1 t1_old, _t1_new t1 ,交换表名,不影响客户端
    • 删除旧表 t1_old
      但如果字表t2太大,以致alter操作可能耗时过长,有可能会强制选择 drop_swap。
      涉及的主要方法在 pt-online-schema-change 文件的 determine_alter_fk_methodrebuild_constraintsswap_tables三个函数中。
  • drop_swap
    • 禁用t2表外键约束检查 FOREIGN_KEY_CHECKS=0
    • 然后 drop t1 原表
    • 再 rename _t1_new t1
      这种方式速度更快,也不会阻塞请求。但有风险,第一,drop表的瞬间到rename过程,原表t1是不存在的,遇到请求会报错;第二,如果因为bug或某种原因,旧表已删,新表rename失败,那就太晚了,但这种情况很少见。
      我们的开发规范决定,即使表间存在外键参考关系,也不通过表定义强制约束。

3.4 在使用之前需要对磁盘容量进行评估

使用OSC会使增加一倍的空间,包括索引
而且在 Row Based Replication 下,还会写一份binlog。不要想当然使用--set-vars去设置 sql_log_bin=0,因为在这个session级别,alter语句也要在从库上执行,除非你对从库另有打算。

4. 使用 pt-osc原生 5.6 online ddl相比,如何选择

  • online ddl在必须copy table时成本较高,不宜采用
  • pt-osc工具在存在触发器时,不适用
  • 修改索引、外键、列名时,优先采用online ddl,并指定 ALGORITHM=INPLACE
  • 其它情况使用pt-osc,虽然存在copy data
  • pt-osc比online ddl要慢一倍左右,因为它是根据负载调整的
  • 无论哪种方式都选择的业务低峰期执行
  • 特殊情况需要利用主从特性,先alter从库,主备切换,再改原主库

借助percona博客一张图说明一下:

DDL flow

5. 示例

  • 添加新列
    完整输出过程

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    [root@ssd-34 sysbench]# pt-online-schema-change --user=user --password=password --host=10.0.201.34  --alter "ADD COLUMN f_id int default 0" D=confluence,t=sbtest3 --print --execute
    No slaves found. See --recursion-method if host ssd-34 has slaves.
    Not checking slave lag because no slaves were found and --check-slave-lag was not specified.
    Operation, tries, wait:
    analyze_table, 10, 1
    copy_rows, 10, 0.25
    create_triggers, 10, 1
    drop_triggers, 10, 1
    swap_tables, 10, 1
    update_foreign_keys, 10, 1
    Altering `confluence`.`sbtest3`...
    Creating new table... ==> 创建新表
    CREATE TABLE `confluence`.`_sbtest3_new` (
    `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
    `k` int(10) unsigned NOT NULL DEFAULT '0',
    `c` char(120) COLLATE utf8_bin NOT NULL DEFAULT '',
    `pad` char(60) COLLATE utf8_bin NOT NULL DEFAULT '',
    PRIMARY KEY (`id`),
    KEY `k_3` (`k`)
    ) ENGINE=InnoDB AUTO_INCREMENT=5000001 DEFAULT CHARSET=utf8 COLLATE=utf8_bin MAX_ROWS=1000000
    Created new table confluence._sbtest3_new OK.
    Altering new table... ==> 使用ddl修改新表结构
    ALTER TABLE `confluence`.`_sbtest3_new` ADD COLUMN f_id int default 0
    Altered `confluence`.`_sbtest3_new` OK.
    2016-05-24T20:54:23 Creating triggers... ==> 在旧表上创建3个触发器
    CREATE TRIGGER `pt_osc_confluence_sbtest3_del` AFTER DELETE ON `confluence`.`sbtest3` FOR EACH ROW
    DELETE IGNORE FROM `confluence`.`_sbtest3_new` WHERE `confluence`.`_sbtest3_new`.`id` <=> OLD.`id`
    CREATE TRIGGER `pt_osc_confluence_sbtest3_upd` AFTER UPDATE ON `confluence`.`sbtest3` FOR EACH ROW
    REPLACE INTO `confluence`.`_sbtest3_new` (`id`, `k`, `c`, `pad`) VALUES (NEW.`id`, NEW.`k`, NEW.`c`, NEW.`pad`)
    CREATE TRIGGER `pt_osc_confluence_sbtest3_ins` AFTER INSERT ON `confluence`.`sbtest3` FOR EACH ROW
    REPLACE INTO `confluence`.`_sbtest3_new` (`id`, `k`, `c`, `pad`) VALUES (NEW.`id`, NEW.`k`, NEW.`c`, NEW.`pad`)
    2016-05-24T20:54:23 Created triggers OK.
    2016-05-24T20:54:23 Copying approximately 4485573 rows... ==> 分块拷贝数据到新表
    INSERT LOW_PRIORITY IGNORE INTO `confluence`.`_sbtest3_new` (`id`, `k`, `c`, `pad`)
    SELECT `id`, `k`, `c`, `pad` FROM `confluence`.`sbtest3` FORCE INDEX(`PRIMARY`) WHERE ((`id` >= ?)) AND ((`id` <= ?))
    LOCK IN SHARE MODE /*pt-online-schema-change 44155 copy nibble*/
    SELECT /*!40001 SQL_NO_CACHE */ `id` FROM `confluence`.`sbtest3` FORCE INDEX(`PRIMARY`) WHERE ((`id` >= ?)) ORDER BY `id` LIMIT ?, 2 /*next chunk boundary*/
    Copying `confluence`.`sbtest3`: 36% 00:52 remain
    Copying `confluence`.`sbtest3`: 69% 00:26 remain
    2016-05-24T20:56:01 Copied rows OK.
    2016-05-24T20:56:01 Analyzing new table...
    2016-05-24T20:56:01 Swapping tables... ==> 交换新旧表
    RENAME TABLE `confluence`.`sbtest3` TO `confluence`.`_sbtest3_old`, `confluence`.`_sbtest3_new` TO `confluence`.`sbtest3`
    2016-05-24T20:56:01 Swapped original and new tables OK.
    2016-05-24T20:56:01 Dropping old table... ==> 删除旧表
    DROP TABLE IF EXISTS `confluence`.`_sbtest3_old`
    2016-05-24T20:56:02 Dropped old table `confluence`.`_sbtest3_old` OK.
    2016-05-24T20:56:02 Dropping triggers...
    DROP TRIGGER IF EXISTS `confluence`.`pt_osc_confluence_sbtest3_del`;
    DROP TRIGGER IF EXISTS `confluence`.`pt_osc_confluence_sbtest3_upd`;
    DROP TRIGGER IF EXISTS `confluence`.`pt_osc_confluence_sbtest3_ins`;
    2016-05-24T20:56:02 Dropped triggers OK.
    Successfully altered `confluence`.`sbtest3`.
  • 修改列类型

    1
    2
    3
    4
    5
    6
    7
    pt-online-schema-change h=10.0.201.34,P=3306,u=jacky,p=xxx,D=confluence,t=sbtest3 \
    --alter "CHANGE pad f_pad varchar(60) NOT NULL DEFAULT '' " \
    --print --dry-run

    pt-online-schema-change -ujacky -p xxx -h "10.0.201.34" D=confluence,t=sbtest3 \
    --alter "CHANGE pad f_pad varchar(60) NOT NULL DEFAULT '' " \
    --execute
  • 添加删除索引
    放后台执行

    1
    2
    3
    pt-online-schema-change --user=user --ask-pass --host=10.0.201.34  \
    --alter "DROP KEY cid, ADD KEY idx_corpid_userid(f_corp_id,f_user_id) " \
    D=confluence,t=sbtest3 --print --execute
  • 修改主键

在我的环境里有不少表设计之初没有自增id,而是采用复合主键,pt-osc 对删除、添加主键会特殊处理,详见 这里 。

6. 错误处理

1. 存在trigger

1
2
3
[zx@mysql-5 ~]$ pt-online-schema-change  -u user -p password -h 10.0.200.195 \
--alter="MODIFY COLUMN f_receiver varchar
NOT NULL DEFAULT '' AFTER f_user_id" --dry-run D=db_name,t=table_name
The table `db_name`.`table_name` has triggers. This tool needs to create its own triggers, so the table cannot already have triggers.

 

表上存在触发器,不适用。

2. no-version-check

1
2
3
$ pt-online-schema-change -uuser -ppassword --alter "add key id_provice(f_provice)" \
D=db_name,t=tb_name -h rdsxxxxxx.mysql.rds.aliyuncs.com
Can't use an undefined value as an ARRAY reference at /usr/bin/pt-online-schema-change line 7335.

 

这个错误在阿里云RDS上执行时出现的,我以为是我哪里语法写错了,但拿到原生5.6的版本上就没问题了,加上--no-version-check选项就好了,见 https://help.aliyun.com/knowledge_detail/13098164.html ,没深究,应该是pt去验证mysql server版本的时候从rds拿到的信息不对,导致格式出错。

 

pt-online-schema-change工具依赖于触发器的机制去实现表的无锁DDL。那我们试想在一主一从的情况下,有个大表需要执行DDL操作,为了验证该操作的执行时长,先用pt-online-schema-change工具在从库上执行变更。

确认没有问题后再在主库上执行变更。当然,在执行之前是需要开启会话级的sql_log_bin=0以避免记录到binlog。
但是我们从官方文档中获知如下: 很明显,如果主从架构下,binlog的日志格式是row的话,依赖触发器机制的pt
-online-schema-change是会造成数据丢失的。因为增量的数据无法通过触发器去收集。

也就是说,在基于主库的binlog为row格式的情况下,从库的触发器是不会起作用的。因为主库本身通过触发器产生的变更已经写入binlog同步到从库。如果从库的触发器也生效的话,就会出现重复。 总结:在基于主从复制架构下,如果主库binlog的日志格式为row情况下,pt
-online-schema-change工具必须在主库执行才能避免主从数据不一致。如果日志格式是statement的话,

则可以先在从库执行pt-online-schema-change再在主库执行。但是目前官方也建议参数binlog_format设置为row,而非statement,以避免某些情况下的主从数据不一致发生。

 

 

#############################################################

如果主库上该表有慢查询,则会出现锁等待超时。

#############################################################

# pt-online-schema-change --alter "add index idx_action_id (action_id)" --charset utf8 --no-check-unique-key-change --no-check-replication-filters 
--critical-load Threads_connected:2600,Threads_running:1500 --max-load Threads_connected:2600,Threads_running:1500
--recurse=1 --check-interval 5 --alter-foreign-keys-method=auto --execute --statistics --max-lag 300
--recursion-method=processlist --progress percentage,1 --user=igoodful --password=123456 --host=10.10.10.10 --port=3307 D=apple,t=current
################ No slaves found. See
--recursion-method if host 10.10.10.10 has slaves. Not checking slave lag because no slaves were found and --check-slave-lag was not specified. Operation, tries, wait: analyze_table, 10, 1 copy_rows, 10, 0.25 create_triggers, 10, 1 drop_triggers, 10, 1 swap_tables, 10, 1 update_foreign_keys, 10, 1 No foreign keys reference `apple`.`current`; ignoring --alter-foreign-keys-method. Altering `apple`.`current`... Creating new table... Created new table apple._current_new OK. Altering new table... Altered `apple`.`_current_new` OK. 2020-08-10T09:44:02 Creating triggers... 2020-08-10T09:54:11 Dropping triggers... 2020-08-10T09:54:11 Dropped triggers OK. 2020-08-10T09:54:11 Dropping new table... 2020-08-10T09:54:11 Dropped new table OK. # Event Count # ================= ===== # INSERT 0 # lock_wait_timeout 10 `apple`.`current` was not altered. Error creating triggers: 2020-08-10T09:54:11 DBD::mysql::db do failed: Lock wait timeout exceeded; try restarting transaction
[for Statement "CREATE TRIGGER `pt_osc_apple_current_del` AFTER DELETE ON `apple`.`current` FOR EACH ROW DELETE IGNORE FROM `apple`.`_current_new` WHERE `apple`.`_current_new`.`event_id` <=> OLD.`event_id`"]
at /usr/local/bin/pt-online-schema-change line 11133.

 

################################################

如果从库有慢查询:

################################################

1,主库内容:

# (apple) > show tables;
+-----------------+
| Tables_in_apple |
+-----------------+
| current         |
| event           |
| event_receiver  |
+-----------------+
3 rows in set (0.00 sec)

Mon Aug 10 14:40:41 2020
#############################
#############################
2,从库内容:

# (apple) > show tables;
+-----------------+
| Tables_in_apple |
+-----------------+
| _current_new    |
| current         |
| event           |
| event_receiver  |
+-----------------+
4 rows in set (0.00 sec)

Mon Aug 10 14:41:59 2020

# 从库执行一个该表的慢查询:
select a.* from event as a inner join current b 
on a.`event_id` = b.`event_id` and a.`event_ts` = b.`event_ts` 
inner join event_receiver c on a.`id` = c.`event_id` 
where (c.`username` = 'zhangli16' and a.`priority` in ('0', '1', '2') and b.`status` = 'PROBLEM') 
group by a.`id` 
order by a.`event_ts` desc limit 100 offset 0;

############################################################################
发现从库仅仅创建了新表,且新表已经执行了结构修改语句,但是还没有创建触发器,新表数据也没有,

同时主从延迟增大,递增,且主库上的其他ddl和dml语句都会被hang住,因为sql线程被hang住了,就导致从库从主库同步过来的二进制日志无法应用。

此时将从库的该慢查询杀掉后,便很快恢复。

 

 

$ pt-online-schema-change --alter "add index idx_action_id (action_id)" --print --charset utf8  --chunk-time 1 
--critical-load Threads_connected:10000,Threads_running:5000 --max-load Threads_connected:10000,Threads_running:5000
--recurse=1 --check-interval 5 --alter-foreign-keys-method=none --force --execute --statistics --max-lag 3
--recursion-method=processlist --progress percentage,1 --user=igoodful--password=123456--host=10.10.10.10 --port=3306 --recursion-method=hosts D=apple,t=current 1> Cannot connect to A=utf8,P=3308,h=,p=...,u=igoodful No slaves found. See --recursion-method if host 10.10.10.10 has slaves. Not checking slave lag because no slaves were found and --check-slave-lag was not specified. Operation, tries, wait: analyze_table, 10, 1 copy_rows, 10, 0.25 create_triggers, 10, 1 drop_triggers, 10, 1 swap_tables, 10, 1 update_foreign_keys, 10, 1 Not updating foreign keys because --alter-foreign-keys-method=none. Foreign keys that reference the table will no longer work. Altering `apple`.`current`... Creating new table... CREATE TABLE `apple`.`_current_new` ( `event_id` varchar(64) COLLATE utf8_unicode_ci NOT NULL DEFAULT '', `status` varchar(32) COLLATE utf8_unicode_ci DEFAULT NULL, `endpoint` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, `current_step` int(11) DEFAULT NULL, `event_ts` varchar(32) COLLATE utf8_unicode_ci DEFAULT NULL, `metric` varchar(64) COLLATE utf8_unicode_ci DEFAULT NULL, `pushed_tags` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, `left_val` double DEFAULT NULL, `func` varchar(32) COLLATE utf8_unicode_ci DEFAULT NULL, `op` varchar(4) COLLATE utf8_unicode_ci DEFAULT NULL, `right_val` double DEFAULT NULL, `max_step` int(11) DEFAULT NULL, `priority` int(11) DEFAULT NULL, `note` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, `tpl_id` bigint(20) DEFAULT NULL, `action_id` bigint(20) DEFAULT NULL, `expression_id` bigint(20) DEFAULT NULL, `strategy_id` bigint(20) DEFAULT NULL, `last_values` varchar(1024) COLLATE utf8_unicode_ci DEFAULT NULL, `db_ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (`event_id`), KEY `idx_endpoint` (`endpoint`), KEY `idx_metric` (`metric`) USING BTREE, KEY `idx_event_ts` (`event_ts`) USING BTREE ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci Created new table apple._current_new OK. Altering new table... ALTER TABLE `apple`.`_current_new` add index idx_action_id (action_id) Altered `apple`.`_current_new` OK. 2020-08-10T15:14:36 Creating triggers... 2020-08-10T15:14:36 Created triggers OK. 2020-08-10T15:14:36 Copying approximately 908393 rows... INSERT LOW_PRIORITY IGNORE INTO `apple`.`_current_new` (`event_id`, `status`, `endpoint`, `current_step`, `event_ts`, `metric`, `pushed_tags`,
`left_val`, `func`, `op`, `right_val`, `max_step`, `priority`, `note`, `tpl_id`, `action_id`, `expression_id`, `strategy_id`, `last_values`, `db_ts`)
SELECT `event_id`, `status`, `endpoint`, `current_step`, `event_ts`, `metric`, `pushed_tags`, `left_val`, `func`, `op`, `right_val`, `max_step`,
`priority`, `note`, `tpl_id`, `action_id`, `expression_id`, `strategy_id`, `last_values`, `db_ts` FROM `apple`.`current` FORCE INDEX(`PRIMARY`)
WHERE ((`event_id`
>= ?)) AND ((`event_id` <= ?)) LOCK IN SHARE MODE /*pt-online-schema-change 112747 copy nibble*/ SELECT /*!40001 SQL_NO_CACHE */ `event_id` FROM `apple`.`current` FORCE INDEX(`PRIMARY`) WHERE ((`event_id` >= ?)) ORDER BY `event_id` LIMIT ?, 2 /*next chunk boundary*/ Copying `apple`.`current`: 1% 01:12 remain Copying `apple`.`current`: 2% 01:13 remain Copying `apple`.`current`: 3% 01:16 remain Copying `apple`.`current`: 5% 01:14 remain Copying `apple`.`current`: 6% 01:13 remain Copying `apple`.`current`: 7% 01:11 remain Copying `apple`.`current`: 9% 01:10 remain Copying `apple`.`current`: 10% 01:09 remain Copying `apple`.`current`: 11% 01:08 remain Copying `apple`.`current`: 12% 01:07 remain Copying `apple`.`current`: 14% 01:06 remain Copying `apple`.`current`: 15% 01:04 remain Copying `apple`.`current`: 16% 01:03 remain Copying `apple`.`current`: 18% 01:02 remain Copying `apple`.`current`: 19% 01:01 remain Copying `apple`.`current`: 20% 01:00 remain Copying `apple`.`current`: 22% 00:59 remain Copying `apple`.`current`: 23% 01:01 remain Copying `apple`.`current`: 24% 01:00 remain Copying `apple`.`current`: 25% 00:59 remain Copying `apple`.`current`: 27% 00:56 remain Copying `apple`.`current`: 28% 00:55 remain Copying `apple`.`current`: 29% 00:54 remain Copying `apple`.`current`: 31% 00:53 remain Copying `apple`.`current`: 32% 00:52 remain Copying `apple`.`current`: 33% 00:51 remain Copying `apple`.`current`: 35% 00:49 remain Copying `apple`.`current`: 36% 00:48 remain Copying `apple`.`current`: 37% 00:49 remain Copying `apple`.`current`: 39% 00:48 remain Copying `apple`.`current`: 40% 00:45 remain Copying `apple`.`current`: 41% 00:44 remain Copying `apple`.`current`: 43% 00:43 remain Copying `apple`.`current`: 44% 00:42 remain Copying `apple`.`current`: 46% 00:40 remain Copying `apple`.`current`: 47% 00:39 remain Copying `apple`.`current`: 49% 00:38 remain Copying `apple`.`current`: 50% 00:37 remain Copying `apple`.`current`: 52% 00:35 remain Copying `apple`.`current`: 53% 00:34 remain Copying `apple`.`current`: 55% 00:33 remain Copying `apple`.`current`: 56% 00:32 remain Copying `apple`.`current`: 58% 00:30 remain Copying `apple`.`current`: 59% 00:29 remain Copying `apple`.`current`: 61% 00:28 remain Copying `apple`.`current`: 63% 00:26 remain Copying `apple`.`current`: 64% 00:25 remain Copying `apple`.`current`: 66% 00:24 remain Copying `apple`.`current`: 67% 00:23 remain Copying `apple`.`current`: 69% 00:22 remain Copying `apple`.`current`: 70% 00:21 remain Copying `apple`.`current`: 71% 00:20 remain Copying `apple`.`current`: 73% 00:19 remain Copying `apple`.`current`: 74% 00:18 remain Copying `apple`.`current`: 75% 00:17 remain Copying `apple`.`current`: 77% 00:16 remain Copying `apple`.`current`: 78% 00:15 remain Copying `apple`.`current`: 79% 00:15 remain Copying `apple`.`current`: 81% 00:14 remain Copying `apple`.`current`: 82% 00:13 remain Copying `apple`.`current`: 83% 00:12 remain Copying `apple`.`current`: 85% 00:11 remain Copying `apple`.`current`: 86% 00:10 remain Copying `apple`.`current`: 87% 00:09 remain Copying `apple`.`current`: 88% 00:08 remain Copying `apple`.`current`: 90% 00:07 remain Copying `apple`.`current`: 91% 00:06 remain Copying `apple`.`current`: 92% 00:05 remain Copying `apple`.`current`: 94% 00:04 remain Copying `apple`.`current`: 95% 00:03 remain Copying `apple`.`current`: 97% 00:02 remain Copying `apple`.`current`: 98% 00:00 remain 2020-08-10T15:15:52 Copied rows OK. 2020-08-10T15:15:52 Analyzing new table... 2020-08-10T15:15:52 Swapping tables... RENAME TABLE `apple`.`current` TO `apple`.`_current_old`, `apple`.`_current_new` TO `apple`.`current` 2020-08-10T15:15:52 Swapped original and new tables OK. 2020-08-10T15:15:52 Dropping old table... SET foreign_key_checks=0 DROP TABLE IF EXISTS `apple`.`_current_old` 2020-08-10T15:15:52 Dropped old table `apple`.`_current_old` OK. 2020-08-10T15:15:52 Dropping triggers... DROP TRIGGER IF EXISTS `apple`.`pt_osc_apple_current_del` DROP TRIGGER IF EXISTS `apple`.`pt_osc_apple_current_upd` DROP TRIGGER IF EXISTS `apple`.`pt_osc_apple_current_ins` 2020-08-10T15:15:52 Dropped triggers OK. # Event Count # ====== ===== # INSERT 76 Successfully altered `apple`.`current`.

 

 

 

 

 

 

 

##############################################

命令使用:

##############################################

 


/usr/bin/pt-online-schema-change --alter ADD `sequence_id` bigint(21) DEFAULT 0 COMMENT '?改操作序列号'

--print --charset utf8 --chunk-time 1 --critical-load Threads_connected:10000,Threads_running:5000 --max-load Threads_connected:10000,Threads_running:5000

--recurse=1 --check-interval 5 --alter-foreign-keys-method=none --force --execute --statistics --max-lag 3 --recursion-method=processlist --progress percentage,1

--user=igoodful--password=123456--host=10.10.10.10--port=3306 D=apple,t=coupon_config





 

 

 

这个很容易理解,pt-osc会在原表上创建3个触发器,而一个表上不能同时有2个相同类型的触发器,为简单通用起见,只能一棍子打死。
所以如果要让它支持有触发器存在的表也是可以实现的,思路就是:先找到原表触发器定义;重写原表触发器;最后阶段将原表触发器定义应用到新表。

如果该命令执行失败,则需要删除新创建的表和原表上的触发器。

 

注意insert low_priority ignore into语法:

insert low_priority ignore into:

如果您使用LOW_PRIORITY关键词,则INSERT的执行被延迟,直到没有其它客户端从表中读取为止。

当原有客户端正在读取时,有些客户端刚开始读取。这些客户端也被包括在内。

此时,INSERT LOW_PRIORITY语句等候。因此,在读取量很大的情况下,发出INSERT LOW_PRIORITY语句的客户端有可能需要等待很长一段时间(甚至是永远等待下去)。