#########################
业务反映某一天这个Lock wait timeout异常特别严重,于是找到我来排查,这不很明显的吗?就是等待锁超时异常了,但是呢,业务不知道哪个地方出现了锁等待,也不清楚什么sql导致的,拉一堆人马在那里等待我救命
于是,我迅雷不及掩耳之势查找了mysql错误日志和中间件日志,业务也说了什么时候出现的,因此直接定位到对应时间段周围:
percona版本的mysql服务的错误日志如下:
2021-07-13T14:21:27.442042+08:00 3262227 [Note] InnoDB: Lock wait timeout info: Requested thread id: 3262227 Requested trx id: 48830402 Requested query: update apple set mac_last_heartbeat = 1626157276000 where ext_id = 105790 Total blocking transactions count: 1 Blocking transaction number: 1 Blocking thread id: 3232905 Blocking query id: 1092250130 Blocking trx id: 48830301
根据错误日志得出:
1)时间是:2021-07-13T14:21:27.442042+08:00,
2)问题是:mysql的3262227会话线程被3232905会话线程给阻塞了
3)3262227会话线程上的什么sql被阻塞了?3262227会话线程上的“update apple set mac_last_heartbeat = 1626157276000 where ext_id = 105790”这个sql语句被阻塞住了,
但是呢,这里只知道3232905会话线程的id,其余信息对解决问题没啥帮助,于是排查kingshard上的业务sql,中间件日志记录了业务发起sql的时间,执行时间,来源ip,目的ip,会话线程id,SQL语句,
因此根据3232905会话线程这个id去找到这个会话在该时间段周围的所有sql,于是如下:
数据库中间件kingshard的日志如下:
2021/07/13 14:21:05 - OK - 1.7ms - 10.10.10.20:44465,conn_id=7479433|COMMIT 2021/07/13 14:21:05 - OK - 1.5ms - 10.10.10.20:44465->10.10.10.10:3306,mysql_connect_id=3232905|select current_timestamp 2021/07/13 14:21:05 - OK - 1.7ms - 10.10.10.20:44465->10.10.10.10:3306,mysql_connect_id=3232905|insert into apple (ext_id, int_id, txnid, mac_db, mac_table, mac_partition, mac_lock_state, mac_lock_type, mac_last_heartbeat, mac_user, mac_host) values (105790, 1,0, 'dw_business', 'dwd_ord_ord_df', null, 'w', 'e', 1626157265000, 'work', 'zjy-hadoop-prc-st2203.bj') 2021/07/13 14:21:05 - OK - 1.6ms - 10.10.10.20:44465->10.10.10.10:3306,mysql_connect_id=3232905|select ext_id, int_id, mac_db, mac_table, mac_partition, mac_lock_state, mac_lock_type, txnid from apple where ext_id = 105790 2021/07/13 14:21:05 - OK - 1.5ms - 10.10.10.20:44465->10.10.10.10:3306,mysql_connect_id=3232905|SAVEPOINT `__66a6cb85_17a856cffba__3b3e` 2021/07/13 14:21:05 - OK - 1.6ms - 10.10.10.20:44465->10.10.10.10:3306,mysql_connect_id=3232905|select ext_id, int_id, mac_db, mac_table, mac_partition, mac_lock_state, mac_lock_type, txnid from apple where mac_db in ('dw_business') and (mac_table is null or mac_table in('dwd_ord_ord_df')) and ext_id <= 105790 2021/07/13 14:21:05 - OK - 1.5ms - 10.10.10.20:44465,conn_id=7479433|ROLLBACK 2021/07/13 14:21:05 - OK - 0.0ms - 10.10.10.20:44465,conn_id=7479433|COMMIT 2021/07/13 14:21:05 - OK - 0.0ms - 10.10.10.20:44465,conn_id=7479433|ROLLBACK 2021/07/13 14:21:05 - OK - 0.0ms - 10.10.10.20:44465->0.0.0.0:6853,conn_id=7479433|SET autocommit=1 2021/07/13 14:21:07 - Connected - conn_id=7543980, autopilot@127.0.0.1:15509 () capability: 238213 2021/07/13 14:21:07 - Quit - conn_id=7543980, 127.0.0.1:15509 2021/07/13 14:21:11 - Connected - conn_id=7543981, autopilot@127.0.0.1:15667 () capability: 238213 2021/07/13 14:21:11 - Quit - conn_id=7543981, 127.0.0.1:15667 2021/07/13 14:21:14 - OK - 0.0ms - 10.10.10.20:43325->0.0.0.0:6853,conn_id=7309803|SET autocommit=0
### 上面日志表示:
# 3232905这个会话线程先执行了select current_timestamp语句,
# 然后执行了insert into apple语句,
# 然后又鬼使神差地执行了SAVEPOINT `__66a6cb85_17a856cffba__3b3e`语句,
# 最后执行了一个select语句
# 然后就没了然后
................
####下面这句是3262227会话线程的执行结果:ERROR,耗时10s多,来源ip是10.10.10.20,这个是业务方的服务所在机器的ip,目的ip为10.10.10.10,这个是mysql集群的主库ip地址,sql语句如下所示:
2021/07/13 14:21:27 - ERROR - 10981.2ms - 10.10.10.20:45954->10.10.10.10:3306,mysql_connect_id=3262227|update apple set mac_last_heartbeat = 1626157276000 where ext_id = 105790
########################