问题描述:每日新用户的次日留存率
问题分解:
每日 group by date
新用户 (user_id,date) IN (SELECT user_id,MIN(date) FROM login GROUP BY user_id;
次日留存率=同时符合新用户+次日(第二天)有登录记录 =(user_id,date) IN (SELECT user_id,MIN(date) FROM login GROUP BY user_id AND (user_id,date) IN (SELECT user_id,DATE_ADD(date,INTERVAL -1 DAY) FROM login)
具体代码:
SELECT date,IFNULL(ROUND(SUM(CASE WHEN (user_id,date) IN (SELECT user_id,DATE_ADD(date,INTERVAL -1 DAY) FROM login) AND (user_id,date) IN (SELECT user_id,MIN(date) FROM login GROUP BY user_id) THEN 1 ELSE 0 END)/ SUM(CASE WHEN (user_id,date) IN (SELECT user_id,MIN(date) FROM login GROUP BY user_id) THEN 1 ELSE 0 END),3),0) AS p FROM login GROUP BY date ORDER BY date;
借鉴https://blog.nowcoder.net/n/1d803b9c5b774cc6b43589de947524a0?f=comment
------------------分割线--------------------------------
注意:
SELECT子句中计算新用户次日留存率中的
(user_id,date) IN (SELECT user_id,DATE_ADD(date,INTERVAL -1 DAY) FROM login) AND (user_id,date) IN (SELECT user_id,MIN(date) FROM login GROUP BY user_id)
可能会有聪明的同学可能会用下面的进行代替,然后会发现结果不一样
(user_id,date) IN (SELECT user_id,DATE_ADD(MIN(date),INTERVAL 1 DAY) FROM login GROUP BY user_id
修改后代码如下
SELECT date,IFNULL(ROUND(SUM(case when (user_id,date)in(select user_id,date_add(min(date),interval 1 day)from login group by user_id) then 1 else 0 end)/ SUM(CASE WHEN (user_id,date) IN (SELECT user_id,MIN(date) FROM login GROUP BY user_id) THEN 1 ELSE 0 END),3),0) AS p FROM login GROUP BY date ORDER BY date;
执行结果并非我们所愿
date | p |
---|---|
2020-10-12 | 0.000 |
2020-10-13 | 0.000 |
2020-10-14 | 0.000 |
2020-10-15 | 0.000 |
那原因是什么呢?我一开始也觉得那么替换是没问题,通过DATE_ADD(MIN())确实在保证新用户的同时增加次日留存的约束,但结果完全不正确。
所以我一个个试验,一开始以为是date_add函数和聚合函数MIN无法联合使用(其实是可以的),然后分别将代码中‘分子’部分‘分母’部分语句分别用常数代替排查后,发现分子分母都没问题,这时我对比两个语句的区别
(user_id,date) IN (SELECT user_id,MIN(date) FROM login GROUP BY user_id AND (user_id,date) IN (SELECT user_id,DATE_ADD(date,INTERVAL -1 DAY) FROM login)
(user_id,date) IN (SELECT user_id,DATE_ADD(MIN(date),INTERVAL 1 DAY) FROM login GROUP BY user_id
分别代入数据进行对比,就找到原因了。
首先我们理清整个SQL语句的执行顺序:case中的select子查询——from,group By ——CASE 的判断,SUM求和,ROUND取数
问题就出在CASE表达式的判断这一环节,分子分母部分执行case后的子查询后,执行结果随后和整个SQL子句中的(user_id,date)遍历对比。
date为2020-10-12的记录经过DATE_ADD(……)之后就变成2020-10-13,所以当遍历到2020-10-12时就没有记录可以匹配的上(因为2020-10-12是最早的纪录),属于12的记录被推后到13;而分母部分依然匹配子查询之外遍历的date.
方案2:
SELECT t.date,ROUND(COUNT(DISTINCT l.user_id)/COUNT(t.user_id),3) AS p FROM (SELECT user_id,MIN(date) AS date FROM login GROUP BY user_id) t LEFT JOIN login l ON t.user_id = l.user_id AND l.date = DATE_ADD(t.date,INTERVAL 1 DAY) GROUP BY t.date UNION SELECT date,0 as p FROM login WHERE date NOT IN (SELECT MIN(date) FROM login GROUP BY user_id) ORDER BY date ASC;
方案3:
SELECT date,ROUND(COUNT(d.user_id)/COUNT(c.user_id),3) AS p FROM (SELECT a.date,b.user_id FROM (SELECT DISTINCT l1.date from login l1) a LEFT JOIN (SELECT l2.user_id,MIN(l2.date) as f_date FROM login l2 GROUP BY l2.user_id) b ON a.date = b.f_date) c LEFT JOIN (SELECT l3.user_id FROM login l3 CROSS JOIN login l4 ON l3.user_id = l4.user_id AND DATE_ADD(l3.date,INTERVAL 1 DAY) = l4.date) d ON c.user_id = d.user_id GROUP BY c.date;