问题描述:每日新用户的次日留存率
问题分解

每日 group by date
新用户 (user_id,date) IN (SELECT user_id,MIN(date) FROM login GROUP BY user_id;
次日留存率=同时符合新用户+次日(第二天)有登录记录
=(user_id,date) IN (SELECT user_id,MIN(date) FROM login GROUP BY user_id AND (user_id,date) IN (SELECT user_id,DATE_ADD(date,INTERVAL -1 DAY) FROM login)

具体代码:

SELECT date,IFNULL(ROUND(SUM(CASE WHEN (user_id,date) IN 
                                                        (SELECT user_id,DATE_ADD(date,INTERVAL -1 DAY) FROM login)
                                  AND (user_id,date) IN (SELECT user_id,MIN(date) FROM login GROUP BY user_id) 
                                  THEN 1 ELSE 0 END)/
                          SUM(CASE WHEN (user_id,date) IN (SELECT user_id,MIN(date) FROM login GROUP BY user_id) 
                                   THEN 1 ELSE 0 END),3),0) AS p
FROM login
GROUP BY date
ORDER BY date;

借鉴https://blog.nowcoder.net/n/1d803b9c5b774cc6b43589de947524a0?f=comment
------------------分割线--------------------------------
注意
SELECT子句中计算新用户次日留存率中的

(user_id,date) IN 
(SELECT user_id,DATE_ADD(date,INTERVAL -1 DAY) FROM login)
 AND (user_id,date) IN (SELECT user_id,MIN(date) FROM login GROUP BY user_id)

可能会有聪明的同学可能会用下面的进行代替,然后会发现结果不一样

(user_id,date) IN   
(SELECT user_id,DATE_ADD(MIN(date),INTERVAL 1 DAY) FROM login GROUP BY user_id

修改后代码如下

SELECT date,IFNULL(ROUND(SUM(case when (user_id,date)in(select user_id,date_add(min(date),interval 1 day)from login group by user_id) then 1 else 0 end)/
                                                SUM(CASE WHEN (user_id,date) IN (SELECT user_id,MIN(date) FROM login GROUP BY user_id) THEN 1 ELSE 0 END),3),0) AS p
FROM login
GROUP BY date
ORDER BY date;

执行结果并非我们所愿

date p
2020-10-12 0.000
2020-10-13 0.000
2020-10-14 0.000
2020-10-15 0.000

那原因是什么呢?我一开始也觉得那么替换是没问题,通过DATE_ADD(MIN())确实在保证新用户的同时增加次日留存的约束,但结果完全不正确。
所以我一个个试验,一开始以为是date_add函数和聚合函数MIN无法联合使用(其实是可以的),然后分别将代码中‘分子’部分‘分母’部分语句分别用常数代替排查后,发现分子分母都没问题,这时我对比两个语句的区别

(user_id,date) IN (SELECT user_id,MIN(date) FROM login GROUP BY user_id AND (user_id,date) IN (SELECT user_id,DATE_ADD(date,INTERVAL -1 DAY) FROM login)
(user_id,date) IN   
(SELECT user_id,DATE_ADD(MIN(date),INTERVAL 1 DAY) FROM login GROUP BY user_id

分别代入数据进行对比,就找到原因了。
首先我们理清整个SQL语句的执行顺序:case中的select子查询——from,group By ——CASE 的判断,SUM求和,ROUND取数
问题就出在CASE表达式的判断这一环节,分子分母部分执行case后的子查询后,执行结果随后和整个SQL子句中的(user_id,date)遍历对比。
date为2020-10-12的记录经过DATE_ADD(……)之后就变成2020-10-13,所以当遍历到2020-10-12时就没有记录可以匹配的上(因为2020-10-12是最早的纪录),属于12的记录被推后到13;而分母部分依然匹配子查询之外遍历的date.

方案2:

SELECT t.date,ROUND(COUNT(DISTINCT l.user_id)/COUNT(t.user_id),3) AS p
FROM (SELECT user_id,MIN(date) AS date FROM login GROUP BY user_id) t
LEFT JOIN login l
ON t.user_id = l.user_id
AND l.date = DATE_ADD(t.date,INTERVAL 1 DAY)
GROUP BY t.date
UNION
SELECT date,0 as p
FROM login 
WHERE date NOT IN (SELECT MIN(date) FROM login GROUP BY user_id)
ORDER BY date ASC;

方案3:

SELECT date,ROUND(COUNT(d.user_id)/COUNT(c.user_id),3) AS p
FROM (SELECT a.date,b.user_id 
            FROM (SELECT DISTINCT l1.date from login l1) a
            LEFT JOIN (SELECT l2.user_id,MIN(l2.date) as f_date FROM login l2 
            GROUP BY l2.user_id) b
            ON a.date = b.f_date) c
LEFT JOIN (SELECT l3.user_id
                        FROM login l3 CROSS JOIN login l4
                        ON l3.user_id = l4.user_id
                        AND DATE_ADD(l3.date,INTERVAL 1 DAY) = l4.date) d
ON c.user_id = d.user_id
GROUP BY c.date;