做一次忘一次,这次终于整理好思路啦!

先来看看知识点:

  • 滑动窗口函数:min(字段1) over(partition by 字段1 order by 字段2 rows between a and b

a取值:unbounded preceding(划分排序后前面所有行)/ n preceding / current row

b取值:unbounded following / n following/ 0 following(=current row)

注意:1️⃣rows between n preceding and current row 可以简写为:rows n preceding

2️⃣当不对rows进行限制时,默认的是rows unbouded preceding

  • 时间序列窗口函数 :lead(字段1,n[,default]) over(partition by 字段1 order by 字段2

字段1一般是时间,表示将字段1向后移动n行得到的值;default表示当值为空时设定的默认值,可省略。

lead窗口函数常用于解决连续登录问题、次日留存问题,非常重要!

还有一个相对的函数:(表示向前移动n行,其他不变)

lag(字段1,n[,default]) over(partition by 字段1 order by 字段2

看完下面的讲解后,可以再来看看能回答下面两个问题吗?

  • 为什么where date_format(dt,"%Y%m") = '202111'不写在临时表t2中?

因为写在t2中,如果一个用户在11月以前首次登录过一次,min(dt) over(partition by uid)取到的日期依旧是11月的,也就是错把旧用户当成了新用户。

  • if(datediff(lead(dt,1) over(partition by uid order by dt) 1,0) as if_next_retention 这样写是否可以运行通过?

是可以运行通过的,但这并不说明代码严谨,只是牛客的测试用例较少了,也是我第一次犯的错误。

有下面简单的情况:

101 | 2021-11-01

101 | 2021-11-02

102 | 2021-11-02

101 | 2021-11-03

此时三行对应的if_newif_next_retention如下:

101 | 2021-11-01 | 1 | 1

101 | 2021-11-02 | 0 | 1

102 | 2021-11-02 | 1 | 0

101 | 2021-11-03 | 1 | 0

但是101不是2021-11-02次日留存的新用户,就会出错。

整体思路如下:

step1:取出用户的活跃时间表,代码如下:

with t1 as(

select uid, date(in_time) as dt

from tb_user_log

union

select uid, date(out_time)

from tb_user_log)

解释:

1️⃣当in_time 和 out_time跨天时,都算做活跃日,因此需要合并起来。

2️⃣用union自动去重,使得t1表(uid,dt)是唯一的

3️⃣union后的select语句中的date(out_time)不需要别名,直接默认union前的select对应的。

step2:判断用户是否新用户以及是否次日留存。

with t2 as(

select uid

,dt

,if(dt =min(dt) over(partition by uid),1,0) as if_new

,if(datediff(lead(dt,1) over(partition by uid order by dt) and dt =min(dt) over(partition by uid), dt)=1,1,0) as if_next_retention

from t1)

解释:

1️⃣if_new字段:min(dt) over(partition by uid)取出每个用户的最小登录日期,判断它与dt是否相同,相同的话标记为1,否则为0;

2️⃣if_next_retention字段:lead(dt,1) over(partition by uid order by dt) 取出每个用户下一次登录日期, 用datediff函数判断出它 与dt的差是否为1,是1的话就说明次日活跃(但非新用户次日活跃),因此日期差等于1且dt =min(dt) over(partition by uid), dt)=1 才说明是新用户次日活跃,这样标记为1,否则标记为0

3️⃣if_new=1 说明是当日新用户;if_next_retention=1说明是次日留存的新用户。

step3:计算次日留存率

select dt, round(sum(if_next_retention)/sum(if_new),2) as as uv_left_rate

from t2

where date_format(dt,"%Y%m") = '202111'

group by dt

having sum(if_new)>0

order by dt

解释:

1️⃣对dt分组,sum(if_new)是当天活跃新用户,sum(if_next_retention)是次日留存新用户,作比即可。

2️⃣where限制住11月数据

3️⃣having sum(if_new)>0是保证当天没有新用户就不输出了。

完整代码如下:

select dt, round(sum(if_next_retention)/sum(if_new),2) as uv_left_rate

from(

select uid,dt

,if(dt =min(dt) over(partition by uid),1,0) as if_new

,if(datediff(lead(dt,1) over(partition by uid order by dt),dt)=1 and dt =min(dt) over(partition by uid),1,0) as if_next_retention

from(

select uid,

date(in_time) as dt

from tb_user_log

union

select uid,

date(out_time)

from tb_user_log

) as t1

) as t2

where date_format(dt,"%Y%m") = '202111'

group by dt

having sum(if_new)>0

order by dt