题目考核知识点:开窗函数中“移动平均”的变体用法(MAX、SUM)
解题思路:
第一步:先计算出每个类别,在每天的总点赞数、总转发数。这一步需要注意一点,video_id 2001和2002属于“旅游”类别,因此不要GROUP BY video_id,而是GROUP BY tag
SELECT i.tag,
DATE(l.start_time) AS dt,
SUM(l.if_like) AS like_per_dt,
SUM(l.if_retweet) AS retweet_per_dt
FROM tb_user_video_log AS l INNER JOIN tb_video_info AS i
ON l.video_id = i.video_id
GROUP BY i.tag, DATE(l.start_time)
第二步:针对每个类别,在每一天中,对应的“前七天的总点赞数”、“前七天中的单天最大转发数”,这里就需要用到开窗函数中的移动计算 —— 即,聚合函数与ROWS N PRECEDING的结合
SELECT t.tag,
t.dt,
SUM(t.like_per_dt) OVER (PARTITION BY t.tag ORDER BY dt ASC ROWS 6 PRECEDING) AS sum_like_cnt,
MAX(t.retweet_per_dt) OVER (PARTITION BY t.tag ORDER BY dt ASC ROWS 6 PRECEDING) AS max_retweet_cnt
FROM
/* 基于第一步的结果表,再嵌套一层查询,对其使用开窗函数,对类别进行分组,组内按照日期升序,并
进行移动计算 —— 注意,是ROWS 6 PRECEDING 而非 7,因为当前行也会被计算在内,所以ROWS 6其实是
包含7条记录的 */
(SELECT i.tag,
DATE(l.start_time) AS dt,
SUM(l.if_like) AS like_per_dt,
SUM(l.if_retweet) AS retweet_per_dt
FROM tb_user_video_log AS l INNER JOIN tb_video_info AS i
ON l.video_id = i.video_id
GROUP BY i.tag, DATE(l.start_time)) AS t
第三步:基于第二步的结果,从中选取出2021-10-01、10-02、10-03的结果即可
SELECT t2.*
FROM
(SELECT t.tag,
t.dt,
SUM(t.like_per_dt) OVER (PARTITION BY t.tag ORDER BY dt ASC ROWS 6 PRECEDING) AS sum_like_cnt,
MAX(t.retweet_per_dt) OVER (PARTITION BY t.tag ORDER BY dt ASC ROWS 6 PRECEDING) AS max_retweet_cnt
FROM
(SELECT i.tag,
DATE(l.start_time) AS dt,
SUM(l.if_like) AS like_per_dt,
SUM(l.if_retweet) AS retweet_per_dt
FROM tb_user_video_log AS l INNER JOIN tb_video_info AS i
ON l.video_id = i.video_id
GROUP BY i.tag, DATE(l.start_time)) AS t) AS t2
WHERE t2.dt BETWEEN '2021-10-01' AND '2021-10-03'
ORDER BY t2.tag DESC, t2.dt ASC;
过程中发现另外一个易犯错的点,仅供参考:
开窗函数只能书写在SELECT子句中,并且,当WHERE和GROUP BY也同时出现时,开窗函数是对WHERE、GROUP BY子句处理后的结果进行操作的,即:开窗函数在WHERE、GROUP BY之后才执行
在书写第二步时,本打算在外层嵌套时,直接使用GROUP BY + HAVING来一步到位,实际发现会出现错误的结果:
查询语句如下:
SELECT t.tag,
t.dt,
SUM(t.like_per_dt) OVER (PARTITION BY t.tag ORDER BY dt ASC ROWS 6 PRECEDING) AS sum_like_cnt,
MAX(t.retweet_per_dt) OVER (PARTITION BY t.tag ORDER BY dt ASC ROWS 6 PRECEDING) AS max_retweet_cnt
FROM
(SELECT i.tag,
DATE(l.start_time) AS dt,
SUM(l.if_like) AS like_per_dt,
SUM(l.if_retweet) AS retweet_per_dt
FROM tb_user_video_log AS l INNER JOIN tb_video_info AS i
ON l.video_id = i.video_id
GROUP BY i.tag, DATE(l.start_time)) AS t
/* 想通过GROUP BY tag, dt来直接解决问题,实际上是有错误的 */
GROUP BY t.tag, t.dt
HAVING t.dt BETWEEN '2021-10-01' AND '2021-10-03';
只会得到结果:
tag | dt | sum_like_cnt | max_retweet_cnt |
---|---|---|---|
旅游 | 2021-10-01 | 2 | 1 |
旅游 | 2021-10-02 | 3 | 3 |
旅游 | 2021-10-03 | 4 | 3 |
这里的原因很简单,原先第一层子查询,会得出结果(如下),是没问题的:
tag | dt | like_per_dt | retweet_per_dt |
---|---|---|---|
旅游 | 2021-09-24 | 1 | 0 |
旅游 | 2021-09-25 | 1 | 2 |
旅游 | 2021-09-26 | 0 | 1 |
旅游 | 2021-09-27 | 1 | 0 |
旅游 | 2021-09-28 | 0 | 1 |
旅游 | 2021-09-29 | 0 | 1 |
旅游 | 2021-09-30 | 1 | 1 |
旅游 | 2021-10-01 | 2 | 1 |
旅游 | 2021-10-02 | 1 | 3 |
旅游 | 2021-10-03 | 1 | 0 |
而在使用GROUP BY t.tag, t.dt HAVING t.dt BETWEEN '2021-10-01' AND '2021-10-03'之后,实际上会先截取出 2021-10-01、10-02、10-03三个记录行,随后才对三个记录上进行开窗函数的计算,因此也会得出上述的错误结果。
若非需要使用GROUP BY来解决问题,则需像上述第三步一样,在最外一层嵌套查询中使用GROUP BY,而非在第二层就直接使用 —— 但是这种方法依旧不如原方法直接、易阅读,并且再GROUP BY一次其实也是没有必要的,因此下述查询语句我是不推荐的(虽然也能运作):
SELECT t2.*
FROM
(SELECT t.tag,
t.dt,
SUM(t.like_per_dt) OVER (PARTITION BY t.tag ORDER BY dt ASC ROWS 6 PRECEDING) AS sum_like_cnt,
MAX(t.retweet_per_dt) OVER (PARTITION BY t.tag ORDER BY dt ASC ROWS 6 PRECEDING) AS max_retweet_cnt
FROM
(SELECT i.tag, DATE(l.start_time) AS dt, SUM(l.if_like) AS like_per_dt, SUM(l.if_retweet) AS retweet_per_dt
FROM tb_user_video_log AS l INNER JOIN tb_video_info AS i
ON l.video_id = i.video_id
GROUP BY i.tag, DATE(l.start_time)) AS t) AS t2
GROUP BY t2.tag, t2.dt
HAVING t2.dt BETWEEN '2021-10-01' AND '2021-10-03'
ORDER BY t2.tag DESC, t2.dt ASC;