题目考核知识点:开窗函数中“移动平均”的变体用法(MAX、SUM)


解题思路:

第一步:先计算出每个类别,在每天的总点赞数、总转发数。这一步需要注意一点,video_id 2001和2002属于“旅游”类别,因此不要GROUP BY video_id,而是GROUP BY tag

SELECT i.tag, 
		DATE(l.start_time) AS dt,
        SUM(l.if_like) AS like_per_dt, 
        SUM(l.if_retweet) AS retweet_per_dt
FROM tb_user_video_log AS l INNER JOIN tb_video_info AS i
ON l.video_id = i.video_id
GROUP BY i.tag, DATE(l.start_time)

第二步:针对每个类别,在每一天中,对应的“前七天的总点赞数”、“前七天中的单天最大转发数”,这里就需要用到开窗函数中的移动计算 —— 即,聚合函数与ROWS N PRECEDING的结合

SELECT t.tag,
       t.dt,
       SUM(t.like_per_dt) OVER (PARTITION BY t.tag ORDER BY dt ASC ROWS 6 PRECEDING) AS sum_like_cnt,
       MAX(t.retweet_per_dt) OVER (PARTITION BY t.tag ORDER BY dt ASC ROWS 6 PRECEDING) AS max_retweet_cnt
FROM
    /* 基于第一步的结果表,再嵌套一层查询,对其使用开窗函数,对类别进行分组,组内按照日期升序,并
    进行移动计算 —— 注意,是ROWS 6 PRECEDING 而非 7,因为当前行也会被计算在内,所以ROWS 6其实是
    包含7条记录的 */
        (SELECT i.tag, 
        		DATE(l.start_time) AS dt, 
                SUM(l.if_like) AS like_per_dt, 
                SUM(l.if_retweet) AS retweet_per_dt
        FROM tb_user_video_log AS l INNER JOIN tb_video_info AS i
        ON l.video_id = i.video_id
        GROUP BY i.tag, DATE(l.start_time)) AS t

第三步:基于第二步的结果,从中选取出2021-10-01、10-02、10-03的结果即可

SELECT t2.*
FROM
    (SELECT t.tag,
           t.dt,
           SUM(t.like_per_dt) OVER (PARTITION BY t.tag ORDER BY dt ASC ROWS 6 PRECEDING) AS sum_like_cnt,
           MAX(t.retweet_per_dt) OVER (PARTITION BY t.tag ORDER BY dt ASC ROWS 6 PRECEDING) AS max_retweet_cnt
    FROM
        (SELECT i.tag, 
        		DATE(l.start_time) AS dt, 
                SUM(l.if_like) AS like_per_dt, 
                SUM(l.if_retweet) AS retweet_per_dt
        FROM tb_user_video_log AS l INNER JOIN tb_video_info AS i
        ON l.video_id = i.video_id
        GROUP BY i.tag, DATE(l.start_time)) AS t) AS t2
WHERE t2.dt BETWEEN '2021-10-01' AND '2021-10-03'
ORDER BY t2.tag DESC, t2.dt ASC;

过程中发现另外一个易犯错的点,仅供参考:

开窗函数只能书写在SELECT子句中,并且,当WHERE和GROUP BY也同时出现时,开窗函数是对WHERE、GROUP BY子句处理后的结果进行操作的,即:开窗函数在WHERE、GROUP BY之后才执行

在书写第二步时,本打算在外层嵌套时,直接使用GROUP BY + HAVING来一步到位,实际发现会出现错误的结果:

查询语句如下:

SELECT t.tag,
       t.dt,
       SUM(t.like_per_dt) OVER (PARTITION BY t.tag ORDER BY dt ASC ROWS 6 PRECEDING) AS sum_like_cnt,
       MAX(t.retweet_per_dt) OVER (PARTITION BY t.tag ORDER BY dt ASC ROWS 6 PRECEDING) AS max_retweet_cnt
FROM
  (SELECT i.tag, 
  		  DATE(l.start_time) AS dt, 
          SUM(l.if_like) AS like_per_dt, 
          SUM(l.if_retweet) AS retweet_per_dt
  FROM tb_user_video_log AS l INNER JOIN tb_video_info AS i
  ON l.video_id = i.video_id
  GROUP BY i.tag, DATE(l.start_time)) AS t
/* 想通过GROUP BY tag, dt来直接解决问题,实际上是有错误的 */
GROUP BY t.tag, t.dt
HAVING t.dt BETWEEN '2021-10-01' AND '2021-10-03';

只会得到结果:

tag dt sum_like_cnt max_retweet_cnt
旅游 2021-10-01 2 1
旅游 2021-10-02 3 3
旅游 2021-10-03 4 3

这里的原因很简单,原先第一层子查询,会得出结果(如下),是没问题的:

tag dt like_per_dt retweet_per_dt
旅游 2021-09-24 1 0
旅游 2021-09-25 1 2
旅游 2021-09-26 0 1
旅游 2021-09-27 1 0
旅游 2021-09-28 0 1
旅游 2021-09-29 0 1
旅游 2021-09-30 1 1
旅游 2021-10-01 2 1
旅游 2021-10-02 1 3
旅游 2021-10-03 1 0

而在使用GROUP BY t.tag, t.dt HAVING t.dt BETWEEN '2021-10-01' AND '2021-10-03'之后,实际上会先截取出 2021-10-01、10-02、10-03三个记录行,随后才对三个记录上进行开窗函数的计算,因此也会得出上述的错误结果。

若非需要使用GROUP BY来解决问题,则需像上述第三步一样,在最外一层嵌套查询中使用GROUP BY,而非在第二层就直接使用 —— 但是这种方法依旧不如原方法直接、易阅读,并且再GROUP BY一次其实也是没有必要的,因此下述查询语句我是不推荐的(虽然也能运作):

SELECT t2.*
FROM
    (SELECT t.tag,
           t.dt,
           SUM(t.like_per_dt) OVER (PARTITION BY t.tag ORDER BY dt ASC ROWS 6 PRECEDING) AS sum_like_cnt,
           MAX(t.retweet_per_dt) OVER (PARTITION BY t.tag ORDER BY dt ASC ROWS 6 PRECEDING) AS max_retweet_cnt
    FROM
        (SELECT i.tag, DATE(l.start_time) AS dt, SUM(l.if_like) AS like_per_dt, SUM(l.if_retweet) AS retweet_per_dt
        FROM tb_user_video_log AS l INNER JOIN tb_video_info AS i
        ON l.video_id = i.video_id
        GROUP BY i.tag, DATE(l.start_time)) AS t) AS t2
GROUP BY t2.tag, t2.dt
HAVING t2.dt BETWEEN '2021-10-01' AND '2021-10-03'
ORDER BY t2.tag DESC, t2.dt ASC;