分组

关于分组统计的一点思考。分组或者叫聚合运算,相当于map/reduce中的reduce,字段运算相当于map。关于聚合运算中的一些常见用途,比如去重,要用到的几个进行分组,然后再此结果,则查询一次。

分组

聚集表达式

aggregate_name (expression [ , ... ] [ order_by_clause ] ) [ FILTER ( WHERE filter_clause ) ]
aggregate_name (ALL expression [ , ... ] [ order_by_clause ] ) [ FILTER ( WHERE filter_clause ) ]
aggregate_name (DISTINCT expression [ , ... ] [ order_by_clause ] ) [ FILTER ( WHERE filter_clause ) ]
aggregate_name ( * ) [ FILTER ( WHERE filter_clause ) ]
aggregate_name ( [ expression [ , ... ] ] ) WITHIN GROUP ( order_by_clause ) [ FILTER ( WHERE filter_clause ) ]

3个分组函数

http://postgres.cn/docs/13/queries-table-expressions.html#QUERIES-GROUPING-SETS

另参见Greenplum企业应用实践

grouping sets

group by grouping sets((c1),(c2))

rollup

group by rollup(c1,c2,c3)

类似于

1 2 3 4 5 
1 2 3 4 
1 2 3 
1 2 
1
空

cube

group by rollup(c1,c2,c3)

取组合 3、2、1、0个元素。

聚集函数

http://postgres.cn/docs/13/functions-aggregate.html

实战

最分组,最早、最晚

场景:以订单号为分组,取每条订单号,最早的操作人、最晚的操作人。

准备数据如下:

CREATE table mydemo (order_id int,oper_id int,created_time timestamp without time zone);

INSERT INTO mydemo
VALUES
	(1, 1, '2021-09-16 16:09'),
	(1, 3, '2021-09-16 16:10'),
	(1, 2, '2021-09-16 16:11'),
	(2, 3, '2021-09-16 16:12'),
	(2, 1, '2021-09-16 16:13'),
	(2, 1, '2021-09-16 16:14');
	
SELECT * from mydemo;

-- 先测试,将时间转换为统一的文本格式,然后跟要取的字段拼接在一起。
SELECT 
    order_id,
    max(to_char(created_time,'YYYYMMDDHH24MISS')||oper_id ) as max_oper,
    min(to_char(created_time,'YYYYMMDDHH24MISS')||oper_id ) as min_oper
FROM mydemo
GROUP BY order_id ;

最终实现代码如下:

SELECT order_id,
    substring(max_oper from 15) as lastet_oper,
    substring(min_oper from 15) as earlist_oper
FROM (
	SELECT 
			order_id,
			max(to_char(created_time,'YYYYMMDDHH24MISS')||oper_id ) as max_oper,
			min(to_char(created_time,'YYYYMMDDHH24MISS')||oper_id ) as min_oper
	FROM mydemo
	GROUP BY order_id 
) a ;
  • 需要注意的点,文字排序

    合理的字符排序,是上面的正确输出的保证。

-- 字符串比较顺序,输出的是 1111  简单理解,按字符排序
SELECT max(id) from unnest(ARRAY['1111','11102']) as id;
  • 最优结果

聚合的时候,居然还能用到按其他字段排序,这不就是我想要的功能嘛!!!!

参见:手册4.2.7聚集表达式

http://postgres.cn/docs/13/sql-expressions.html#SYNTAX-AGGREGATES

SELECT 
    order_id,
    array_agg(oper_id order by created_time ASC)
FROM mydemo
GROUP BY order_id ;
  • 其他实现

其他的实现,有考虑过,分组、排序后,将操作人字段合并成一个字段。但是发现,需要对订单排序,然后还要能组出数组的第一个,最后一个。数组的集合array_agg还必须是有序的……。

其他方法,待挖掘。

网上其他老哥,帮忙给的方案,(数据格式跟我的格式不太一样)。

select *
from mydemo
join (
select row_number() over(partition by order_id order by create_time) as min_row,
          row_number() over(partition by order_id order by create_time desc) as max_row,
          *
from mydemo ) as compute_mydemo on (compute_mydemo.id = mydemo.id)
where compute_mydemo.min_row =1 or compute_mydemo.max_row=1

另见行构造器

貌似也能解决部分问题。

select array_agg(ROW(ship_id,scan_site,ins_db_tm)) from test_guiji2 group by ship_id limit 10;

聚合字段+去重

其他

有序集聚集的调用例子:

SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY income) FROM households;
 percentile_cont
-----------------
           50489

如果指定了FILTER,那么只有对*filter_clause*计算为真的输入行会被交给该聚集函数,其他行会被丢弃。例如:

SELECT
    count(*) AS unfiltered,
    count(*) FILTER (WHERE i < 5) AS filtered
FROM generate_series(1,10) AS s(i);
 unfiltered | filtered
------------+----------
         10 |        4
(1 row)