分组
关于分组统计的一点思考。分组或者叫聚合运算,相当于map/reduce中的reduce,字段运算相当于map。关于聚合运算中的一些常见用途,比如去重,要用到的几个进行分组,然后再此结果,则查询一次。
分组
聚集表达式
aggregate_name (expression [ , ... ] [ order_by_clause ] ) [ FILTER ( WHERE filter_clause ) ]
aggregate_name (ALL expression [ , ... ] [ order_by_clause ] ) [ FILTER ( WHERE filter_clause ) ]
aggregate_name (DISTINCT expression [ , ... ] [ order_by_clause ] ) [ FILTER ( WHERE filter_clause ) ]
aggregate_name ( * ) [ FILTER ( WHERE filter_clause ) ]
aggregate_name ( [ expression [ , ... ] ] ) WITHIN GROUP ( order_by_clause ) [ FILTER ( WHERE filter_clause ) ]
3个分组函数
http://postgres.cn/docs/13/queries-table-expressions.html#QUERIES-GROUPING-SETS
另参见Greenplum企业应用实践
grouping sets
group by grouping sets((c1),(c2))
rollup
group by rollup(c1,c2,c3)
类似于
1 2 3 4 5
1 2 3 4
1 2 3
1 2
1
空
cube
group by rollup(c1,c2,c3)
取组合 3、2、1、0个元素。
聚集函数
http://postgres.cn/docs/13/functions-aggregate.html
实战
最分组,最早、最晚
场景:以订单号为分组,取每条订单号,最早的操作人、最晚的操作人。
准备数据如下:
CREATE table mydemo (order_id int,oper_id int,created_time timestamp without time zone);
INSERT INTO mydemo
VALUES
(1, 1, '2021-09-16 16:09'),
(1, 3, '2021-09-16 16:10'),
(1, 2, '2021-09-16 16:11'),
(2, 3, '2021-09-16 16:12'),
(2, 1, '2021-09-16 16:13'),
(2, 1, '2021-09-16 16:14');
SELECT * from mydemo;
-- 先测试,将时间转换为统一的文本格式,然后跟要取的字段拼接在一起。
SELECT
order_id,
max(to_char(created_time,'YYYYMMDDHH24MISS')||oper_id ) as max_oper,
min(to_char(created_time,'YYYYMMDDHH24MISS')||oper_id ) as min_oper
FROM mydemo
GROUP BY order_id ;
最终实现代码如下:
SELECT order_id,
substring(max_oper from 15) as lastet_oper,
substring(min_oper from 15) as earlist_oper
FROM (
SELECT
order_id,
max(to_char(created_time,'YYYYMMDDHH24MISS')||oper_id ) as max_oper,
min(to_char(created_time,'YYYYMMDDHH24MISS')||oper_id ) as min_oper
FROM mydemo
GROUP BY order_id
) a ;
需要注意的点,文字排序
合理的字符排序,是上面的正确输出的保证。
-- 字符串比较顺序,输出的是 1111 简单理解,按字符排序
SELECT max(id) from unnest(ARRAY['1111','11102']) as id;
- 最优结果
聚合的时候,居然还能用到按其他字段排序,这不就是我想要的功能嘛!!!!
参见:手册4.2.7聚集表达式
http://postgres.cn/docs/13/sql-expressions.html#SYNTAX-AGGREGATES
SELECT
order_id,
array_agg(oper_id order by created_time ASC)
FROM mydemo
GROUP BY order_id ;
- 其他实现
其他的实现,有考虑过,分组、排序后,将操作人字段合并成一个字段。但是发现,需要对订单排序,然后还要能组出数组的第一个,最后一个。数组的集合array_agg还必须是有序的……。
其他方法,待挖掘。
网上其他老哥,帮忙给的方案,(数据格式跟我的格式不太一样)。
select *
from mydemo
join (
select row_number() over(partition by order_id order by create_time) as min_row,
row_number() over(partition by order_id order by create_time desc) as max_row,
*
from mydemo ) as compute_mydemo on (compute_mydemo.id = mydemo.id)
where compute_mydemo.min_row =1 or compute_mydemo.max_row=1
另见行构造器
貌似也能解决部分问题。
select array_agg(ROW(ship_id,scan_site,ins_db_tm)) from test_guiji2 group by ship_id limit 10;
聚合字段+去重
其他
有序集聚集的调用例子:
SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY income) FROM households;
percentile_cont
-----------------
50489
如果指定了FILTER,那么只有对*filter_clause*计算为真的输入行会被交给该聚集函数,其他行会被丢弃。例如:
SELECT
count(*) AS unfiltered,
count(*) FILTER (WHERE i < 5) AS filtered
FROM generate_series(1,10) AS s(i);
unfiltered | filtered
------------+----------
10 | 4
(1 row)