pg数据计算实践
gp的一些操作sql进行记录,统计哪些sql的性能比较消耗性能。记录一下常见的操作方式。比如如何去重,去重在哪方面比较消耗性能。本来不是更多的的查询条件,会有更多的时间消耗。反而,因为条件,最终的时间会少一些。
利用联表,减少去重的数据量。
sql的各种子查询都能用联表来解决。sql其实也是典型的map、reduce这种。map,如对各种字段的各种运算,而reduce其实就是对数据进行分组、聚合。
操作
DISTRIBUTED
首先,原有的表,有DISTRIBUTED BY (ship_id),根据特定的条件,将查询出的结果,另存到一张表内,查看时间影响。
DROP TABLE IF EXISTS test_scan_qian;
CREATE TABLE test_scan_qian WITH (
APPENDONLY = TRUE,
COMPRESSLEVEL = 5,
ORIENTATION = COLUMN,
COMPRESSTYPE = ZLIB
) AS SELECT
*
FROM
(
SELECT
ship_id,
scan_site,
scan_emp,
ins_db_tm,
scan_tm,
scan_typ,
udf_1,
udf_4,
ROW_NUMBER () OVER (
PARTITION BY ship_id
ORDER BY
ins_db_tm
) AS rn
FROM
tb_scan
WHERE
ins_db_tm >= '2021-07-09'
AND ins_db_tm < '2021-07-10'
AND scan_typ = '10'
AND rmk_id = '1'
) AS T
WHERE
T.rn = 1
DISTRIBUTED BY (ship_id); -- 测试最后一句,有无分布对时间的影响
有DISTRIBUTED
SELECT 54062499
Time: 23343.444 ms
无DISTRIBUTED
SELECT 54062499
Time: 22795.694 ms
貌似,没有什么影响。
一次查询
时间裁剪分区
以下均在一个按时间分区的表上,进行查询。
-- 签收 1天时间内
SELECT 54062499
Time: 23343.444 ms
-- 分发,10天时间内
SELECT 515138696
Time: 217047.224 ms
-- 揽件,30天时间内
SELECT 1630545559
Time: 761096.137 ms
大概是按量算出来的。
既然都是一张表,能否在一次计算内,完成呢?这样,来减少时间?
比较差集
用主键比较一次差集。
select count(1) from test_scan_qian09 as a left join test_scan_qian as b on a.ship_id = b.ship_id where b.ship_id is null ;
统计数据样本情况
sum(case when zyfifthyz is not null then 1 else 0 end) 其实就是count(字段)
select
sum(case when zyfifthyz is not null then 1 else 0 end) as zyfifthyz_cnt,
sum(case when rule5a is not null then 1 else 0 end) as rule5a_cnt,
sum(case when zyfifthgj is not null then 1 else 0 end) as zyfifthgj_cnt,
sum(case when rule5b is not null then 1 else 0 end) as rule5b_cnt,
sum(case when zyfifthdsd is not null then 1 else 0 end) as zyfifthdsd_cnt,
sum(case when rule5c is not null then 1 else 0 end) as rule5c_cnt,
sum(case when sfcalwd is not null then 1 else 0 end) as sfcalwd_cnt,
sum(case when sfcalpsm is not null then 1 else 0 end) as sfcalpsm_cnt,
sum(case when sfcalyz is not null then 1 else 0 end) as sfcalyz_cnt,
sum(case when sfcalgj is not null then 1 else 0 end) as sfcalgj_cnt,
sum(case when sfcaldsd is not null then 1 else 0 end) as sfcaldsd_cnt,
sum(case when cnmdwd is not null then 1 else 0 end) as cnmdwd_cnt,
sum(case when cnpsm is not null then 1 else 0 end) as cnpsm_cnt,
sum(case when cnfifthcode is not null then 1 else 0 end) as cnfifthcode_cnt,
sum(case when edmwlwd is not null then 1 else 0 end) as edmwlwd_cnt,
sum(case when edmwlpsm is not null then 1 else 0 end) as edmwlpsm_cnt,
sum(case when edmhisqcwd is not null then 1 else 0 end) as edmhisqcwd_cnt,
sum(case when edmhisqcpsm is not null then 1 else 0 end) as edmhisqcpsm_cnt,
sum(case when sdmrgzdwd is not null then 1 else 0 end) as sdmrgzdwd_cnt,
sum(case when sdmrgzdpsm is not null then 1 else 0 end) as sdmrgzdpsm_cnt,
sum(case when sdmgjcrgzdwd is not null then 1 else 0 end) as sdmgjcrgzdwd_cnt,
sum(case when sdmgjcrgzdpsm is not null then 1 else 0 end) as sdmgjcrgzdpsm_cnt,
sum(case when sdmkeysdicpsfwwd is not null then 1 else 0 end) as sdmkeysdicpsfwwd_cnt,
sum(case when sdmkeysdicpsfwpsm is not null then 1 else 0 end) as sdmkeysdicpsfwpsm_cnt,
sum(case when sfqccalwd is not null then 1 else 0 end) as sfqccalwd_cnt,
sum(case when sfqccalpsm is not null then 1 else 0 end) as sfqccalpsm_cnt,
sum(case when edmhisfcroadnumberwd is not null then 1 else 0 end) as edmhisfcroadnumberwd_cnt,
sum(case when edmhisfcroadnumberpsm is not null then 1 else 0 end) as edmhisfcroadnumberpsm_cnt,
sum(case when edmhisfcpoiwd is not null then 1 else 0 end) as edmhisfcpoiwd_cnt,
sum(case when edmhisfcpoipsm is not null then 1 else 0 end) as edmhisfcpoipsm_cnt
from test_alldata;
select count(ship_id),count(distinct ship_id) from addr_real where ldsj >='2021-7-10' and ldsj < '2021-7-10';
-- 结果
zyfifthyz_cnt | 24260067
rule5a_cnt | 59316603
zyfifthgj_cnt | 10092057
rule5b_cnt | 59316603
zyfifthdsd_cnt | 21988239
rule5c_cnt | 59316603
sfcalwd_cnt | 40757859
sfcalpsm_cnt | 40757859
sfcalyz_cnt | 8530955
sfcalgj_cnt | 2099311
sfcaldsd_cnt | 6041209
cnmdwd_cnt | 21258103
cnpsm_cnt | 20862872
cnfifthcode_cnt | 3772426
edmwlwd_cnt | 17365571
edmwlpsm_cnt | 17365571
edmhisqcwd_cnt | 44155107
edmhisqcpsm_cnt | 44155107
sdmrgzdwd_cnt | 0
sdmrgzdpsm_cnt | 0
sdmgjcrgzdwd_cnt | 213877
sdmgjcrgzdpsm_cnt | 49756
sdmkeysdicpsfwwd_cnt | 16013344
sdmkeysdicpsfwpsm_cnt | 15912223
sfqccalwd_cnt | 0
sfqccalpsm_cnt | 0
edmhisfcroadnumberwd_cnt | 14759277
edmhisfcroadnumberpsm_cnt | 14759277
edmhisfcpoiwd_cnt | 46548511
edmhisfcpoipsm_cnt | 46548511
统计哪些重复的
select * from test_alldata_20210710 where ship_id in (select ship_id from test_alldata_20210710 group by ship_id having count(ship_id) > 1 limit 10);
-- 或者自联表??
去重的方式
select distinct on() * from sometable order by
大概的意思是,on的字段,作为分组的依据 ,取哪1条,根据 order by的顺序来,取第1条。
重复计算字段
一般的解法,都是用子查询等,将查询的结果,则重新select 一次。
http://cn.voidcc.com/question/p-tvmbteju-s.html
在select的字段,如何直接引用这些字段呢?在mysql中倒是可以直接引用。语法如下:
大概用
@var:= 表达式的形式,为表量复制,然后使用@var来直接引用变量。
SELECT @b := UPPER(hostname) as b ,SUBSTR(@b,1,3) as c from access_log limit 10;
-- 另一种用法
set @hello =1;
select @hello;
但是,gp中是否有类似的语法?测试了,貌似不行。
大表去重统计
对1个大表的字段进行去重统计,性能如下:
select count(recv_addr) from tu_doc_info;
-[ RECORD 1 ]------
count | 28018141557
Time: 4371634.524 ms
select count(1) from (select 1 from tu_doc_info group by recv_addr) as t;
-[ RECORD 1 ]-----
count | 3898994670
Time: 29168870.915 ms
其他,尝试统计mysql中的一个分区表,大概2000万多条,耗时20~30秒左右。
快速去重
比如recv_addr这个字段,取md5后,另存到一张表。然后呢,再group by或distinct不知道性能如何?
统计重复
-- 这样的去重效率比直接distinct的效率高一些,结果一致的
select count(1) from (select count(1) from ydwr_base_salesman_sorting group by branch_code ,salesman_id ) as f;
-- 统计总数
select count(1) from ydwr_base_salesman_sorting;
-- 统计重复的数据有哪些
select branch_code,salesman_id from ydwr_base_salesman_sorting group by branch_code ,salesman_id having count(1) > 1 ;
-- 将上一部的结果,代入到这一步,联表查看,具体是哪些记录
select * from ydwr_base_salesman_sorting as a right join (select branch_code,salesman_id from ydwr_base_salesman_sorting group by branch_code ,salesman_id having count(1) > 1) as b
on a.branch_code = b.branch_code and a.salesman_id = b.salesman_id;
-- 貌似开窗函数也能满足,但是,如何同时取做到 2个字段同时聚合呢?
select * from ydwr_base_salesman_sorting ;
地址表去重
抽取数据,难免会遇到去重问题,到底是先去重,还是先联表求接?计算字段,到底是先计算,还是放到总后计算,或者在中间去完重再计算?
场景
test_scan_qian是最终联表的主角,所以呢,在去重的时候,先利用其去重。
addr_real_uniq是目标表,是希望得到去重的数据,方便在最终的时候,进行联表查询。
本身并不一定需要test_scan_qian联表去重,但是呢,不用的话,数据量太大了,增加条件,减少数据量,正好,最后联表的时候,数据量也小一些。
运行时间
代码1单独的提取数据,没有去重,貌似数据量大。故时间比较长。
代码2跟代码1相比,本来是打算联表减少数据量的,但是用的是LEFT JOIN,而不是RIGHT JOIN,查找的数据总量没有减少,还增加了运算时间。
代码3跟代码2相比,采用了RIGHT JOIN,故数落量减少,总时间减到2/3
代码4跟代码3总体差不多,但是条件,由on 改到 where里面,时间反而减少了(不科学),估计都用到了分区裁剪吧。
条件到底是写在on里面 还是写到 where里面,在联很多表的时候,写到on里面,貌似更方便里面。
代码5跟代码4,联表方式发生了变化,RIGHT JOIN 改成了INNER JOIN。
代码6,使用了FROM A,B WHERE A B的形式。
关于执行耗时,其实大体上都差不多,可能写出的sql,引擎会整理,可能是殊途同归。
代码7、8是在代码5的基础上,减少运算条件,比如最外层的去重、开窗,就剩下单纯的联表的消耗呢。结果发现,并没有在时间上有提升,而数据量有增加。这说明什么,增加条件,也需并不会增加运行时间。所以呢,减少联表,最后再去重(假设重复的数据并不多),也需速度会提升。
减少扫描次数、减少联表次数
数据总量
代码1、代码2一样,代码3、4、5、6差不多。但是代码3却多出1条??为啥?
耗时相关:数据总量、字段数。
代码1
-- a7地址去重
DROP TABLE IF EXISTS addr_real_uniq;
CREATE TABLE addr_real_uniq WITH (
APPENDONLY = TRUE,
COMPRESSLEVEL = 5,
ORIENTATION = COLUMN,
COMPRESSTYPE = ZLIB
) AS
select
*
from
(
select
*,
-- CAINIAO>YDEC>JD>PDD>BM>YJ>PDDGJ>其他>OOS_NON_ELE,OOS>PC_REAL_NAME_NON_LEL,PC_REAL_NAME>COD
-- as ldsj_val,
ROW_NUMBER () OVER (
PARTITION BY tm
ORDER BY
case
when ldly = 'CAINIAO' then 1
when ldly = 'YDEC' then 2
when ldly = 'JD' then 3
when ldly = 'PDD' then 4
when ldly = 'BM' then 5
when ldly = 'YJ' then 6
when ldly = 'PDDGJ' then 7
when ldly = '其他' then 8
when ldly = 'OOS_NON_ELE' then 9
when ldly = 'OOS' then 10
when ldly = 'PC_REAL_NAME_NON_LEL' then 11
when ldly = 'PC_REAL_NAME' then 12
when ldly = 'COD' then 13
else 14
end ASC,ldsj desc
) AS rn
FROM
addr_real
WHERE ldsj >= '2021-08-15 00:00:00'
) t
where rn = 1
DISTRIBUTED BY (tm)
;
SELECT 549681179
Time: 1484293.820 ms
代码2
DROP TABLE IF EXISTS addr_real_uniq;
CREATE TABLE addr_real_uniq WITH (
APPENDONLY = TRUE,
COMPRESSLEVEL = 5,
ORIENTATION = COLUMN,
COMPRESSTYPE = ZLIB
) AS
select
*
from
(
select
a.*,
-- CAINIAO>YDEC>JD>PDD>BM>YJ>PDDGJ>其他>OOS_NON_ELE,OOS>PC_REAL_NAME_NON_LEL,PC_REAL_NAME>COD
-- as ldsj_val,
ROW_NUMBER () OVER (
PARTITION BY tm
ORDER BY
case
when ldly = 'CAINIAO' then 1
when ldly = 'YDEC' then 2
when ldly = 'JD' then 3
when ldly = 'PDD' then 4
when ldly = 'BM' then 5
when ldly = 'YJ' then 6
when ldly = 'PDDGJ' then 7
when ldly = '其他' then 8
when ldly = 'OOS_NON_ELE' then 9
when ldly = 'OOS' then 10
when ldly = 'PC_REAL_NAME_NON_LEL' then 11
when ldly = 'PC_REAL_NAME' then 12
when ldly = 'COD' then 13
else 14
end ASC,ldsj desc
) AS rn
FROM
addr_real as a
LEFT JOIN test_scan_qian as q on a.tm = q.ship_id
WHERE ldsj >= '2021-08-15 00:00:00'
) t
where rn = 1
DISTRIBUTED BY (tm)
;
上面联表方式错了,(RIGHT JOIN),数据没有少,还增加了运算时间,故时间更旧。
SELECT 549681179
Time: 1606665.073 ms
代码3
DROP TABLE IF EXISTS addr_real_uniq;
CREATE TABLE addr_real_uniq WITH (
APPENDONLY = TRUE,
COMPRESSLEVEL = 5,
ORIENTATION = COLUMN,
COMPRESSTYPE = ZLIB
) AS
select
*
from
(
select
a.*,
-- CAINIAO>YDEC>JD>PDD>BM>YJ>PDDGJ>其他>OOS_NON_ELE,OOS>PC_REAL_NAME_NON_LEL,PC_REAL_NAME>COD
-- as ldsj_val,
ROW_NUMBER () OVER (
PARTITION BY tm
ORDER BY
case
when ldly = 'CAINIAO' then 1
when ldly = 'YDEC' then 2
when ldly = 'JD' then 3
when ldly = 'PDD' then 4
when ldly = 'BM' then 5
when ldly = 'YJ' then 6
when ldly = 'PDDGJ' then 7
when ldly = '其他' then 8
when ldly = 'OOS_NON_ELE' then 9
when ldly = 'OOS' then 10
when ldly = 'PC_REAL_NAME_NON_LEL' then 11
when ldly = 'PC_REAL_NAME' then 12
when ldly = 'COD' then 13
else 14
end ASC,ldsj desc
) AS rn
FROM
addr_real as a
RIGHT JOIN test_scan_qian as q on a.tm = q.ship_id and ldsj >= '2021-08-15 00:00:00'
) t
where rn = 1
DISTRIBUTED BY (tm)
;
SELECT 48430255
Time: 524957.187 ms
代码4
DROP TABLE IF EXISTS addr_real_uniq;
CREATE TABLE addr_real_uniq WITH (
APPENDONLY = TRUE,
COMPRESSLEVEL = 5,
ORIENTATION = COLUMN,
COMPRESSTYPE = ZLIB
) AS
select
*
from
(
select
a.*,
-- CAINIAO>YDEC>JD>PDD>BM>YJ>PDDGJ>其他>OOS_NON_ELE,OOS>PC_REAL_NAME_NON_LEL,PC_REAL_NAME>COD
-- as ldsj_val,
ROW_NUMBER () OVER (
PARTITION BY tm
ORDER BY
case
when ldly = 'CAINIAO' then 1
when ldly = 'YDEC' then 2
when ldly = 'JD' then 3
when ldly = 'PDD' then 4
when ldly = 'BM' then 5
when ldly = 'YJ' then 6
when ldly = 'PDDGJ' then 7
when ldly = '其他' then 8
when ldly = 'OOS_NON_ELE' then 9
when ldly = 'OOS' then 10
when ldly = 'PC_REAL_NAME_NON_LEL' then 11
when ldly = 'PC_REAL_NAME' then 12
when ldly = 'COD' then 13
else 14
end ASC,ldsj desc
) AS rn
FROM
addr_real as a
RIGHT JOIN test_scan_qian as q on a.tm = q.ship_id
where ldsj >= '2021-08-15 00:00:00'
) t
where rn = 1
DISTRIBUTED BY (tm)
;
SELECT 48430254
Time: 487740.310 ms
代码5
DROP TABLE IF EXISTS addr_real_uniq;
CREATE TABLE addr_real_uniq WITH (
APPENDONLY = TRUE,
COMPRESSLEVEL = 5,
ORIENTATION = COLUMN,
COMPRESSTYPE = ZLIB
) AS
select
*
from
(
select
a.*,
-- CAINIAO>YDEC>JD>PDD>BM>YJ>PDDGJ>其他>OOS_NON_ELE,OOS>PC_REAL_NAME_NON_LEL,PC_REAL_NAME>COD
-- as ldsj_val,
ROW_NUMBER () OVER (
PARTITION BY tm
ORDER BY
case
when ldly = 'CAINIAO' then 1
when ldly = 'YDEC' then 2
when ldly = 'JD' then 3
when ldly = 'PDD' then 4
when ldly = 'BM' then 5
when ldly = 'YJ' then 6
when ldly = 'PDDGJ' then 7
when ldly = '其他' then 8
when ldly = 'OOS_NON_ELE' then 9
when ldly = 'OOS' then 10
when ldly = 'PC_REAL_NAME_NON_LEL' then 11
when ldly = 'PC_REAL_NAME' then 12
when ldly = 'COD' then 13
else 14
end ASC,ldsj desc
) AS rn
FROM
addr_real as a
INNER JOIN test_scan_qian as q on a.tm = q.ship_id
where ldsj >= '2021-08-15 00:00:00'
) t
where rn = 1
DISTRIBUTED BY (tm)
;
SELECT 48430254
Time: 480104.767 ms
代码6
更换了from形式
DROP TABLE IF EXISTS addr_real_uniq;
CREATE TABLE addr_real_uniq WITH (
APPENDONLY = TRUE,
COMPRESSLEVEL = 5,
ORIENTATION = COLUMN,
COMPRESSTYPE = ZLIB
) AS
select
*
from
(
select
a.*,
-- CAINIAO>YDEC>JD>PDD>BM>YJ>PDDGJ>其他>OOS_NON_ELE,OOS>PC_REAL_NAME_NON_LEL,PC_REAL_NAME>COD
-- as ldsj_val,
ROW_NUMBER () OVER (
PARTITION BY tm
ORDER BY
case
when ldly = 'CAINIAO' then 1
when ldly = 'YDEC' then 2
when ldly = 'JD' then 3
when ldly = 'PDD' then 4
when ldly = 'BM' then 5
when ldly = 'YJ' then 6
when ldly = 'PDDGJ' then 7
when ldly = '其他' then 8
when ldly = 'OOS_NON_ELE' then 9
when ldly = 'OOS' then 10
when ldly = 'PC_REAL_NAME_NON_LEL' then 11
when ldly = 'PC_REAL_NAME' then 12
when ldly = 'COD' then 13
else 14
end ASC,ldsj desc
) AS rn
FROM
addr_real as a,test_scan_qian as q
-- 更换联表的条件
where a.tm = q.ship_id and ldsj >= '2021-08-15 00:00:00'
) t
where rn = 1
DISTRIBUTED BY (tm)
;
SELECT 48430254
Time: 473906.605 ms
代码7
对比代码5
去掉外层的去重试试,看看内部的耗时。
开窗后,去重的where并不怎么消耗时间,相差不大。
DROP TABLE IF EXISTS addr_real_uniq2;
CREATE TABLE addr_real_uniq2 WITH (
APPENDONLY = TRUE,
COMPRESSLEVEL = 5,
ORIENTATION = COLUMN,
COMPRESSTYPE = ZLIB
) AS
select
a.*,
ROW_NUMBER () OVER (
PARTITION BY tm
ORDER BY
case
when ldly = 'CAINIAO' then 1
when ldly = 'YDEC' then 2
when ldly = 'JD' then 3
when ldly = 'PDD' then 4
when ldly = 'BM' then 5
when ldly = 'YJ' then 6
when ldly = 'PDDGJ' then 7
when ldly = '其他' then 8
when ldly = 'OOS_NON_ELE' then 9
when ldly = 'OOS' then 10
when ldly = 'PC_REAL_NAME_NON_LEL' then 11
when ldly = 'PC_REAL_NAME' then 12
when ldly = 'COD' then 13
else 14
end ASC,ldsj desc
) AS rn
FROM
addr_real as a
INNER JOIN test_scan_qian as q on a.tm = q.ship_id
where ldsj >= '2021-08-15 00:00:00'
DISTRIBUTED BY (tm)
;
SELECT 55299456
Time: 476894.489 ms
代码8
再去掉开窗函数,看看耗时。纯粹的看联表的耗时。
DROP TABLE IF EXISTS addr_real_uniq2;
CREATE TABLE addr_real_uniq2 WITH (
APPENDONLY = TRUE,
COMPRESSLEVEL = 5,
ORIENTATION = COLUMN,
COMPRESSTYPE = ZLIB
) AS
select
a.*
FROM
addr_real as a
INNER JOIN test_scan_qian as q on a.tm = q.ship_id
where ldsj >= '2021-08-15 00:00:00'
DISTRIBUTED BY (tm)
;
SELECT 55299456
Time: 470603.757 ms
i1借助联表去重
联一下主表,主表是主要的数据源。
下面其实表达的就是内部连接,但是写得比较啰嗦。
DROP TABLE IF EXISTS i1_order_unqi;
CREATE TABLE i1_order_unqi WITH (
APPENDONLY = TRUE,
COMPRESSLEVEL = 5,
ORIENTATION = COLUMN,
COMPRESSTYPE = ZLIB
) AS
SELECT
*
FROM
(
SELECT
i.*,
ROW_NUMBER () OVER (
PARTITION BY i.ship_id
ORDER BY
batch_time DESC
) AS rn
FROM
i1_order as i
LEFT JOIN test_scan_qian as q ON q.ship_id = i.ship_id and i.batch_time >= '<?=dt('-7 day')?>'
WHERE q.ship_id IS NOT NULL
) t
WHERE
rn = 1
DISTRIBUTED BY (ship_id);
傻了
RIGHT JOIN test_scan_qian as q ON q.ship_id = i.ship_id and i.batch_time >= '<?=dt('-7 day')?>'
发现还是错的,其实是INNER JOIN
INNER JOIN test_scan_qian as q ON q.ship_id = i.ship_id and i.batch_time >= '<?=dt('-7 day')?>'
借助in去重
省略了建表语句,最终结果都是按ship_id分布存储到数据库中,作为中间结果。
优化后,数据量减少到1/135,时间缩减了快10倍。效果非常的明显。此时,并没有纠结in/join/exists的小表驱动问题。
- 优化后
select ship_id,string_agg(ship_type,'|') as ship_type from (select ship_id,ship_type from marked_ship
where ship_id in (select ship_id from test_scan_qian ) -- 增加了这句
group by ship_id,ship_type ) as f group by ship_id
SELECT 10724200
156777 Time: 62214.511 ms
- 优化前
select ship_id,string_agg(ship_type,'|') as ship_type from (select ship_id,ship_type from marked_ship group by ship_id,ship_type ) as f group by ship_id
SELECT 1357038779
Time: 524996.890 ms
字段获取
总结
减少字段,能提升速度,但是不明显。比如900秒,到800秒。但是,通过联表(INNER JOIN,减少总量),减少到200秒左右。
0相关数据量
select count(*) from test_scan_qian;
count
----------
50196784
select max(store_time),min(store_time) from test_doc;
max | min
---------------------+---------------------
2021-08-26 23:49:52 | 2021-07-24 03:29:29
1原始去重需求
DROP TABLE IF EXISTS test_doc;
CREATE TABLE test_doc WITH (
APPENDONLY = TRUE,
COMPRESSLEVEL = 5,
ORIENTATION = COLUMN,
COMPRESSTYPE = ZLIB
) AS SELECT
*
FROM
(
SELECT
d.doc_sn,
d.doc_src,
d.store_time,
d.recv_prov,
d.recv_city,
d.recv_ctry,
d.recv_addr,
d.send_prov,
d.send_city,
d.send_ctry,
d.send_addr,
ROW_NUMBER () OVER (
PARTITION BY doc_sn
ORDER BY
dtime DESC
) AS rn
FROM
tu_doc_info AS d
WHERE
dtime >= '2021-07-27 00:00:00'
AND dtime < '2021-08-27 00:00:00'
) AS T
WHERE
T.rn = 1
DISTRIBUTED BY (doc_sn);
SELECT 1531072496
Time: 937538.918 ms
2减少字段
减少字段,确实能减少时间消耗。但是,提升比率不是特别的高。
DROP TABLE IF EXISTS test_doc;
CREATE TABLE test_doc WITH (
APPENDONLY = TRUE,
COMPRESSLEVEL = 5,
ORIENTATION = COLUMN,
COMPRESSTYPE = ZLIB
) AS SELECT
*
FROM
(
SELECT
d.doc_sn,
d.doc_src,
d.store_time,
d.recv_prov,
d.recv_city,
d.recv_ctry,
d.recv_addr,
-- d.send_prov,
-- d.send_city,
-- d.send_ctry,
-- d.send_addr,
ROW_NUMBER () OVER (
PARTITION BY doc_sn
ORDER BY
dtime DESC
) AS rn
FROM
tu_doc_info AS d
WHERE
dtime >= '2021-07-27 00:00:00'
AND dtime < '2021-08-27 00:00:00'
) AS T
WHERE
T.rn = 1
DISTRIBUTED BY (doc_sn);
SELECT 1531072496
Time: 833351.744 ms
3联表减少数据量
DROP TABLE IF EXISTS test_doc;
CREATE TABLE test_doc WITH (
APPENDONLY = TRUE,
COMPRESSLEVEL = 5,
ORIENTATION = COLUMN,
COMPRESSTYPE = ZLIB
) AS SELECT
*
FROM
(
SELECT
d.doc_sn,
d.doc_src,
d.store_time,
d.recv_prov,
d.recv_city,
d.recv_ctry,
d.recv_addr,
d.send_prov,
d.send_city,
d.send_ctry,
d.send_addr,
ROW_NUMBER () OVER (
PARTITION BY doc_sn
ORDER BY
dtime DESC
) AS rn
FROM
tu_doc_info AS d
INNER JOIN test_scan_qian as q ON q.ship_id = d.doc_sn
WHERE
dtime >= '2021-07-27 00:00:00'
AND dtime < '2021-08-27 00:00:00'
) AS T
WHERE
T.rn = 1
DISTRIBUTED BY (doc_sn);
SELECT 50106931
Time: 221289.113 ms
字段少,时间很快
-- 集包信息,tm可能是大包号、运单号
DROP TABLE IF EXISTS test_jibao;
CREATE TABLE test_jibao WITH (
APPENDONLY = TRUE,
COMPRESSLEVEL = 5,
ORIENTATION = COLUMN,
COMPRESSTYPE = ZLIB
) AS SELECT
*
FROM
(
SELECT
case when j.grp_ship_id is not null then j.grp_ship_id else a.ship_id end as tm, -- 确定装卸车查找的 号(大包号优先于运单号)
a.ship_id,
j.grp_ship_id AS pac_id, --大包号
j.scan_tm AS pac_time,
j.scan_site AS pac_sitecode,
CASE
WHEN j.scan_typ IN ('3', '03') THEN
'分拨集包'
WHEN j.scan_typ IN ('5', '05') THEN
'网点集包'
END AS pac_class,
ROW_NUMBER () OVER (
PARTITION BY j.ship_id
ORDER BY
j.ins_db_tm DESC
) AS rn
FROM
test_scan_qian as a
LEFT JOIN tb_scan as j on j.ship_id = a.ship_id
WHERE
j.ins_db_tm >= '2021-08-23 00:00:00'
AND j.ins_db_tm < '2021-09-02 00:00:00'
AND j.scan_typ in ('3','03','5','05')
) AS T
WHERE
T.rn = 1
DISTRIBUTED BY (tm);
时间短的有点不敢相信
SELECT 37564347
Time: 67449.129 ms
奇怪的问题
-- 装97、卸96车信息
DROP TABLE IF EXISTS test_zhuangxie;
CREATE TABLE test_zhuangxie WITH (
APPENDONLY = TRUE,
COMPRESSLEVEL = 5,
ORIENTATION = COLUMN,
COMPRESSTYPE = ZLIB
) AS SELECT
*
FROM
(
SELECT
a.*,
z.scan_site AS sffb_id, -- 始发分拨
x.scan_site AS mdfb_id, -- 目的站点
ROW_NUMBER () OVER (
PARTITION BY a.ship_id
ORDER BY
z.ins_db_tm ASC, -- 装车取最早
x.ins_db_tm DESC -- 卸车取最晚
) AS rn1
FROM
test_jibao as a
LEFT JOIN tb_scan as z on z.ship_id = a.ship_id -- 装车
LEFT JOIN tb_scan as x on x.ship_id = a.ship_id -- 卸车
WHERE
z.ins_db_tm >= '2021-08-23 00:00:00'
AND z.ins_db_tm < '2021-09-02 00:00:00'
AND z.scan_typ = '97'
AND x.ins_db_tm >= '2021-08-23 00:00:00'
AND x.ins_db_tm < '2021-09-02 00:00:00'
AND x.scan_typ = '96'
) AS T
WHERE
T.rn1 = 1
DISTRIBUTED BY (tm);
SELECT 57880
Time: 149169.754 ms
develop=# select count(*) from test_jibao ;
count
----------
37564347
居然,过滤掉很多数据。
DROP TABLE IF EXISTS test_zhuangxie;
CREATE TABLE test_zhuangxie WITH (
APPENDONLY = TRUE,
COMPRESSLEVEL = 5,
ORIENTATION = COLUMN,
COMPRESSTYPE = ZLIB
) AS SELECT
*
FROM
(
SELECT
a.*,
z.scan_site AS sffb_id, -- 始发分拨
x.scan_site AS mdfb_id, -- 目的站点
ROW_NUMBER () OVER (
PARTITION BY a.ship_id
ORDER BY
z.ins_db_tm ASC, -- 装车取最早
x.ins_db_tm DESC -- 卸车取最晚
) AS rn1
FROM
test_jibao as a
LEFT JOIN tb_scan as z on z.ship_id = a.ship_id -- 装车
AND z.ins_db_tm >= '2021-08-23 00:00:00'
AND z.ins_db_tm < '2021-09-02 00:00:00'
AND z.scan_typ = '97'
LEFT JOIN tb_scan as x on x.ship_id = a.ship_id -- 卸车
AND x.ins_db_tm >= '2021-08-23 00:00:00'
AND x.ins_db_tm < '2021-09-02 00:00:00'
AND x.scan_typ = '96'
) AS T
WHERE
T.rn1 = 1
DISTRIBUTED BY (tm);
SELECT 37564347
Time: 100181.922 ms
压缩表的影响
压缩表的好处是,由于压缩,体积减少,能减少磁盘io,另外,列存储,非常适合数仓这种类型。
实验证明,写到压缩表内,并不会增加消耗的时间。
压缩表
DROP TABLE IF EXISTS test_scan_fen;
CREATE TABLE test_scan_fen WITH (
APPENDONLY = TRUE,
COMPRESSLEVEL = 5,
ORIENTATION = COLUMN,
COMPRESSTYPE = ZLIB
) AS SELECT
ship_id,
scan_site,
-- scan_emp,字段更换为 delv_emp
delv_emp,
scan_tm,
rmk_inf,
-- 备注派送码
CASE
WHEN rmk_inf SIMILAR to '[a-zA-Z0-9]{2}' THEN
rmk_inf
ELSE
null
END AS mark_code,
-- 备注派送码是否为机动,是否XYWZ开头
case when CHAR_LENGTH (rmk_inf) = 2 and (left(upper(rmk_inf),1)='X' or left(upper(rmk_inf),1)='Y' or left(upper(rmk_inf),1)='W' or left(upper(rmk_inf),1)='Z') then
1
else
0
end as is_var_code
FROM
(
SELECT
ship_id,
scan_site,
-- scan_emp,
delv_emp,
scan_tm,
rmk_inf,
ROW_NUMBER () OVER (
PARTITION BY ship_id
ORDER BY
scan_tm DESC
) AS rn
FROM
tb_scan
WHERE
ins_db_tm >= '2021-08-17 00:00:00'
AND ins_db_tm < '2021-08-27 00:00:00'
AND scan_typ = '24'
) AS T
WHERE
T.rn = 1
DISTRIBUTED BY (ship_id);
SELECT 483512499
Time: 217465.246 ms
普通表
DROP TABLE IF EXISTS test_scan_fen;
CREATE TABLE test_scan_fen AS
SELECT
ship_id,
scan_site,
-- scan_emp,字段更换为 delv_emp
delv_emp,
scan_tm,
rmk_inf,
-- 备注派送码
CASE
WHEN rmk_inf SIMILAR to '[a-zA-Z0-9]{2}' THEN
rmk_inf
ELSE
null
END AS mark_code,
-- 备注派送码是否为机动,是否XYWZ开头
case when CHAR_LENGTH (rmk_inf) = 2 and (left(upper(rmk_inf),1)='X' or left(upper(rmk_inf),1)='Y' or left(upper(rmk_inf),1)='W' or left(upper(rmk_inf),1)='Z') then
1
else
0
end as is_var_code
FROM
(
SELECT
ship_id,
scan_site,
-- scan_emp,
delv_emp,
scan_tm,
rmk_inf,
ROW_NUMBER () OVER (
PARTITION BY ship_id
ORDER BY
scan_tm DESC
) AS rn
FROM
tb_scan
WHERE
ins_db_tm >= '2021-08-17 00:00:00'
AND ins_db_tm < '2021-08-27 00:00:00'
AND scan_typ = '24'
) AS T
WHERE
T.rn = 1
DISTRIBUTED BY (ship_id);
SELECT 483512499
Time: 209393.916 ms
视图
DROP VIEW IF EXISTS test_scan_fen;
EsRROR: “test_scan_fen” is not a view
HINT: Use DROP TABLE to remove a table, DROP EXTERNAL TABLE if external, or DROP FOREIGN TABLE if foreign.
CREATE VIEW test_scan_fen AS SELECT
relation “test_scan_fen” already exists
通过上面的例子发现,其实视图,跟表有点像。
DROP VIEW IF EXISTS test_scan_fen;
CREATE VIEW test_scan_fen AS
SELECT
...
创建视图很快。只是存储了sql。并不会实际执行。
创建视图,一次计算
具体见sql/table_job.sql、sql/view_job.sql。
视图约占 77%,速度有提升,但是貌似并不大。
test_scan_qian; 改为view
test_scan_fen; 改为view
test_scan_lan; 改为view
test_doc; 改为view
test_scan_doc; 不变,输出表格。
视图方式
SELECT 50196784
Time: 1717532.500 ms
直接计算方式。
16:12开始 至 16:50 大概38分钟。
2208822
SELECT 50196784
Time: 39995.549 ms
psql:/yd/table_job.sql:49: NOTICE: table "test_scan_fen" does not exist, skipping
DROP TABLE
Time: 0.843 ms
SELECT 483512499
Time: 248721.958 ms
psql:/yd/table_job.sql:102: NOTICE: table "test_scan_lan" does not exist, skipping
DROP TABLE
Time: 0.731 ms
SELECT 1506757081
Time: 700615.860 ms
psql:/yd/table_job.sql:135: NOTICE: table "test_doc" does not exist, skipping
DROP TABLE
Time: 0.663 ms
SELECT 1531072496
Time: 1022834.503 ms
DROP TABLE
Time: 24.487 ms
SELECT 50196784
Time: 196633.527 ms
csv文件导入核对
# 查看此时的文件占用
du -sh
# 查看总行数,导出的cvs文件,是pdo.fetch->fputcsv直接写的,没有表头,但是有最后的空行
# 该方法并不一定总是很准,误差可能来源与,字段中如过有空行,可能会被误识别
cat marked_ship_20210*|wc -l
psql操作
查看总条数是否一致,查看1条记录是否正常
select * from marked_ship where insert_db_date >= '2021-07-11' limit 1;
select count(1) from marked_ship where insert_db_date >= '2021-07-11';
核对一致,则可删除文件
rm -f marked_ship_20210*
du -sh
in/exists/join性能对比
test_scan_qian 大表,大概5000万数据。
i1_record 相当来讲是个小表,但是重复的数据比较多。20310127
join
据说,一般能用子查询的,都能换成join方式
DROP TABLE IF EXISTS i1_record_unqi;
CREATE TABLE i1_record_unqi WITH (
APPENDONLY = TRUE,
COMPRESSLEVEL = 5,
ORIENTATION = COLUMN,
COMPRESSTYPE = ZLIB
) AS
SELECT
*
FROM
(
SELECT
i.*,
ROW_NUMBER () OVER (
PARTITION BY i.ship_id
ORDER BY
batch_time DESC
) AS rn
FROM
i1_record as i
LEFT JOIN test_scan_qian as q ON q.ship_id = i.ship_id and i.batch_time >= '2021-08-23'
WHERE q.ship_id IS NOT NULL
) t
WHERE
rn = 1
DISTRIBUTED BY (ship_id);
SELECT 11620
Time: 48267.562 ms
- in
DROP TABLE IF EXISTS i1_record_unqi;
CREATE TABLE i1_record_unqi WITH (
APPENDONLY = TRUE,
COMPRESSLEVEL = 5,
ORIENTATION = COLUMN,
COMPRESSTYPE = ZLIB
) AS
SELECT
*
FROM
(
SELECT
i.*,
ROW_NUMBER () OVER (
PARTITION BY i.ship_id
ORDER BY
batch_time DESC
) AS rn
FROM
i1_record as i
where i.batch_time >= '2021-08-23'
and ship_id in (select ship_id from test_scan_qian )
) t
WHERE
rn = 1
DISTRIBUTED BY (ship_id);
SELECT 11620
Time: 50033.728 ms
- exists
DROP TABLE IF EXISTS i1_record_unqi;
CREATE TABLE i1_record_unqi WITH (
APPENDONLY = TRUE,
COMPRESSLEVEL = 5,
ORIENTATION = COLUMN,
COMPRESSTYPE = ZLIB
) AS
SELECT
*
FROM
(
SELECT
i.*,
ROW_NUMBER () OVER (
PARTITION BY i.ship_id
ORDER BY
batch_time DESC
) AS rn
FROM
i1_record as i
where i.batch_time >= '2021-08-23'
and exists (select 1 from test_scan_qian where ship_id = i.ship_id )
) t
WHERE
rn = 1
DISTRIBUTED BY (ship_id);
SELECT 11620
Time: 51298.095 ms
备注:有的时候,添加条件,反而速度还快一点。如去掉下面的where条件,速度反而还慢一些。
LEFT JOIN test_scan_qian as q ON q.ship_id = i.ship_id and i.batch_time >= '2021-08-23'
-- WHERE q.ship_id IS NOT NULL
SELECT 1971098
Time: 62557.812 ms
扩展:发现除了前面3种语法,还有其他类型的子查询。
子查询1
select * from (
select
a.*,
( select aid from b where b.aid=a.id limit 1) as aid
from a
) as t
where t.aid is null;
子查询2:
select * from a where (select aid from b where b.aid=a.id limit 1) is null;
日期统计
select to_char(store_time::timestamp,'YYYY-MM-DD'),count(1) from test_doc group by to_char(store_time::timestamp,'YYYY-MM-DD') order by to_char(store_time::timestamp,'YYYY-MM-DD');
-- 简写
select to_char(store_time::timestamp,'YYYY-MM-DD'),count(1) from test_doc group by 1 order by 1;
-- 不知道为什么,自己总喜欢写成嵌套的形式 可能数据量太小,耗时跟上面两种差不多。
select dt,count(1) from (select to_char(store_time::timestamp,'YYYY-MM-DD') as dt from test_doc) t group by dt;
时间大概统计
分发:200多秒,取5亿数据,并取出6个字段,并完成计算字段,去重。(+联表条件,能降到94秒)
签收:600秒,取15亿数据,大致取4个字段。并完成去重。
(增加联表条件,时间降到一半)
SELECT 55019979
Time: 227152.571 ms
录单:960秒,取15亿数据。
(增加联表条件,时间降到1/4,说明扫描耗时,但是取字段更耗时。)
SELECT 55351804
Time: 236165.105 ms
这样,几个大表再联合的时候,源数据量少了很多,估计联表也快不少。
三表联合,由200秒,降低到到1/4。(有两张表同时从15亿降到了0.5亿的水平)
SELECT 55415034
Time: 48538.021 ms
其实,再次优化,能将多表的扫描过程,最好一次完成,估计时间会降低,但是如何实现?自己写函数?增加索引?自己写复杂的联合sql?
所以,目标数据5000万左右,如果中间数据超过这个数很多倍,那么,就是有问题。都能使用联表的方式,降低中间的输出总量。
统计各个层次的数量
其实,下面可以直接用普通的取整之类的运算符,来解决。如果按0.01精度来分区,那岂不是还要写上一100个when。
select case
when i1l_ftconfid = '' then -1
when i1l_ftconfid::numeric >= 1.1 then 1.1
when i1l_ftconfid::numeric >= 1 then 1
when i1l_ftconfid::numeric >= 0.9 then 0.9
when i1l_ftconfid::numeric >= 0.8 then 0.8
when i1l_ftconfid::numeric >= 0.7 then 0.7
when i1l_ftconfid::numeric >= 0.6 then 0.6
when i1l_ftconfid::numeric >= 0.5 then 0.5
when i1l_ftconfid::numeric >= 0.4 then 0.4
when i1l_ftconfid::numeric >= 0.3 then 0.3
when i1l_ftconfid::numeric >= 0.2 then 0.2
when i1l_ftconfid::numeric >= 0.1 then 0.1
when i1l_ftconfid::numeric >= 0 then 0
else
-1
end as possible
,count(1)
from test_data_uniq_20210831
group by 1
order by 1
;
本来想进一步增加百分比的信息,但是,如下:
select possible,cnt,(cnt::numeric)/(sum(cnt))::numeric from
( 上面的语句 ) t ;
报错信息如下:
column "t.possible" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: select possible,cnt,cnt::numeric,sum(cnt) from
原因:就是因为增加了聚集函数sum,就报错。不知道到为何。
去重条件
最开始写的sql如下:
-- 集包信息,tm可能是大包号、运单号
DROP TABLE IF EXISTS test_jibao;
CREATE TABLE test_jibao WITH (
APPENDONLY = TRUE,
COMPRESSLEVEL = 5,
ORIENTATION = COLUMN,
COMPRESSTYPE = ZLIB
) AS SELECT
*
FROM
(
SELECT
case when j.grp_ship_id is not null then j.grp_ship_id else a.ship_id end as tm, -- 确定装卸车查找的 号(大包号优先于运单号)
a.ship_id,
j.grp_ship_id AS pac_id, --大包号
j.scan_tm AS pac_time,
j.scan_site AS pac_sitecode,
CASE
WHEN j.scan_typ IN ('3', '03') THEN
'分拨集包'
WHEN j.scan_typ IN ('5', '05') THEN
'网点集包'
END AS pac_class,
ROW_NUMBER () OVER (
PARTITION BY j.ship_id
ORDER BY
j.ins_db_tm DESC
) AS rn
FROM
test_scan_qian as a
LEFT JOIN tb_scan as j on j.ship_id = a.ship_id
WHERE
j.ins_db_tm >= '2021-08-26 00:00:00'
AND j.ins_db_tm < '2021-09-05 00:00:00'
AND j.scan_typ in ('3','03','5','05')
) AS T
WHERE
T.rn = 1
DISTRIBUTED BY (tm);
-- 装97、卸96车信息
DROP TABLE IF EXISTS test_zhuangxie;
CREATE TABLE test_zhuangxie WITH (
APPENDONLY = TRUE,
COMPRESSLEVEL = 5,
ORIENTATION = COLUMN,
COMPRESSTYPE = ZLIB
) AS SELECT
*
FROM
(
SELECT
a.*,
z.scan_site AS sffb_id, -- 始发分拨
x.scan_site AS mdfb_id, -- 目的站点
ROW_NUMBER () OVER (
PARTITION BY a.ship_id
ORDER BY
z.ins_db_tm ASC, -- 装车取最早
x.ins_db_tm DESC -- 卸车取最晚
) AS rn1
FROM
test_jibao as a
LEFT JOIN tb_scan as z on z.ship_id = a.ship_id -- 装车
AND z.ins_db_tm >= '2021-08-26 00:00:00'
AND z.ins_db_tm < '2021-09-05 00:00:00'
AND z.scan_typ = '97'
LEFT JOIN tb_scan as x on x.ship_id = a.ship_id -- 卸车
AND x.ins_db_tm >= '2021-08-26 00:00:00'
AND x.ins_db_tm < '2021-09-05 00:00:00'
AND x.scan_typ = '96'
) AS T
WHERE
T.rn1 = 1
DISTRIBUTED BY (tm);
上面主表test_scan_qian结果过滤,数据变少了,跟最开始写的有出入。而且经过运算后的,tm,可能有重复,下一次联表的时候,可能也不好弄。
主表qian没有重复,但是呢,优先用大包号、其次是运单号,去联,集包信息的表,(而集包信息,本身也可能会有重复,所以呢,第一步,先让集包信息唯一),假设,集包信息已经唯一呢,那么,此时,再用qian按前面的条件去联集包信息,只是联表的字段,可能不确定。假设qian的联表字段已经确定,那么,势必会有qian的联表字段重复的情形。所以,确定的顺序是:
- 集包先单独去重
- 计算qian的联表字段,并按联表字段,重分布。
- 将集包跟 新签 进行联表。
- 将上述联表后的结果,重新按qian的主键,来处理。
减少步骤的方式是,动态计算联表字段。尽量每一步,处理各自的逻辑。比如,先去重,而不是联表后,再去重,这样可能导致结果跟预期不一样。
换一种思路,qian表,有大包号按大包处理,没有打包,则按运单处理,然后呢,
一次错误的sql,查非分布键
select
develop-# ship_id,
develop-# grp_ship_id,
develop-# rmk_id,
develop-# rmk_inf,
develop-# scan_site,
develop-# fst_scan_site,
develop-# pre_scan_site,
develop-# nxt_scan_site,
develop-# scan_typ,
develop-# scan_tm,
develop-# ins_db_tm
develop-# from tb_scan
develop-# where ins_db_tm >='2021-09-01' and ins_db_tm < '2021-09-06' and ship_id = '1202767564898' or grp_ship_id = '9401966434871'
develop-# order by ins_db_tm;
^CCancel request sent
ERROR: canceling statement due to user request
Time: 434568.440 ms
select
ship_id,
grp_ship_id,
rmk_id,
rmk_inf,
scan_site,
fst_scan_site,
pre_scan_site,
nxt_scan_site,
scan_typ,
scan_tm,
ins_db_tm
from tb_scan
where ins_db_tm >='2021-09-01' and ins_db_tm < '2021-09-06' and ship_id in ('1202767564898','9401966434871')
order by ins_db_tm;
Time: 36705.112 ms
差别非常的明显,查询分布键,非常的快,但是如果增加了非分布键,则出奇的慢。(猜测)
以下的主键,存在计算字段,相当于查询了非分布键,但是性能也还好。
SELECT
*
FROM
(
SELECT
a.*,
z.scan_site AS sffb_id, -- 始发分拨
x.scan_site AS mdfb_id, -- 目的站点
ROW_NUMBER () OVER (
PARTITION BY a.ship_id
ORDER BY
z.ins_db_tm ASC, -- 装车取最早
x.ins_db_tm DESC -- 卸车取最晚
) AS rn1
FROM
test_jibao as a
LEFT JOIN tb_scan as z on z.ship_id = (case when a.pac_id is not null then a.pac_id else a.ship_id end ) -- 装车
AND z.ins_db_tm >= '2021-08-26 00:00:00'
AND z.ins_db_tm < '2021-09-06 00:00:00'
AND z.scan_typ = '97'
LEFT JOIN tb_scan as x on x.ship_id = (case when a.pac_id is not null then a.pac_id else a.ship_id end ) -- 卸车
AND x.ins_db_tm >= '2021-08-26 00:00:00'
AND x.ins_db_tm < '2021-09-06 00:00:00'
AND x.scan_typ = '96'
) AS T
WHERE
T.rn1 = 1
分组的计算字段
select pac_pre,count(1) from test_zhuangxie group by substring(pac_id from 1 for 2) pac_pre;
正确的使用引用字段:
gp下可以正常用。
select substring(pac_id from 1 for 2) as pac_pre,count(1) as cnt from test_zhuangxie group by 1 order by 2;
-- 写字段名也是可以的
select substring(pac_id from 1 for 2) as pac_pre,count(1) as cnt from test_zhuangxie group by pac_pre order by 2;
clickhouse 却能这样用????
SELECT
deploy,
fullurl,
count(*) as count
FROM ncompaas6rows
LEFT JOIN temp_target_ip ON ncompaas6rows.dest_ip == temp_target_ip.ip AND ncompaas6rows.dest_port == temp_target_ip.port
WHERE temp_target_ip.ip != ''
GROUP BY
replaceRegexpOne(
replaceRegexpOne('http://' || hostname || url, '[?&].*$', ''),
'/\\d{{11,15}}($|/.*$)',
''
) AS fullurl,
deploy AS deploy
ORDER BY
deploy ASC,
count DESC
;
压缩率
大概3.3倍的压缩率。
下面select命令用在分区表上,好像并不行。
-- 37GB 5000万
select pg_size_pretty(pg_relation_size('public.test_data_uniq_20210907'));
-- 123G
ls -lh 导出的csv文件
创建视图,会产生依赖,导致原来的表无法直接删除
直接使用drop table 无法删除掉。注意……
psql:fetch.sql:14: ERROR: cannot drop append only columnar table test_scan_qian because other objects depend on it
DETAIL: view mytestss2_view depends on append only columnar table test_scan_qian
HINT: Use DROP ... CASCADE to drop the dependent objects too.
扫描轨迹问题
- 分拨
create table test_fenbo as select mc,bm,lb from gs where bm in (select distinct zzz from wdzzz_cw );
- 查1天的数据,数据量虽然不多,但是呢,时间慢
DROP TABLE IF EXISTS test_guiji;
CREATE TABLE test_guiji WITH (
APPENDONLY = TRUE,
COMPRESSLEVEL = 5,
ORIENTATION = COLUMN,
COMPRESSTYPE = ZLIB
) AS SELECT
*
FROM
(
SELECT
a.ship_id,
a.scan_site
FROM
tb_scan AS a
INNER JOIN test_scan_qian AS q ON q.ship_id = a.ship_id
INNER JOIN test_fenbo AS f ON f.bm::text = a.scan_site
WHERE
a.ins_db_tm >= '2021-09-15'
AND a.ins_db_tm < '2021-09-16'
) AS T
DISTRIBUTED BY (ship_id);
SELECT 56093895
Time: 194421.786 ms
- 查10天的数据,范围变长了,但是呢,增加了另外一个类型字段来过滤,即,不需要联表,就能过滤掉数据。
DROP TABLE IF EXISTS test_guiji2;
CREATE TABLE test_guiji2 WITH (
APPENDONLY = TRUE,
COMPRESSLEVEL = 5,
ORIENTATION = COLUMN,
COMPRESSTYPE = ZLIB
) AS SELECT
*
FROM
(
SELECT
a.ship_id,
a.scan_site
FROM
tb_scan AS a
INNER JOIN test_scan_qian AS q ON q.ship_id = a.ship_id
INNER JOIN test_fenbo AS f ON f.bm::text = a.scan_site
WHERE
a.ins_db_tm >= '2021-09-05'
AND a.ins_db_tm < '2021-09-16'
AND a.scan_typ not in ('10','14','24')
) AS T
DISTRIBUTED BY (ship_id);
SELECT 155809387
Time: 277863.165 ms
- 增加到30天,再测试。
DROP TABLE IF EXISTS test_guiji2;
CREATE TABLE test_guiji2 WITH (
APPENDONLY = TRUE,
COMPRESSLEVEL = 5,
ORIENTATION = COLUMN,
COMPRESSTYPE = ZLIB
) AS SELECT
*
FROM
(
SELECT
a.ship_id,
a.scan_site
FROM
tb_scan AS a
INNER JOIN test_scan_qian AS q ON q.ship_id = a.ship_id
INNER JOIN test_fenbo AS f ON f.bm::text = a.scan_site
WHERE
a.ins_db_tm >= '2021-08-15'
AND a.ins_db_tm < '2021-09-16'
AND a.scan_typ not in ('10','14','24')
) AS T
DISTRIBUTED BY (ship_id);
时间增加3倍,估计3倍时间
实际上并没有。这个时间,还算能接受。但是不清楚数据质量怎样。
SELECT 156105801
Time: 467436.559 ms
正则匹配
- 简单的中文测试
CREATE OR REPLACE FUNCTION test_ch(addr text) RETURNS text AS $$
return '河南'
$$LANGUAGE plpythonu;
select test_ch('1') ;
- python正则测试
CREATE OR REPLACE FUNCTION test_ch(addr text) RETURNS text AS $$
import re
pat = re.compile('.{1,5}省')
sch = pat.search(addr) # search
if sch and sch.group(0):
return sch.group(0)
return ''
$$LANGUAGE plpythonu;
发现并不行
select test_ch('河南') ;
select test_ch('湖南省');
- 正则测试 - 观察组1
CREATE OR REPLACE FUNCTION parse_addr(addr text) RETURNS text AS $$
#coding=utf-8
if 'pattn' not in GD:
import re
pattn =[]
pattn.append( re.compile('(.*省)(.*市)(.*区)'))
pattn.append( re.compile('(.*省)(.*市)(.*县)'))
pattn.append( re.compile('(.*省)(.*市)(.*市)'))
GD['pattn'] = pattn
pat = GD['pattn']
sch = pat[0].search(addr) # search
if sch and sch.group(0):
return sch.group(0)
sch = pat[1].search(addr) # search
if sch and sch.group(0):
return sch.group(0)
sch = pat[2].search(addr) # search
if sch and sch.group(0):
return sch.group(0)
return ''
$$LANGUAGE plpythonu;
导入函数,所有的观察组都使用下面的方式,来进行计算。
DROP TABLE IF EXISTS test_parse_addr;
CREATE TABLE test_parse_addr WITH (
APPENDONLY = TRUE,
COMPRESSLEVEL = 5,
ORIENTATION = COLUMN,
COMPRESSTYPE = ZLIB
) AS
select doc_sn,recv_addr, parse_addr(recv_addr) from tu_doc_info where dtime >= '2021-09-16' and dtime < '2021-09-17' and recv_addr is not null
DISTRIBUTED BY (doc_sn);
注意:recv_addr is not null,目的是:防止调用时,null的错误
SELECT 80780849
Time: 58160.892 ms
貌似性能还可以,即python参与运算,貌似不太会拖慢性能。
- 正则测试 - 观察组2
取消贪婪匹配模式,因为不支持 .{1,5}语法
CREATE OR REPLACE FUNCTION parse_addr(addr text) RETURNS text AS $$
#coding=utf-8
if 'pattn' not in GD:
import re
pattn =[]
pattn.append( re.compile('(.*?省)(.*?市)(.*?区)'))
pattn.append( re.compile('(.*?省)(.*?市)(.*?县)'))
pattn.append( re.compile('(.*?省)(.*?市)(.*?市)'))
GD['pattn'] = pattn
pat = GD['pattn']
sch = pat[0].search(addr) # search
if sch and sch.group(0):
return sch.group(0)
sch = pat[1].search(addr) # search
if sch and sch.group(0):
return sch.group(0)
sch = pat[2].search(addr) # search
if sch and sch.group(0):
return sch.group(0)
return ''
$$LANGUAGE plpythonu;
结果:
-- 一样的测试语句
SELECT 80780849
Time: 165485.472 ms
-- 第二遍
SELECT 80780849
Time: 161170.369 ms
-- 第三遍再复测原来的脚本,说明函数的功能……
SELECT 80780849
Time: 62675.341 ms
貌似,正则对整个结果影响还是挺大的。
测试in、跟联表、python函数、sql函数的差距
结论:使用in、INNER JOIN这种反查询,效率成本差不多。因为in的子查询非常小。python在量非常大的情况下,效率低不少。非必要,不用。最奇葩的是,具体值放入到 in 内,居然总量不对,时间还长……。故,sql可能更擅长做联表这种运算。
- 基础,先准备相关的数据。
DROP VIEW IF EXISTS test_guiji_view;
DROP TABLE IF EXISTS test_fenbo;
CREATE TABLE test_fenbo AS SELECT
mc,
bm::text as bm,
lb
FROM
gs
WHERE
bm IN (
SELECT DISTINCT zzz FROM wdzzz_cw
)
DISTRIBUTED BY (bm);
INNER JOIN的方式
联的是一张小表
DROP TABLE IF EXISTS tmp_guiji;
CREATE TABLE tmp_guiji AS
SELECT
a.ship_id,
a.scan_site,
a.ins_db_tm
FROM
tb_scan AS a
INNER JOIN test_scan_qian AS q ON q.ship_id = a.ship_id
INNER JOIN test_fenbo AS f ON f.bm = a.scan_site
WHERE
a.ins_db_tm >= '2021-08-18 00:00:00'
AND a.ins_db_tm < '2021-09-18 00:00:00'
AND a.scan_typ not in ('10','14','24')
DISTRIBUTED BY (ship_id);
;
SELECT 156322459
Time: 328684.450 ms
in子查询方式
DROP TABLE IF EXISTS tmp_guiji;
CREATE TABLE tmp_guiji AS
SELECT
a.ship_id,
a.scan_site,
a.ins_db_tm
FROM
tb_scan AS a
INNER JOIN test_scan_qian AS q ON q.ship_id = a.ship_id
WHERE
a.ins_db_tm >= '2021-08-18 00:00:00'
AND a.ins_db_tm < '2021-09-18 00:00:00'
AND a.scan_typ not in ('10','14','24')
AND a.scan_site IN (SELECT bm FROM test_fenbo)
DISTRIBUTED BY (ship_id);
SELECT 156322459
Time: 321444.458 ms
in,代入具体值
求解代入值的时候,使用select string_agg(bm,'#') from test_fenbo;,然后自己手工再替换#为',',为啥要这么麻烦呢?是因为转义的原因……
其实我感觉,应该性能是差不多,如果比之前查多了,那就说明,in用的是数组遍历。
DROP TABLE IF EXISTS tmp_guiji;
CREATE TABLE tmp_guiji AS
SELECT
a.ship_id,
a.scan_site,
a.ins_db_tm
FROM
tb_scan AS a
INNER JOIN test_scan_qian AS q ON q.ship_id = a.ship_id
WHERE
a.ins_db_tm >= '2021-08-18 00:00:00'
AND a.ins_db_tm < '2021-09-18 00:00:00'
AND a.scan_typ not in ('10','14','24')
AND a.scan_site IN (1,2,3)
DISTRIBUTED BY (ship_id);
SELECT 149342271
Time: 362041.943 ms
python函数
虽然选择处理的数据有1.5亿,但是呢,总量是有120亿左右。也就是说,要频繁的调用这么多次,故,调用成为瓶颈,
在量少的时候,可能因为有写操作或者其他,其时间成本显示不出来。测试过python匹配3个正则,5000万的量,贪婪匹配60秒、非贪婪匹配160秒,当然,re是否是c实现的,暂时未知。
结论:python可用,但是还是要考虑量的问题。
-- 调用的时候用 is_fenbo 来判断
CREATE OR REPLACE FUNCTION is_fenbo(bm text) RETURNS bool AS $$
#coding=utf-8
if 'fenbo' not in GD:
fenbo = {
}
GD['fenbo'] = fenbo
mapping = GD['fenbo']
return bm in mapping
$$LANGUAGE plpythonu;
select is_fenbo('461001') ;
SELECT 156322459
Time: 886813.411 ms
至于要不要用pg/plsql来实现函数计算,其实,发现也没有必要。估计性能也不好。
自定义表unnest + ARRAY
通过 unnest + ARRAY 模式,定义一张虚拟的表。然后从这个表内,筛选数据。
bm IN (
--# SELECT DISTINCT zzz FROM wdzzz_cw
SELECT DISTINCT zzz FROM wdzzz_cw union
SELECT zzz FROM unnest((ARRAY[528455,518000])::int[]) as zzz
)
case not null
CREATE table mydemo1 (id int);
INSERT INTO mydemo1 VALUES (1),(null);
-- 错误 不能将id提前 not null 不行 is not null 也行
select case id when is not null then id else 100 end as myid from mydemo1;
-- 正确
select case when id is not null then id else 100 end as myid from mydemo1;
例子
case when gj.sffb_id is not null then gj.sffb_id else zx.sffb_id end as sffb_id,
case when gj.sffb_id is not null then gj.sffb_name else zx.sffb_name end as sffb_name,
case when gj.mdfb_id is not null then gj.mdfb_id else zx.mdfb_id end as mdfb_id,
case when gj.mdfb_id is not null then gj.mdfb_name else zx.mdfb_name end as mdfb_name,
计算分组百分比
本来主要的想法是,先计算出分组情况。select mycat,count(*) as cnt from mytable group by 1 ;然后在此结果上,再嵌套一层,算出总数,及百分比。select *,cnt/sum(cnt) as percent from (上一步结果) f,但是sum无法直接出现在非聚合查询上。……
总数,运用(关联)子查询
- 关联子查询
SELECT
rmk_id,
COUNT (*) AS cnt,
round(
(
(COUNT(*)) :: DECIMAL / (
SELECT
COUNT (*)
FROM
test_data_uniq_20210922
)
) * 100,
2
) AS percent
FROM
test_data_uniq_20210922
GROUP BY
rmk_id
ORDER BY
cnt DESC;
另一种方式
准备条件,先研究如下用法。
over的含义,应该是在分组查询最后的结果上处理,如果没有分组,那么over,即在原始数据上处理。而查询中,已有group了,那么,在select的字段上,使用窗口函数,那么,针对的应该是分组后的结果。
SELECT
rmk_id,
COUNT (*) AS cnt,
-- 下面即是总数
sum(count(*)) over() ,
-- 下面输出的是分组的总数,即分了多少组
count(1) over()
FROM
test_data_uniq_20210922
GROUP BY
rmk_id
ORDER BY
cnt DESC;
计算百分比
性能方面,跟关联子查询,没有区别。如果有求和总数的除0异常,还需要用case when来处理。
SELECT
rmk_id,
COUNT (*) AS cnt,
round(count(*)::decimal * 100/sum(count(*)) over(), 2) AS percent
FROM
mytable
GROUP BY
1
ORDER BY
2 DESC;
多个字段时,如下
SELECT
rmk_id,
a7_ldly,
COUNT (*) AS cnt,
round(count(*)::decimal * 100/sum(count(*)) over(), 2) AS percent
FROM
mytable
GROUP BY
1,2
ORDER BY
3 DESC;
除此之外,还能使用with提前计算出结果。但是,貌似,效果差不多。
伪类型无法入库
CREATE TABLE gj WITH (
APPENDONLY = TRUE,
COMPRESSLEVEL = 5,
ORIENTATION = COLUMN,
COMPRESSTYPE = ZLIB
) AS
select
ship_id,
max(ins_db_tm),
min(ins_db_tm),
array_agg(row(
grp_ship_id,
scan_site,
scan_emp,
ins_db_tm,
scan_tm,
delv_emp,
scan_typ
) order by ins_db_tm
)
from tb_scan where ins_db_tm >='2021-08-26' and ins_db_tm < '2021-09-26'
and ship_id in (select ship_id from test_scan_qian)
group by ship_id
DISTRIBUTED BY (ship_id);
ERROR: column "array_agg" has pseudo-type record[]
分析前缀
SELECT
SUBSTRING (grp_ship_id FROM 1 FOR 2) AS pre,
COUNT (1) AS cnt
FROM
tb_scan
WHERE
ins_db_tm >= '2021-08-20'
AND ins_db_tm < '2021-09-27'
AND grp_ship_id IS NOT NULL
GROUP BY
1
ORDER BY
2;
增加条件字段,减少总数,提速
没有过滤,需要写的记录比较多。
DROP TABLE IF EXISTS test_qianshou_package;
CREATE TABLE test_qianshou_package WITH (
APPENDONLY = TRUE,
COMPRESSLEVEL = 5,
ORIENTATION = COLUMN,
COMPRESSTYPE = ZLIB
) AS
select ship_id,grp_ship_id from test_qianshou_guiji group by ship_id,grp_ship_id
DISTRIBUTED BY (ship_id);
SELECT 127765499
Time: 23501.989 ms
增加过滤条件。
DROP TABLE IF EXISTS test_qianshou_package;
CREATE TABLE test_qianshou_package WITH (
APPENDONLY = TRUE,
COMPRESSLEVEL = 5,
ORIENTATION = COLUMN,
COMPRESSTYPE = ZLIB
) AS
select ship_id,grp_ship_id from test_qianshou_guiji where grp_ship_id is not null group by ship_id,grp_ship_id
DISTRIBUTED BY (ship_id);
SELECT 67652643
Time: 10959.580 ms
上面结果对比,总量减少一半,总时间减少一半,感觉很大的可能是,系统卡在写的性能上面。也说明,在入库的时候,可以提前增加一些计算字段,来作为过滤条件、或着分区条件,来避免大表关联的时候的性能消耗。
子查询:field查询 – from 子查询–with临时表
from子查询
这个算是比较常规的想法,之所以用from子查询,是因为from子查询内部,进行了聚集计算,而反是非聚集计算的字段,都不能参与select后的字段,故,放到from子查询内,注意,此时的条件,是放到子查询内部的,这样,算是提前剪枝,而内部的排序,一般是没有用处的,但是因为有limit计算,故还是需要order by的运算。
select gs.mc as scan_name ,a.*
from (
select scan_site,count(1) as cnt from tb_scan
where ins_db_tm >='2021-10-18' and ins_db_tm < '2021-10-19'
group by scan_site
order by cnt limit 100
) a
left join gs on gs.bm::text = a.scan_site;
select 的计算字段
这种方式,不需要子查询,外部表驱动内部表,进行查询,写起来更简单一些,而,我之前不太擅长这种写发。
注意,scan_name字段,因为关联的字段,其实是参与了聚集运算,故不会重复。(对性能有提升作用?)
select (select mc from gs where bm::text = a.scan_site ) as scan_name,scan_site,count(1) as cnt from tb_scan as a
where ins_db_tm >='2021-10-18' and ins_db_tm < '2021-10-19'
group by scan_site
order by cnt limit 100;
with临时表
突然想起来,还有这种方式。能避免from子查询等。
with临时表的方式,通常能代替from子查询方式,而且通过观察下面的sql,其实发现,跟from子查询,非常的相似。
with a as (
select scan_site,count(1) as cnt from tb_scan
where ins_db_tm >='2021-10-18' and ins_db_tm < '2021-10-19'
group by scan_site
order by cnt limit 100
)
select gs.mc as scan_name,a.* from a
left join gs on gs.bm::text = a.scan_site;
疑问:针对with的临时表查询,如果有多个临时表的查询,是否,后面的临时表能用上前面的临时表的查询结果呢?简单测试,是可行的。那么,我有个非常大胆的想法,是否能将之前的所有的查询,都用临时表的方式呢处理呢?这样,真的相当于“1条sql”,完成了整个取数任务。
with临时表,能解决很多,嵌套查询的问题。
总结:
上面二者的性能差不多,整体运行时间都在65秒左右,下面的,略微快了几秒而已。
do-while
- 分组
-- 查1天
select count(distinct tm),count(tm) from addr_real where ldsj >='2021-10-19' and ldsj < '2021-10-20' and tm like '317%' ;
-- 查n天
select to_char(ldsj,'YYYY-MM-DD') as day, count(distinct tm) as distinct_cnt,count(tm) as cnt from addr_real where ldsj >='2021-10-01' and tm like '317%' group by 1 order by 1;
如上的查询,使用了分组,其实即按分区来分组,感觉n条命令来执行,可能效果更快一些。或者说,group by的时候,尽量使用分组的名称作为条件。
但是呢,不想用for,貌似也能使用子查询。
- 子查询的思路。
下面能写,貌似感觉效率不高。
select (select count(distinct tm) from addr_real where 条件 ) from 日期列表 as a
- 直接写do-loop循环
扩展:
如果想实现数据,按天进行分组并联表呢? do–loop,子查询,还是?
同事写的代码 select into + count(distinct case)
下面的代码,有点的意料,如果是我的话,我可能是insert into table from (select * from table1 union all select * from table2 ...) f这种思路。
下面居然还能这样写?select into 第一个表后,后面直接跟其他的表。(select into 的语法我是知道的,但是直接union all使用,第一次遇到)
select * into cyf_etl_dta_202110242526
from cyf_etl_dta_20211024
union all
select * from cyf_etl_dta_20211025
union all
select * from cyf_etl_dta_20211026
count的几种方式
select
Dtime::DATE as date,
min(Dtime) as min_date,
max(Dtime) as max_date,
count(*),
count(distinct Doc_Sn) as ttl,--整体
count(case when Doc_Src in ('cainiao') then Doc_Sn end) as cainiao, --菜鸟
count(distinct case when Doc_Src in ('cainiao') then Doc_Sn end) as cainiao --菜鸟
from tu_doc_info
where Dtime>='2021-10-24' and Dtime<'2021-10-27'
group by Dtime::DATE
order by Dtime::DATE
除此之外,
sum( case when 条件 then 1 esle 0 end )
count(条件 or null) ,如 count(sex = ‘男’ or null)
上面即为计算的几种方式。
上面的count 的计算字段 居然还能distinct + case when ,真没有想这么多
看来sql的语法,可以写这么复杂,只是自己不敢这么搭配而已。
思路
扫描表,按类型分为,揽件、分发、签收。故,该张表,虽然是一张表,但是可以拆分出多张表。另外,每种,都要取最早、或最晚的那条。然后还要1个月之内的数据。
优化方式,从最简单的到复杂,感觉有以下几种。
建立索引
直接先取出多天的唯一记录。主要是方便多天的取值
优化sql,之前是每条都取唯一,这次,取唯一的时候,在联表的时候操作?是否能提速?
分区优化,(distribute 主键 )+ (分区 主键),暂时不考虑时间,先将所有的运单处理完?
gp的json,先将多条的数据,都转换成主键,然后避免3次取表?
自定义函数优化。即一次扫描全表,组装出一个运单的揽、分、签?
提前计算好,揽、签、收的1~2个月数据。
每次计算最值,范围是滚动的,比如1个月。那么,能否滚动的计算出最值? 在原有的1个月的数据上,按窗口滑动的形式处理?