pg数据计算实践

gp的一些操作sql进行记录,统计哪些sql的性能比较消耗性能。记录一下常见的操作方式。比如如何去重,去重在哪方面比较消耗性能。本来不是更多的的查询条件,会有更多的时间消耗。反而,因为条件,最终的时间会少一些。

利用联表,减少去重的数据量。

sql的各种子查询都能用联表来解决。sql其实也是典型的map、reduce这种。map,如对各种字段的各种运算,而reduce其实就是对数据进行分组、聚合。

操作

DISTRIBUTED

首先,原有的表,有DISTRIBUTED BY (ship_id),根据特定的条件,将查询出的结果,另存到一张表内,查看时间影响。

DROP TABLE IF EXISTS test_scan_qian;
CREATE TABLE test_scan_qian WITH (
    APPENDONLY = TRUE,
    COMPRESSLEVEL = 5,
    ORIENTATION = COLUMN,
    COMPRESSTYPE = ZLIB
) AS SELECT
    *
FROM
    (
        SELECT
            ship_id,
            scan_site,
            scan_emp,
            ins_db_tm,
            scan_tm,
            scan_typ,
            udf_1,
            udf_4,
            ROW_NUMBER () OVER (
                PARTITION BY ship_id
                ORDER BY
                    ins_db_tm
            ) AS rn
        FROM
            tb_scan
        WHERE
            ins_db_tm >= '2021-07-09'
        AND ins_db_tm < '2021-07-10'
        AND scan_typ = '10'
        AND rmk_id = '1'
    ) AS T
WHERE
    T.rn = 1
DISTRIBUTED BY (ship_id);  -- 测试最后一句,有无分布对时间的影响

有DISTRIBUTED

SELECT 54062499
Time: 23343.444 ms

无DISTRIBUTED

SELECT 54062499
Time: 22795.694 ms

貌似,没有什么影响。

一次查询

时间裁剪分区

以下均在一个按时间分区的表上,进行查询。

-- 签收 1天时间内
SELECT 54062499
Time: 23343.444 ms

-- 分发,10天时间内
SELECT 515138696
Time: 217047.224 ms

-- 揽件,30天时间内
SELECT 1630545559
Time: 761096.137 ms

大概是按量算出来的。

既然都是一张表,能否在一次计算内,完成呢?这样,来减少时间?

比较差集

用主键比较一次差集。

select count(1) from test_scan_qian09 as a  left join test_scan_qian as b on a.ship_id = b.ship_id where b.ship_id is null ; 

统计数据样本情况

sum(case when zyfifthyz is not null then 1 else 0 end) 其实就是count(字段)

select 
sum(case when zyfifthyz is not null then 1 else 0 end) as  zyfifthyz_cnt,
sum(case when rule5a is not null then 1 else 0 end) as  rule5a_cnt,
sum(case when zyfifthgj is not null then 1 else 0 end) as  zyfifthgj_cnt,
sum(case when rule5b is not null then 1 else 0 end) as  rule5b_cnt,
sum(case when zyfifthdsd is not null then 1 else 0 end) as  zyfifthdsd_cnt,
sum(case when rule5c is not null then 1 else 0 end) as  rule5c_cnt,
sum(case when sfcalwd is not null then 1 else 0 end) as  sfcalwd_cnt,
sum(case when sfcalpsm is not null then 1 else 0 end) as  sfcalpsm_cnt,
sum(case when sfcalyz is not null then 1 else 0 end) as  sfcalyz_cnt,
sum(case when sfcalgj is not null then 1 else 0 end) as  sfcalgj_cnt,
sum(case when sfcaldsd is not null then 1 else 0 end) as  sfcaldsd_cnt,
sum(case when cnmdwd is not null then 1 else 0 end) as  cnmdwd_cnt,
sum(case when cnpsm is not null then 1 else 0 end) as  cnpsm_cnt,
sum(case when cnfifthcode is not null then 1 else 0 end) as  cnfifthcode_cnt,
sum(case when edmwlwd is not null then 1 else 0 end) as  edmwlwd_cnt,
sum(case when edmwlpsm is not null then 1 else 0 end) as  edmwlpsm_cnt,
sum(case when edmhisqcwd is not null then 1 else 0 end) as  edmhisqcwd_cnt,
sum(case when edmhisqcpsm is not null then 1 else 0 end) as  edmhisqcpsm_cnt,
sum(case when sdmrgzdwd is not null then 1 else 0 end) as  sdmrgzdwd_cnt,
sum(case when sdmrgzdpsm is not null then 1 else 0 end) as  sdmrgzdpsm_cnt,
sum(case when sdmgjcrgzdwd is not null then 1 else 0 end) as  sdmgjcrgzdwd_cnt,
sum(case when sdmgjcrgzdpsm is not null then 1 else 0 end) as  sdmgjcrgzdpsm_cnt,
sum(case when sdmkeysdicpsfwwd is not null then 1 else 0 end) as  sdmkeysdicpsfwwd_cnt,
sum(case when sdmkeysdicpsfwpsm is not null then 1 else 0 end) as  sdmkeysdicpsfwpsm_cnt,
sum(case when sfqccalwd is not null then 1 else 0 end) as  sfqccalwd_cnt,
sum(case when sfqccalpsm is not null then 1 else 0 end) as  sfqccalpsm_cnt,
sum(case when edmhisfcroadnumberwd is not null then 1 else 0 end) as  edmhisfcroadnumberwd_cnt,
sum(case when edmhisfcroadnumberpsm is not null then 1 else 0 end) as  edmhisfcroadnumberpsm_cnt,
sum(case when edmhisfcpoiwd is not null then 1 else 0 end) as  edmhisfcpoiwd_cnt,
sum(case when edmhisfcpoipsm is not null then 1 else 0 end) as  edmhisfcpoipsm_cnt
from test_alldata;

select count(ship_id),count(distinct ship_id)  from addr_real  where ldsj >='2021-7-10' and ldsj < '2021-7-10';

--  结果
zyfifthyz_cnt             | 24260067
rule5a_cnt                | 59316603
zyfifthgj_cnt             | 10092057
rule5b_cnt                | 59316603
zyfifthdsd_cnt            | 21988239
rule5c_cnt                | 59316603
sfcalwd_cnt               | 40757859
sfcalpsm_cnt              | 40757859
sfcalyz_cnt               | 8530955
sfcalgj_cnt               | 2099311
sfcaldsd_cnt              | 6041209
cnmdwd_cnt                | 21258103
cnpsm_cnt                 | 20862872
cnfifthcode_cnt           | 3772426
edmwlwd_cnt               | 17365571
edmwlpsm_cnt              | 17365571
edmhisqcwd_cnt            | 44155107
edmhisqcpsm_cnt           | 44155107
sdmrgzdwd_cnt             | 0
sdmrgzdpsm_cnt            | 0
sdmgjcrgzdwd_cnt          | 213877
sdmgjcrgzdpsm_cnt         | 49756
sdmkeysdicpsfwwd_cnt      | 16013344
sdmkeysdicpsfwpsm_cnt     | 15912223
sfqccalwd_cnt             | 0
sfqccalpsm_cnt            | 0
edmhisfcroadnumberwd_cnt  | 14759277
edmhisfcroadnumberpsm_cnt | 14759277
edmhisfcpoiwd_cnt         | 46548511
edmhisfcpoipsm_cnt        | 46548511

统计哪些重复的

select * from test_alldata_20210710 where ship_id in (select ship_id from  test_alldata_20210710 group by ship_id having count(ship_id) > 1 limit 10);

-- 或者自联表??

去重的方式

select distinct on() * from sometable order by

大概的意思是,on的字段,作为分组的依据 ,取哪1条,根据 order by的顺序来,取第1条。

重复计算字段

一般的解法,都是用子查询等,将查询的结果,则重新select 一次。

http://cn.voidcc.com/question/p-tvmbteju-s.html

在select的字段,如何直接引用这些字段呢?在mysql中倒是可以直接引用。语法如下:

大概用@var:= 表达式 的形式,为表量复制,然后使用@var来直接引用变量。

SELECT  @b := UPPER(hostname) as b ,SUBSTR(@b,1,3) as c from access_log limit 10;

-- 另一种用法
set @hello =1;
select @hello;

但是,gp中是否有类似的语法?测试了,貌似不行。

大表去重统计

对1个大表的字段进行去重统计,性能如下:

select count(recv_addr) from tu_doc_info;                           
-[ RECORD 1 ]------
count | 28018141557
Time: 4371634.524 ms

select count(1) from (select 1 from tu_doc_info group by recv_addr) as t;       
-[ RECORD 1 ]-----
count | 3898994670

Time: 29168870.915 ms

其他,尝试统计mysql中的一个分区表,大概2000万多条,耗时20~30秒左右。

快速去重

比如recv_addr这个字段,取md5后,另存到一张表。然后呢,再group by或distinct不知道性能如何?

统计重复

--  这样的去重效率比直接distinct的效率高一些,结果一致的
select count(1) from (select count(1) from ydwr_base_salesman_sorting group by branch_code ,salesman_id ) as f;
-- 统计总数
select count(1) from ydwr_base_salesman_sorting;
-- 统计重复的数据有哪些
select branch_code,salesman_id from ydwr_base_salesman_sorting group by branch_code ,salesman_id having count(1) > 1 ;

-- 将上一部的结果,代入到这一步,联表查看,具体是哪些记录
select * from ydwr_base_salesman_sorting as a right join (select branch_code,salesman_id from ydwr_base_salesman_sorting group by branch_code ,salesman_id having count(1) > 1) as b
on a.branch_code = b.branch_code and a.salesman_id = b.salesman_id;

-- 貌似开窗函数也能满足,但是,如何同时取做到 2个字段同时聚合呢?
select * from ydwr_base_salesman_sorting ;

地址表去重

抽取数据,难免会遇到去重问题,到底是先去重,还是先联表求接?计算字段,到底是先计算,还是放到总后计算,或者在中间去完重再计算?

场景

test_scan_qian是最终联表的主角,所以呢,在去重的时候,先利用其去重。

addr_real_uniq是目标表,是希望得到去重的数据,方便在最终的时候,进行联表查询。

本身并不一定需要test_scan_qian联表去重,但是呢,不用的话,数据量太大了,增加条件,减少数据量,正好,最后联表的时候,数据量也小一些。

运行时间

代码1单独的提取数据,没有去重,貌似数据量大。故时间比较长。

代码2跟代码1相比,本来是打算联表减少数据量的,但是用的是LEFT JOIN,而不是RIGHT JOIN,查找的数据总量没有减少,还增加了运算时间。

代码3跟代码2相比,采用了RIGHT JOIN,故数落量减少,总时间减到2/3

代码4跟代码3总体差不多,但是条件,由on 改到 where里面,时间反而减少了(不科学),估计都用到了分区裁剪吧。

条件到底是写在on里面 还是写到 where里面,在联很多表的时候,写到on里面,貌似更方便里面。

代码5跟代码4,联表方式发生了变化,RIGHT JOIN 改成了INNER JOIN

代码6,使用了FROM A,B WHERE A B的形式。

关于执行耗时,其实大体上都差不多,可能写出的sql,引擎会整理,可能是殊途同归。

代码7、8是在代码5的基础上,减少运算条件,比如最外层的去重、开窗,就剩下单纯的联表的消耗呢。结果发现,并没有在时间上有提升,而数据量有增加。这说明什么,增加条件,也需并不会增加运行时间。所以呢,减少联表,最后再去重(假设重复的数据并不多),也需速度会提升。

减少扫描次数、减少联表次数

数据总量

代码1、代码2一样,代码3、4、5、6差不多。但是代码3却多出1条??为啥?

耗时相关:数据总量、字段数。

代码1

-- a7地址去重
DROP TABLE IF EXISTS addr_real_uniq;
CREATE TABLE addr_real_uniq WITH (
    APPENDONLY = TRUE,
    COMPRESSLEVEL = 5,
    ORIENTATION = COLUMN,
    COMPRESSTYPE = ZLIB
) AS 
select 
*
from 
(
    select
        *,
        -- CAINIAO>YDEC>JD>PDD>BM>YJ>PDDGJ>其他>OOS_NON_ELE,OOS>PC_REAL_NAME_NON_LEL,PC_REAL_NAME>COD
        -- as ldsj_val,
        ROW_NUMBER () OVER (
            PARTITION BY tm
            ORDER BY
                case 
                when  ldly = 'CAINIAO' then 1
                when  ldly = 'YDEC' then 2
                when  ldly = 'JD' then 3
                when  ldly = 'PDD' then 4
                when  ldly = 'BM' then 5
                when  ldly = 'YJ' then 6
                when  ldly = 'PDDGJ' then 7
                when  ldly = '其他' then 8
                when  ldly = 'OOS_NON_ELE' then 9
                when  ldly = 'OOS' then 10
                when  ldly = 'PC_REAL_NAME_NON_LEL' then 11
                when  ldly = 'PC_REAL_NAME' then 12
                when  ldly = 'COD' then 13
                else  14
            end ASC,ldsj desc
        ) AS rn
    FROM
        addr_real
    WHERE ldsj >= '2021-08-15 00:00:00'
) t
where rn = 1
DISTRIBUTED BY (tm)
;
SELECT 549681179
Time: 1484293.820 ms

代码2

DROP TABLE IF EXISTS addr_real_uniq;
CREATE TABLE addr_real_uniq WITH (
    APPENDONLY = TRUE,
    COMPRESSLEVEL = 5,
    ORIENTATION = COLUMN,
    COMPRESSTYPE = ZLIB
) AS 
select 
*
from 
(
    select
        a.*,
        -- CAINIAO>YDEC>JD>PDD>BM>YJ>PDDGJ>其他>OOS_NON_ELE,OOS>PC_REAL_NAME_NON_LEL,PC_REAL_NAME>COD
        -- as ldsj_val,
        ROW_NUMBER () OVER (
            PARTITION BY tm
            ORDER BY
                case 
                when  ldly = 'CAINIAO' then 1
                when  ldly = 'YDEC' then 2
                when  ldly = 'JD' then 3
                when  ldly = 'PDD' then 4
                when  ldly = 'BM' then 5
                when  ldly = 'YJ' then 6
                when  ldly = 'PDDGJ' then 7
                when  ldly = '其他' then 8
                when  ldly = 'OOS_NON_ELE' then 9
                when  ldly = 'OOS' then 10
                when  ldly = 'PC_REAL_NAME_NON_LEL' then 11
                when  ldly = 'PC_REAL_NAME' then 12
                when  ldly = 'COD' then 13
                else  14
            end ASC,ldsj desc
        ) AS rn
    FROM
        addr_real as a
    LEFT JOIN test_scan_qian as q on a.tm = q.ship_id
    WHERE ldsj >= '2021-08-15 00:00:00'
) t
where rn = 1
DISTRIBUTED BY (tm)
;

上面联表方式错了,(RIGHT JOIN),数据没有少,还增加了运算时间,故时间更旧。

SELECT 549681179
Time: 1606665.073 ms

代码3

DROP TABLE IF EXISTS addr_real_uniq;
CREATE TABLE addr_real_uniq WITH (
    APPENDONLY = TRUE,
    COMPRESSLEVEL = 5,
    ORIENTATION = COLUMN,
    COMPRESSTYPE = ZLIB
) AS 
select 
*
from 
(
    select
        a.*,
        -- CAINIAO>YDEC>JD>PDD>BM>YJ>PDDGJ>其他>OOS_NON_ELE,OOS>PC_REAL_NAME_NON_LEL,PC_REAL_NAME>COD
        -- as ldsj_val,
        ROW_NUMBER () OVER (
            PARTITION BY tm
            ORDER BY
                case 
                when  ldly = 'CAINIAO' then 1
                when  ldly = 'YDEC' then 2
                when  ldly = 'JD' then 3
                when  ldly = 'PDD' then 4
                when  ldly = 'BM' then 5
                when  ldly = 'YJ' then 6
                when  ldly = 'PDDGJ' then 7
                when  ldly = '其他' then 8
                when  ldly = 'OOS_NON_ELE' then 9
                when  ldly = 'OOS' then 10
                when  ldly = 'PC_REAL_NAME_NON_LEL' then 11
                when  ldly = 'PC_REAL_NAME' then 12
                when  ldly = 'COD' then 13
                else  14
            end ASC,ldsj desc
        ) AS rn
    FROM
        addr_real as a
    RIGHT JOIN test_scan_qian as q on a.tm = q.ship_id and ldsj >= '2021-08-15 00:00:00'
) t
where rn = 1
DISTRIBUTED BY (tm)
;
SELECT 48430255
Time: 524957.187 ms

代码4

DROP TABLE IF EXISTS addr_real_uniq;
CREATE TABLE addr_real_uniq WITH (
    APPENDONLY = TRUE,
    COMPRESSLEVEL = 5,
    ORIENTATION = COLUMN,
    COMPRESSTYPE = ZLIB
) AS 
select 
*
from 
(
    select
        a.*,
        -- CAINIAO>YDEC>JD>PDD>BM>YJ>PDDGJ>其他>OOS_NON_ELE,OOS>PC_REAL_NAME_NON_LEL,PC_REAL_NAME>COD
        -- as ldsj_val,
        ROW_NUMBER () OVER (
            PARTITION BY tm
            ORDER BY
                case 
                when  ldly = 'CAINIAO' then 1
                when  ldly = 'YDEC' then 2
                when  ldly = 'JD' then 3
                when  ldly = 'PDD' then 4
                when  ldly = 'BM' then 5
                when  ldly = 'YJ' then 6
                when  ldly = 'PDDGJ' then 7
                when  ldly = '其他' then 8
                when  ldly = 'OOS_NON_ELE' then 9
                when  ldly = 'OOS' then 10
                when  ldly = 'PC_REAL_NAME_NON_LEL' then 11
                when  ldly = 'PC_REAL_NAME' then 12
                when  ldly = 'COD' then 13
                else  14
            end ASC,ldsj desc
        ) AS rn
    FROM
        addr_real as a
    RIGHT JOIN test_scan_qian as q on a.tm = q.ship_id 
    where ldsj >= '2021-08-15 00:00:00'
) t
where rn = 1
DISTRIBUTED BY (tm)
;
SELECT 48430254
Time: 487740.310 ms

代码5

DROP TABLE IF EXISTS addr_real_uniq;
CREATE TABLE addr_real_uniq WITH (
    APPENDONLY = TRUE,
    COMPRESSLEVEL = 5,
    ORIENTATION = COLUMN,
    COMPRESSTYPE = ZLIB
) AS 
select 
*
from 
(
    select
        a.*,
        -- CAINIAO>YDEC>JD>PDD>BM>YJ>PDDGJ>其他>OOS_NON_ELE,OOS>PC_REAL_NAME_NON_LEL,PC_REAL_NAME>COD
        -- as ldsj_val,
        ROW_NUMBER () OVER (
            PARTITION BY tm
            ORDER BY
                case 
                when  ldly = 'CAINIAO' then 1
                when  ldly = 'YDEC' then 2
                when  ldly = 'JD' then 3
                when  ldly = 'PDD' then 4
                when  ldly = 'BM' then 5
                when  ldly = 'YJ' then 6
                when  ldly = 'PDDGJ' then 7
                when  ldly = '其他' then 8
                when  ldly = 'OOS_NON_ELE' then 9
                when  ldly = 'OOS' then 10
                when  ldly = 'PC_REAL_NAME_NON_LEL' then 11
                when  ldly = 'PC_REAL_NAME' then 12
                when  ldly = 'COD' then 13
                else  14
            end ASC,ldsj desc
        ) AS rn
    FROM
        addr_real as a
    INNER JOIN test_scan_qian as q on a.tm = q.ship_id 
    where ldsj >= '2021-08-15 00:00:00'
) t
where rn = 1
DISTRIBUTED BY (tm)
;
SELECT 48430254
Time: 480104.767 ms

代码6

更换了from形式

DROP TABLE IF EXISTS addr_real_uniq;
CREATE TABLE addr_real_uniq WITH (
    APPENDONLY = TRUE,
    COMPRESSLEVEL = 5,
    ORIENTATION = COLUMN,
    COMPRESSTYPE = ZLIB
) AS 
select 
*
from 
(
    select
        a.*,
        -- CAINIAO>YDEC>JD>PDD>BM>YJ>PDDGJ>其他>OOS_NON_ELE,OOS>PC_REAL_NAME_NON_LEL,PC_REAL_NAME>COD
        -- as ldsj_val,
        ROW_NUMBER () OVER (
            PARTITION BY tm
            ORDER BY
                case 
                when  ldly = 'CAINIAO' then 1
                when  ldly = 'YDEC' then 2
                when  ldly = 'JD' then 3
                when  ldly = 'PDD' then 4
                when  ldly = 'BM' then 5
                when  ldly = 'YJ' then 6
                when  ldly = 'PDDGJ' then 7
                when  ldly = '其他' then 8
                when  ldly = 'OOS_NON_ELE' then 9
                when  ldly = 'OOS' then 10
                when  ldly = 'PC_REAL_NAME_NON_LEL' then 11
                when  ldly = 'PC_REAL_NAME' then 12
                when  ldly = 'COD' then 13
                else  14
            end ASC,ldsj desc
        ) AS rn
    FROM
        addr_real as a,test_scan_qian as q
    -- 更换联表的条件
    where a.tm = q.ship_id and ldsj >= '2021-08-15 00:00:00'
) t
where rn = 1
DISTRIBUTED BY (tm)
;
SELECT 48430254
Time: 473906.605 ms

代码7

对比代码5

去掉外层的去重试试,看看内部的耗时。

开窗后,去重的where并不怎么消耗时间,相差不大。

DROP TABLE IF EXISTS addr_real_uniq2;
CREATE TABLE addr_real_uniq2 WITH (
    APPENDONLY = TRUE,
    COMPRESSLEVEL = 5,
    ORIENTATION = COLUMN,
    COMPRESSTYPE = ZLIB
) AS 
    select
        a.*,
        ROW_NUMBER () OVER (
            PARTITION BY tm
            ORDER BY
                case 
                when  ldly = 'CAINIAO' then 1
                when  ldly = 'YDEC' then 2
                when  ldly = 'JD' then 3
                when  ldly = 'PDD' then 4
                when  ldly = 'BM' then 5
                when  ldly = 'YJ' then 6
                when  ldly = 'PDDGJ' then 7
                when  ldly = '其他' then 8
                when  ldly = 'OOS_NON_ELE' then 9
                when  ldly = 'OOS' then 10
                when  ldly = 'PC_REAL_NAME_NON_LEL' then 11
                when  ldly = 'PC_REAL_NAME' then 12
                when  ldly = 'COD' then 13
                else  14
            end ASC,ldsj desc
        ) AS rn
    FROM
        addr_real as a
    INNER JOIN test_scan_qian as q on a.tm = q.ship_id 
    where ldsj >= '2021-08-15 00:00:00'
DISTRIBUTED BY (tm)
;
SELECT 55299456
Time: 476894.489 ms

代码8

再去掉开窗函数,看看耗时。纯粹的看联表的耗时。

DROP TABLE IF EXISTS addr_real_uniq2;
CREATE TABLE addr_real_uniq2 WITH (
    APPENDONLY = TRUE,
    COMPRESSLEVEL = 5,
    ORIENTATION = COLUMN,
    COMPRESSTYPE = ZLIB
) AS 
    select
        a.*
    FROM
        addr_real as a
    INNER JOIN test_scan_qian as q on a.tm = q.ship_id 
    where ldsj >= '2021-08-15 00:00:00'
DISTRIBUTED BY (tm)
;
SELECT 55299456
Time: 470603.757 ms

i1借助联表去重

联一下主表,主表是主要的数据源。

下面其实表达的就是内部连接,但是写得比较啰嗦。

DROP TABLE IF EXISTS i1_order_unqi;
CREATE TABLE i1_order_unqi WITH (
    APPENDONLY = TRUE,
    COMPRESSLEVEL = 5,
    ORIENTATION = COLUMN,
    COMPRESSTYPE = ZLIB
) AS 
SELECT
    *
FROM
    (
        SELECT
            i.*, 
            ROW_NUMBER () OVER (
                PARTITION BY i.ship_id
                ORDER BY
                batch_time DESC
            ) AS rn
        FROM
            i1_order as i
        LEFT JOIN test_scan_qian as q ON q.ship_id = i.ship_id and i.batch_time >= '<?=dt('-7 day')?>'
        WHERE q.ship_id IS NOT NULL
    ) t
WHERE
    rn = 1
DISTRIBUTED BY (ship_id);

傻了

RIGHT JOIN test_scan_qian as q ON q.ship_id = i.ship_id and i.batch_time >= '<?=dt('-7 day')?>'

发现还是错的,其实是INNER JOIN

INNER JOIN test_scan_qian as q ON q.ship_id = i.ship_id and i.batch_time >= '<?=dt('-7 day')?>'

借助in去重

省略了建表语句,最终结果都是按ship_id分布存储到数据库中,作为中间结果。

优化后,数据量减少到1/135,时间缩减了快10倍。效果非常的明显。此时,并没有纠结in/join/exists的小表驱动问题。

  • 优化后
select  ship_id,string_agg(ship_type,'|') as ship_type from (select ship_id,ship_type from marked_ship 
where ship_id in (select ship_id from test_scan_qian ) -- 增加了这句
group by ship_id,ship_type ) as f group by  ship_id
SELECT 10724200
156777 Time: 62214.511 ms
  • 优化前
select  ship_id,string_agg(ship_type,'|') as ship_type from (select ship_id,ship_type from marked_ship group by ship_id,ship_type ) as f group by  ship_id
SELECT 1357038779
Time: 524996.890 ms

字段获取

总结

减少字段,能提升速度,但是不明显。比如900秒,到800秒。但是,通过联表(INNER JOIN,减少总量),减少到200秒左右。

0相关数据量

select count(*) from test_scan_qian;
  count   
----------
 50196784
 
 
 select max(store_time),min(store_time) from test_doc;
         max         |         min         
---------------------+---------------------
 2021-08-26 23:49:52 | 2021-07-24 03:29:29

1原始去重需求

DROP TABLE IF EXISTS test_doc;
CREATE TABLE test_doc WITH (
    APPENDONLY = TRUE,
    COMPRESSLEVEL = 5,
    ORIENTATION = COLUMN,
    COMPRESSTYPE = ZLIB
) AS SELECT
    *
FROM
    (
        SELECT
            d.doc_sn,
            d.doc_src,
            d.store_time,
            d.recv_prov,
            d.recv_city,
            d.recv_ctry,
            d.recv_addr,
            d.send_prov,
            d.send_city,
            d.send_ctry,
            d.send_addr,
            ROW_NUMBER () OVER (
                PARTITION BY doc_sn
                ORDER BY
                    dtime DESC
            ) AS rn
        FROM
            tu_doc_info AS d
        WHERE
            dtime >= '2021-07-27 00:00:00'
        AND dtime < '2021-08-27 00:00:00'
    ) AS T
WHERE
    T.rn = 1
DISTRIBUTED BY (doc_sn);
SELECT 1531072496
Time: 937538.918 ms

2减少字段

减少字段,确实能减少时间消耗。但是,提升比率不是特别的高。

DROP TABLE IF EXISTS test_doc;
CREATE TABLE test_doc WITH (
    APPENDONLY = TRUE,
    COMPRESSLEVEL = 5,
    ORIENTATION = COLUMN,
    COMPRESSTYPE = ZLIB
) AS SELECT
    *
FROM
    (
        SELECT
            d.doc_sn,
            d.doc_src,
            d.store_time,
            d.recv_prov,
            d.recv_city,
            d.recv_ctry,
            d.recv_addr,
            -- d.send_prov,
            -- d.send_city,
            -- d.send_ctry,
            -- d.send_addr,
            ROW_NUMBER () OVER (
                PARTITION BY doc_sn
                ORDER BY
                    dtime DESC
            ) AS rn
        FROM
            tu_doc_info AS d
        WHERE
            dtime >= '2021-07-27 00:00:00'
        AND dtime < '2021-08-27 00:00:00'
    ) AS T
WHERE
    T.rn = 1
DISTRIBUTED BY (doc_sn);
SELECT 1531072496
Time: 833351.744 ms

3联表减少数据量

DROP TABLE IF EXISTS test_doc;
CREATE TABLE test_doc WITH (
    APPENDONLY = TRUE,
    COMPRESSLEVEL = 5,
    ORIENTATION = COLUMN,
    COMPRESSTYPE = ZLIB
) AS SELECT
    *
FROM
    (
        SELECT
            d.doc_sn,
            d.doc_src,
            d.store_time,
            d.recv_prov,
            d.recv_city,
            d.recv_ctry,
            d.recv_addr,
            d.send_prov,
            d.send_city,
            d.send_ctry,
            d.send_addr,
            ROW_NUMBER () OVER (
                PARTITION BY doc_sn
                ORDER BY
                    dtime DESC
            ) AS rn
        FROM
            tu_doc_info AS d
        INNER JOIN test_scan_qian as q ON q.ship_id = d.doc_sn
        WHERE
            dtime >= '2021-07-27 00:00:00'
        AND dtime < '2021-08-27 00:00:00'
    ) AS T
WHERE
    T.rn = 1
DISTRIBUTED BY (doc_sn);
SELECT 50106931
Time: 221289.113 ms

字段少,时间很快

-- 集包信息,tm可能是大包号、运单号
DROP TABLE IF EXISTS test_jibao;
CREATE TABLE test_jibao WITH (
    APPENDONLY = TRUE,
    COMPRESSLEVEL = 5,
    ORIENTATION = COLUMN,
    COMPRESSTYPE = ZLIB
) AS SELECT
    *
FROM
    (
        SELECT
            case when j.grp_ship_id is not null then j.grp_ship_id else a.ship_id end as tm,  -- 确定装卸车查找的 号(大包号优先于运单号)
            a.ship_id,
            j.grp_ship_id AS pac_id, --大包号
            j.scan_tm AS pac_time,
            j.scan_site AS pac_sitecode, 
            CASE
            WHEN j.scan_typ IN ('3', '03') THEN
                '分拨集包'
            WHEN j.scan_typ IN ('5', '05') THEN
                '网点集包'
            END AS pac_class,
            ROW_NUMBER () OVER (
                PARTITION BY j.ship_id
                ORDER BY
                    j.ins_db_tm DESC
            ) AS rn
        FROM
            test_scan_qian as a
        LEFT JOIN tb_scan as j on j.ship_id = a.ship_id
        WHERE
            j.ins_db_tm >= '2021-08-23 00:00:00'
        AND j.ins_db_tm < '2021-09-02 00:00:00'
        AND j.scan_typ in ('3','03','5','05')
    ) AS T
WHERE
    T.rn = 1
DISTRIBUTED BY (tm);

时间短的有点不敢相信

SELECT 37564347
Time: 67449.129 ms

奇怪的问题

-- 装97、卸96车信息
DROP TABLE IF EXISTS test_zhuangxie;
CREATE TABLE test_zhuangxie WITH (
    APPENDONLY = TRUE,
    COMPRESSLEVEL = 5,
    ORIENTATION = COLUMN,
    COMPRESSTYPE = ZLIB
) AS SELECT
    *
FROM
    (
        SELECT
            a.*,
            z.scan_site AS sffb_id,  -- 始发分拨
            x.scan_site AS mdfb_id,  -- 目的站点
            ROW_NUMBER () OVER (
                PARTITION BY a.ship_id
                ORDER BY
                    z.ins_db_tm ASC, -- 装车取最早
                    x.ins_db_tm DESC -- 卸车取最晚
            ) AS rn1
        FROM
            test_jibao as a
        LEFT JOIN tb_scan as z on z.ship_id = a.ship_id   -- 装车
        LEFT JOIN tb_scan as x on x.ship_id = a.ship_id   -- 卸车
        WHERE
            z.ins_db_tm >= '2021-08-23 00:00:00'
        AND z.ins_db_tm < '2021-09-02 00:00:00'
        AND z.scan_typ = '97'
        AND x.ins_db_tm >= '2021-08-23 00:00:00'
        AND x.ins_db_tm < '2021-09-02 00:00:00'
        AND x.scan_typ = '96'
    ) AS T
WHERE
    T.rn1 = 1
DISTRIBUTED BY (tm);
SELECT 57880
Time: 149169.754 ms
develop=# select count(*) from test_jibao ;                                                                                                                                                                
  count   
----------
 37564347

居然,过滤掉很多数据。

DROP TABLE IF EXISTS test_zhuangxie;
CREATE TABLE test_zhuangxie WITH (
    APPENDONLY = TRUE,
    COMPRESSLEVEL = 5,
    ORIENTATION = COLUMN,
    COMPRESSTYPE = ZLIB
) AS SELECT
    *
FROM
    (
        SELECT
            a.*,
            z.scan_site AS sffb_id,  -- 始发分拨
            x.scan_site AS mdfb_id,  -- 目的站点
            ROW_NUMBER () OVER (
                PARTITION BY a.ship_id
                ORDER BY
                    z.ins_db_tm ASC, -- 装车取最早
                    x.ins_db_tm DESC -- 卸车取最晚
            ) AS rn1
        FROM
            test_jibao as a
        LEFT JOIN tb_scan as z on z.ship_id = a.ship_id   -- 装车
            AND z.ins_db_tm >= '2021-08-23 00:00:00'
            AND z.ins_db_tm < '2021-09-02 00:00:00'
            AND z.scan_typ = '97'
        LEFT JOIN tb_scan as x on x.ship_id = a.ship_id   -- 卸车
            AND x.ins_db_tm >= '2021-08-23 00:00:00'
            AND x.ins_db_tm < '2021-09-02 00:00:00'
            AND x.scan_typ = '96'
    ) AS T
WHERE
    T.rn1 = 1
DISTRIBUTED BY (tm);
SELECT 37564347
Time: 100181.922 ms

压缩表的影响

压缩表的好处是,由于压缩,体积减少,能减少磁盘io,另外,列存储,非常适合数仓这种类型。

实验证明,写到压缩表内,并不会增加消耗的时间。

压缩表

DROP TABLE IF EXISTS test_scan_fen;
CREATE TABLE test_scan_fen WITH (
    APPENDONLY = TRUE,
    COMPRESSLEVEL = 5,
    ORIENTATION = COLUMN,
    COMPRESSTYPE = ZLIB
) AS SELECT
    ship_id,
    scan_site,
    -- scan_emp,字段更换为  delv_emp
    delv_emp,
    scan_tm,
    rmk_inf,
    -- 备注派送码
    CASE
    WHEN rmk_inf SIMILAR to '[a-zA-Z0-9]{2}' THEN
        rmk_inf
    ELSE
        null
    END AS mark_code,
    -- 备注派送码是否为机动,是否XYWZ开头
    case when  CHAR_LENGTH (rmk_inf) = 2 and (left(upper(rmk_inf),1)='X' or left(upper(rmk_inf),1)='Y' or left(upper(rmk_inf),1)='W' or left(upper(rmk_inf),1)='Z') then
        1
    else 
        0
    end as is_var_code
FROM
    (
        SELECT
            ship_id,
            scan_site,
            --  scan_emp,
            delv_emp,
            scan_tm,
            rmk_inf,
            ROW_NUMBER () OVER (
                PARTITION BY ship_id
                ORDER BY
                    scan_tm DESC
            ) AS rn
        FROM
            tb_scan
        WHERE
            ins_db_tm >= '2021-08-17 00:00:00'
        AND ins_db_tm < '2021-08-27 00:00:00'
        AND scan_typ = '24'
    ) AS T
WHERE
    T.rn = 1
DISTRIBUTED BY (ship_id);
SELECT 483512499
Time: 217465.246 ms

普通表

DROP TABLE IF EXISTS test_scan_fen;
CREATE TABLE test_scan_fen AS
SELECT
    ship_id,
    scan_site,
    -- scan_emp,字段更换为  delv_emp
    delv_emp,
    scan_tm,
    rmk_inf,
    -- 备注派送码
    CASE
    WHEN rmk_inf SIMILAR to '[a-zA-Z0-9]{2}' THEN
        rmk_inf
    ELSE
        null
    END AS mark_code,
    -- 备注派送码是否为机动,是否XYWZ开头
    case when  CHAR_LENGTH (rmk_inf) = 2 and (left(upper(rmk_inf),1)='X' or left(upper(rmk_inf),1)='Y' or left(upper(rmk_inf),1)='W' or left(upper(rmk_inf),1)='Z') then
        1
    else 
        0
    end as is_var_code
FROM
    (
        SELECT
            ship_id,
            scan_site,
            --  scan_emp,
            delv_emp,
            scan_tm,
            rmk_inf,
            ROW_NUMBER () OVER (
                PARTITION BY ship_id
                ORDER BY
                    scan_tm DESC
            ) AS rn
        FROM
            tb_scan
        WHERE
            ins_db_tm >= '2021-08-17 00:00:00'
        AND ins_db_tm < '2021-08-27 00:00:00'
        AND scan_typ = '24'
    ) AS T
WHERE
    T.rn = 1
DISTRIBUTED BY (ship_id);
SELECT 483512499
Time: 209393.916 ms

视图

DROP VIEW IF EXISTS test_scan_fen;

EsRROR: “test_scan_fen” is not a view
HINT: Use DROP TABLE to remove a table, DROP EXTERNAL TABLE if external, or DROP FOREIGN TABLE if foreign.

CREATE VIEW test_scan_fen AS SELECT

relation “test_scan_fen” already exists

通过上面的例子发现,其实视图,跟表有点像。

DROP VIEW IF EXISTS test_scan_fen;
CREATE VIEW test_scan_fen AS
SELECT
   ...

创建视图很快。只是存储了sql。并不会实际执行。

创建视图,一次计算

具体见sql/table_job.sqlsql/view_job.sql

视图约占 77%,速度有提升,但是貌似并不大。

test_scan_qian;   改为view
test_scan_fen;    改为view
test_scan_lan;    改为view
test_doc;         改为view
test_scan_doc;    不变,输出表格。

视图方式

SELECT 50196784
Time: 1717532.500 ms

直接计算方式。

16:12开始  至 16:50  大概38分钟。

2208822

SELECT 50196784
Time: 39995.549 ms
psql:/yd/table_job.sql:49: NOTICE:  table "test_scan_fen" does not exist, skipping
DROP TABLE
Time: 0.843 ms
SELECT 483512499
Time: 248721.958 ms
psql:/yd/table_job.sql:102: NOTICE:  table "test_scan_lan" does not exist, skipping
DROP TABLE
Time: 0.731 ms
SELECT 1506757081
Time: 700615.860 ms
psql:/yd/table_job.sql:135: NOTICE:  table "test_doc" does not exist, skipping
DROP TABLE
Time: 0.663 ms
SELECT 1531072496
Time: 1022834.503 ms
DROP TABLE
Time: 24.487 ms
SELECT 50196784
Time: 196633.527 ms

csv文件导入核对

# 查看此时的文件占用
du -sh
# 查看总行数,导出的cvs文件,是pdo.fetch->fputcsv直接写的,没有表头,但是有最后的空行
# 该方法并不一定总是很准,误差可能来源与,字段中如过有空行,可能会被误识别
cat marked_ship_20210*|wc -l  

psql操作

查看总条数是否一致,查看1条记录是否正常

select * from marked_ship where insert_db_date >= '2021-07-11' limit 1; 
select count(1) from marked_ship where insert_db_date >= '2021-07-11'; 

核对一致,则可删除文件

rm -f marked_ship_20210* 
du -sh

in/exists/join性能对比

test_scan_qian 大表,大概5000万数据。

i1_record 相当来讲是个小表,但是重复的数据比较多。20310127

  • join

    据说,一般能用子查询的,都能换成join方式

DROP TABLE IF EXISTS i1_record_unqi;
CREATE TABLE i1_record_unqi WITH (
    APPENDONLY = TRUE,
    COMPRESSLEVEL = 5,
    ORIENTATION = COLUMN,
    COMPRESSTYPE = ZLIB
) AS 
SELECT
    *
FROM
    (
        SELECT
            i.*, 
            ROW_NUMBER () OVER (
                PARTITION BY i.ship_id
                ORDER BY
                batch_time DESC
            ) AS rn
        FROM
            i1_record as i
        LEFT JOIN test_scan_qian as q ON q.ship_id = i.ship_id and i.batch_time >= '2021-08-23'
        WHERE q.ship_id IS NOT NULL
    ) t
WHERE
    rn = 1
DISTRIBUTED BY (ship_id);

SELECT 11620
Time: 48267.562 ms

  • in

DROP TABLE IF EXISTS i1_record_unqi;
CREATE TABLE i1_record_unqi WITH (
    APPENDONLY = TRUE,
    COMPRESSLEVEL = 5,
    ORIENTATION = COLUMN,
    COMPRESSTYPE = ZLIB
) AS 
SELECT
    *
FROM
    (
        SELECT
            i.*, 
            ROW_NUMBER () OVER (
                PARTITION BY i.ship_id
                ORDER BY
                batch_time DESC
            ) AS rn
        FROM
            i1_record as i
        where i.batch_time >= '2021-08-23'
            and ship_id in (select ship_id from test_scan_qian )
    ) t
WHERE
    rn = 1
DISTRIBUTED BY (ship_id);

SELECT 11620
Time: 50033.728 ms

  • exists

DROP TABLE IF EXISTS i1_record_unqi;
CREATE TABLE i1_record_unqi WITH (
    APPENDONLY = TRUE,
    COMPRESSLEVEL = 5,
    ORIENTATION = COLUMN,
    COMPRESSTYPE = ZLIB
) AS 
SELECT
    *
FROM
    (
        SELECT
            i.*, 
            ROW_NUMBER () OVER (
                PARTITION BY i.ship_id
                ORDER BY
                batch_time DESC
            ) AS rn
        FROM
            i1_record as i
        where i.batch_time >= '2021-08-23'
            and exists (select 1 from test_scan_qian where ship_id = i.ship_id )
    ) t
WHERE
    rn = 1
DISTRIBUTED BY (ship_id);

SELECT 11620
Time: 51298.095 ms

备注:有的时候,添加条件,反而速度还快一点。如去掉下面的where条件,速度反而还慢一些。

        LEFT JOIN test_scan_qian as q ON q.ship_id = i.ship_id and i.batch_time >= '2021-08-23'
        -- WHERE q.ship_id IS NOT NULL
        
SELECT 1971098
Time: 62557.812 ms

扩展:发现除了前面3种语法,还有其他类型的子查询。

子查询1

select * from (
  select
    a.*,
    ( select aid from b where b.aid=a.id limit 1) as aid 
  from a
) as t
where t.aid is null;

子查询2:

select * from a where (select aid from b where b.aid=a.id limit 1) is null;

日期统计

select to_char(store_time::timestamp,'YYYY-MM-DD'),count(1) from test_doc group by to_char(store_time::timestamp,'YYYY-MM-DD') order by to_char(store_time::timestamp,'YYYY-MM-DD');

-- 简写
select to_char(store_time::timestamp,'YYYY-MM-DD'),count(1) from test_doc group by 1 order by 1;

-- 不知道为什么,自己总喜欢写成嵌套的形式    可能数据量太小,耗时跟上面两种差不多。
select dt,count(1) from (select to_char(store_time::timestamp,'YYYY-MM-DD') as dt from test_doc) t group by dt;

时间大概统计

分发:200多秒,取5亿数据,并取出6个字段,并完成计算字段,去重。(+联表条件,能降到94秒)

签收:600秒,取15亿数据,大致取4个字段。并完成去重。

(增加联表条件,时间降到一半)

SELECT 55019979
Time: 227152.571 ms

录单:960秒,取15亿数据。

(增加联表条件,时间降到1/4,说明扫描耗时,但是取字段更耗时。)

SELECT 55351804
Time: 236165.105 ms

这样,几个大表再联合的时候,源数据量少了很多,估计联表也快不少。

三表联合,由200秒,降低到到1/4。(有两张表同时从15亿降到了0.5亿的水平)

SELECT 55415034
Time: 48538.021 ms

其实,再次优化,能将多表的扫描过程,最好一次完成,估计时间会降低,但是如何实现?自己写函数?增加索引?自己写复杂的联合sql?

所以,目标数据5000万左右,如果中间数据超过这个数很多倍,那么,就是有问题。都能使用联表的方式,降低中间的输出总量。

统计各个层次的数量

其实,下面可以直接用普通的取整之类的运算符,来解决。如果按0.01精度来分区,那岂不是还要写上一100个when

select case 
    when i1l_ftconfid = '' then -1
    when i1l_ftconfid::numeric >= 1.1 then 1.1
    when i1l_ftconfid::numeric >= 1 then 1
    when i1l_ftconfid::numeric >= 0.9 then 0.9
    when i1l_ftconfid::numeric >= 0.8 then 0.8
    when i1l_ftconfid::numeric >= 0.7 then 0.7
    when i1l_ftconfid::numeric >= 0.6 then 0.6
    when i1l_ftconfid::numeric >= 0.5 then 0.5
    when i1l_ftconfid::numeric >= 0.4 then 0.4
    when i1l_ftconfid::numeric >= 0.3 then 0.3
    when i1l_ftconfid::numeric >= 0.2 then 0.2
    when i1l_ftconfid::numeric >= 0.1 then 0.1
    when i1l_ftconfid::numeric >= 0 then 0
    else 
    -1
 end as possible
,count(1)
from test_data_uniq_20210831 
group by 1
order by 1
;

本来想进一步增加百分比的信息,但是,如下:

select possible,cnt,(cnt::numeric)/(sum(cnt))::numeric from 
( 上面的语句 ) t  ;

报错信息如下:

column "t.possible" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: select possible,cnt,cnt::numeric,sum(cnt) from 

原因:就是因为增加了聚集函数sum,就报错。不知道到为何。

去重条件

最开始写的sql如下:

-- 集包信息,tm可能是大包号、运单号
DROP TABLE IF EXISTS test_jibao;
CREATE TABLE test_jibao WITH (
    APPENDONLY = TRUE,
    COMPRESSLEVEL = 5,
    ORIENTATION = COLUMN,
    COMPRESSTYPE = ZLIB
) AS SELECT
    *
FROM
    (
        SELECT
            case when j.grp_ship_id is not null then j.grp_ship_id else a.ship_id end as tm,  -- 确定装卸车查找的 号(大包号优先于运单号)
            a.ship_id,
            j.grp_ship_id AS pac_id, --大包号
            j.scan_tm AS pac_time,
            j.scan_site AS pac_sitecode, 
            CASE
            WHEN j.scan_typ IN ('3', '03') THEN
                '分拨集包'
            WHEN j.scan_typ IN ('5', '05') THEN
                '网点集包'
            END AS pac_class,
            ROW_NUMBER () OVER (
                PARTITION BY j.ship_id
                ORDER BY
                    j.ins_db_tm DESC
            ) AS rn
        FROM
            test_scan_qian as a
        LEFT JOIN tb_scan as j on j.ship_id = a.ship_id
        WHERE
            j.ins_db_tm >= '2021-08-26 00:00:00'
        AND j.ins_db_tm < '2021-09-05 00:00:00'
        AND j.scan_typ in ('3','03','5','05')
    ) AS T
WHERE
    T.rn = 1
DISTRIBUTED BY (tm);





-- 装97、卸96车信息
DROP TABLE IF EXISTS test_zhuangxie;
CREATE TABLE test_zhuangxie WITH (
    APPENDONLY = TRUE,
    COMPRESSLEVEL = 5,
    ORIENTATION = COLUMN,
    COMPRESSTYPE = ZLIB
) AS SELECT
    *
FROM
    (
        SELECT
            a.*,
            z.scan_site AS sffb_id,  -- 始发分拨
            x.scan_site AS mdfb_id,  -- 目的站点
            ROW_NUMBER () OVER (
                PARTITION BY a.ship_id
                ORDER BY
                    z.ins_db_tm ASC, -- 装车取最早
                    x.ins_db_tm DESC -- 卸车取最晚
            ) AS rn1
        FROM
            test_jibao as a
        LEFT JOIN tb_scan as z on z.ship_id = a.ship_id   -- 装车
            AND z.ins_db_tm >= '2021-08-26 00:00:00'
            AND z.ins_db_tm < '2021-09-05 00:00:00'
            AND z.scan_typ = '97'
        LEFT JOIN tb_scan as x on x.ship_id = a.ship_id   -- 卸车
            AND x.ins_db_tm >= '2021-08-26 00:00:00'
            AND x.ins_db_tm < '2021-09-05 00:00:00'
            AND x.scan_typ = '96'
    ) AS T
WHERE
    T.rn1 = 1
DISTRIBUTED BY (tm);

上面主表test_scan_qian结果过滤,数据变少了,跟最开始写的有出入。而且经过运算后的,tm,可能有重复,下一次联表的时候,可能也不好弄。

主表qian没有重复,但是呢,优先用大包号、其次是运单号,去联,集包信息的表,(而集包信息,本身也可能会有重复,所以呢,第一步,先让集包信息唯一),假设,集包信息已经唯一呢,那么,此时,再用qian按前面的条件去联集包信息,只是联表的字段,可能不确定。假设qian的联表字段已经确定,那么,势必会有qian的联表字段重复的情形。所以,确定的顺序是:

  • 集包先单独去重
  • 计算qian的联表字段,并按联表字段,重分布。
  • 将集包跟 新签 进行联表。
  • 将上述联表后的结果,重新按qian的主键,来处理。

减少步骤的方式是,动态计算联表字段。尽量每一步,处理各自的逻辑。比如,先去重,而不是联表后,再去重,这样可能导致结果跟预期不一样。

换一种思路,qian表,有大包号按大包处理,没有打包,则按运单处理,然后呢,

一次错误的sql,查非分布键

select 
develop-# ship_id,
develop-# grp_ship_id,
develop-# rmk_id,
develop-# rmk_inf,
develop-# scan_site,
develop-# fst_scan_site,
develop-# pre_scan_site,
develop-# nxt_scan_site,
develop-# scan_typ,
develop-# scan_tm,
develop-# ins_db_tm
develop-# from tb_scan 
develop-# where ins_db_tm >='2021-09-01' and ins_db_tm < '2021-09-06' and ship_id = '1202767564898' or grp_ship_id = '9401966434871'
develop-# order by ins_db_tm;
^CCancel request sent
ERROR:  canceling statement due to user request
Time: 434568.440 ms
select 
ship_id,
grp_ship_id,
rmk_id,
rmk_inf,
scan_site,
fst_scan_site,
pre_scan_site,
nxt_scan_site,
scan_typ,
scan_tm,
ins_db_tm
from tb_scan 
where ins_db_tm >='2021-09-01' and ins_db_tm < '2021-09-06' and ship_id in ('1202767564898','9401966434871')
order by ins_db_tm;
Time: 36705.112 ms

差别非常的明显,查询分布键,非常的快,但是如果增加了非分布键,则出奇的慢。(猜测)

以下的主键,存在计算字段,相当于查询了非分布键,但是性能也还好。

SELECT
    *
FROM
    (
        SELECT
            a.*,
            z.scan_site AS sffb_id,  -- 始发分拨
            x.scan_site AS mdfb_id,  -- 目的站点
            ROW_NUMBER () OVER (
                PARTITION BY a.ship_id
                ORDER BY
                    z.ins_db_tm ASC, -- 装车取最早
                    x.ins_db_tm DESC -- 卸车取最晚
            ) AS rn1
        FROM
            test_jibao as a
        LEFT JOIN tb_scan as z on z.ship_id = (case when a.pac_id is not null then a.pac_id else a.ship_id end )   -- 装车
            AND z.ins_db_tm >= '2021-08-26 00:00:00'
            AND z.ins_db_tm < '2021-09-06 00:00:00'
            AND z.scan_typ = '97'
        LEFT JOIN tb_scan as x on x.ship_id = (case when a.pac_id is not null then a.pac_id else a.ship_id end )   -- 卸车
            AND x.ins_db_tm >= '2021-08-26 00:00:00'
            AND x.ins_db_tm < '2021-09-06 00:00:00'
            AND x.scan_typ = '96'
    ) AS T
WHERE
    T.rn1 = 1

分组的计算字段

select pac_pre,count(1) from test_zhuangxie group by substring(pac_id from 1 for 2) pac_pre;

正确的使用引用字段:

gp下可以正常用。

select substring(pac_id from 1 for 2) as pac_pre,count(1) as cnt from test_zhuangxie group by 1 order by 2;

-- 写字段名也是可以的
select substring(pac_id from 1 for 2) as pac_pre,count(1) as cnt from test_zhuangxie group by pac_pre order by 2;

clickhouse 却能这样用????

SELECT
    deploy,
    fullurl,
    count(*) as count
FROM ncompaas6rows
LEFT JOIN temp_target_ip ON ncompaas6rows.dest_ip == temp_target_ip.ip AND ncompaas6rows.dest_port == temp_target_ip.port
WHERE temp_target_ip.ip != ''
GROUP BY
replaceRegexpOne(
    replaceRegexpOne('http://' || hostname || url, '[?&].*$', ''),
    '/\\d{{11,15}}($|/.*$)',
    ''
    ) AS fullurl,
deploy AS deploy
ORDER BY 
    deploy ASC,
    count DESC
;

压缩率

大概3.3倍的压缩率。

下面select命令用在分区表上,好像并不行。

-- 37GB 5000万
select pg_size_pretty(pg_relation_size('public.test_data_uniq_20210907'));
-- 123G
ls -lh  导出的csv文件

创建视图,会产生依赖,导致原来的表无法直接删除

直接使用drop table 无法删除掉。注意……

psql:fetch.sql:14: ERROR:  cannot drop append only columnar table test_scan_qian because other objects depend on it
DETAIL:  view mytestss2_view depends on append only columnar table test_scan_qian
HINT:  Use DROP ... CASCADE to drop the dependent objects too.

扫描轨迹问题

  • 分拨
create table test_fenbo as select mc,bm,lb from gs where bm  in (select distinct zzz from wdzzz_cw );
  • 查1天的数据,数据量虽然不多,但是呢,时间慢
DROP TABLE IF EXISTS test_guiji;
CREATE TABLE test_guiji WITH (
    APPENDONLY = TRUE,
    COMPRESSLEVEL = 5,
    ORIENTATION = COLUMN,
    COMPRESSTYPE = ZLIB
)  AS SELECT
    *
FROM
    (
        SELECT
            a.ship_id,
            a.scan_site
        FROM
            tb_scan AS a
        INNER JOIN test_scan_qian AS q ON q.ship_id = a.ship_id
        INNER JOIN test_fenbo AS f ON f.bm::text = a.scan_site
        WHERE
            a.ins_db_tm >= '2021-09-15'
        AND a.ins_db_tm < '2021-09-16'
    ) AS T
DISTRIBUTED BY (ship_id);
SELECT 56093895
Time: 194421.786 ms
  • 查10天的数据,范围变长了,但是呢,增加了另外一个类型字段来过滤,即,不需要联表,就能过滤掉数据。
DROP TABLE IF EXISTS test_guiji2;
CREATE TABLE test_guiji2 WITH (
    APPENDONLY = TRUE,
    COMPRESSLEVEL = 5,
    ORIENTATION = COLUMN,
    COMPRESSTYPE = ZLIB
)  AS SELECT
    *
FROM
    (
        SELECT
            a.ship_id,
            a.scan_site
        FROM
            tb_scan AS a
        INNER JOIN test_scan_qian AS q ON q.ship_id = a.ship_id
        INNER JOIN test_fenbo AS f ON f.bm::text = a.scan_site
        WHERE
            a.ins_db_tm >= '2021-09-05'
        AND a.ins_db_tm < '2021-09-16'
        AND a.scan_typ not in ('10','14','24')
    ) AS T
DISTRIBUTED BY (ship_id);
SELECT 155809387
Time: 277863.165 ms
  • 增加到30天,再测试。
DROP TABLE IF EXISTS test_guiji2;
CREATE TABLE test_guiji2 WITH (
    APPENDONLY = TRUE,
    COMPRESSLEVEL = 5,
    ORIENTATION = COLUMN,
    COMPRESSTYPE = ZLIB
)  AS SELECT
    *
FROM
    (
        SELECT
            a.ship_id,
            a.scan_site
        FROM
            tb_scan AS a
        INNER JOIN test_scan_qian AS q ON q.ship_id = a.ship_id
        INNER JOIN test_fenbo AS f ON f.bm::text = a.scan_site
        WHERE
            a.ins_db_tm >= '2021-08-15'
        AND a.ins_db_tm < '2021-09-16'
        AND a.scan_typ not in ('10','14','24')
    ) AS T
DISTRIBUTED BY (ship_id);

时间增加3倍,估计3倍时间
实际上并没有。这个时间,还算能接受。但是不清楚数据质量怎样。

SELECT 156105801
Time: 467436.559 ms

正则匹配

  • 简单的中文测试
CREATE OR REPLACE FUNCTION test_ch(addr text) RETURNS text AS $$
return '河南'
$$LANGUAGE plpythonu;
select test_ch('1') ;
  • python正则测试
CREATE OR REPLACE FUNCTION test_ch(addr text) RETURNS text AS $$
import re
pat = re.compile('.{1,5}省')

sch = pat.search(addr)  # search
if sch and sch.group(0):
    return sch.group(0)
return ''
$$LANGUAGE plpythonu;

发现并不行

select test_ch('河南') ;
select test_ch('湖南省');
  • 正则测试 - 观察组1
CREATE OR REPLACE FUNCTION parse_addr(addr text) RETURNS text AS $$
#coding=utf-8
if 'pattn' not in GD:
    import re
    pattn =[]
    pattn.append( re.compile('(.*省)(.*市)(.*区)'))
    pattn.append( re.compile('(.*省)(.*市)(.*县)'))
    pattn.append( re.compile('(.*省)(.*市)(.*市)'))
    GD['pattn'] = pattn
pat = GD['pattn']
sch = pat[0].search(addr)  # search
if sch and sch.group(0):
    return sch.group(0)
sch = pat[1].search(addr)  # search
if sch and sch.group(0):
    return sch.group(0)
sch = pat[2].search(addr)  # search
if sch and sch.group(0):
    return sch.group(0)
return ''
$$LANGUAGE plpythonu;

导入函数,所有的观察组都使用下面的方式,来进行计算。

DROP TABLE IF EXISTS test_parse_addr;
CREATE TABLE test_parse_addr WITH (
    APPENDONLY = TRUE,
    COMPRESSLEVEL = 5,
    ORIENTATION = COLUMN,
    COMPRESSTYPE = ZLIB
) AS 
select doc_sn,recv_addr, parse_addr(recv_addr) from tu_doc_info where dtime >= '2021-09-16' and dtime < '2021-09-17' and recv_addr is not null
DISTRIBUTED BY (doc_sn);

注意:recv_addr is not null,目的是:防止调用时,null的错误

SELECT 80780849
Time: 58160.892 ms

貌似性能还可以,即python参与运算,貌似不太会拖慢性能。

  • 正则测试 - 观察组2

取消贪婪匹配模式,因为不支持 .{1,5}语法

CREATE OR REPLACE FUNCTION parse_addr(addr text) RETURNS text AS $$
#coding=utf-8
if 'pattn' not in GD:
    import re
    pattn =[]
    pattn.append( re.compile('(.*?省)(.*?市)(.*?区)'))
    pattn.append( re.compile('(.*?省)(.*?市)(.*?县)'))
    pattn.append( re.compile('(.*?省)(.*?市)(.*?市)'))
    GD['pattn'] = pattn
pat = GD['pattn']
sch = pat[0].search(addr)  # search
if sch and sch.group(0):
    return sch.group(0)
sch = pat[1].search(addr)  # search
if sch and sch.group(0):
    return sch.group(0)
sch = pat[2].search(addr)  # search
if sch and sch.group(0):
    return sch.group(0)
return ''
$$LANGUAGE plpythonu;

结果:

-- 一样的测试语句

SELECT 80780849
Time: 165485.472 ms

-- 第二遍
SELECT 80780849
Time: 161170.369 ms

-- 第三遍再复测原来的脚本,说明函数的功能……
SELECT 80780849
Time: 62675.341 ms

貌似,正则对整个结果影响还是挺大的。

测试in、跟联表、python函数、sql函数的差距

结论:使用in、INNER JOIN这种反查询,效率成本差不多。因为in的子查询非常小。python在量非常大的情况下,效率低不少。非必要,不用。最奇葩的是,具体值放入到 in 内,居然总量不对,时间还长……。故,sql可能更擅长做联表这种运算。

  • 基础,先准备相关的数据。
DROP VIEW IF EXISTS test_guiji_view;
DROP TABLE IF EXISTS test_fenbo;
CREATE TABLE test_fenbo AS SELECT
    mc,
    bm::text as bm,
    lb
FROM
    gs
WHERE
    bm IN (
        SELECT DISTINCT zzz FROM wdzzz_cw
    )
DISTRIBUTED BY (bm);

INNER JOIN的方式

联的是一张小表

DROP TABLE IF EXISTS tmp_guiji;
CREATE TABLE tmp_guiji AS
SELECT
    a.ship_id,
    a.scan_site,
    a.ins_db_tm
FROM
    tb_scan AS a
INNER JOIN test_scan_qian AS q ON q.ship_id = a.ship_id
INNER JOIN test_fenbo AS f ON f.bm = a.scan_site
WHERE
    a.ins_db_tm >= '2021-08-18 00:00:00'
AND a.ins_db_tm < '2021-09-18 00:00:00'
AND a.scan_typ not in ('10','14','24')
DISTRIBUTED BY (ship_id);
;
SELECT 156322459
Time: 328684.450 ms

in子查询方式

DROP TABLE IF EXISTS tmp_guiji;
CREATE TABLE tmp_guiji AS
SELECT
    a.ship_id,
    a.scan_site,
    a.ins_db_tm
FROM
    tb_scan AS a
INNER JOIN test_scan_qian AS q ON q.ship_id = a.ship_id
 
WHERE
    a.ins_db_tm >= '2021-08-18 00:00:00'
AND a.ins_db_tm < '2021-09-18 00:00:00'
AND a.scan_typ not in ('10','14','24')
AND a.scan_site IN (SELECT bm FROM test_fenbo)
DISTRIBUTED BY (ship_id);
SELECT 156322459
Time: 321444.458 ms

in,代入具体值

求解代入值的时候,使用select string_agg(bm,'#') from test_fenbo;,然后自己手工再替换#',',为啥要这么麻烦呢?是因为转义的原因……

其实我感觉,应该性能是差不多,如果比之前查多了,那就说明,in用的是数组遍历。

DROP TABLE IF EXISTS tmp_guiji;
CREATE TABLE tmp_guiji AS
SELECT
    a.ship_id,
    a.scan_site,
    a.ins_db_tm
FROM
    tb_scan AS a
INNER JOIN test_scan_qian AS q ON q.ship_id = a.ship_id
 
WHERE
    a.ins_db_tm >= '2021-08-18 00:00:00'
AND a.ins_db_tm < '2021-09-18 00:00:00'
AND a.scan_typ not in ('10','14','24')
AND a.scan_site IN (1,2,3)
DISTRIBUTED BY (ship_id);
SELECT 149342271
Time: 362041.943 ms

python函数

虽然选择处理的数据有1.5亿,但是呢,总量是有120亿左右。也就是说,要频繁的调用这么多次,故,调用成为瓶颈,

在量少的时候,可能因为有写操作或者其他,其时间成本显示不出来。测试过python匹配3个正则,5000万的量,贪婪匹配60秒、非贪婪匹配160秒,当然,re是否是c实现的,暂时未知。

结论:python可用,但是还是要考虑量的问题。

-- 调用的时候用  is_fenbo 来判断

CREATE OR REPLACE FUNCTION is_fenbo(bm text) RETURNS bool AS $$
#coding=utf-8
if 'fenbo' not in GD:
    fenbo = {
    }
    GD['fenbo'] = fenbo
mapping = GD['fenbo']
return bm in mapping 
$$LANGUAGE plpythonu;

select is_fenbo('461001') ;
SELECT 156322459
Time: 886813.411 ms

至于要不要用pg/plsql来实现函数计算,其实,发现也没有必要。估计性能也不好。

自定义表unnest + ARRAY

通过 unnest + ARRAY 模式,定义一张虚拟的表。然后从这个表内,筛选数据。

bm IN (
        --# SELECT DISTINCT zzz FROM wdzzz_cw
        SELECT DISTINCT zzz FROM wdzzz_cw  union 
        SELECT zzz FROM unnest((ARRAY[528455,518000])::int[]) as zzz
    )

case not null

CREATE table mydemo1 (id int);

INSERT INTO mydemo1 VALUES (1),(null);

-- 错误   不能将id提前   not null 不行  is not null 也行
select  case id when is not null then id else 100 end as myid from mydemo1;
-- 正确
select  case when id is not null then id else 100 end as myid from mydemo1;

例子

case when gj.sffb_id is not null then gj.sffb_id else zx.sffb_id end as sffb_id,
case when gj.sffb_id is not null then gj.sffb_name else zx.sffb_name end as sffb_name,
case when gj.mdfb_id is not null then gj.mdfb_id else zx.mdfb_id end as mdfb_id,
case when gj.mdfb_id is not null then gj.mdfb_name else zx.mdfb_name end as mdfb_name,

计算分组百分比

本来主要的想法是,先计算出分组情况。select mycat,count(*) as cnt from mytable group by 1 ;然后在此结果上,再嵌套一层,算出总数,及百分比。select *,cnt/sum(cnt) as percent from (上一步结果) f,但是sum无法直接出现在非聚合查询上。……

总数,运用(关联)子查询

  • 关联子查询
SELECT
    rmk_id,
    COUNT (*) AS cnt,
    round(
        (
            (COUNT(*)) :: DECIMAL / (
                SELECT
                    COUNT (*)
                FROM
                    test_data_uniq_20210922
            )
        ) * 100,
        2
    ) AS percent
FROM
    test_data_uniq_20210922
GROUP BY
    rmk_id
ORDER BY
    cnt DESC;
  • 另一种方式

    准备条件,先研究如下用法。

    over的含义,应该是在分组查询最后的结果上处理,如果没有分组,那么over,即在原始数据上处理。而查询中,已有group了,那么,在select的字段上,使用窗口函数,那么,针对的应该是分组后的结果。

SELECT
    rmk_id,
    COUNT (*) AS cnt,
    -- 下面即是总数
    sum(count(*)) over() ,
    -- 下面输出的是分组的总数,即分了多少组
    count(1) over() 
FROM
    test_data_uniq_20210922
GROUP BY
    rmk_id
ORDER BY
    cnt DESC;

计算百分比

性能方面,跟关联子查询,没有区别。如果有求和总数的除0异常,还需要用case when来处理。

SELECT
    rmk_id,
    COUNT (*) AS cnt,
    round(count(*)::decimal * 100/sum(count(*)) over(), 2) AS percent
FROM
    mytable
GROUP BY
    1
ORDER BY
    2 DESC;

多个字段时,如下

SELECT
    rmk_id,
    a7_ldly,
    COUNT (*) AS cnt,
    round(count(*)::decimal * 100/sum(count(*)) over(), 2) AS percent
FROM
    mytable
GROUP BY
    1,2
ORDER BY
    3 DESC;

除此之外,还能使用with提前计算出结果。但是,貌似,效果差不多。

伪类型无法入库

CREATE TABLE gj WITH (
    APPENDONLY = TRUE,
    COMPRESSLEVEL = 5,
    ORIENTATION = COLUMN,
    COMPRESSTYPE = ZLIB
) AS 
select
    ship_id,
    max(ins_db_tm),
    min(ins_db_tm),
    array_agg(row(
            grp_ship_id,
            scan_site,
            scan_emp,
            ins_db_tm,
            scan_tm,
            delv_emp,
            scan_typ
        ) order by ins_db_tm 
    )
from tb_scan where ins_db_tm >='2021-08-26' and ins_db_tm < '2021-09-26' 
and ship_id in (select ship_id from test_scan_qian)
group by ship_id
DISTRIBUTED BY (ship_id);
ERROR:  column "array_agg" has pseudo-type record[]

分析前缀

SELECT
	SUBSTRING (grp_ship_id FROM 1 FOR 2) AS pre,
	COUNT (1) AS cnt
FROM
	tb_scan
WHERE
	ins_db_tm >= '2021-08-20'
AND ins_db_tm < '2021-09-27'
AND grp_ship_id IS NOT NULL
GROUP BY
	1
ORDER BY
	2;

增加条件字段,减少总数,提速

没有过滤,需要写的记录比较多。

DROP TABLE IF EXISTS test_qianshou_package;
CREATE TABLE test_qianshou_package WITH (
    APPENDONLY = TRUE,
    COMPRESSLEVEL = 5,
    ORIENTATION = COLUMN,
    COMPRESSTYPE = ZLIB
) AS

select ship_id,grp_ship_id from test_qianshou_guiji group by ship_id,grp_ship_id

DISTRIBUTED BY (ship_id);

SELECT 127765499
Time: 23501.989 ms

增加过滤条件。

DROP TABLE IF EXISTS test_qianshou_package;
CREATE TABLE test_qianshou_package WITH (
    APPENDONLY = TRUE,
    COMPRESSLEVEL = 5,
    ORIENTATION = COLUMN,
    COMPRESSTYPE = ZLIB
) AS

select ship_id,grp_ship_id from test_qianshou_guiji where grp_ship_id is not null group by ship_id,grp_ship_id

DISTRIBUTED BY (ship_id);

SELECT 67652643
Time: 10959.580 ms

上面结果对比,总量减少一半,总时间减少一半,感觉很大的可能是,系统卡在写的性能上面。也说明,在入库的时候,可以提前增加一些计算字段,来作为过滤条件、或着分区条件,来避免大表关联的时候的性能消耗。

子查询:field查询 – from 子查询–with临时表

from子查询

这个算是比较常规的想法,之所以用from子查询,是因为from子查询内部,进行了聚集计算,而反是非聚集计算的字段,都不能参与select后的字段,故,放到from子查询内,注意,此时的条件,是放到子查询内部的,这样,算是提前剪枝,而内部的排序,一般是没有用处的,但是因为有limit计算,故还是需要order by的运算。

select gs.mc as scan_name ,a.*
from (
    select scan_site,count(1) as cnt from tb_scan 
    where ins_db_tm >='2021-10-18' and ins_db_tm < '2021-10-19'
    group by scan_site
    order by  cnt limit 100 
)  a
left join gs on gs.bm::text = a.scan_site;

select 的计算字段

这种方式,不需要子查询,外部表驱动内部表,进行查询,写起来更简单一些,而,我之前不太擅长这种写发。

注意,scan_name字段,因为关联的字段,其实是参与了聚集运算,故不会重复。(对性能有提升作用?)

select  (select mc from gs where bm::text = a.scan_site ) as scan_name,scan_site,count(1) as cnt from tb_scan as a
where ins_db_tm >='2021-10-18' and ins_db_tm < '2021-10-19'
group by scan_site
order by  cnt limit 100;

with临时表

突然想起来,还有这种方式。能避免from子查询等。

with临时表的方式,通常能代替from子查询方式,而且通过观察下面的sql,其实发现,跟from子查询,非常的相似。

with a as (
    select scan_site,count(1) as cnt from tb_scan 
    where ins_db_tm >='2021-10-18' and ins_db_tm < '2021-10-19'
    group by scan_site
    order by  cnt limit 100
)
select gs.mc as scan_name,a.* from a
left join gs on gs.bm::text = a.scan_site;

疑问:针对with的临时表查询,如果有多个临时表的查询,是否,后面的临时表能用上前面的临时表的查询结果呢?简单测试,是可行的。那么,我有个非常大胆的想法,是否能将之前的所有的查询,都用临时表的方式呢处理呢?这样,真的相当于“1条sql”,完成了整个取数任务。

with临时表,能解决很多,嵌套查询的问题。

总结:

上面二者的性能差不多,整体运行时间都在65秒左右,下面的,略微快了几秒而已。

do-while

  • 分组
--  查1天
select count(distinct tm),count(tm) from addr_real  where ldsj >='2021-10-19' and  ldsj < '2021-10-20' and tm like '317%' ;

-- 查n天
select to_char(ldsj,'YYYY-MM-DD') as day, count(distinct tm) as distinct_cnt,count(tm) as cnt from addr_real  where ldsj >='2021-10-01' and tm like '317%'  group by 1 order by 1;

如上的查询,使用了分组,其实即按分区来分组,感觉n条命令来执行,可能效果更快一些。或者说,group by的时候,尽量使用分组的名称作为条件。

但是呢,不想用for,貌似也能使用子查询。

  • 子查询的思路。

下面能写,貌似感觉效率不高。

select (select count(distinct tm) from addr_real where 条件 ) from 日期列表 as a
  • 直接写do-loop循环

扩展:

如果想实现数据,按天进行分组并联表呢? do–loop,子查询,还是?

同事写的代码 select into + count(distinct case)

下面的代码,有点的意料,如果是我的话,我可能是insert into table from (select * from table1 union all select * from table2 ...) f这种思路。

下面居然还能这样写?select into 第一个表后,后面直接跟其他的表。(select into 的语法我是知道的,但是直接union all使用,第一次遇到)

select * into cyf_etl_dta_202110242526
from cyf_etl_dta_20211024
union all 
select * from cyf_etl_dta_20211025
union all
select * from cyf_etl_dta_20211026

count的几种方式

select 
Dtime::DATE as date,
min(Dtime) as min_date,
max(Dtime) as max_date,
count(*), 
count(distinct Doc_Sn) as ttl,--整体
count(case when Doc_Src in ('cainiao') then Doc_Sn end) as cainiao, --菜鸟
count(distinct case when Doc_Src in ('cainiao') then Doc_Sn end) as cainiao --菜鸟
from tu_doc_info
where Dtime>='2021-10-24' and Dtime<'2021-10-27'
group by Dtime::DATE
order by Dtime::DATE

除此之外,

sum( case when 条件 then 1 esle 0 end )

count(条件 or null) ,如 count(sex = ‘男’ or null)

上面即为计算的几种方式。

上面的count 的计算字段 居然还能distinct + case when ,真没有想这么多

看来sql的语法,可以写这么复杂,只是自己不敢这么搭配而已。

思路

扫描表,按类型分为,揽件、分发、签收。故,该张表,虽然是一张表,但是可以拆分出多张表。另外,每种,都要取最早、或最晚的那条。然后还要1个月之内的数据。

优化方式,从最简单的到复杂,感觉有以下几种。

  • 建立索引

  • 直接先取出多天的唯一记录。主要是方便多天的取值

  • 优化sql,之前是每条都取唯一,这次,取唯一的时候,在联表的时候操作?是否能提速?

  • 分区优化,(distribute 主键 )+ (分区 主键),暂时不考虑时间,先将所有的运单处理完?

  • gp的json,先将多条的数据,都转换成主键,然后避免3次取表?

  • 自定义函数优化。即一次扫描全表,组装出一个运单的揽、分、签?

  • 提前计算好,揽、签、收的1~2个月数据。

    每次计算最值,范围是滚动的,比如1个月。那么,能否滚动的计算出最值? 在原有的1个月的数据上,按窗口滑动的形式处理?