发布于: Nov 30, 2022

【概要】Amazon Redshift 的表设计是获取良好性能的基础,要想做出正确的设计一方面需要读者了解 Amazon Redshift 的运作原理和表设计最佳实践,另一方面更需要读者对自身业务的深入理解,有时候表设计不仅仅是技术,甚至是一种艺术。

接下来,我们将做几个实验来验证此前的理论,实验所用的集群硬件为 2 台 dc1.large 实例。请注意实验的数据集不同,硬件条件不同,查询语句不同,实验结果自然也会有所不同,但是我们可以从实验中尝试总结出一些规律,以便在今后的工作中可以避免一些明显的错误。

 

准备

我们创建两个表 lineorder_ec 和 lineorder_ei,两个表有相同的数据,计 600037902 行,表大小均为 33G 左右 ,分配方式为 even,排序键选择 lo_orderkey,lo_orderdate,lo_custkey,lo_suppkey 四个列,两个表中 lineorder_ec 选择复合排序,lineorder_ei 选择交错排序,其建表语句如下:

CREATE TABLE lineorder_ec

(

  lo_orderkey          INTEGER NOT NULL,

  lo_linenumber        INTEGER NOT NULL,

  lo_custkey           INTEGER NOT NULL,

  lo_partkey           INTEGER NOT NULL,

  lo_suppkey           INTEGER NOT NULL,

  lo_orderdate         INTEGER NOT NULL,

  lo_orderpriority     VARCHAR(15) NOT NULL,

  lo_shippriority      VARCHAR(1) NOT NULL,

  lo_quantity          INTEGER NOT NULL,

  lo_extendedprice     INTEGER NOT NULL,

  lo_ordertotalprice   INTEGER NOT NULL,

  lo_discount          INTEGER NOT NULL,

  lo_revenue           INTEGER NOT NULL,

  lo_supplycost        INTEGER NOT NULL,

  lo_tax               INTEGER NOT NULL,

  lo_commitdate        INTEGER NOT NULL,

  lo_shipmode          VARCHAR(10) NOT NULL

)

DISTSTYLE even

compound SORTKEY

(

lo_orderkey,

lo_orderdate,

lo_custkey,

lo_suppkey

);

CREATE TABLE lineorder_ei

(

  lo_orderkey          INTEGER NOT NULL,

  lo_linenumber        INTEGER NOT NULL,

  lo_custkey           INTEGER NOT NULL,

  lo_partkey           INTEGER NOT NULL,

  lo_suppkey           INTEGER NOT NULL,

  lo_orderdate         INTEGER NOT NULL,

  lo_orderpriority     VARCHAR(15) NOT NULL,

  lo_shippriority      VARCHAR(1) NOT NULL,

  lo_quantity          INTEGER NOT NULL,

  lo_extendedprice     INTEGER NOT NULL,

  lo_ordertotalprice   INTEGER NOT NULL,

  lo_discount          INTEGER NOT NULL,

  lo_revenue           INTEGER NOT NULL,

  lo_supplycost        INTEGER NOT NULL,

  lo_tax               INTEGER NOT NULL,

  lo_commitdate        INTEGER NOT NULL,

  lo_shipmode          VARCHAR(10) NOT NULL

)

DISTSTYLE even

interleaved SORTKEY

(

lo_orderkey,

lo_orderdate,

lo_custkey,

lo_suppkey

);

测试

在 2 个表上分别执行 7 个查询,从不同维度筛选数据,譬如订单号、订单日期、顾客编号以及以上条件的组合。筛选数据后求取符合条件所有订单税值的最小值和最大值

第一个查询涉及排序键的第一列,第二个查询涉及排序键的第二列,第三个查询涉及排序键的第三列,第四个查询涉及排序键的第四列,第五个查询涉及排序键的第一列和第二列,第六个查询涉及排序键的第一和第三列,第七个查询涉及排序键的第二和第三列,针对两张表的测试语句如下:

select max(lo_tax),min(lo_tax) from liuyiyi.lineorder_ec where lo_orderkey=6 ;

select max(lo_tax),min(lo_tax) from liuyiyi.lineorder_ei where lo_orderkey=6 ;

select max(lo_tax),min(lo_tax) from liuyiyi.lineorder_ec where lo_orderdate=19960102 ;

select max(lo_tax),min(lo_tax) from liuyiyi.lineorder_ei where lo_orderdate=19960102 ;

select max(lo_tax),min(lo_tax) from liuyiyi.lineorder_ec where lo_custkey=1112441 ;

select max(lo_tax),min(lo_tax) from liuyiyi.lineorder_ei where lo_orderkey=1112441 ;

select max(lo_tax),min(lo_tax) from liuyiyi.lineorder_ec where lo_suppkey=287503 ;

select max(lo_tax),min(lo_tax) from liuyiyi.lineorder_ei where lo_suppkey=287503 ;

select max(lo_tax),min(lo_tax) from liuyiyi.lineorder_ec where  lo_orderkey=7 and lo_orderdate=19960110 ;

select max(lo_tax),min(lo_tax) from liuyiyi.lineorder_ei where  lo_orderkey=7 and lo_orderdate=19960110 ;

select max(lo_tax),min(lo_tax) from liuyiyi.lineorder_ec where  lo_orderkey=7 and lo_custkey=782686 ;

select max(lo_tax),min(lo_tax) from liuyiyi.lineorder_ei where  lo_orderkey=7 and lo_custkey=782686 ;

select max(lo_tax),min(lo_tax) from liuyiyi.lineorder_ec where  lo_orderdate=19960110 and lo_custkey=782686 ;

select max(lo_tax),min(lo_tax) from liuyiyi.lineorder_ei where  lo_orderdate=19960110 and lo_custkey=782686 ;

结果

查询涉及列

组合排序

交错排序

第一列

2.58

7.39

第二列

8.69

7.47

第三列

8.72

7.48

第四列

7.72

7.32

第一列和第三列

2.58

10.24

第一列和第三列

2.59

11.21

第二列和第三列

10.42

10.54

通过实验,我们可以看到:

  • 针对组合排序键的查询 where 子句是否涉及主列对查询性能的影响极其明显,只涉及辅助排序列的查询性能较差。
  • 交错排序键在涉及排序键列不同列组合的各种查询中表现比较平均。
  • 对于组合排序键来说,针对主列,主列和第一个辅助排序列以及涉及组合排序键所有列的查询,从执行计划中可以看到没有 xn sort 步骤,直接 send to leader,而交错排序还是有 xn sort 步骤,sort key 是 lo_orderkey

 

 

准备

实验的数据集按星型模型设计,其中 sales 表是事实表, times,customers 以及 products 是维度表。其中 sales 事实表大小 3G,times 维度表大小 0.3G,products 维度表大小 0.2G,customers 维度表大小 0.2G。实验中我们对事实表尝试 KEY 和 EVEN 两种分配方式,维度表尝试 KEY、EVEN 和 ALL 分配,执行相同查询,针对事实表和维度表不同数据分配的排列组合,测试查询效果,进而使读者能直观认识到不同数据分配对查询性能的影响。

请注意表名后缀,后缀 _e 的表采取 EVEN 分配,后缀 _k 的表采取 KEY 分配,而后缀 _a 的表采取 ALL 分配。因为建表语句较长,为了不分散读者注意力,我将建表语句置于后面附录部分,读者可自行参阅。

测试

接下来的查询均是将事实表 sales 和维度表 times,customers 以及 products 进行联接,而后按季度,产品名称,产品价格,客户信用卡额度,客户婚姻状况进行分组,显示汇总销售数据,查询内容相同,但是涉及表的数据分配不同,以此测试事实表和维度表不同的数据分配对相同查询语句性能的影响。同样地,因为测试语句较长,为了不分散读者注意力,我将测试语句置于后面附录部分,读者可自行参阅。具体每个查询涉及表的数据分配特征如下所示:

  • 事实表数据 key 分配 + 交错排序键,维度表数据全部 key 分配
  • 事实表数据 key 分配 + 复合排序键,维度表数据全部 key 分配
  • 事实表数据 even 分配,维度表数据全部 key 分配
  • 事实表数据 key 分配 + 交错排序键,维度表 times 数据 key 分配且与事实表键值相同,其他维度表数据全部 all 分配
  • 事实表数据 key 分配 + 复合排序键,维度表 times 数据 key 分配且与事实表键值相同,其他维度表数据全部 all 分配
  • 事实表数据 even 分配,维度表数据全部 all 分配
  • 事实表数据 even 分配,维度表数据全部 even 分配
  • 事实表数据 even 分配,维度表数据全部 key 分配

结果

查询涉及表的数据组织

cost

Join类型

重分配类型

事实表数据key分配+交错排序键,维度表数据全部key分配

31121437712.42

Hash join

DS_BCAST_INNER+ DS_DIST_INNER +DS_DIST_NONE

事实表数据key分配+复合排序键,维度表数据全部key分配

31121082423.82

Hash join+Merge Join

DS_BCAST_INNER+ DS_DIST_INNER+ DS_DIST_NONE

事实表数据even分配,维度表数据全部key分配

371776381.48

Hash join

DS_BCAST_INNER+ DS_DIST_INNER

事实表数据key分配+交错排序键,维度表times数据 key分配且与事实表键值相同,其他维度表数据全部all分配

4086534.53

Hash join

DS_DIST_ALL_NONE

+DS_DIST_NONE

事实表数据key分配+复合排序键,维度表times 数据key分配且与事实表键值相同,其他维度表数据全部all分配

2552664.38

Hash join+Merge Join

DS_DIST_ALL_NONE

+DS_DIST_NONE

事实表数据even分配,维度表数据全部all分配

4084534.58

Hash join

DS_DIST_ALL_NONE

事实表数据even分配,维度表数据全部even分配

31413686534.53

Hash join

DS_BCAST_INNER

事实表数据even分配,维度表数据全部key分配

31413686534.53

Hash join

DS_BCAST_INNER+ DS_DIST_INNER

我们针对以上结果做出分析,可以总结出以下规律:

  • 我们可以看到,通过实验验证了此前介绍的表设计最佳实践,事实表的数据做 EVEN 或 KEY 分配,维度表的数据做 ALL 分配效果不错。
  • 如果维度表的数据全部做 KEY 或者 EVEN 分配,可以清楚地看到性能极差。
  • 使执行语句 cost 最小的数据分配方式是事实表和其中一个维度表的数据都以 KEY 分配,且根据共同列分配事实表和这个维度表的数据(事实表做组合排序),其他维度表的数据则选择 ALL 分配,如果以上条件不变,仅仅事实表选择交错排序则性能比选择会组合排序时下降。
  • 事实表和其中一个维度表的数据都以 KEY 分配,根据共同列分配事实表和这个维度表的数据(事实表做组合排序)时,执行计划中的联接类型会出现 merge join,其他情况下则都是 hash join。
  • 只要是维度表的数据做 ALL 分配或者可以根据共同列分配事实表和这个维度表的数据,那么就可以不做重分配,执行计划中可以看到 DS_DIST_ALL_NONE +DS_DIST_NONE,否则一定会重分配,执行计划中会看到 DS_BCAST_INNER 或 DS_DIST_INNER。

限于章节,关于执行计划中联接方式和数据重分配的意义以及执行计划各要素的说明我们将在下一篇博客《对症下药 - Amazon Redshift 调优漫谈》中详细介绍。

 

事实表

/*even 分配的事实表*/

CREATE TABLE IF NOT EXISTS sh.sales_e(

prod_id NUMERIC(38,18) NOT NULL,

cust_id NUMERIC(38,18) NOT NULL,

time_id TIMESTAMP WITHOUT TIME ZONE NOT NULL,

channel_id NUMERIC(38,18) NOT NULL,

promo_id NUMERIC(38,18) NOT NULL,

quantity_sold NUMERIC(10,2) ENCODE ZSTD NOT NULL,

amount_sold NUMERIC(10,2) ENCODE ZSTD NOT NULL

)

DISTSTYLE EVEN

SORTKEY

(

time_id,

channel_id,

cust_id,

prod_id,

promo_id

);

/*key 分配的事实表,复合排序键*/

CREATE TABLE IF NOT EXISTS sh.sales_c(

prod_id NUMERIC(38,18) NOT NULL,

cust_id NUMERIC(38,18) NOT NULL,

time_id TIMESTAMP WITHOUT TIME ZONE NOT NULL,

channel_id NUMERIC(38,18) NOT NULL,

promo_id NUMERIC(38,18) NOT NULL,

quantity_sold NUMERIC(10,2) NOT NULL,

amount_sold NUMERIC(10,2) NOT NULL

)

DISTSTYLE KEY

DISTKEY

(

time_id

)

compound SORTKEY

(

time_id,

channel_id,

prod_id,

promo_id,

cust_id

);

/*key 分配的事实表,交错排序键*/

CREATE TABLE IF NOT EXISTS sh.sales_k(

prod_id NUMERIC(38,18) NOT NULL,

cust_id NUMERIC(38,18) NOT NULL,

time_id TIMESTAMP WITHOUT TIME ZONE NOT NULL,

channel_id NUMERIC(38,18) NOT NULL,

promo_id NUMERIC(38,18) NOT NULL,

quantity_sold NUMERIC(10,2) ENCODE ZSTD NOT NULL,

amount_sold NUMERIC(10,2) ENCODE ZSTD NOT NULL

)

DISTSTYLE KEY

DISTKEY

(

time_id

)

interleaved SORTKEY

(

time_id,

channel_id,

cust_id,

prod_id,

promo_id

);

维度表

/*all 分配的维度表*/

CREATE TABLE IF NOT EXISTS sh.customers_a(

cust_id NUMERIC(38,18) NOT NULL,

cust_first_name CHARACTER VARYING(20) ENCODE ZSTD NOT NULL,

cust_last_name CHARACTER VARYING(40) ENCODE ZSTD NOT NULL,

cust_gender CHARACTER VARYING(1) NOT NULL,

cust_year_of_birth SMALLINT NOT NULL,

cust_marital_status CHARACTER VARYING(20),

cust_street_address CHARACTER VARYING(40) ENCODE ZSTD NOT NULL,

cust_postal_code CHARACTER VARYING(10) ENCODE ZSTD NOT NULL,

cust_city CHARACTER VARYING(30) ENCODE ZSTD NOT NULL,

cust_city_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

cust_state_province CHARACTER VARYING(40) ENCODE ZSTD NOT NULL,

cust_state_province_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

country_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

cust_main_phone_number CHARACTER VARYING(25) ENCODE ZSTD NOT NULL,

cust_income_level CHARACTER VARYING(30) ENCODE ZSTD,

cust_credit_limit NUMERIC(38,18) ENCODE ZSTD,

cust_email CHARACTER VARYING(30) ENCODE ZSTD,

cust_total CHARACTER VARYING(14) ENCODE ZSTD NOT NULL,

cust_total_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

cust_src_id NUMERIC(38,18) ENCODE ZSTD,

cust_eff_from TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD,

cust_eff_to TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD,

cust_valid CHARACTER VARYING(1) ENCODE ZSTD

)

DISTSTYLE all

SORTKEY

(

cust_id,

cust_gender,

cust_marital_status,

cust_year_of_birth

);

CREATE TABLE IF NOT EXISTS sh.products_a(

prod_id INTEGER NOT NULL,

prod_name CHARACTER VARYING(50) ENCODE ZSTD NOT NULL,

prod_desc CHARACTER VARYING(4000) ENCODE ZSTD NOT NULL,

prod_subcategory CHARACTER VARYING(50) NOT NULL,

prod_subcategory_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

prod_subcategory_desc CHARACTER VARYING(2000) ENCODE ZSTD NOT NULL,

prod_category CHARACTER VARYING(50) NOT NULL,

prod_category_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

prod_category_desc CHARACTER VARYING(2000) ENCODE ZSTD NOT NULL,

prod_weight_class SMALLINT ENCODE ZSTD NOT NULL,

prod_unit_of_measure CHARACTER VARYING(20) ENCODE ZSTD,

prod_pack_size CHARACTER VARYING(30) ENCODE ZSTD NOT NULL,

supplier_id INTEGER ENCODE ZSTD NOT NULL,

prod_status CHARACTER VARYING(20) NOT NULL,

prod_list_price NUMERIC(8,2) ENCODE ZSTD NOT NULL,

prod_min_price NUMERIC(8,2) ENCODE ZSTD NOT NULL,

prod_total CHARACTER VARYING(13) ENCODE ZSTD NOT NULL,

prod_total_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

prod_src_id NUMERIC(38,18) ENCODE ZSTD,

prod_eff_from TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD,

prod_eff_to TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD,

prod_valid CHARACTER VARYING(1) ENCODE ZSTD

)

DISTSTYLE all

SORTKEY

(

prod_id,

prod_category,

prod_status,

prod_subcategory

);

CREATE TABLE IF NOT EXISTS sh.times_a(

time_id TIMESTAMP WITHOUT TIME ZONE NOT NULL,

day_name CHARACTER VARYING(9) ENCODE ZSTD NOT NULL,

day_number_in_week SMALLINT ENCODE ZSTD NOT NULL,

day_number_in_month SMALLINT ENCODE ZSTD NOT NULL,

calendar_week_number SMALLINT ENCODE ZSTD NOT NULL,

fiscal_week_number SMALLINT ENCODE ZSTD NOT NULL,

week_ending_day TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD NOT NULL,

week_ending_day_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

calendar_month_number SMALLINT ENCODE ZSTD NOT NULL,

fiscal_month_number SMALLINT ENCODE ZSTD NOT NULL,

calendar_month_desc CHARACTER VARYING(8) ENCODE ZSTD NOT NULL,

calendar_month_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

fiscal_month_desc CHARACTER VARYING(8) ENCODE ZSTD NOT NULL,

fiscal_month_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

days_in_cal_month NUMERIC(38,18) ENCODE ZSTD NOT NULL,

days_in_fis_month NUMERIC(38,18) ENCODE ZSTD NOT NULL,

end_of_cal_month TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD NOT NULL,

end_of_fis_month TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD NOT NULL,

calendar_month_name CHARACTER VARYING(9) ENCODE ZSTD NOT NULL,

fiscal_month_name CHARACTER VARYING(9) ENCODE ZSTD NOT NULL,

calendar_quarter_desc CHARACTER VARYING(7) ENCODE ZSTD NOT NULL,

calendar_quarter_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

fiscal_quarter_desc CHARACTER VARYING(7) ENCODE ZSTD NOT NULL,

fiscal_quarter_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

days_in_cal_quarter NUMERIC(38,18) ENCODE ZSTD NOT NULL,

days_in_fis_quarter NUMERIC(38,18) ENCODE ZSTD NOT NULL,

end_of_cal_quarter TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD NOT NULL,

end_of_fis_quarter TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD NOT NULL,

calendar_quarter_number SMALLINT ENCODE ZSTD NOT NULL,

fiscal_quarter_number SMALLINT ENCODE ZSTD NOT NULL,

calendar_year SMALLINT ENCODE ZSTD NOT NULL,

calendar_year_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

fiscal_year SMALLINT ENCODE ZSTD NOT NULL,

fiscal_year_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

days_in_cal_year NUMERIC(38,18) ENCODE ZSTD NOT NULL,

days_in_fis_year NUMERIC(38,18) ENCODE ZSTD NOT NULL,

end_of_cal_year TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD NOT NULL,

end_of_fis_year TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD NOT NULL

)

DISTSTYLE all

SORTKEY

(

time_id

);

/*key 分配的维度表*/

CREATE TABLE IF NOT EXISTS sh.customers_k(

cust_id NUMERIC(38,18) NOT NULL,

cust_first_name CHARACTER VARYING(20) ENCODE ZSTD NOT NULL,

cust_last_name CHARACTER VARYING(40) ENCODE ZSTD NOT NULL,

cust_gender CHARACTER VARYING(1) NOT NULL,

cust_year_of_birth SMALLINT NOT NULL,

cust_marital_status CHARACTER VARYING(20),

cust_street_address CHARACTER VARYING(40) ENCODE ZSTD NOT NULL,

cust_postal_code CHARACTER VARYING(10) ENCODE ZSTD NOT NULL,

cust_city CHARACTER VARYING(30) ENCODE ZSTD NOT NULL,

cust_city_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

cust_state_province CHARACTER VARYING(40) ENCODE ZSTD NOT NULL,

cust_state_province_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

country_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

cust_main_phone_number CHARACTER VARYING(25) ENCODE ZSTD NOT NULL,

cust_income_level CHARACTER VARYING(30) ENCODE ZSTD,

cust_credit_limit NUMERIC(38,18) ENCODE ZSTD,

cust_email CHARACTER VARYING(30) ENCODE ZSTD,

cust_total CHARACTER VARYING(14) ENCODE ZSTD NOT NULL,

cust_total_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

cust_src_id NUMERIC(38,18) ENCODE ZSTD,

cust_eff_from TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD,

cust_eff_to TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD,

cust_valid CHARACTER VARYING(1) ENCODE ZSTD

)

DISTSTYLE KEY

DISTKEY

(

cust_id

)

SORTKEY

(

cust_id,

cust_gender,

cust_marital_status,

cust_year_of_birth

);


CREATE TABLE IF NOT EXISTS sh.products_k(

prod_id INTEGER NOT NULL,

prod_name CHARACTER VARYING(50) ENCODE ZSTD NOT NULL,

prod_desc CHARACTER VARYING(4000) ENCODE ZSTD NOT NULL,

prod_subcategory CHARACTER VARYING(50) NOT NULL,

prod_subcategory_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

prod_subcategory_desc CHARACTER VARYING(2000) ENCODE ZSTD NOT NULL,

prod_category CHARACTER VARYING(50) NOT NULL,

prod_category_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

prod_category_desc CHARACTER VARYING(2000) ENCODE ZSTD NOT NULL,

prod_weight_class SMALLINT ENCODE ZSTD NOT NULL,

prod_unit_of_measure CHARACTER VARYING(20) ENCODE ZSTD,

prod_pack_size CHARACTER VARYING(30) ENCODE ZSTD NOT NULL,

supplier_id INTEGER ENCODE ZSTD NOT NULL,

prod_status CHARACTER VARYING(20) NOT NULL,

prod_list_price NUMERIC(8,2) ENCODE ZSTD NOT NULL,

prod_min_price NUMERIC(8,2) ENCODE ZSTD NOT NULL,

prod_total CHARACTER VARYING(13) ENCODE ZSTD NOT NULL,

prod_total_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

prod_src_id NUMERIC(38,18) ENCODE ZSTD,

prod_eff_from TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD,

prod_eff_to TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD,

prod_valid CHARACTER VARYING(1) ENCODE ZSTD

)

DISTSTYLE KEY

DISTKEY

(

prod_id

)

SORTKEY

(

prod_id,

prod_category,

prod_status,

prod_subcategory

);



CREATE TABLE IF NOT EXISTS sh.times_k(

time_id TIMESTAMP WITHOUT TIME ZONE NOT NULL,

day_name CHARACTER VARYING(9) ENCODE ZSTD NOT NULL,

day_number_in_week SMALLINT ENCODE ZSTD NOT NULL,

day_number_in_month SMALLINT ENCODE ZSTD NOT NULL,

calendar_week_number SMALLINT ENCODE ZSTD NOT NULL,

fiscal_week_number SMALLINT ENCODE ZSTD NOT NULL,

week_ending_day TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD NOT NULL,

week_ending_day_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

calendar_month_number SMALLINT ENCODE ZSTD NOT NULL,

fiscal_month_number SMALLINT ENCODE ZSTD NOT NULL,

calendar_month_desc CHARACTER VARYING(8) ENCODE ZSTD NOT NULL,

calendar_month_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

fiscal_month_desc CHARACTER VARYING(8) ENCODE ZSTD NOT NULL,

fiscal_month_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

days_in_cal_month NUMERIC(38,18) ENCODE ZSTD NOT NULL,

days_in_fis_month NUMERIC(38,18) ENCODE ZSTD NOT NULL,

end_of_cal_month TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD NOT NULL,

end_of_fis_month TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD NOT NULL,

calendar_month_name CHARACTER VARYING(9) ENCODE ZSTD NOT NULL,

fiscal_month_name CHARACTER VARYING(9) ENCODE ZSTD NOT NULL,

calendar_quarter_desc CHARACTER VARYING(7) ENCODE ZSTD NOT NULL,

calendar_quarter_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

fiscal_quarter_desc CHARACTER VARYING(7) ENCODE ZSTD NOT NULL,

fiscal_quarter_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

days_in_cal_quarter NUMERIC(38,18) ENCODE ZSTD NOT NULL,

days_in_fis_quarter NUMERIC(38,18) ENCODE ZSTD NOT NULL,

end_of_cal_quarter TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD NOT NULL,

end_of_fis_quarter TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD NOT NULL,

calendar_quarter_number SMALLINT ENCODE ZSTD NOT NULL,

fiscal_quarter_number SMALLINT ENCODE ZSTD NOT NULL,

calendar_year SMALLINT ENCODE ZSTD NOT NULL,

calendar_year_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

fiscal_year SMALLINT ENCODE ZSTD NOT NULL,

fiscal_year_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

days_in_cal_year NUMERIC(38,18) ENCODE ZSTD NOT NULL,

days_in_fis_year NUMERIC(38,18) ENCODE ZSTD NOT NULL,

end_of_cal_year TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD NOT NULL,

end_of_fis_year TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD NOT NULL

)

DISTSTYLE KEY

DISTKEY

(

time_id

)

SORTKEY

(

time_id

);

/*even 分配的维度表*/

CREATE TABLE IF NOT EXISTS sh.customers_e(

cust_id NUMERIC(38,18) NOT NULL,

cust_first_name CHARACTER VARYING(20) ENCODE ZSTD NOT NULL,

cust_last_name CHARACTER VARYING(40) ENCODE ZSTD NOT NULL,

cust_gender CHARACTER VARYING(1) NOT NULL,

cust_year_of_birth SMALLINT NOT NULL,

cust_marital_status CHARACTER VARYING(20),

cust_street_address CHARACTER VARYING(40) ENCODE ZSTD NOT NULL,

cust_postal_code CHARACTER VARYING(10) ENCODE ZSTD NOT NULL,

cust_city CHARACTER VARYING(30) ENCODE ZSTD NOT NULL,

cust_city_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

cust_state_province CHARACTER VARYING(40) ENCODE ZSTD NOT NULL,

cust_state_province_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

country_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

cust_main_phone_number CHARACTER VARYING(25) ENCODE ZSTD NOT NULL,

cust_income_level CHARACTER VARYING(30) ENCODE ZSTD,

cust_credit_limit NUMERIC(38,18) ENCODE ZSTD,

cust_email CHARACTER VARYING(30) ENCODE ZSTD,

cust_total CHARACTER VARYING(14) ENCODE ZSTD NOT NULL,

cust_total_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

cust_src_id NUMERIC(38,18) ENCODE ZSTD,

cust_eff_from TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD,

cust_eff_to TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD,

cust_valid CHARACTER VARYING(1) ENCODE ZSTD

)

DISTSTYLE even

SORTKEY

(

cust_id,

cust_gender,

cust_marital_status,

cust_year_of_birth

);


CREATE TABLE IF NOT EXISTS sh.products_e(

prod_id INTEGER NOT NULL,

prod_name CHARACTER VARYING(50) ENCODE ZSTD NOT NULL,

prod_desc CHARACTER VARYING(4000) ENCODE ZSTD NOT NULL,

prod_subcategory CHARACTER VARYING(50) NOT NULL,

prod_subcategory_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

prod_subcategory_desc CHARACTER VARYING(2000) ENCODE ZSTD NOT NULL,

prod_category CHARACTER VARYING(50) NOT NULL,

prod_category_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

prod_category_desc CHARACTER VARYING(2000) ENCODE ZSTD NOT NULL,

prod_weight_class SMALLINT ENCODE ZSTD NOT NULL,

prod_unit_of_measure CHARACTER VARYING(20) ENCODE ZSTD,

prod_pack_size CHARACTER VARYING(30) ENCODE ZSTD NOT NULL,

supplier_id INTEGER ENCODE ZSTD NOT NULL,

prod_status CHARACTER VARYING(20) NOT NULL,

prod_list_price NUMERIC(8,2) ENCODE ZSTD NOT NULL,

prod_min_price NUMERIC(8,2) ENCODE ZSTD NOT NULL,

prod_total CHARACTER VARYING(13) ENCODE ZSTD NOT NULL,

prod_total_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

prod_src_id NUMERIC(38,18) ENCODE ZSTD,

prod_eff_from TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD,

prod_eff_to TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD,

prod_valid CHARACTER VARYING(1) ENCODE ZSTD

)

DISTSTYLE even

SORTKEY

(

prod_id,

prod_category,

prod_status,

prod_subcategory

);

CREATE TABLE IF NOT EXISTS sh.times_e(

time_id TIMESTAMP WITHOUT TIME ZONE NOT NULL,

day_name CHARACTER VARYING(9) ENCODE ZSTD NOT NULL,

day_number_in_week SMALLINT ENCODE ZSTD NOT NULL,

day_number_in_month SMALLINT ENCODE ZSTD NOT NULL,

calendar_week_number SMALLINT ENCODE ZSTD NOT NULL,

fiscal_week_number SMALLINT ENCODE ZSTD NOT NULL,

week_ending_day TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD NOT NULL,

week_ending_day_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

calendar_month_number SMALLINT ENCODE ZSTD NOT NULL,

fiscal_month_number SMALLINT ENCODE ZSTD NOT NULL,

calendar_month_desc CHARACTER VARYING(8) ENCODE ZSTD NOT NULL,

calendar_month_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

fiscal_month_desc CHARACTER VARYING(8) ENCODE ZSTD NOT NULL,

fiscal_month_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

days_in_cal_month NUMERIC(38,18) ENCODE ZSTD NOT NULL,

days_in_fis_month NUMERIC(38,18) ENCODE ZSTD NOT NULL,

end_of_cal_month TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD NOT NULL,

end_of_fis_month TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD NOT NULL,

calendar_month_name CHARACTER VARYING(9) ENCODE ZSTD NOT NULL,

fiscal_month_name CHARACTER VARYING(9) ENCODE ZSTD NOT NULL,

calendar_quarter_desc CHARACTER VARYING(7) ENCODE ZSTD NOT NULL,

calendar_quarter_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

fiscal_quarter_desc CHARACTER VARYING(7) ENCODE ZSTD NOT NULL,

fiscal_quarter_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

days_in_cal_quarter NUMERIC(38,18) ENCODE ZSTD NOT NULL,

days_in_fis_quarter NUMERIC(38,18) ENCODE ZSTD NOT NULL,

end_of_cal_quarter TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD NOT NULL,

end_of_fis_quarter TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD NOT NULL,

calendar_quarter_number SMALLINT ENCODE ZSTD NOT NULL,

fiscal_quarter_number SMALLINT ENCODE ZSTD NOT NULL,

calendar_year SMALLINT ENCODE ZSTD NOT NULL,

calendar_year_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

fiscal_year SMALLINT ENCODE ZSTD NOT NULL,

fiscal_year_id NUMERIC(38,18) ENCODE ZSTD NOT NULL,

days_in_cal_year NUMERIC(38,18) ENCODE ZSTD NOT NULL,

days_in_fis_year NUMERIC(38,18) ENCODE ZSTD NOT NULL,

end_of_cal_year TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD NOT NULL,

end_of_fis_year TIMESTAMP WITHOUT TIME ZONE ENCODE ZSTD NOT NULL

)

DISTSTYLE even

SORTKEY

(

time_id

);

测试查询语句

/* 1 事实表 key 分配+交错排序键,维度表全 key 分配 */

explain select b.calendar_quarter_desc,e.prod_name,e.prod_list_price,c.cust_credit_limit,c.cust_marital_status,sum(a.amount_sold) sale_cnt

from sh.sales_k a,sh.times_k b,sh.customers_k c,sh.products_k e

where

    a.time_id=b.time_id

    and a.prod_id=e.prod_id

    and a.cust_id=c.cust_id

group by b.calendar_quarter_desc,e.prod_name,e.prod_list_price,c.cust_credit_limit,c.cust_marital_status limit 80;

/* 2 事实表 key 分配+复合排序键,维度表全 key 分配 */

explain select b.calendar_quarter_desc,e.prod_name,e.prod_list_price,c.cust_credit_limit,c.cust_marital_status,sum(a.amount_sold) sale_cnt

from sh.sales_c a,sh.times_k b,sh.customers_k c,sh.products_k e

where

   a.time_id=b.time_id

    and a.prod_id=e.prod_id

    and a.cust_id=c.cust_id

group by b.calendar_quarter_desc,e.prod_name,e.prod_list_price,c.cust_credit_limit,c.cust_marital_status limit 80;

/* 3 事实表 even 分配,维度表全 key 分配 */

explain select b.calendar_quarter_desc,e.prod_name,e.prod_list_price,c.cust_credit_limit,c.cust_marital_status,sum(a.amount_sold) sale_cnt

from sh.sales_e a,sh.times_k b,sh.customers_k c,sh.products_k e

where

    a.time_id>=to_timestamp('20010101','yyyymmdd') and a.time_id<=to_timestamp('20011231','yyyymmdd')

    and a.time_id=b.time_id

    and a.prod_id=e.prod_id

    and a.cust_id=c.cust_id

group by b.calendar_quarter_desc,e.prod_name,e.prod_list_price,c.cust_credit_limit,c.cust_marital_status limit 80;

/* 4 事实表 key 分配+交错排序键,维度表 times key 分配且与事实表键值相同,其他维度表 all 分配 */

explain select to_char(a.time_id,'yyyymm'),b.prod_name,b.prod_list_price,c.cust_credit_limit,c.cust_marital_status,sum(a.amount_sold) sale_cnt

from sh.sales_k a,sh.products_a b,sh.customers_a c,sh.times_k e

where

    a.time_id=e.time_id

    and a.prod_id=b.prod_id

    and a.cust_id=c.cust_id

group by to_char(a.time_id,'yyyymm'),b.prod_name,b.prod_list_price,c.cust_credit_limit,c.cust_marital_status limit 80;

/* 5 事实表 key 分配+复合排序键,维度表 times key 分配且与事实表键值相同,其他维度表 all 分配 注意 merge join cost 最小*/

explain select to_char(a.time_id,'yyyymm'),b.prod_name,b.prod_list_price,c.cust_credit_limit,c.cust_marital_status,sum(a.amount_sold) sale_cnt

from sh.sales_c a,sh.products_a b,sh.customers_a c,sh.countries_a d,sh.times_k e

where

    a.time_id>=to_timestamp('20010101','yyyymmdd') and a.time_id<=to_timestamp('20011231','yyyymmdd')

    and a.time_id=e.time_id

    and a.prod_id=b.prod_id

    and a.cust_id=c.cust_id

group by to_char(a.time_id,'yyyymm'),b.prod_name,b.prod_list_price,c.cust_credit_limit,c.cust_marital_status limit 80;

/* 6 事实表 even 分配,维度表全 all 分配 */

explain select to_char(a.time_id,'yyyymm'),b.prod_name,b.prod_list_price,c.cust_credit_limit,c.cust_marital_status,sum(a.amount_sold) sale_cnt

from sh.sales_e a,sh.products_a b,sh.customers_a c,sh.times_a e

where

    a.time_id=e.time_id

    and a.prod_id=b.prod_id

    and a.cust_id=c.cust_id

group by to_char(a.time_id,'yyyymm'),b.prod_name,b.prod_list_price,c.cust_credit_limit,c.cust_marital_status limit 80;

/* 7 事实表 even 分配,维度表全 even 分配 */

explain select to_char(a.time_id,'yyyymm'),b.prod_name,b.prod_list_price,c.cust_credit_limit,c.cust_marital_status,sum(a.amount_sold) sale_cnt

from sh.sales_e a,sh.products_e b,sh.customers_e c,sh.times_e e

where

   a.time_id=e.time_id

    and a.prod_id=b.prod_id

    and a.cust_id=c.cust_id

group by to_char(a.time_id,'yyyymm'),b.prod_name,b.prod_list_price,c.cust_credit_limit,c.cust_marital_status limit 80;

/* 8 事实表 even 分配,维度表全 key 分配 */

explain select to_char(a.time_id,'yyyymm'),b.prod_name,b.prod_list_price,c.cust_credit_limit,c.cust_marital_status,sum(a.amount_sold) sale_cnt

from sh.sales_e a,sh.products_k b,sh.customers_k c,sh.times_k e

where

    a.time_id=e.time_id

    and a.prod_id=b.prod_id

    and a.cust_id=c.cust_id

group by to_char(a.time_id,'yyyymm'),b.prod_name,b.prod_list_price,c.cust_credit_limit,c.cust_marital_status limit 80;

Amazon Redshift 的表设计是获取良好性能的基础,要想做出正确的设计一方面需要读者了解 Amazon Redshift 的运作原理和表设计最佳实践,另一方面更需要读者对自身业务的深入理解,有时候表设计不仅仅是技术,甚至是一种艺术。总之,正如俗语所说“善始方能善终”,希望读者在阅读本文后,在使用 Amazon Redshift 之前能多一份思考,多一分准备,能够最大限度地发挥 Amazon Redshift 的潜力,助力我们的业务成长。

 

相关文章