Обсуждение: Re: Our trial to TPC-DS but optimizer made unreasonable plan

Поиск
Список
Период
Сортировка

Re: Our trial to TPC-DS but optimizer made unreasonable plan

От
Kouhei Kaigai
Дата:
> On Mon, Aug 17, 2015 at 9:40 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> > I think SortSupport logic provides a reasonable way to solve this
> > kind of problem. For example, btint4sortsupport() informs a function
> > pointer of the fast version of comparator (btint4fastcmp) which takes
> > two Datum argument without indirect memory reference.
> > This mechanism will also make sense for HashAggregate logic, to reduce
> > the cost of function invocations.
> >
> > Please comment on the idea I noticed here.
> 
> It's possible that this can work, but it might be a good idea to run
> 'perf' on this query and find out where the CPU time is actually
> going.  That might give you a clearer picture of why the HashAggregate
> is slow.
>
I tried to run one of CTE portion under the perf enabled.

HashAggregate still takes 490sec in spite of 70sec by underlying Join.


tpcds100=# explain analyze select c_customer_id customer_id      ,c_first_name customer_first_name      ,c_last_name
customer_last_name     ,c_preferred_cust_flag customer_preferred_cust_flag      ,c_birth_country customer_birth_country
    ,c_login customer_login      ,c_email_address customer_email_address      ,d_year dyear
,sum(((ss_ext_list_price-ss_ext_wholesale_cost-ss_ext_discount_amt)+ss_ext_sales_price)/2)year_total      ,'s'
sale_typefromcustomer    ,store_sales    ,date_dimwhere c_customer_sk = ss_customer_sk  and ss_sold_date_sk =
d_date_skgroupby c_customer_id        ,c_first_name        ,c_last_name        ,c_preferred_cust_flag
,c_birth_country       ,c_login        ,c_email_address        ,d_year
 
;                                            QUERY PLAN

---------------------------------------------------------------------------------------------------------------------
HashAggregate  (cost=18194948.40..21477516.00 rows=262605408 width=178)              (actual
time=483480.161..490763.640rows=9142442 loops=1)  Group Key: customer.c_customer_id, customer.c_first_name,
customer.c_last_name,customer.c_preferred_cust_flag,             customer.c_birth_country, customer.c_login,
customer.c_email_address,date_dim.d_year  ->  Custom Scan (GpuJoin)  (cost=101342.54..9660272.64 rows=262605408
width=178)                            (actual time=2430.787..73116.553 rows=268562375 loops=1)        Bulkload: On
(density:100.00%)        Depth 1: Logic: GpuHashJoin, HashKeys: (ss_sold_date_sk), JoinQual: (ss_sold_date_sk =
d_date_sk),                nrows (287997024 -> 275041999, 95.50% expected 95.47%)        Depth 2: Logic: GpuHashJoin,
HashKeys:(ss_customer_sk), JoinQual: (ss_customer_sk = c_customer_sk),                 nrows (275041999 -> 268562375,
93.25%expected 91.18%)        ->  Custom Scan (BulkScan) on store_sales  (cost=0.00..9649559.60 rows=287996960
width=38)                                                  (actual time=17.141..52757.354 rows=287997024 loops=1)
->  Seq Scan on date_dim  (cost=0.00..2705.49 rows=73049 width=16)                                  (actual
time=0.030..20.597rows=73049 loops=1)        ->  Seq Scan on customer  (cost=0.00..87141.74 rows=2000074 width=156)
                            (actual time=0.010..585.861 rows=2000000 loops=1)Planning time: 1.558 msExecution time:
492113.558ms
 
(11 rows)


Perf output is below. Unlike my expectation, the largest portion was consumed
by bpchareq(6.76%) + bcTruelen(8.23%). One other big cluster is, probabaly,
TupleHashTableHash(1.11%) -> slot_getattr(4.29%) -> slot_deform_tuple(4.92%).


# ========
# captured on: Thu Aug 20 09:52:24 2015
# hostname : ayu.kaigai.gr.jp
# os release : 2.6.32-504.23.4.el6.x86_64
# perf version : 2.6.32-504.23.4.el6.x86_64.debug
# arch : x86_64
# nrcpus online : 48
# nrcpus avail : 48
# cpudesc : Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
# cpuid : GenuineIntel,6,63,2
# total memory : 396795400 kB
# cmdline : /usr/bin/perf record -a -e cycles
# event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern = 0, excl_host =
0,excl_guest = 1, precise_ip = 0, attr_mmap2 = 0, attr_mmap  = 1, attr_mmap_data = 0
 
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# pmu mappings: cpu = 4, tracepoint = 2, software = 1
# ========
#
# Samples: 2M of event 'cycles'
# Event count (approx.): 1558291468259
#
# Overhead          Command               Shared Object                                                 Symbol
# ........  ...............  ..........................  .....................................................
#    8.23%         postgres  postgres                    [.] bcTruelen    6.76%         postgres  postgres
     [.] bpchareq    4.92%         postgres  postgres                    [.] pg_detoast_datum    4.29%         postgres
postgres                    [.] slot_getattr    4.07%         postgres  postgres                    [.] AllocSetAlloc
3.58%         postgres  postgres                    [.] slot_deform_tuple    3.39%         postgres  postgres
        [.] div_var    3.35%         postgres  postgres                    [.] hash_search_with_hash_value    3.11%
   postgres  postgres                    [.] hash_any    2.62%         postgres  postgres                    [.]
make_result   2.50%         postgres  postgres                    [.] add_abs    2.24%         postgres  postgres
            [.] ExecAgg    2.23%         postgres  postgres                    [.] init_var_from_num    2.09%
postgres postgres                    [.] pg_detoast_datum_packed    2.07%         postgres  postgres
[.]ExecMakeFunctionResultNoSets    1.95%         postgres  [vsyscall]                  [.] 0x000000000000014c    1.88%
      postgres  libc-2.12.so                [.] memcpy    1.83%         postgres  postgres                    [.]
execTuplesMatch   1.71%         postgres  postgres                    [.] sub_abs    1.70%         postgres
[kernel.kallsyms]          [k] copy_user_generic_string    1.48%         postgres  pg_strom.so                 [.]
pgstrom_data_store_insert_block   1.41%         postgres  postgres                    [.] palloc    1.38%
postgres postgres                    [.] texteq    1.29%         postgres  [vdso]                      [.]
0x0000000000000890   1.11%         postgres  postgres                    [.] TupleHashTableHash     :
 
(only larger than 1.0%)


Indeed, 6 of 8 grouping keys in this query uses bpchar() data type, so it is
natural comparison function consumed larger portion of CPU cycles.
Do we have any idea to assist these queries by the backend?


tpcds100=# \d customer                 Table "public.customer"        Column         |         Type          |
Modifiers
------------------------+-----------------------+-----------c_customer_sk          | bigint                | not
nullc_customer_id         | character(16)         | not nullc_current_cdemo_sk     | bigint
|c_current_hdemo_sk    | bigint                |c_current_addr_sk      | bigint                |c_first_shipto_date_sk
|bigint                |c_first_sales_date_sk  | bigint                |c_salutation           | character(10)
|c_first_name          | character(20)         |c_last_name            | character(30)         |c_preferred_cust_flag
|character(1)          |c_birth_day            | bigint                |c_birth_month          | bigint
|c_birth_year          | bigint                |c_birth_country        | character varying(20) |c_login
|character(13)         |c_email_address        | character(50)         |c_last_review_date_sk  | bigint
|


Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


Re: Our trial to TPC-DS but optimizer made unreasonable plan

От
Peter Geoghegan
Дата:
On Wed, Aug 19, 2015 at 6:08 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> Indeed, 6 of 8 grouping keys in this query uses bpchar() data type, so it is
> natural comparison function consumed larger portion of CPU cycles.
> Do we have any idea to assist these queries by the backend?

With abbreviated keys, char(n) is very significantly slower than
varchar(n) (or text). I've been meaning to revisit my docpatch to warn
users of this (we already specifically advise against using char(n),
more or less). Abbreviation and a few other tricks could easily make
an enormous difference.

There is no very good reason why the same optimizations that I applied
to text could not also be applied to bpchar(), I think. I think that
abbreviation could remove much of the special char(n) effort, as well;
someone simply needs to do the work. I'm actually more concerned about
the problems that this causes for third-party benchmarks than I am
about the problems for real users.

-- 
Peter Geoghegan