PARTIAL REDISTRIBUTION AND PARTIAL DUPLICATION(PRPD)
Dealing with
data skew can be considered a chief activity in Teradata performance
optimization. Although, skew on table level usually can be avoided by choosing
a proper Primary Index, skew arising during query execution is often a major
issue (so-called spool skew).
Spool skew is
often emerging when join tables (or spools). On a typical Teradata System,
the join process is executed in parallel by all available AMPs at the
same time.
Let’s consider
the following join example:
B … Big table
with 100 Million rows
S … Small table
with 10.000 rows
We assume, that
both tables are joined on a column which is neither the primary index of B nor
the primary index of S. In such a case, the optimizer may choose to rehash and
redistribute both tables. This is a good approach if the join column values if
both tables are evenly distributed across all AMPs.
But we assume,
that the big table B has exactly the same value in the join column in about
99% of the rows i.e. 99 Million of rows out of 100 Million rows
carry the same value in the join column.
As of
now, we run into a big performance problem, caused by skew.
By rehashing
the rows, 99 Million rows are distributed to only one AMP (as they carry the
same hash value).
The receiving
AMP will be completely overloaded (Hot AMP situation) and may run out of
spool space.
Until Teradata
Release 13, it was exclusively your task as performance specialist to solve
suchlike dynamic skewing problems, by forcing the optimizer into a different
execution plan.
Sometimes this could be solved by adding statistics, some of the time a complete rewrite of the SQL statement was the only opportunity.
Sometimes this could be solved by adding statistics, some of the time a complete rewrite of the SQL statement was the only opportunity.
Starting
with Teradata 14, a sophisticated method for resolving such data
skew issues was implemented.
Each AMP
receives information about skewed values (in our example this is the big table
B).
Biased rows of
the big table B, stay locally on their AMPs, only rows not causing skew, are
redistributed by rehashing over the the join column(s).
Similarly, not
biased rows from the small table S, are rehashed on their join column(s)
but rows which would cause skew after redistribution, are duplicated to all
AMPs.
As soon as the
partial redistribution and partial duplication of the tables has been completed
(PRPD), each AMP generates a first result set by joining the contents of
the redistributed spools .
In a next step,
each AMP generates a second result set by joining the contents of the
duplicated spool of table S and the local spool of table B.
Starting with
Teradata 14, PRPD relives the performance specialist from some skew related
problems, but the Teradata optimizer needs to know about biased values, hence
it is important to keep the statistics fresh.