PDW Shallow Dive – Part 3

Welcome to part three of the PDW Shallow dive. In this section I am going to explore partitioning in PDW. We will look at how to partition a table on the appliance and how to use partition switching. Unfortunately I was hoping to look at the CTAS function and specifically the mega merge statement but I will have to create a 4th part of the series to cover that due to time constraints.

Introducing Partitioning on the PDW

Partitioning is used to help manage large tables. When partitioning a table you track logical subsets of rows with metadata. Unlike SMP SQL Server there is no partitioning function or scheme you simply set it when you create the table or use Alter Table to add partitioning to the table by selecting what column to partition by and setting your boundary values. This does not affect which distribution or compute node the data is stored on. Rather it will allow you to manage the table with these subsets of data, for example partition by month and then archive an older month by switching the partition to an archive table.

Let’s look at a simple example of a fact table that is partitioned, the code looks like this:

CREATE TABLE [dbo].[FactSales] (
[OnlineSalesKey] bigint NULL,
[DateKey] datetime NULL,
[StoreKey] int NULL,
[ProductKey] int NULL,
[PromotionKey] int NULL,
[CurrencyKey] int NULL,
[CustomerKey] int NULL,
[SalesOrderNumber] varchar(28) COLLATE Latin1_General_100_CI_AS_KS_WS NULL,
[SalesOrderLineNumber] int NULL,
[SalesQuantity] int NULL,
[SalesAmount] money NULL,
[ReturnQuantity] int NULL,
[ReturnAmount] money NULL,
[DiscountQuantity] int NULL,
[DiscountAmount] money NULL,
[TotalCost] money NULL,
[UnitCost] money NULL,
[UnitPrice] money NULL,
[ETLLoadID] int NULL,
[LoadDate] datetime NULL,
[UpdateDate] datetime NULL
)
WITH (CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = HASH([OnlineSalesKey]),
PARTITION ([DateKey] RANGE LEFT FOR VALUES (‘Dec 31 2006 12:00AM’, ‘Dec 31 2007 12:00AM’, ‘Dec 31 2008 12:00AM’, ‘Dec 31 2009 12:00AM’, ‘Dec 31 2010 12:00AM’, ‘Dec 31 2011 12:00AM’)));

It doesn’t have to be dates that are used to partition tables, although this is the most used option, you can partition tables by set values such a product categories or sales channels. Understanding the data you are working with and how it will be analysed is critical to set up partitioning. The key options for setting up partitioning are:

  • The column – as mentioned before it is critical to identify the best column for partitioning your data by, in the example above we use the [DateKey] column.
  • Range – this specifies the boundary of the partition, it defaults to Left (lower values) but you can choose Right (higher values), in the example above we use the default, Left. This means that the values specified will be the last value in each partition e.g. our first partition will have all data with a date before Dec 31 2006 12:00AM.
  • The boundary value is the values in the list that you will partition the table by, this cannot be Null. These values must match or be implicitly convertible to the data type of the partitioning column.

Partitioning by example

In this section we will go step by step through a basic partitioning example. We create three tables, a main table with partitioning on it using a part_id, a second table used for archiving data from our main table and finally a table that we will use to move data from into our main table.

  1. Create out our factTableSample with a partitioning option, using a part_id column:

    create table factTableSample

    (id int not null, col1 varchar(50),part_id int not null)

    with (distribution =
    replicate,
    partition(part_id range
    left
    for
    values(1,2,3,4,5)));

  2. Now we will insert some basic data into the table:

    insert
    into factTableSample

    select 1 id,
    ‘row1’col1, 1 part_id

    union
    all

    select 2 id,
    ‘row2’col1, 2 part_id

    union
    all

    select 3 id,
    ‘row3’col1, 3 part_id

    union
    all

    select 4 id,
    ‘row4’col1, 4 part_id

    union
    all

    select 5 id,
    ‘row5’col1, 5 part_id

    Below shows what those rows then look like:


  3. Let’s take a look at the how the data is stored in that table, how many rows are in each partition:

    –Check Paritions and the number of rows

    SELECT o.name, pnp.index_id, pnp.partition_id, pnp.rows,

    pnp.data_compression_desc, pnp.pdw_node_id

    FROM
    sys.pdw_nodes_partitions AS pnp

    JOIN
    sys.pdw_nodes_tables AS NTables


    ON pnp.object_id
    = NTables.object_id

    AND pnp.pdw_node_id = NTables.pdw_node_id

    JOIN
    sys.pdw_table_mappings AS TMap


    ON NTables.name = TMap.physical_name

    JOIN
    sys.objects
    AS o


    ON TMap.object_id
    = o.object_id

    WHERE o.name =
    ‘factTableSample’

    ORDER
    BY o.name, pdw_node_id, pnp.index_id, pnp.partition_id

    The results show that all but the final partition has 1 row in them. However you also notice there are 24 rows returned. This is because we created a replicated table that is replicated across our 4 nodes. For more information on replicated tables review part 1 of the series.


  4. Now we will create an archive table to which we can move out older data:

    create
    table factTableSample_Archive
    (

    id int
    not
    null,

    col1 varchar(50),

    part_id int
    not
    null)

    with (distribution =
    replicate);

  5. Now we can switch out partition 5 to the archive table and look at the results:

    alter
    table factTableSample switch partition 5 to factTableSample_Archive;


    We can clearly see that the 5th partition has been moved from the main fact table (factTableSample) to the archive table (factTableSample_Archive).

  6. Next we create a new table which we would use to load staging data into:

    create
    table factTableStage with (distribution =
    replicate

    ,
    partition(part_id range
    left
    for
    values(1,2,3,4,5)))

    as

    select 1 id,

    cast(‘row6’
    as
    varchar(50)) col1,

    5 part_id

This adds a single row into partition 5:


  1. Now let’s switch the stage data into partition 5 of the main fact table and look at the results:

alter
table factTableStage switch partition 5 to factTableSample partition 5;


We see that partition 5 is now populated with the data from the stage table.

To summarise we can see using partitioning on the PDW is not overly complex. The alter table statements used to switch the partitions are very fast. If we had used distributed tables in the above examples it would have no effect on any of the code we used. Just like with SMP SQL Server it is important to plan your partitioning strategy in line with the analysis requirements and distribution of data in the datawarehouse. In the next part of the series I will dig into the CTAS (create table as select) statement and show an example of using this to do a merge into a dimension table. I hope you find the above useful, if you want any code examples please feel free to email at matt@coeo.com and I will get you a copy of the full script.


Building BI Projects with Agility

Let me introduce you to an interesting story that was part of the foreword of the Mike Cohn book “Succeeding with Agile: Software Development with Scrum”. In the story, trappers would stop off at the Hudson Bay store to get all their supplies before setting off into the wilderness and make their fortune. However once they had done their shopping at the supply store they would only go a few miles and then set up a camp. The idea was that they would be better to identify what they had missed in that initial shop and only have to go back a few miles rather than being lost in the wilderness without an important piece of equipment. Although most trappers, like most consultants, feel their planning is brilliant the really good ones prepare in a way that they don’t end up frozen in the Rockies!

I like this analogy to projects but does it really resonate with BI? For me it certainly does. On an older BI project we had units of measure to handle in our solution. Applying a Kimball design pattern I asked the company are there really only 3 UOMs you want to handle? The reply was a confident yes with confirmation that they had not reported on more than three for the last 5 years. Great! So based on this confident feedback I built the UOMs into columns in the fact table(s) rather than applying a UOM dimension. I created MDX for time intelligence for each date hierarchy and for each UOM, for three UOMs this wasn’t a hardship.

However, a few months later the customer requested my time to work on an enhancement request to add another 5 UOMs. Unfortunately the nature of consultancy “time & materials” means the customer doesn’t want to spend time (therefore money) on redesigning the core design and including a UOM dimension. So I was left in MDX script hell, although not with the usual complexities of writing MDX but in copy paste and having the longest MDX script in the western world (there must be longer ones in China with 1000s of characters in their writing system!).

If we had been more agile and delivered an earlier version of this cube (as this was the deliverable) then perhaps the business would have seen how easy things now were with these UOMs and suggested they would like to slice the data by more. We then would have had time to refactor the design and implement our UOM dimension.

BUT not all projects fall easily into a completely agile process. For example, did the trappers go to the Hudson store and buy just a sleeping bag (forgive me if these weren’t available in 1890) and then head off to camp to check it worked. Then they bought a kettle on day two before on day three realising they needed some tea bags… I don’t think so. Yet many software developers who talk of pure Agile say that there is value in delivering a web page with a username text box, no password box and no logon button… I don’t agree and would argue that there is value in QA and UAT at this point, but NO value in that delivery to the business.

With BI projects there may be some value, depending on the business requirements, to deliver a product dimension and let the business see this; I certainly found this with retailers that could then identify missing attributes and start to design hierarchies they wanted or with customers that had a customer dimension for the first time. Of course there was value in applying a new business definition of a unique customer and then seeing what this list of customers was for the first time.

However the key to both of these examples is that the biggest business value for these customers was looking at sales orders over time for these products or for identifying how customer recency affected the actions of this new list of customers. To delivery the biggest value to the business we have to make sure we have the tent, sleeping bag, cooking stove and, at least, a kettle to boil water. Everything needed to do a basic camping trip in the Rockies (I am not a camper). For me this is the high level design of the BI project and I suggest to deliver any value to the business the high level design needs to include everything you know you need at the start of the project. On later camping trips it is likely you will have more experience and that the initial design or shopping trip will be more detailed and there will be less trips back to the store.

From that point on, continually delivering value, based on business priority/need, is achievable. Breaking down complex tasks into work items and grouping these into sprints works really well but not just for the sake of unrelated work items. It is much better that a sprint focusses around a group of items that, together, may allow delivery of something that is valuable to the business. To do this you must be interacting constantly with the business. In a recent project we had an excellent MI team who helpfully worked with us to group work items in a set of business headings all relating to a key area that would add value to the team and therefore the business. We could then focus sprints around these headings and, once deployed, the MI team could start to see value in the BI project.

But the key on a new project is that without that high level design and that first big effort of choosing what you need, building it and then heading out to the wilderness, you are simply going camping with just a sleeping bag and will make more trips to the supply store than are really needed. Get good requirements from the right people (not IT), build as comprehensive a design as your experience and the requirements allow and then deliver the first cut for business review as early as possible. To help with this I often try and not worry about control processes, partitioning strategies, incremental ETL loads etc. I move that to later stages of the build so we can focus purely on delivering that first cut to the business.

In summary I do not think that a single “pure” process has been required across all my projects. But being more agile by using best practices from across the various methodologies has helped me to be more and more consistent in delivering high business value solutions to my customers. And just like evolving your development processes helps improve delivery always look to learn from each project to help evolve your way of working, this for me is one of the best things being Agile teaches us.