PDW Shallow Dive – Part 2 – ETL

Welcome to part two of my shallow dive into the PDW. In this section we take a look at the options available for ETL with the appliance. Here we will cover the following topics:

  1. DWLoader.
  2. SSIS and PDW.

DWLoader

The DWLoader application is a bit like the PDW version of BCP, it is a command line application that is all about very fast import and export of flat files that are logically structured. The tool can be used via a script or the command line to import flat files to the appliance or export data to flat files off the appliance. Example syntax of the DWLoader application:

dwloader.exe -i ${loadfile} -m -fh 1 -M {mode} -b 100000 -rt value -rv 1000 -R ${loadfile}.rejects -E -e ASCII -t , -r \r\n -S $server -U
$user  -P $password -T {targetdatabase}${table}

Key switches:

  1. –i – the location of one or more source files to load.
  2. –M – specifies the load mode, options are fastappend, append, upsert or reload data. The default mode is append. To use fastappend you must specifiy –m, which is the multi-transaction option. The key difference to append and fast append is that the latter doesn’t use a temporary table.
  3. –fh – the number of lines to ignore at the beginning of the file (ie. Header rows).
  4. –b – sets the batch size, the default is 10,000 rows.
  5. –rv – specifies either the number of rows or % of data allowed to fail before halting the load (use of –rt sets whether this is a value or %).
  6. –R – this is the load failure file, if the file already exists it will be overwritten.
  7. –E – will convert empty strings to NULLL. The default is not to convert.
  8. –e – is the character encoding, ASCII is the default but other options are UTF8,UTF16 or UTF16BE.
  9. –t – specifies the field delimiter.
  10. –r – is the row delimiter.
  11. –S – sets the target appliance normally an IP address.
  12. –U and –P – is the username and password for the appliance (until the next PDW update PDW only supports SQL security so this would be your SQL username and password)
  13. –T – is the three part name for the destination table e.g. dbname.dbo.table1 (dbo is the only schema supported in the current version of PDW)

Once the import or export is started there is not much to see however you can track progress through the PDW dashboard.

A common problem with the DWLoader tool is that it isn’t able to work out the file format of the flat file (ie. Unicode or UTF-8). If you aren’t sure about the file type open up the file in SSIS Import/Export wizard and it will allow you to work out the format.

SSIS and PDW

First let’s remind ourselves how where SSIS fits in the PDW world:

The important point the diagram shows is that SSIS runs on the landing zone server NOT on the PDW directly. This is quite critical because, whereas in traditional ETL solutions you will look to SSIS to manage all of the transformations, when you have a PDW appliance as your database you will want to do that work on the actual appliance. This is also important to limit the effect on memory by pulling large data volumes into the SSIS data flow. It is much better to use SSIS as the control flow for the ETL solution and doing the advanced transformations of the data on the PDW.

To use SSIS with the PDW you use the same tools and processes as you would against normal SQL Server. Visual Studio with SSDT is used to create an SSIS project. Within the project you can specify a source adapter that will be an OLE DB source pointing to the PDW; however you will see that SSIS thinks this is a normal SQL Server source because the icons for the tables do not show as distributed or replicated, just tables. This is by design and it is good practice to build your source query as a stored procedure on the PDW so that you can test it before using in the SSIS project. The big difference in SSIS is the destination adapter, this is PDW specific and an example is shown in the diagram below:

Using the PDW destination adapter it is critical to make sure the source columns match the data type of the destination columns; the adapter is not flexible or forgiving like other destination adapters in SSIS (i.e. SQL Server Destination), it won’t do any automatic type conversion. Important settings when using the PDW destination adapter include:

  • Connection Manager – you will need to create a new, PDW specific, connection manager:
    • Servername will be the appliance name and port number or the appliance IP and port number.
    • User and password will be your SQL Server PDW login information.
    • Destination database name is the target database on the PDW.
    • Staging database name is the stage database you want to use, if using FastAppend this must be empty.
  • Destination Table – the table you wish to load the data into.
  • Loading Mode – if the loading mode is fast append make sure you un-select “roll-back load on table update or insert failure”
    • By unchecking this box, load from the staging table to a distributed destination table are performed in parallel across the distributions; this is quick but not transaction safe. It is the same as using the –m switch in DWLoader.
    • Checking the box loads data serially across distributions within each compute node and in parallel across the compute nodes. This is slower but is transaction safe.
  • Finally you can map your columns from input to destination – HOWEVER MAKE SURE YOU MATCH DATA TYPES!

Below is an example of the package running to import a simple file:

You can then follow the performance of the load through the PDW dashboard just as with the DWLoader:

In summary, when dealing with ETL on the PDW, push as much of the data transformations to the PDW as possible. Use tools not normally available in SQL Server or SSIS such as CTAS (Create Table As Select) and try and limit the use of the Update statement. Understanding that SSIS works by pulling data from the data source adapter into the memory on the SSIS server and runs it there, is critical to manage memory and performance of SSIS.

My suggestion for using SSIS with PDW is to use it for the control flow and for any basic data preparation for types not supported on the appliance, for example XML data type. Then use the power of the PDW to do the advanced transformations. In a recent POC for a customer we were loading around 300gbs of data in less than one hour using DWLoader and then transforming 2 million XML documents into a relational data warehouse using SSIS and PDW stored procedures. We did have an issue with no Merge statement support in PDW so to upsert into a dimension table we created a “mega-merge” statement using the CTAS function. In Part 3 of this series I will cover how that worked and also how partitions are managed on the PDW.

If you have any comments or questions please feel free to post them here for me or tweet me at @mattasimpson

PDW Shallow Dive – Part 1

There is a very little amount of resource on the internet for PDW. I wanted to do a short series that will focus on the basics of PDW. I am not going into deep detail on the parallel engine or the commercial aspects, but instead am wanting to focus on what PDW is, where it sits in the Microsoft BI stack and what I found particularly useful as a Microsoft Business Intelligence specialist.

In part 1 of the series we will look at the following topics:

  1. Introduction to the PDW.
  2. PDW Exploration.

Introduction to PDW

So this is not a sales pitch but if you are looking at this blog you already see the value in having an eminently scalable and super performant datawarehousing appliance; in regards an idea of cost the current base offering from Microsoft comes in at around the cost of a high performance Fast Track Data Warehouse with tin and software licenses but for more info reach out to eddie@coeo.com who can help with any commercial questions around the PDW.

Ok so what is it?! Ultimately the PDW is an appliance – that’s it hope you liked it…

… Just kidding! It is an appliance that uses SQL Server 2012 (from v2). The appliance is really scalable by using SAN like technology and virtualisation software for redundancy. The appliance can be on either Dell or HP, both or any future hardware providers have to adhere to strict Microsoft guidelines and offer performance to a set level. In a way not dissimilar from the Fast Track Datawarehouse. In fact Microsoft learnt from all that work that went into recommending specific, optimum hardware for the fast track and built on that with the PDW appliance.

How does it differ from SMP SQL? Well with a single instance of SQL Server you get one buffer, one space for all your user requests (queries) to go through. This means that if one query is requested that user gets the whole buffer and it’s quick! However the more users requesting results or the more complex the queries then the buffer soon fills and users start having to wait.

With the PDW you get multiple instances of SQL Server working in parallel, meaning multiple buffers! User traffic can be managed by the management node to optimise performance. This MPP (Massive Parallel Processing) platform offers scalability, high concurrency, complex workloads and redundancy in a single appliance:

With the diagram example above we would have 6 SQL buffers to work with in parallel! The appliance comes in a single rack with a control node, management node and a load of SQL instances (depending on what variant of the appliance you buy); oh and there is also a redundant node for added failover should one of your nodes fail. The DMS is the data movement services, this is what controls how data is stored and moved on the various instances. More of this later.

PDW Exploration

Ok so you know why you want one, let’s assume you get one, how do you go about developing on it? The first thing to note is that Microsoft has tried to make sure all the complexities of MPP are hidden from us. Ultimately we use our existing SQL Server and BI skills to develop against the appliance. To highlight this the graphic below shows that, once connected, the PDW is just seen as a slightly different version of SQL Server. It is good practice to use Visual Studio (and I find 2012 best at the moment) to work with the PDW.

Now you don’t access the PDW directly on the management or control node. In your rack you will also have an SMP instance of SQL Server running on a Windows server, this is commonly called the landing zone or landing server. A typical architecture for a PDW rack may be:

To start work on the PDW we connect to the landing zone and then use Visual Studio Data Tools to connect to our server. For example it may look like:

The next important concept you need is how you will store your data. There are two way to store your data. First is as distributed, so data is spread across all the SQL Server nodes. The second method is to store the data replicated. With this option the data will be copied to each SQL Server node. Because SQL Server still needs all the data on a single node to return the final results it uses the DMS (Data Movement Service) to move data around as it needs to complete a query. For smaller tables (generally dimension tables) it is best to use replicated so less DMS is needed. However for large fact tables it is much better to distribute that data so the PDW can apply the query across multiple nodes and use the processing power of each to get the queries and then pull that together to present the result. The diagram below shows a distributed fact table, by looking at the icon of the table in the SQL Server Object Explorer you can also see what is replicated and what is distributed:

With distributed tables the important option is the key on which you want to distribute the data, above you can see we use OnlineSalesKey; to make the right choice it is critical to understand the data and also the potential use of the data, however it can be changed very quickly. This option will have the biggest bearing on performance of a distributed table. It is more important to use a key that has a balance of data volumes and is most used when querying the data. You will also notice above we use the Clustered ColumnStore Index (which is writeable) and we partition the table; we will talk more about both of these options in later blog posts.

The second option is replicated and you can see an example of a replicated table below:

This will make our date dimension table replicated on all of the SQL Server nodes which will mean less DMS work so faster queries. Changing modes can be done easily and by using a PDW command called CTAS (Create Table As Select). You can quickly move data into a newly defined table; more on CTAS in later blog posts.

Finally in this section I wanted to show the central management portal for the PDW, this is a dashboard that can be accessed via the browser and an example is below:

From this dashboard you will be notified of any potential issues with the appliance, note the red exclamation mark in the health section. It is also where you can look at query plans in detail to see how the PDW engine is managing the query. It is also a good place to monitor data loads, which we talk about in detail in part 2 of this series. Finally you can use the performance monitor to see what is happening in real time on your appliance, below shows an example of the performance monitor as I ran a Select * from dimFactSales that contains 10 million records. As it is distributed you see that all the SQL Server nodes are working at roughly the same time/levels:

So overall I hope I have shown you how relatively simple the PDW is. Under the covers we have a huge amount of impressive hardware and software that is optimised purely for datawarehouse loads yet above the covers we have our tried and tested SQL Server interface. My final thought in this part of my blog series is that with the PDW I suggest you have to slightly rethink your traditional BI strategy, for example using SSIS you don’t want to pull huge volumes of data from the PDW, manage them in the SSIS pipeline and then push them back onto the PDW, it is now much better to use SSIS as a control flow and let the PDW do more of the work. Also with regards the lack of a Merge statement, and the general ordinary performance of update statements it is critical to get used to being able to leverage the power of the PDW through the CTAS statement. In part 3 of this series I will expand on our mega-merge process which makes full use of the CTAS statement to update a dimension table with millions of records in seconds. I hope you like the start of this series and if you have any specific questions please leave a comment and I will endeavour to answer as soon as I can.

Microsoft gets serious about Big Data

Everyone is talking about Big Data. What it is, how important it is and why it should be part of your strategy or roadmap. But what does it really mean? I can only talk from my experiences with various customers and what it means for them. For some it means large volumes of difficult to use data: weblogs, social media streams or XML documents, perhaps all in different formats or from different suppliers. Whilst the business is desperate to extract customer insight, understand brand perception or see how location affects sales, the BI team are struggling to make use or sense of this voluminous, complex data. For many other customers it simply means a traditional data warehouse that has just got too big!

Currently people are attempting to tackle big data in two ways. Firstly a lot of businesses are starting to adjust the way they deliver their BI projects to be more agile. Business users were complaining that the BI system was slow and any enhancement requests were put on a huge backlog that, by the time it got delivered, the need had passed. So BI teams are adopting agile methodologies such as Kanban not only to deliver value more regularly but to tie those deliveries back to clear business need. Secondly businesses are dabbling into the open source world of Hadoop, HDFS, MapReduce etc. Often it leads to leaning on traditional development to hand craft bespoke components to extract any value from the data. In my opinion this is giving people a taste of the possible with big data but it is proving immensely difficult in terms of the skills needed to deliver something production ready. I liken it to writing a Shakespeare size play full of code that only a few people really grasp but everyone thinks is cool. If you have a theatre full of advanced developers and can cope with the risk of using bespoke code and open source components then this is a an option.

For those of us with a simpler desires then welcome to the Microsoft Parallel Datawarehouse (PDW). Microsoft’s premium data warehousing appliance. It is eminently scalable and performance is literally off the chart.

PDWSMP                

Figure 1 – PDW versus SMP SQL across 8 real life customer DWH queries

 PDWDash

Figure 2 – PDW Performance Dashboard highlighting all SQL nodes being used to run a query

The PDW offers the ability to spread data across multiple instances of SQL Server 2012. PDW offers blistering performance, the ability to rethink how you model data and gives you the tools to make your current datawarehouse techniques more efficient. The appliance is easy to use with the complexity of parallelism abstracted by the familiar SQL Server. Using the existing MS BI stack you can migrate your existing warehouse quickly and deliver a scalable platform that can grow up to 7 racks full of power Dell or HP hardware. Support is within a 4 hour window and can be direct through the hardware provider or Microsoft directly.

The piece de resistance, however, is the ability to co-host a Hadoop cluster in the PDW appliance. Thus you can store and query your unstructured data alongside to your traditional datawarehouse. As well as that by using Microsoft’s Polybase technology you will be able to build T-SQL queries to join up your structured and unstructured data. The great thing here is Microsoft has hidden all the complexity of adding data and building queries across it so all you need to do is go grab all the invaluable big data sources you have, load them onto your PDW appliance and bob’s your uncle and customer insight is your aunt!