Microsoft BI – 2015 Highlights

It’s been a great year for BI! Power BI coming of age,  exciting SQL Server 2016 CTP releases and a maturity in the cloud for analytics, data science and big data.

For me Power BI is the biggest news of 2015. POCs ran in H1 of 2015 found it wanting. Basic functionality missing and the confusion of wrapping it in Office 365 made it to much for businesses to consider. However with the GA release, and the numerous updates, it had finally delivered on its vision and given Microsoft an end to end, enterprise solution, for the first time in its history; including multidimensional connectivity!

Microsoft also made some great tactical manoeuvres including the purchase of Datazen and Revolution R as well as their excellent Data Culture series. Datazen is a good tool in its own right with great dashboard creation capability and impressive mobile delivery functionality on all devices/platforms. It will nicely integrate to SSRS top deliver a modern reporting experience via mobile in SQL 2016. R is the buzz of 2015, a great statistical analysis tool that will really enhance SQL Server as the platform of choice for analytics as well as RDBMS. In fact you can already leverage is capability in Power BI today!

Cloud. So Microsoft finally realised that trying to drag businesses into the cloud was not the correct strategy. A hybrid approach is what is required. Give businesses the best of both worlds. Allowing them to benefit from their existing investments but “burst” into the cloud either for scale or new capability, as yet untested. SQL 2014’s ability to store some data files, perhaps old data purely kept for compliance,  is a great example of this. ExpressRoutes ability to offer a fast way to connect on-premises with cloud is brilliant. Or go experiment with Machine Learning, made Microsoft simple by the Azure offering.

For me I was also scored to see the PDW hot the cloud with Azure SQL Data Warehouse. An MVP platform is the closest my customers have needed to be to BigData but the initial outlay of circa half a million quid was a bit steep. With the cloud offering companies get all the benefits worn a minimal investment and an infinite ability to scale. But do consider speed of making data available as it could be limited by Internet connections.

So in summary an awesome year for Microsoft BI with the future looking great! I still feel Microsoft lack SSAS in the cloud but perhaps Power BI will gain that scale in 2016. Overall I envisage seeing Microsoft as a strong leader in the next Gartner quadrant release for BI and I can’t wait for SQL 2016’s full release!

The future (2016 at least) is bright, the future is hybrid cloud…

image

MS BI Current World

APS (PDW) – Extracting Load and Query Stats

APS (PDW) – Extracting Load and Query Stats

Hi, this is a short blog post that may be useful to users of the PDW to get a full list of load statistics and query statistics. The main area to get statistics is the PDW dashboard, but in a lot of cases this is not enough. It is even worse if best practice has not been implemented and labels for queries are not used then the dashboard becomes less use than a chocolate teapot.
So in order to extract load information from the APS the following query is rather useful, note that this will only pull back information on backups, restores and loads, if you loaded data using “insert into” for example information would not show in the results.

SELECT

r.[run_id], r.[name], r.[submit_time], r.[start_time], r.[end_time], r.[total_elapsed_time], r.[operation_type],

r.[mode], r.[database_name], r.[table_name], l.[name], r.[session_id], r.[request_id], r.[status], r.[progress], case when r.[command] is null

then q.[command] else r.[command] end as [command], r.[rows_processed], r.[rows_rejected], r.[rows_inserted] from sys.pdw_loader_backup_runs r

join sys.sql_logins l on r.principal_id = l.principal_id

left outer join sys.dm_pdw_exec_requests q on r.[request_id] = q.[request_id] where r.[operation_type] = ‘LOAD’

–AND l.[name] = ‘someusername’

order by CASE UPPER(r.[status])

WHEN ‘RUNNING’ THEN 0 WHEN ‘QUEUED’ THEN 1 ELSE 2 END ASC , ISNULL(r.[submit_time], SYSDATETIME())

DESC OPTION (label = ‘Rd_DataLoads’)

Note in the preceding statement a line is commented out. This line can be used to find loads completed by a specific user. The DMV for loads sys.pdw_loader_backup_runs stores all loads over time and persists after a region restart. Again best practice should be in place where users are logging into the PDW with their own user (or windows auth if possible) NOT sa! Finally note the use of labels:

OPTION (label = ‘some comment in here’) 

Labels can then be used either in the queries you use on this page or even through the dashboard where you can see a column for labels. I would recommend your team applying some naming conventions or standards for labelling. Next query below is useful for pulling back a list of all queries that have been executed on the APS:

select q.[request_id], q.[status], q.[submit_time] as start_time, q.[end_time], q.[total_elapsed_time], q.[command], q.[error_id], q.[session_id], s.[login_name], q.[label]

from sys.dm_pdw_exec_requests q inner join sys.dm_pdw_exec_sessions s on s.[session_id] = q.[session_id]

where LEFT(s.client_id, 9) <> ‘127.0.0.1’

order by [start_time] desc OPTION (label = ‘Rd_Query_history’)

This will give you a list of all queries performed by all users of the APS, however the DMV used: sys.dm_pdw_exec_requests only stores up to 10,000 rows so depending on the usage of the APS the query above will only give you a very recent snapshot of query performance. My recommendation for both of the above queries would be to set up SQL Agent job on the loading server to extract these stats, from the PDW, into a dedicated stats database on the loading server. You could then use SSRS or any other tool to do some proactive monitoring of large/long loads and queries for example. At worst you have a nice log of data over time should you start to get feedback about degrading performance for example.

PDW to APS – Market Transition?

So on 15th April 2014 Microsoft announced, albeit quietly that the PDW, Parallel Data Warehouse was being renamed to APS, Analytics Platform System. For me it was sad as I really liked the name PDW, it rolled off the tongue and in true Ronseal style, did what it said on the tin. However I can fully understand the re-branding as the market for BI is changing. I do not feel that BI is dead and Big Data is it’s replacement but I do feel there is a change in requirements from the business and traditional BI alone doesn’t meet these new requirements. These newer requirements include the ability to get data insight faster, regularly adding new data sources and the ability to link insights from the warehouse to big data (unstructured data sources). There is also a need from the business for proper analytics, yes Excel is still key but customers need more. Self service BI is required so analysts can do their job of analysing and presenting insight rather than spending 90% of their time preparing data. Mobile BI for delivering those important dashboards and reports into the hands of the decision makers, who, unfortunately use iPads!

There is an argument that the current Microsoft BI stack already caters for this. Apply SMP SQL in 2012/14 guise (with Columnstore, partitioning) with SSAS (cubes, tabular models), Power BI (with advanced visualisations and HTML5 support, sort of) and an agile development approach to the project and et voila! And for a lot of customers this is great, in fact I think this is the most all-round complete Microsoft BI stack since the birth of SSAS. But for others it doesn’t offer them the ability to manage VERY large/wide data sets or the confidence that it can cope with expected growth and acquisitions. Hardware costs spiral and if you have to go down the route of scale out the licensing costs also start to make this prohibitive. Then for big data you are looking at a second solution, Hadoop, HDInsight, Cloudera for example. Bringing both sources together can still be achieved by using Power Query and Power Pivot and then you could productionise this using Tabular models, however that is still a stretch and needs you to add at least some structure to the big data side.

To help with this Microsoft released AU1 for the PDW but realised that with this release the PDW was now not just a parallel data warehouse it was more than this and, inline with the market movements that it was more about general analytics not just datawarehousing. Being an appliance then this is really a platform and a high performance, scalable analytics platform. Hence the not so nice anacronym APS. But like ronseal it still does what it says on the tin!

Features added to the APS v1 or PDW AU1:
– HDInsight, inside the appliance
– Polybase V2
– linking to the HDInsight area of the appliance (using pushdown)
– HDInsight in Azure (no push down)
– Integrated user authentication through Active Directory
– Transparent data encryption
– Seamless scale out capabilities

Power BI – Musings in May 2014

Having just finished a customer POC using Power BI I wanted to share my thoughts on the toolset. I feel confident that Microsoft are moving in the right direction with Power BI and that its objectives, deliver self-service and mobile BI, is exactly what a lot of my customers want. To do that wrapped in the familiar Excel and bundled up with new licensing model options with Office 365 it is an exciting consideration.

Where to Begin?

My first issue with Power BI is that the overall messaging and marketing is confusing. I am fairly competent but trying to set up my Office 365 trial with Power BI took some getting my head around. However I am also typical IT man in that I try never to read the getting started guide. After an hour of faffing around I went to the Power BI getting started guide and it helped, a lot!

To understand it more I put together the following diagram that hopefully helps, please note this is not an overview of the whole of Power BI just the elements we covered in this POC:

PowerBI

Basically Power BI is an app that runs inside SharePoint online. You get access to SharePoint online if you sign up for an Office 365 subscription (can be paid monthly or annually). With Office 365 you can use Office apps online or download them to your desktop. To be able to use the Power BI Excel add-ins (Power View, Power Pivot, Power Query, Power Map) you will need Office 365 Professional Plus. Unfortunately even for users that will simply consume reports/dashboards through your Power BI site you will need the add-on to your subscription for Power BI (tenant). Again this is something that should be looked at by Microsoft.

Power BI components, utilised during this POC:

Power Pivot – probably needs no introduction but is a data modelling add-in for Excel that allows users to bring together data from multiple sources, relate it, extend it and add calculations using DAX.

Power View – a dashboarding and visualisation tool (perhaps the same thing) that can source data from Excel, Power Pivot or SSAS Tabular DBs (in SharePoint on-premise, integrated mode it runs inside SSRS and can work with SSAS Multidimensional DBs). Power View has some great charting functions and allows relating dashboard items, advanced filters, Bing maps, play axis and slicers across the whole dashboard.

Power Query – this is a self service ETL tool (to some degree) it allows you to connect out to the internet (except Twitter at this point contrary to all the pre-sales demos showing you Twitter feeds, to get this currently you will need a third party connector) and grab tables and lists of data. Once you have it you can add it to Power Pivot models and you can then analyse the data using Excel and Power View, for example.

Power Map – this is my least used tool that looks great in demos but I have yet to see a use over and above Bing Maps in Power View. It ultimately works in a similar way to bing maps in Power View in that you can plot locations and look at measures on a map. The key benefit to Power Map is that you can then record a “tour” where you can record your analyse around the map and then save this out to a video for example.

So to get started with Power BI you go get yourself a trial of Office 365, add the Power BI functionality to it, follow the getting started guide and then start building out some Power Pivot models in Excel that you can use as a source for reports and dashboards in Power View.

The final piece to the Power BI puzzle that I found really great is the Windows 8.1 Power BI app. Again this will be available for iOS and Android later this year. What this allows is the user of the app to browse to the SharePoint online site and feature reports from that site in the app.

What went right, what went wrong?

So back to the proof of concept. It took us about 1/2 a day to construct a very basic SharePoint site that had the Power BI app enabled. We downloaded and installed Office 365 Pro Plus on our desktop and then it took a lot more time to try and come up with useful content. Our major issue is that we were trying to use this to surface our SSAS 2012 Multidimensional cube. We have a large(ish) cube with (around 40gbs) with a lot of data. However the biggest flaw with Power BI, today, is that you cannot connect to on-premise SSAS databases directly. You have to either create subsets of data for each and every report you need in an embedded Power Pivot model or you have to try and create oData feeds of the data you want to use. There is a potentially useful download called the Microsoft Data Management Gateway that does allow you to set up an encrypted link between your on-premise SQL Server or Oracle databases (and oData feeds) but as yet this doesn’t allow for connection to SSAS so we were not able to make use of it.

The other massive benefit my customer, and most other customers, is the ability to have true mobile BI. Ultimately the pieces they need to access are Excel, Power View and SSRS. And unfortunately where they need them is on an iPad! This customer actually wanted to see Excel and Power View reports on the Nexus as well. Power View can be rendered in HTML5 however it warns you when you use this view that some things may not work (most notably the play axis, although I still cant find a real world use for this) we found it most cumbersome around Bing Maps in Power View. Having done some mobile app development I know how hard it is to make sure an app functions and offers a similar user experience across browsers and devices, however with Microsoft’s resources they need to get this right.

Finally the biggest drawback to the whole POC was performance and usability to actually get to the report. From a mobile BI perspective what we want is our analyst to create a report in Excel, check (once we gave them a Power Pivot model). The ability of said analyst to publish the Power View dashboard to the Power BI site, check. Finally an alert for our end users there was a new report and to go look at it, nope. Ok so can the analysts simply click a link to our Power BI site and look at a list of reports, not quite! They have to login to our SharePoint online team site (we don’t want this at all) and then launch Power BI from the left hand links. The user experience here is not just too many clicks but also conflicting look/feel. The team site can be customised and made to look almost corporate, in true SharePoint fashion, however the Power BI site cannot be and looks blue and white. Don’t get me wrong it looks ok, but this is a real world business. The point of our mobile BI piece is for senior execs to be able to launch a report from their iPad on the golf course and get a glance at how their business is going, if they have to go through a Microsoft branded page they are not going to be terribly impressed. And the speed… the lack of it. Loading the team site, slow. Loading the Power BI site, slow. Loading the reports, slower. Even, when running HTML5 as opposed to Silverlight, interacting with a Power View dashboard was slow. This needs to be looked at and fixed. Alternatively give us a link directly to a single report, we can make those look corporate and hopefully we can push them to use Surfaces and use IE and Silverlight so interaction is fast!

There were other general things that can be summarised here:
1. Inability to link to on-premise data in the form of the SSAS cube.
2. General site performance – Loading the Team site was slow (took nn), loading the Power BI page took nn, loading reports took too long.
3. Too many clicks to get to a report
4. No ability to share a report via email link (to take the user straight to the report)
5. Power BI site not able to be customized inline with SharePoint look/feel, based on corporate requirements.
6. Featured Reports option not working on Android and Apple devices.
7. No ability to remove Featured Reports.
8. Inability to connect directly to Twitter feeds from Power Query and therefore link to a Power Pivot model and visualize through Power View.
9. Bing maps viewed excellently in Windows app and through Silverlight but through HTML5 they were not responsive enough, issues with pinching to zoom, moving the map around with touch and issues when changing filter it not always refreshing the map points.
10. Interacting with drillable charts in Power View (in HTML5) on the iPad and Nexus was tricky, sometimes the drill worked other times it highlighted the slice/column.

To Power BI or Not to Power BI?

In short my customer chose not to Power BI at this time and I agree with them. Obviously depending on your customers or business needs this decision may be different. But with the issues we encountered it wasn’t a viable option for end user reporting and dashboarding. BUT… I have it on great authority that a lot of the issues we found will be fixed in coming release(s). In fact if you read Chris Webb’s blog post a few of the issues we encountered are mentioned at the recent PASS summit.

I have advised my customer to wait until said release(s) and to try again with our POC upon their potential new release (perhaps mid-July as the rumour mill suggests?). This coupled with the relative cost versus alternatives such as QlikView and Tableau AND the fact that most companies are looking for a suitable upgrade path for MS Office means that people will use this toolset, and rightly so. If Microsoft can make v2 with all the required enhancements they will have a truly amazing BI stack, imo the best in the marketplace. Lets cross our fingers and hope July is not just sunny but Power BI v2 comes out and knocks us all down!

As usual any feedback or tips very much appreciated!

P.S – It would also be so wonderful if I could extend my trial. Now I am looking at having to re-do all my POC work, from scratch, in July!

P.P.S – Check out Microsoft’s latest guide on the BI Tool Use

PDW Shallow Dive – Part 3

Welcome to part three of the PDW Shallow dive. In this section I am going to explore partitioning in PDW. We will look at how to partition a table on the appliance and how to use partition switching. Unfortunately I was hoping to look at the CTAS function and specifically the mega merge statement but I will have to create a 4th part of the series to cover that due to time constraints.

Introducing Partitioning on the PDW

Partitioning is used to help manage large tables. When partitioning a table you track logical subsets of rows with metadata. Unlike SMP SQL Server there is no partitioning function or scheme you simply set it when you create the table or use Alter Table to add partitioning to the table by selecting what column to partition by and setting your boundary values. This does not affect which distribution or compute node the data is stored on. Rather it will allow you to manage the table with these subsets of data, for example partition by month and then archive an older month by switching the partition to an archive table.

Let’s look at a simple example of a fact table that is partitioned, the code looks like this:

CREATE TABLE [dbo].[FactSales] (
[OnlineSalesKey] bigint NULL,
[DateKey] datetime NULL,
[StoreKey] int NULL,
[ProductKey] int NULL,
[PromotionKey] int NULL,
[CurrencyKey] int NULL,
[CustomerKey] int NULL,
[SalesOrderNumber] varchar(28) COLLATE Latin1_General_100_CI_AS_KS_WS NULL,
[SalesOrderLineNumber] int NULL,
[SalesQuantity] int NULL,
[SalesAmount] money NULL,
[ReturnQuantity] int NULL,
[ReturnAmount] money NULL,
[DiscountQuantity] int NULL,
[DiscountAmount] money NULL,
[TotalCost] money NULL,
[UnitCost] money NULL,
[UnitPrice] money NULL,
[ETLLoadID] int NULL,
[LoadDate] datetime NULL,
[UpdateDate] datetime NULL
)
WITH (CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = HASH([OnlineSalesKey]),
PARTITION ([DateKey] RANGE LEFT FOR VALUES (‘Dec 31 2006 12:00AM’, ‘Dec 31 2007 12:00AM’, ‘Dec 31 2008 12:00AM’, ‘Dec 31 2009 12:00AM’, ‘Dec 31 2010 12:00AM’, ‘Dec 31 2011 12:00AM’)));

It doesn’t have to be dates that are used to partition tables, although this is the most used option, you can partition tables by set values such a product categories or sales channels. Understanding the data you are working with and how it will be analysed is critical to set up partitioning. The key options for setting up partitioning are:

  • The column – as mentioned before it is critical to identify the best column for partitioning your data by, in the example above we use the [DateKey] column.
  • Range – this specifies the boundary of the partition, it defaults to Left (lower values) but you can choose Right (higher values), in the example above we use the default, Left. This means that the values specified will be the last value in each partition e.g. our first partition will have all data with a date before Dec 31 2006 12:00AM.
  • The boundary value is the values in the list that you will partition the table by, this cannot be Null. These values must match or be implicitly convertible to the data type of the partitioning column.

Partitioning by example

In this section we will go step by step through a basic partitioning example. We create three tables, a main table with partitioning on it using a part_id, a second table used for archiving data from our main table and finally a table that we will use to move data from into our main table.

  1. Create out our factTableSample with a partitioning option, using a part_id column:

    create table factTableSample

    (id int not null, col1 varchar(50),part_id int not null)

    with (distribution =
    replicate,
    partition(part_id range
    left
    for
    values(1,2,3,4,5)));

  2. Now we will insert some basic data into the table:

    insert
    into factTableSample

    select 1 id,
    ‘row1’col1, 1 part_id

    union
    all

    select 2 id,
    ‘row2’col1, 2 part_id

    union
    all

    select 3 id,
    ‘row3’col1, 3 part_id

    union
    all

    select 4 id,
    ‘row4’col1, 4 part_id

    union
    all

    select 5 id,
    ‘row5’col1, 5 part_id

    Below shows what those rows then look like:


  3. Let’s take a look at the how the data is stored in that table, how many rows are in each partition:

    –Check Paritions and the number of rows

    SELECT o.name, pnp.index_id, pnp.partition_id, pnp.rows,

    pnp.data_compression_desc, pnp.pdw_node_id

    FROM
    sys.pdw_nodes_partitions AS pnp

    JOIN
    sys.pdw_nodes_tables AS NTables


    ON pnp.object_id
    = NTables.object_id

    AND pnp.pdw_node_id = NTables.pdw_node_id

    JOIN
    sys.pdw_table_mappings AS TMap


    ON NTables.name = TMap.physical_name

    JOIN
    sys.objects
    AS o


    ON TMap.object_id
    = o.object_id

    WHERE o.name =
    ‘factTableSample’

    ORDER
    BY o.name, pdw_node_id, pnp.index_id, pnp.partition_id

    The results show that all but the final partition has 1 row in them. However you also notice there are 24 rows returned. This is because we created a replicated table that is replicated across our 4 nodes. For more information on replicated tables review part 1 of the series.


  4. Now we will create an archive table to which we can move out older data:

    create
    table factTableSample_Archive
    (

    id int
    not
    null,

    col1 varchar(50),

    part_id int
    not
    null)

    with (distribution =
    replicate);

  5. Now we can switch out partition 5 to the archive table and look at the results:

    alter
    table factTableSample switch partition 5 to factTableSample_Archive;


    We can clearly see that the 5th partition has been moved from the main fact table (factTableSample) to the archive table (factTableSample_Archive).

  6. Next we create a new table which we would use to load staging data into:

    create
    table factTableStage with (distribution =
    replicate

    ,
    partition(part_id range
    left
    for
    values(1,2,3,4,5)))

    as

    select 1 id,

    cast(‘row6’
    as
    varchar(50)) col1,

    5 part_id

This adds a single row into partition 5:


  1. Now let’s switch the stage data into partition 5 of the main fact table and look at the results:

alter
table factTableStage switch partition 5 to factTableSample partition 5;


We see that partition 5 is now populated with the data from the stage table.

To summarise we can see using partitioning on the PDW is not overly complex. The alter table statements used to switch the partitions are very fast. If we had used distributed tables in the above examples it would have no effect on any of the code we used. Just like with SMP SQL Server it is important to plan your partitioning strategy in line with the analysis requirements and distribution of data in the datawarehouse. In the next part of the series I will dig into the CTAS (create table as select) statement and show an example of using this to do a merge into a dimension table. I hope you find the above useful, if you want any code examples please feel free to email at matt@coeo.com and I will get you a copy of the full script.


PDW Shallow Dive – Part 2 – ETL

Welcome to part two of my shallow dive into the PDW. In this section we take a look at the options available for ETL with the appliance. Here we will cover the following topics:

  1. DWLoader.
  2. SSIS and PDW.

DWLoader

The DWLoader application is a bit like the PDW version of BCP, it is a command line application that is all about very fast import and export of flat files that are logically structured. The tool can be used via a script or the command line to import flat files to the appliance or export data to flat files off the appliance. Example syntax of the DWLoader application:

dwloader.exe -i ${loadfile} -m -fh 1 -M {mode} -b 100000 -rt value -rv 1000 -R ${loadfile}.rejects -E -e ASCII -t , -r \r\n -S $server -U
$user  -P $password -T {targetdatabase}${table}

Key switches:

  1. –i – the location of one or more source files to load.
  2. –M – specifies the load mode, options are fastappend, append, upsert or reload data. The default mode is append. To use fastappend you must specifiy –m, which is the multi-transaction option. The key difference to append and fast append is that the latter doesn’t use a temporary table.
  3. –fh – the number of lines to ignore at the beginning of the file (ie. Header rows).
  4. –b – sets the batch size, the default is 10,000 rows.
  5. –rv – specifies either the number of rows or % of data allowed to fail before halting the load (use of –rt sets whether this is a value or %).
  6. –R – this is the load failure file, if the file already exists it will be overwritten.
  7. –E – will convert empty strings to NULLL. The default is not to convert.
  8. –e – is the character encoding, ASCII is the default but other options are UTF8,UTF16 or UTF16BE.
  9. –t – specifies the field delimiter.
  10. –r – is the row delimiter.
  11. –S – sets the target appliance normally an IP address.
  12. –U and –P – is the username and password for the appliance (until the next PDW update PDW only supports SQL security so this would be your SQL username and password)
  13. –T – is the three part name for the destination table e.g. dbname.dbo.table1 (dbo is the only schema supported in the current version of PDW)

Once the import or export is started there is not much to see however you can track progress through the PDW dashboard.

A common problem with the DWLoader tool is that it isn’t able to work out the file format of the flat file (ie. Unicode or UTF-8). If you aren’t sure about the file type open up the file in SSIS Import/Export wizard and it will allow you to work out the format.

SSIS and PDW

First let’s remind ourselves how where SSIS fits in the PDW world:

The important point the diagram shows is that SSIS runs on the landing zone server NOT on the PDW directly. This is quite critical because, whereas in traditional ETL solutions you will look to SSIS to manage all of the transformations, when you have a PDW appliance as your database you will want to do that work on the actual appliance. This is also important to limit the effect on memory by pulling large data volumes into the SSIS data flow. It is much better to use SSIS as the control flow for the ETL solution and doing the advanced transformations of the data on the PDW.

To use SSIS with the PDW you use the same tools and processes as you would against normal SQL Server. Visual Studio with SSDT is used to create an SSIS project. Within the project you can specify a source adapter that will be an OLE DB source pointing to the PDW; however you will see that SSIS thinks this is a normal SQL Server source because the icons for the tables do not show as distributed or replicated, just tables. This is by design and it is good practice to build your source query as a stored procedure on the PDW so that you can test it before using in the SSIS project. The big difference in SSIS is the destination adapter, this is PDW specific and an example is shown in the diagram below:

Using the PDW destination adapter it is critical to make sure the source columns match the data type of the destination columns; the adapter is not flexible or forgiving like other destination adapters in SSIS (i.e. SQL Server Destination), it won’t do any automatic type conversion. Important settings when using the PDW destination adapter include:

  • Connection Manager – you will need to create a new, PDW specific, connection manager:
    • Servername will be the appliance name and port number or the appliance IP and port number.
    • User and password will be your SQL Server PDW login information.
    • Destination database name is the target database on the PDW.
    • Staging database name is the stage database you want to use, if using FastAppend this must be empty.
  • Destination Table – the table you wish to load the data into.
  • Loading Mode – if the loading mode is fast append make sure you un-select “roll-back load on table update or insert failure”
    • By unchecking this box, load from the staging table to a distributed destination table are performed in parallel across the distributions; this is quick but not transaction safe. It is the same as using the –m switch in DWLoader.
    • Checking the box loads data serially across distributions within each compute node and in parallel across the compute nodes. This is slower but is transaction safe.
  • Finally you can map your columns from input to destination – HOWEVER MAKE SURE YOU MATCH DATA TYPES!

Below is an example of the package running to import a simple file:

You can then follow the performance of the load through the PDW dashboard just as with the DWLoader:

In summary, when dealing with ETL on the PDW, push as much of the data transformations to the PDW as possible. Use tools not normally available in SQL Server or SSIS such as CTAS (Create Table As Select) and try and limit the use of the Update statement. Understanding that SSIS works by pulling data from the data source adapter into the memory on the SSIS server and runs it there, is critical to manage memory and performance of SSIS.

My suggestion for using SSIS with PDW is to use it for the control flow and for any basic data preparation for types not supported on the appliance, for example XML data type. Then use the power of the PDW to do the advanced transformations. In a recent POC for a customer we were loading around 300gbs of data in less than one hour using DWLoader and then transforming 2 million XML documents into a relational data warehouse using SSIS and PDW stored procedures. We did have an issue with no Merge statement support in PDW so to upsert into a dimension table we created a “mega-merge” statement using the CTAS function. In Part 3 of this series I will cover how that worked and also how partitions are managed on the PDW.

If you have any comments or questions please feel free to post them here for me or tweet me at @mattasimpson

PDW Shallow Dive – Part 1

There is a very little amount of resource on the internet for PDW. I wanted to do a short series that will focus on the basics of PDW. I am not going into deep detail on the parallel engine or the commercial aspects, but instead am wanting to focus on what PDW is, where it sits in the Microsoft BI stack and what I found particularly useful as a Microsoft Business Intelligence specialist.

In part 1 of the series we will look at the following topics:

  1. Introduction to the PDW.
  2. PDW Exploration.

Introduction to PDW

So this is not a sales pitch but if you are looking at this blog you already see the value in having an eminently scalable and super performant datawarehousing appliance; in regards an idea of cost the current base offering from Microsoft comes in at around the cost of a high performance Fast Track Data Warehouse with tin and software licenses but for more info reach out to eddie@coeo.com who can help with any commercial questions around the PDW.

Ok so what is it?! Ultimately the PDW is an appliance – that’s it hope you liked it…

… Just kidding! It is an appliance that uses SQL Server 2012 (from v2). The appliance is really scalable by using SAN like technology and virtualisation software for redundancy. The appliance can be on either Dell or HP, both or any future hardware providers have to adhere to strict Microsoft guidelines and offer performance to a set level. In a way not dissimilar from the Fast Track Datawarehouse. In fact Microsoft learnt from all that work that went into recommending specific, optimum hardware for the fast track and built on that with the PDW appliance.

How does it differ from SMP SQL? Well with a single instance of SQL Server you get one buffer, one space for all your user requests (queries) to go through. This means that if one query is requested that user gets the whole buffer and it’s quick! However the more users requesting results or the more complex the queries then the buffer soon fills and users start having to wait.

With the PDW you get multiple instances of SQL Server working in parallel, meaning multiple buffers! User traffic can be managed by the management node to optimise performance. This MPP (Massive Parallel Processing) platform offers scalability, high concurrency, complex workloads and redundancy in a single appliance:

With the diagram example above we would have 6 SQL buffers to work with in parallel! The appliance comes in a single rack with a control node, management node and a load of SQL instances (depending on what variant of the appliance you buy); oh and there is also a redundant node for added failover should one of your nodes fail. The DMS is the data movement services, this is what controls how data is stored and moved on the various instances. More of this later.

PDW Exploration

Ok so you know why you want one, let’s assume you get one, how do you go about developing on it? The first thing to note is that Microsoft has tried to make sure all the complexities of MPP are hidden from us. Ultimately we use our existing SQL Server and BI skills to develop against the appliance. To highlight this the graphic below shows that, once connected, the PDW is just seen as a slightly different version of SQL Server. It is good practice to use Visual Studio (and I find 2012 best at the moment) to work with the PDW.

Now you don’t access the PDW directly on the management or control node. In your rack you will also have an SMP instance of SQL Server running on a Windows server, this is commonly called the landing zone or landing server. A typical architecture for a PDW rack may be:

To start work on the PDW we connect to the landing zone and then use Visual Studio Data Tools to connect to our server. For example it may look like:

The next important concept you need is how you will store your data. There are two way to store your data. First is as distributed, so data is spread across all the SQL Server nodes. The second method is to store the data replicated. With this option the data will be copied to each SQL Server node. Because SQL Server still needs all the data on a single node to return the final results it uses the DMS (Data Movement Service) to move data around as it needs to complete a query. For smaller tables (generally dimension tables) it is best to use replicated so less DMS is needed. However for large fact tables it is much better to distribute that data so the PDW can apply the query across multiple nodes and use the processing power of each to get the queries and then pull that together to present the result. The diagram below shows a distributed fact table, by looking at the icon of the table in the SQL Server Object Explorer you can also see what is replicated and what is distributed:

With distributed tables the important option is the key on which you want to distribute the data, above you can see we use OnlineSalesKey; to make the right choice it is critical to understand the data and also the potential use of the data, however it can be changed very quickly. This option will have the biggest bearing on performance of a distributed table. It is more important to use a key that has a balance of data volumes and is most used when querying the data. You will also notice above we use the Clustered ColumnStore Index (which is writeable) and we partition the table; we will talk more about both of these options in later blog posts.

The second option is replicated and you can see an example of a replicated table below:

This will make our date dimension table replicated on all of the SQL Server nodes which will mean less DMS work so faster queries. Changing modes can be done easily and by using a PDW command called CTAS (Create Table As Select). You can quickly move data into a newly defined table; more on CTAS in later blog posts.

Finally in this section I wanted to show the central management portal for the PDW, this is a dashboard that can be accessed via the browser and an example is below:

From this dashboard you will be notified of any potential issues with the appliance, note the red exclamation mark in the health section. It is also where you can look at query plans in detail to see how the PDW engine is managing the query. It is also a good place to monitor data loads, which we talk about in detail in part 2 of this series. Finally you can use the performance monitor to see what is happening in real time on your appliance, below shows an example of the performance monitor as I ran a Select * from dimFactSales that contains 10 million records. As it is distributed you see that all the SQL Server nodes are working at roughly the same time/levels:

So overall I hope I have shown you how relatively simple the PDW is. Under the covers we have a huge amount of impressive hardware and software that is optimised purely for datawarehouse loads yet above the covers we have our tried and tested SQL Server interface. My final thought in this part of my blog series is that with the PDW I suggest you have to slightly rethink your traditional BI strategy, for example using SSIS you don’t want to pull huge volumes of data from the PDW, manage them in the SSIS pipeline and then push them back onto the PDW, it is now much better to use SSIS as a control flow and let the PDW do more of the work. Also with regards the lack of a Merge statement, and the general ordinary performance of update statements it is critical to get used to being able to leverage the power of the PDW through the CTAS statement. In part 3 of this series I will expand on our mega-merge process which makes full use of the CTAS statement to update a dimension table with millions of records in seconds. I hope you like the start of this series and if you have any specific questions please leave a comment and I will endeavour to answer as soon as I can.

Microsoft gets serious about Big Data

Everyone is talking about Big Data. What it is, how important it is and why it should be part of your strategy or roadmap. But what does it really mean? I can only talk from my experiences with various customers and what it means for them. For some it means large volumes of difficult to use data: weblogs, social media streams or XML documents, perhaps all in different formats or from different suppliers. Whilst the business is desperate to extract customer insight, understand brand perception or see how location affects sales, the BI team are struggling to make use or sense of this voluminous, complex data. For many other customers it simply means a traditional data warehouse that has just got too big!

Currently people are attempting to tackle big data in two ways. Firstly a lot of businesses are starting to adjust the way they deliver their BI projects to be more agile. Business users were complaining that the BI system was slow and any enhancement requests were put on a huge backlog that, by the time it got delivered, the need had passed. So BI teams are adopting agile methodologies such as Kanban not only to deliver value more regularly but to tie those deliveries back to clear business need. Secondly businesses are dabbling into the open source world of Hadoop, HDFS, MapReduce etc. Often it leads to leaning on traditional development to hand craft bespoke components to extract any value from the data. In my opinion this is giving people a taste of the possible with big data but it is proving immensely difficult in terms of the skills needed to deliver something production ready. I liken it to writing a Shakespeare size play full of code that only a few people really grasp but everyone thinks is cool. If you have a theatre full of advanced developers and can cope with the risk of using bespoke code and open source components then this is a an option.

For those of us with a simpler desires then welcome to the Microsoft Parallel Datawarehouse (PDW). Microsoft’s premium data warehousing appliance. It is eminently scalable and performance is literally off the chart.

PDWSMP                

Figure 1 – PDW versus SMP SQL across 8 real life customer DWH queries

 PDWDash

Figure 2 – PDW Performance Dashboard highlighting all SQL nodes being used to run a query

The PDW offers the ability to spread data across multiple instances of SQL Server 2012. PDW offers blistering performance, the ability to rethink how you model data and gives you the tools to make your current datawarehouse techniques more efficient. The appliance is easy to use with the complexity of parallelism abstracted by the familiar SQL Server. Using the existing MS BI stack you can migrate your existing warehouse quickly and deliver a scalable platform that can grow up to 7 racks full of power Dell or HP hardware. Support is within a 4 hour window and can be direct through the hardware provider or Microsoft directly.

The piece de resistance, however, is the ability to co-host a Hadoop cluster in the PDW appliance. Thus you can store and query your unstructured data alongside to your traditional datawarehouse. As well as that by using Microsoft’s Polybase technology you will be able to build T-SQL queries to join up your structured and unstructured data. The great thing here is Microsoft has hidden all the complexity of adding data and building queries across it so all you need to do is go grab all the invaluable big data sources you have, load them onto your PDW appliance and bob’s your uncle and customer insight is your aunt!