Choosing the Right BI Partner

Having worked as a Senior Consultant for Thorogood Associates and Principal Consultant at Coeo, I feel very qualified to write this guide to help you when selecting a new BI partner.  I worked as a consultant for a combination of about 6 years and have worked in and around BI for over 15 years. During my consultancy time I have worked worth with over 100 companies and among the engagements around 20 green field projects. The majority of these companies selected to work with us because of a vendor recommendation or a personal connection with existing staff. The majority of these companies engaged before ever meeting the consultants that would be working with them or ever investigating the suitability of the partner. This is not to say that the engagements didn’t work, they often did, but I have often questioned customers selection criteria and wondered if there were not cheaper/better ways for them to approach their BI needs. For example would you buy a new car without a test drive first? Would you employ a BI developer or architect without an interview? If the answer is no to both then this blog may be of some use to  you. I now sit on the “other side of the fence” and I think there are many things worth considering before signing up with a BI partner, and a test drive might well be one of those!

Top 5 things to consider when selecting your BI partner:

1. Are they a certified partner to your chosen, or preferred technology vendor(s)?

There is little point talking with non-certified partners, but understanding the partner process your potential vendor uses is also important. For example anyone and everyone can become a Microsoft partner, easily. But to become a silver or gold BI partner requires a lot of important steps to be completed by the vendor.Question the website graphics and check vendor lists for up-to-date information and the steps required to achieve the advertised status.

If you havent decided on a technology or vendor then make sure your potential BI partner is certified to multiple vendors or none at all with experience of vendor selection processes. Alternatively, and ideally, select two potential partners per potential vendor and make this part of your vendor selection process.

Beware vendor recommendations as the engagement, at this point, with vendors is sales led. Are they the best people to recommend a partner to you?

2. What are their staff credentials, experience (ideally in your industry), and certifications?

This should be current, an MCDBA in SQL 2005 is not relevant. It should also be relevant to your requirements, Prince 2 certifications when you will provide your own project management matters little. Utilise LinkedIn as it is a fantastic way to validate some of the key people at your potential partner. Very early on in the process ask about the BI team and do some research, later in the process meet these people! Investigate industry experts in BI and often talking to them in open groups on Twitter or LinkedIn, can help get recommendations or confirm your thoughts on the potential partners. Look to relevant industry events and look for sponsorship and lead technical experts sharing their knowledge with everyone. Finally asking to speak to similar companies they have worked with and/or researching the case studies is also a valid and useful approach.

Ultimately treat this review as you would an interview for BI specialists. Most companies, around 90%, I have worked with all, ultimately, want to create their own internal BI competancy. To that end your choice of partner is about extending or even starting your own BI team for your company. Finding a partner that has experience doing this and have approaches for aiding this is also important. Meeting with, interviewing and validating these people is just common sense.

3. Does their engagement model fit to your requirements?

In my opinion it should be flexible, not cookie cutter BS that you could probably find on slideshare. Critically you should know your requirements and, if knowledge transfer is important, then how does the partner manage this or build it into their process. Does your business have ever changing requirements and priorities, can this partner support agile, and I dont mean they do stand ups and split work into sprints, true agile is not not just about this and validating their agile certifications is also important, if, agile is important to your business/department.

Do they force project management upon you? Do they charge for this or is it wrapped up as part of the price, day rate etc… Not that PM is bad, I think it hugely important and useful on projects. All partners should be quizzed about how they deliver the capabilities a normal PM function would bring. BUT this should be inline with your expecations and preferred way of working.

4. Do they have recent, reference-able and relevant examples of working on these kind of projects before?

This should include any key areas important to your business, i.e Master Data or data quality. Perhaps similar experience in helping transitioning you to your own BI team, in the fullness of time. Can you talk with their previous customers? If yes then I like to talk to the business analyst or BI team lead, rather than a PM. Talking to a key business user is also interesting if your choice is linked to a specific vendor.

5. How do they manage risk?

This is a broad area and difficult to gauge. But there are some key areas to think about during your selection process. Key man dependency – how does the partner make sure that if their star consultant leaves how does that affect your project and deadlines? Changing requirements – what are the processes and how do they manage the reality of changing requirements – no BI project EVER fits a static, waterfall approach. If your partners talk about being able to manage this but at extra cost, close the conversation immediately. These are best practice areas that all partners should include as part of their general way of working.

Apart from these areas to consider, what’s absolutely key is understanding what you want to deliver for your business and what the potential return or value is to that business. If at least one of these things are not clear then it is not time to engage with a third party, unless, perhaps they have good experience with these stages of work and can provide management consultancy to help build that initial business case; in my experience not many can do this from scratch and you probably need a less technical partner than the typical BI partners. But I don’t mean you have to have a completed business case or a full technical specification but you need some idea and definitely buy in from your business stakeholders.

Costs

Up front it is more than ok to ask for rate cards. If the partner doesnt work that way then ask for example costs of previous projects with similar requirements and deadlines; or ask for their average rate in their last financial year, or last two BI projects, trust me they ALL know this information. It is critical to make sure, for both you and the partner, that you are in the right ballpark right from the off. At an early stage it is completely ok for these costs to be indicative and you shouldnt start budgeting around these numbers, all good proposal stages require a more accurate assesment of need to identify more accurate costs. At an early stage it is also important to think about the cost approach, time and materials or a fixed approach. In my experience an indicative/umbrella budget with a T&M proposal gives you the best price and is the most flexible way to work with a partner. Fixed prices can work but due to the risk and nature of these engagements expect paying for a hefty upfront design stage and the partner (rightly) adding a % on top of normal costs to mitigate potential slippage/risk.

Top 3 Actions to complete whilst selecting the right partner:

1. Meet the team – and not just you, include key members of your internal tech stakeholders and potential internal delivery team members. DON’T just meet sales, or technical pre-sales.

2. See working examples of their work, relevant to yours – ideally they may want to run a mini POC with your data (perhaps with vendor supported funding)

3. Constantly review your requirements, for the project and business, against what you are learning about the partner.  Have your internal stakeholders be part of these reviews.

Time

This process should take as long as it needs, but match the length of time with size of the project and complexity of your requirements. I have found that evn a small piece of work this process can take as long as a multiple month green field project. Critically the time this process takes is up to you and should NEVER be dictated by your potential partners. Certainly question the lead time required by the potential partner to be available to start initiation or development kick off BUT dont be driven by this. Your project can wait a month for the right partner, it has probably waited a lot longer prior to this!

In Summary…

This process is a difficult one filled with the usual mix of technical review, emotional and emphatic feelings. The more people you get involved in this process, at relevant times, the better. If you are forming a BICC have your most important business stakeholders part of the process. If you or someone in your IT team knows of a good BI person (validate this) then have them talk about experiences and recommendations for BI partners. Finally I have seen matching your company size, or project size with a partner size. IMG Group (now Hitachi) always seem to work well with larger companies, more mature in BI, than say Coeo who worked much better with less BI mature companies, often of a smaller size.

I do have a spreadsheet that I used recently for vendor selection which I am more than happy to share, please comment with your details and I will email it over. I am also happy to help, where I can with my experience. Good BI people are close knit bunch in the UK and I am more than happy to share my experience and recommendations, for what it is worth to you! 🙂

And as always thank you for reading, I hope this is of some help!

Building a Star Schema in Power BI

So it has been a while since the GA release of PowerBI.com and Power BI Desktop Edition (for those that remember ProClarity and Panorama Desktop tools this choice of name still makes me giggle!) and I thought it about time I put the modelling capabilities of Power BI Desktop to the test. I am a strong believer that the data model and is the MOST important part of any BI delivery. Without it we will get different answers on the same question and not have a consistent experience for end users or analysts.

Power BI purports to be a one stop shop for data blending, modelling and presentation. I have no doubt in its presentation capabilities, the dashboards look and feel great and with the GA release the ability to do basic things such as add images and customise colour schemes make it a good competitor at the front end space of your BI stack. I am also confident that by integrating Power Query connectors and transformation capabilities into the tool that basic ETL and data blending is going to be no problem. In fact I would say it is the best way available, certainly most cost effective way, to bring together cloud and on-premise data into a single place. However modelling just feels wrong in a front end tool! To put it to the test I had some basic Coeo sales data that I wanted to play around with and try and model in a way that I could deliver a monthly sales dashboard.

I found many useful things that really helped with modelling including: adding a column as a new named query, de-duping this list and then converting to a table. Features like the group by, or replacing values, ie nulls, with No Service or Unknown allow us to achieve a basic star schema! However, when I needed to generate a fact table based on a different grain I am not ashamed to say I had to go back into SQL Server and using tables and T-SQL transform, match and load the data into my new schema. This gave me the quickest and easiest way to then build my sales dashboard!

I am going to blog in more detail about some of the fun features I found along the way around modelling, dax measures, date dimension funnies and also about how we are going to build the architecture to support the Coeo Data Warehouse in the cloud utilising Azure Data Factory, SQL DWH, Power BI and Datazen! So watch this space for more! In the mean time below are the steps I used to create my basic star schema!

1. Starting in Power BI Desktop with a flat extract of data from our CRM system:

FlatDataExtract

 

2. Now Create a dimension, e.g. Company!

CreateCompDim

3. Remove Duplicates, as it is a dimension table and we only want unique values:

dedupe

4. Now convert it to a table, we can do more with it then!

converttotable

5. Such as rename the column to something meaningful!

Rename

6. Now simply close and load the queries to the model:

close and load

And we have a mini star schema! Simply repeat for all the dimensions you want to generate. It also works nicely for date dimensions just remember to set the correct data types, you can even add columns to form a mini date dimension!

relationships

Integrating Hadoop and the Data Warehouse

The objective of any data warehouse should include:

  1. Identification of all possible data assets
  2. Select the assets that have actionable content and are accessible
  3. Model the assets into a high performance data model
  4. Expose the data assets most effective for decision making

New data assets are now available that may meet some of the above criteria but are difficult, or impossible, to manage using RDBMS technology. Examples of these are:

  1. Unstructured, semi-structured or machine structured data
  2. Evolving schemas, just in time schemas
  3. Links, Images, Genomes, Geo-Positions, Log Data

These data assets can be described as Big Data and this blog looks at Big Data stored in a Hadoop cluster.

In very few words Hadoop is an open source distributed storage and processing framework. There are a number of different software vendor implementations of Hadoop. The different Hadoop implementations should be investigated depending on your requirements.

Figure 1 highlights the key differences, and similarities, between relational database management systems (RDBMS) and Hadoop.

RDBMS and HadoopFigure 1 – Differences between RDBMS and Hadoop

The three layers that can be used to describe both systems are Storage, Metadata and Query. With a typical RDBMS system, these layers are “glued” together with the overall application, for example, SQL Server or Oracle. However, in Hadoop these layers work independently allowing for multiple access to each layer; meaning super-scalable performance.

Exploring Data between the Data Warehouse and Hadoop Cluster

Often there is an unknown quality or value in the Hadoop data. To start to identify value or explore the possibility of gaining new insight from the Hadoop data, it is useful to be able to query the data directly and alongside the existing data warehouse. To query by conformed dimensions, for example, is extremely powerful and can help to query Hadoop data based on well-governed dimension data.

This “exploration” can be relatively slow, compared to simply querying Hadoop with Hive or Impala directly, or by queries against a dimensional modelled data warehouse. However, this gives us an opportunity to explore data before we worry about leveraging an ETL process to extract, transform and load the data into our ultimate data warehouse.

To do this exploration there are two main options:

Option 1 – Mash Ups

By leveraging tools such as Power BI (Power Query and Power Pivot) or Alteryx Designer, you are able to bring together data from a Hadoop cluster and an RDBMS data warehouse. The data can be modelled and calculations added. Finally, the data can be queried to start to identify possible insights.

Option 2 – Direct Querying

There are some technologies, such as Microsoft Polybase or Teradata QueryGrid, that allow you to leverage SQL query language to add temporary structure to Hadoop data and join to data warehouse data. My hope from Microsoft is that Polybase is bought from the MPP appliance, APS, and into SMP SQL Server in its next release. This technology is perfect for people not wishing to learn Java, Python, Sqoop and Linux.

Extending the Data Warehouse

The explore options above are useful but limited. Performance will be limited by the Hadoop Cluster and a lack of structure on the data or by the RDBMS data warehouse. If insight is shown through the exploration then the next logical step will be to bring useful data together into a single data warehouse.

Initially you may wish to use existing ETL tools, such as SSIS, Information Builders, or go directly to what these tools often leverage which is Sqoop. This will allow you to bring data from the Hadoop cluster and then you can use Pig, for example, to transform the data into a dimensional model in your existing RDBMS data warehouse. This allows you to benefit from the proven performance of a dimensional model. I refer to this data as your “known unknowns”.

Secondly, you may wish to move your data warehouse or, more often, create your new data warehouse in Hadoop. This can be a sensible option when you compare the performance of the Hadoop architecture compared to RDBMS standard architecture. You can also still leverage your SQL skills using tools such as Hive or Impala to analyse the data. However, to further improve performance, you can add some semi-permanent structure to the data using Parquet. Parquet is a file format that uses columnar methods similar to existing in-memory columnar engines such as Vertipaq. This will allow us to apply dimensional modelling techniques to our data and benefit from conformed dimensions, for example.

In Summary

Ultimately, we should not ignore Big Data and Hadoop. The “Internet of Things” alone will mean the volume; variety and velocity of data available to our businesses will stretch traditional RDBMS data warehouses to the maximum. Will they cope? Do existing techniques, such as dimensional modelling, still work? The answer is probably yes, to both. Dr Ralph Kimball, in his webinar series with Cloudera last year, likened it to XML data when it first arrived. It was tough to manage and it took RDBMS vendors 10 years to integrate XML into their applications. However, why wait? With the tools mentioned in the exploration section, and there are many more, you have the ability to easily investigate Big Data and mix it up with your existing data warehouse. As BI professionals the more value we can add to the business will make investment into better hardware, more storage, advanced tools far easier to access.

References and Useful Links:

Cloudera and Ralph Kimball: http://cloudera.com/content/cloudera/en/resources/library/recordedwebinar/building-a-hadoop-data-warehouse-video.html

SSIS and Hadoop: http://sqlmag.com/blog/use-ssis-etl-hadoop

Power Query and Hadoop: http://msbiacademy.com/?p=6641

Microsoft Polybase: http://blogs.technet.com/b/dataplatforminsider/archive/2014/04/30/change-the-game-with-aps-and-polybase.aspx

Teradata and Hadoop: http://www.teradata.co.uk/Teradata-Portfolio-for-Hadoop/?LangType=2057&LangSelect=true

Introduction to Flume and Sqoop: http://www.guru99.com/introduction-to-flume-and-sqoop.html

Parquet (Hadoop): http://parquet.incubator.apache.org/

Power BI – Musings in May 2014

Having just finished a customer POC using Power BI I wanted to share my thoughts on the toolset. I feel confident that Microsoft are moving in the right direction with Power BI and that its objectives, deliver self-service and mobile BI, is exactly what a lot of my customers want. To do that wrapped in the familiar Excel and bundled up with new licensing model options with Office 365 it is an exciting consideration.

Where to Begin?

My first issue with Power BI is that the overall messaging and marketing is confusing. I am fairly competent but trying to set up my Office 365 trial with Power BI took some getting my head around. However I am also typical IT man in that I try never to read the getting started guide. After an hour of faffing around I went to the Power BI getting started guide and it helped, a lot!

To understand it more I put together the following diagram that hopefully helps, please note this is not an overview of the whole of Power BI just the elements we covered in this POC:

PowerBI

Basically Power BI is an app that runs inside SharePoint online. You get access to SharePoint online if you sign up for an Office 365 subscription (can be paid monthly or annually). With Office 365 you can use Office apps online or download them to your desktop. To be able to use the Power BI Excel add-ins (Power View, Power Pivot, Power Query, Power Map) you will need Office 365 Professional Plus. Unfortunately even for users that will simply consume reports/dashboards through your Power BI site you will need the add-on to your subscription for Power BI (tenant). Again this is something that should be looked at by Microsoft.

Power BI components, utilised during this POC:

Power Pivot – probably needs no introduction but is a data modelling add-in for Excel that allows users to bring together data from multiple sources, relate it, extend it and add calculations using DAX.

Power View – a dashboarding and visualisation tool (perhaps the same thing) that can source data from Excel, Power Pivot or SSAS Tabular DBs (in SharePoint on-premise, integrated mode it runs inside SSRS and can work with SSAS Multidimensional DBs). Power View has some great charting functions and allows relating dashboard items, advanced filters, Bing maps, play axis and slicers across the whole dashboard.

Power Query – this is a self service ETL tool (to some degree) it allows you to connect out to the internet (except Twitter at this point contrary to all the pre-sales demos showing you Twitter feeds, to get this currently you will need a third party connector) and grab tables and lists of data. Once you have it you can add it to Power Pivot models and you can then analyse the data using Excel and Power View, for example.

Power Map – this is my least used tool that looks great in demos but I have yet to see a use over and above Bing Maps in Power View. It ultimately works in a similar way to bing maps in Power View in that you can plot locations and look at measures on a map. The key benefit to Power Map is that you can then record a “tour” where you can record your analyse around the map and then save this out to a video for example.

So to get started with Power BI you go get yourself a trial of Office 365, add the Power BI functionality to it, follow the getting started guide and then start building out some Power Pivot models in Excel that you can use as a source for reports and dashboards in Power View.

The final piece to the Power BI puzzle that I found really great is the Windows 8.1 Power BI app. Again this will be available for iOS and Android later this year. What this allows is the user of the app to browse to the SharePoint online site and feature reports from that site in the app.

What went right, what went wrong?

So back to the proof of concept. It took us about 1/2 a day to construct a very basic SharePoint site that had the Power BI app enabled. We downloaded and installed Office 365 Pro Plus on our desktop and then it took a lot more time to try and come up with useful content. Our major issue is that we were trying to use this to surface our SSAS 2012 Multidimensional cube. We have a large(ish) cube with (around 40gbs) with a lot of data. However the biggest flaw with Power BI, today, is that you cannot connect to on-premise SSAS databases directly. You have to either create subsets of data for each and every report you need in an embedded Power Pivot model or you have to try and create oData feeds of the data you want to use. There is a potentially useful download called the Microsoft Data Management Gateway that does allow you to set up an encrypted link between your on-premise SQL Server or Oracle databases (and oData feeds) but as yet this doesn’t allow for connection to SSAS so we were not able to make use of it.

The other massive benefit my customer, and most other customers, is the ability to have true mobile BI. Ultimately the pieces they need to access are Excel, Power View and SSRS. And unfortunately where they need them is on an iPad! This customer actually wanted to see Excel and Power View reports on the Nexus as well. Power View can be rendered in HTML5 however it warns you when you use this view that some things may not work (most notably the play axis, although I still cant find a real world use for this) we found it most cumbersome around Bing Maps in Power View. Having done some mobile app development I know how hard it is to make sure an app functions and offers a similar user experience across browsers and devices, however with Microsoft’s resources they need to get this right.

Finally the biggest drawback to the whole POC was performance and usability to actually get to the report. From a mobile BI perspective what we want is our analyst to create a report in Excel, check (once we gave them a Power Pivot model). The ability of said analyst to publish the Power View dashboard to the Power BI site, check. Finally an alert for our end users there was a new report and to go look at it, nope. Ok so can the analysts simply click a link to our Power BI site and look at a list of reports, not quite! They have to login to our SharePoint online team site (we don’t want this at all) and then launch Power BI from the left hand links. The user experience here is not just too many clicks but also conflicting look/feel. The team site can be customised and made to look almost corporate, in true SharePoint fashion, however the Power BI site cannot be and looks blue and white. Don’t get me wrong it looks ok, but this is a real world business. The point of our mobile BI piece is for senior execs to be able to launch a report from their iPad on the golf course and get a glance at how their business is going, if they have to go through a Microsoft branded page they are not going to be terribly impressed. And the speed… the lack of it. Loading the team site, slow. Loading the Power BI site, slow. Loading the reports, slower. Even, when running HTML5 as opposed to Silverlight, interacting with a Power View dashboard was slow. This needs to be looked at and fixed. Alternatively give us a link directly to a single report, we can make those look corporate and hopefully we can push them to use Surfaces and use IE and Silverlight so interaction is fast!

There were other general things that can be summarised here:
1. Inability to link to on-premise data in the form of the SSAS cube.
2. General site performance – Loading the Team site was slow (took nn), loading the Power BI page took nn, loading reports took too long.
3. Too many clicks to get to a report
4. No ability to share a report via email link (to take the user straight to the report)
5. Power BI site not able to be customized inline with SharePoint look/feel, based on corporate requirements.
6. Featured Reports option not working on Android and Apple devices.
7. No ability to remove Featured Reports.
8. Inability to connect directly to Twitter feeds from Power Query and therefore link to a Power Pivot model and visualize through Power View.
9. Bing maps viewed excellently in Windows app and through Silverlight but through HTML5 they were not responsive enough, issues with pinching to zoom, moving the map around with touch and issues when changing filter it not always refreshing the map points.
10. Interacting with drillable charts in Power View (in HTML5) on the iPad and Nexus was tricky, sometimes the drill worked other times it highlighted the slice/column.

To Power BI or Not to Power BI?

In short my customer chose not to Power BI at this time and I agree with them. Obviously depending on your customers or business needs this decision may be different. But with the issues we encountered it wasn’t a viable option for end user reporting and dashboarding. BUT… I have it on great authority that a lot of the issues we found will be fixed in coming release(s). In fact if you read Chris Webb’s blog post a few of the issues we encountered are mentioned at the recent PASS summit.

I have advised my customer to wait until said release(s) and to try again with our POC upon their potential new release (perhaps mid-July as the rumour mill suggests?). This coupled with the relative cost versus alternatives such as QlikView and Tableau AND the fact that most companies are looking for a suitable upgrade path for MS Office means that people will use this toolset, and rightly so. If Microsoft can make v2 with all the required enhancements they will have a truly amazing BI stack, imo the best in the marketplace. Lets cross our fingers and hope July is not just sunny but Power BI v2 comes out and knocks us all down!

As usual any feedback or tips very much appreciated!

P.S – It would also be so wonderful if I could extend my trial. Now I am looking at having to re-do all my POC work, from scratch, in July!

P.P.S – Check out Microsoft’s latest guide on the BI Tool Use

PDW Shallow Dive – Part 3

Welcome to part three of the PDW Shallow dive. In this section I am going to explore partitioning in PDW. We will look at how to partition a table on the appliance and how to use partition switching. Unfortunately I was hoping to look at the CTAS function and specifically the mega merge statement but I will have to create a 4th part of the series to cover that due to time constraints.

Introducing Partitioning on the PDW

Partitioning is used to help manage large tables. When partitioning a table you track logical subsets of rows with metadata. Unlike SMP SQL Server there is no partitioning function or scheme you simply set it when you create the table or use Alter Table to add partitioning to the table by selecting what column to partition by and setting your boundary values. This does not affect which distribution or compute node the data is stored on. Rather it will allow you to manage the table with these subsets of data, for example partition by month and then archive an older month by switching the partition to an archive table.

Let’s look at a simple example of a fact table that is partitioned, the code looks like this:

CREATE TABLE [dbo].[FactSales] (
[OnlineSalesKey] bigint NULL,
[DateKey] datetime NULL,
[StoreKey] int NULL,
[ProductKey] int NULL,
[PromotionKey] int NULL,
[CurrencyKey] int NULL,
[CustomerKey] int NULL,
[SalesOrderNumber] varchar(28) COLLATE Latin1_General_100_CI_AS_KS_WS NULL,
[SalesOrderLineNumber] int NULL,
[SalesQuantity] int NULL,
[SalesAmount] money NULL,
[ReturnQuantity] int NULL,
[ReturnAmount] money NULL,
[DiscountQuantity] int NULL,
[DiscountAmount] money NULL,
[TotalCost] money NULL,
[UnitCost] money NULL,
[UnitPrice] money NULL,
[ETLLoadID] int NULL,
[LoadDate] datetime NULL,
[UpdateDate] datetime NULL
)
WITH (CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = HASH([OnlineSalesKey]),
PARTITION ([DateKey] RANGE LEFT FOR VALUES (‘Dec 31 2006 12:00AM’, ‘Dec 31 2007 12:00AM’, ‘Dec 31 2008 12:00AM’, ‘Dec 31 2009 12:00AM’, ‘Dec 31 2010 12:00AM’, ‘Dec 31 2011 12:00AM’)));

It doesn’t have to be dates that are used to partition tables, although this is the most used option, you can partition tables by set values such a product categories or sales channels. Understanding the data you are working with and how it will be analysed is critical to set up partitioning. The key options for setting up partitioning are:

  • The column – as mentioned before it is critical to identify the best column for partitioning your data by, in the example above we use the [DateKey] column.
  • Range – this specifies the boundary of the partition, it defaults to Left (lower values) but you can choose Right (higher values), in the example above we use the default, Left. This means that the values specified will be the last value in each partition e.g. our first partition will have all data with a date before Dec 31 2006 12:00AM.
  • The boundary value is the values in the list that you will partition the table by, this cannot be Null. These values must match or be implicitly convertible to the data type of the partitioning column.

Partitioning by example

In this section we will go step by step through a basic partitioning example. We create three tables, a main table with partitioning on it using a part_id, a second table used for archiving data from our main table and finally a table that we will use to move data from into our main table.

  1. Create out our factTableSample with a partitioning option, using a part_id column:

    create table factTableSample

    (id int not null, col1 varchar(50),part_id int not null)

    with (distribution =
    replicate,
    partition(part_id range
    left
    for
    values(1,2,3,4,5)));

  2. Now we will insert some basic data into the table:

    insert
    into factTableSample

    select 1 id,
    ‘row1’col1, 1 part_id

    union
    all

    select 2 id,
    ‘row2’col1, 2 part_id

    union
    all

    select 3 id,
    ‘row3’col1, 3 part_id

    union
    all

    select 4 id,
    ‘row4’col1, 4 part_id

    union
    all

    select 5 id,
    ‘row5’col1, 5 part_id

    Below shows what those rows then look like:


  3. Let’s take a look at the how the data is stored in that table, how many rows are in each partition:

    –Check Paritions and the number of rows

    SELECT o.name, pnp.index_id, pnp.partition_id, pnp.rows,

    pnp.data_compression_desc, pnp.pdw_node_id

    FROM
    sys.pdw_nodes_partitions AS pnp

    JOIN
    sys.pdw_nodes_tables AS NTables


    ON pnp.object_id
    = NTables.object_id

    AND pnp.pdw_node_id = NTables.pdw_node_id

    JOIN
    sys.pdw_table_mappings AS TMap


    ON NTables.name = TMap.physical_name

    JOIN
    sys.objects
    AS o


    ON TMap.object_id
    = o.object_id

    WHERE o.name =
    ‘factTableSample’

    ORDER
    BY o.name, pdw_node_id, pnp.index_id, pnp.partition_id

    The results show that all but the final partition has 1 row in them. However you also notice there are 24 rows returned. This is because we created a replicated table that is replicated across our 4 nodes. For more information on replicated tables review part 1 of the series.


  4. Now we will create an archive table to which we can move out older data:

    create
    table factTableSample_Archive
    (

    id int
    not
    null,

    col1 varchar(50),

    part_id int
    not
    null)

    with (distribution =
    replicate);

  5. Now we can switch out partition 5 to the archive table and look at the results:

    alter
    table factTableSample switch partition 5 to factTableSample_Archive;


    We can clearly see that the 5th partition has been moved from the main fact table (factTableSample) to the archive table (factTableSample_Archive).

  6. Next we create a new table which we would use to load staging data into:

    create
    table factTableStage with (distribution =
    replicate

    ,
    partition(part_id range
    left
    for
    values(1,2,3,4,5)))

    as

    select 1 id,

    cast(‘row6’
    as
    varchar(50)) col1,

    5 part_id

This adds a single row into partition 5:


  1. Now let’s switch the stage data into partition 5 of the main fact table and look at the results:

alter
table factTableStage switch partition 5 to factTableSample partition 5;


We see that partition 5 is now populated with the data from the stage table.

To summarise we can see using partitioning on the PDW is not overly complex. The alter table statements used to switch the partitions are very fast. If we had used distributed tables in the above examples it would have no effect on any of the code we used. Just like with SMP SQL Server it is important to plan your partitioning strategy in line with the analysis requirements and distribution of data in the datawarehouse. In the next part of the series I will dig into the CTAS (create table as select) statement and show an example of using this to do a merge into a dimension table. I hope you find the above useful, if you want any code examples please feel free to email at matt@coeo.com and I will get you a copy of the full script.


Building BI Projects with Agility

Let me introduce you to an interesting story that was part of the foreword of the Mike Cohn book “Succeeding with Agile: Software Development with Scrum”. In the story, trappers would stop off at the Hudson Bay store to get all their supplies before setting off into the wilderness and make their fortune. However once they had done their shopping at the supply store they would only go a few miles and then set up a camp. The idea was that they would be better to identify what they had missed in that initial shop and only have to go back a few miles rather than being lost in the wilderness without an important piece of equipment. Although most trappers, like most consultants, feel their planning is brilliant the really good ones prepare in a way that they don’t end up frozen in the Rockies!

I like this analogy to projects but does it really resonate with BI? For me it certainly does. On an older BI project we had units of measure to handle in our solution. Applying a Kimball design pattern I asked the company are there really only 3 UOMs you want to handle? The reply was a confident yes with confirmation that they had not reported on more than three for the last 5 years. Great! So based on this confident feedback I built the UOMs into columns in the fact table(s) rather than applying a UOM dimension. I created MDX for time intelligence for each date hierarchy and for each UOM, for three UOMs this wasn’t a hardship.

However, a few months later the customer requested my time to work on an enhancement request to add another 5 UOMs. Unfortunately the nature of consultancy “time & materials” means the customer doesn’t want to spend time (therefore money) on redesigning the core design and including a UOM dimension. So I was left in MDX script hell, although not with the usual complexities of writing MDX but in copy paste and having the longest MDX script in the western world (there must be longer ones in China with 1000s of characters in their writing system!).

If we had been more agile and delivered an earlier version of this cube (as this was the deliverable) then perhaps the business would have seen how easy things now were with these UOMs and suggested they would like to slice the data by more. We then would have had time to refactor the design and implement our UOM dimension.

BUT not all projects fall easily into a completely agile process. For example, did the trappers go to the Hudson store and buy just a sleeping bag (forgive me if these weren’t available in 1890) and then head off to camp to check it worked. Then they bought a kettle on day two before on day three realising they needed some tea bags… I don’t think so. Yet many software developers who talk of pure Agile say that there is value in delivering a web page with a username text box, no password box and no logon button… I don’t agree and would argue that there is value in QA and UAT at this point, but NO value in that delivery to the business.

With BI projects there may be some value, depending on the business requirements, to deliver a product dimension and let the business see this; I certainly found this with retailers that could then identify missing attributes and start to design hierarchies they wanted or with customers that had a customer dimension for the first time. Of course there was value in applying a new business definition of a unique customer and then seeing what this list of customers was for the first time.

However the key to both of these examples is that the biggest business value for these customers was looking at sales orders over time for these products or for identifying how customer recency affected the actions of this new list of customers. To delivery the biggest value to the business we have to make sure we have the tent, sleeping bag, cooking stove and, at least, a kettle to boil water. Everything needed to do a basic camping trip in the Rockies (I am not a camper). For me this is the high level design of the BI project and I suggest to deliver any value to the business the high level design needs to include everything you know you need at the start of the project. On later camping trips it is likely you will have more experience and that the initial design or shopping trip will be more detailed and there will be less trips back to the store.

From that point on, continually delivering value, based on business priority/need, is achievable. Breaking down complex tasks into work items and grouping these into sprints works really well but not just for the sake of unrelated work items. It is much better that a sprint focusses around a group of items that, together, may allow delivery of something that is valuable to the business. To do this you must be interacting constantly with the business. In a recent project we had an excellent MI team who helpfully worked with us to group work items in a set of business headings all relating to a key area that would add value to the team and therefore the business. We could then focus sprints around these headings and, once deployed, the MI team could start to see value in the BI project.

But the key on a new project is that without that high level design and that first big effort of choosing what you need, building it and then heading out to the wilderness, you are simply going camping with just a sleeping bag and will make more trips to the supply store than are really needed. Get good requirements from the right people (not IT), build as comprehensive a design as your experience and the requirements allow and then deliver the first cut for business review as early as possible. To help with this I often try and not worry about control processes, partitioning strategies, incremental ETL loads etc. I move that to later stages of the build so we can focus purely on delivering that first cut to the business.

In summary I do not think that a single “pure” process has been required across all my projects. But being more agile by using best practices from across the various methodologies has helped me to be more and more consistent in delivering high business value solutions to my customers. And just like evolving your development processes helps improve delivery always look to learn from each project to help evolve your way of working, this for me is one of the best things being Agile teaches us.

 

 

 

PDW Shallow Dive – Part 2 – ETL

Welcome to part two of my shallow dive into the PDW. In this section we take a look at the options available for ETL with the appliance. Here we will cover the following topics:

  1. DWLoader.
  2. SSIS and PDW.

DWLoader

The DWLoader application is a bit like the PDW version of BCP, it is a command line application that is all about very fast import and export of flat files that are logically structured. The tool can be used via a script or the command line to import flat files to the appliance or export data to flat files off the appliance. Example syntax of the DWLoader application:

dwloader.exe -i ${loadfile} -m -fh 1 -M {mode} -b 100000 -rt value -rv 1000 -R ${loadfile}.rejects -E -e ASCII -t , -r \r\n -S $server -U
$user  -P $password -T {targetdatabase}${table}

Key switches:

  1. –i – the location of one or more source files to load.
  2. –M – specifies the load mode, options are fastappend, append, upsert or reload data. The default mode is append. To use fastappend you must specifiy –m, which is the multi-transaction option. The key difference to append and fast append is that the latter doesn’t use a temporary table.
  3. –fh – the number of lines to ignore at the beginning of the file (ie. Header rows).
  4. –b – sets the batch size, the default is 10,000 rows.
  5. –rv – specifies either the number of rows or % of data allowed to fail before halting the load (use of –rt sets whether this is a value or %).
  6. –R – this is the load failure file, if the file already exists it will be overwritten.
  7. –E – will convert empty strings to NULLL. The default is not to convert.
  8. –e – is the character encoding, ASCII is the default but other options are UTF8,UTF16 or UTF16BE.
  9. –t – specifies the field delimiter.
  10. –r – is the row delimiter.
  11. –S – sets the target appliance normally an IP address.
  12. –U and –P – is the username and password for the appliance (until the next PDW update PDW only supports SQL security so this would be your SQL username and password)
  13. –T – is the three part name for the destination table e.g. dbname.dbo.table1 (dbo is the only schema supported in the current version of PDW)

Once the import or export is started there is not much to see however you can track progress through the PDW dashboard.

A common problem with the DWLoader tool is that it isn’t able to work out the file format of the flat file (ie. Unicode or UTF-8). If you aren’t sure about the file type open up the file in SSIS Import/Export wizard and it will allow you to work out the format.

SSIS and PDW

First let’s remind ourselves how where SSIS fits in the PDW world:

The important point the diagram shows is that SSIS runs on the landing zone server NOT on the PDW directly. This is quite critical because, whereas in traditional ETL solutions you will look to SSIS to manage all of the transformations, when you have a PDW appliance as your database you will want to do that work on the actual appliance. This is also important to limit the effect on memory by pulling large data volumes into the SSIS data flow. It is much better to use SSIS as the control flow for the ETL solution and doing the advanced transformations of the data on the PDW.

To use SSIS with the PDW you use the same tools and processes as you would against normal SQL Server. Visual Studio with SSDT is used to create an SSIS project. Within the project you can specify a source adapter that will be an OLE DB source pointing to the PDW; however you will see that SSIS thinks this is a normal SQL Server source because the icons for the tables do not show as distributed or replicated, just tables. This is by design and it is good practice to build your source query as a stored procedure on the PDW so that you can test it before using in the SSIS project. The big difference in SSIS is the destination adapter, this is PDW specific and an example is shown in the diagram below:

Using the PDW destination adapter it is critical to make sure the source columns match the data type of the destination columns; the adapter is not flexible or forgiving like other destination adapters in SSIS (i.e. SQL Server Destination), it won’t do any automatic type conversion. Important settings when using the PDW destination adapter include:

  • Connection Manager – you will need to create a new, PDW specific, connection manager:
    • Servername will be the appliance name and port number or the appliance IP and port number.
    • User and password will be your SQL Server PDW login information.
    • Destination database name is the target database on the PDW.
    • Staging database name is the stage database you want to use, if using FastAppend this must be empty.
  • Destination Table – the table you wish to load the data into.
  • Loading Mode – if the loading mode is fast append make sure you un-select “roll-back load on table update or insert failure”
    • By unchecking this box, load from the staging table to a distributed destination table are performed in parallel across the distributions; this is quick but not transaction safe. It is the same as using the –m switch in DWLoader.
    • Checking the box loads data serially across distributions within each compute node and in parallel across the compute nodes. This is slower but is transaction safe.
  • Finally you can map your columns from input to destination – HOWEVER MAKE SURE YOU MATCH DATA TYPES!

Below is an example of the package running to import a simple file:

You can then follow the performance of the load through the PDW dashboard just as with the DWLoader:

In summary, when dealing with ETL on the PDW, push as much of the data transformations to the PDW as possible. Use tools not normally available in SQL Server or SSIS such as CTAS (Create Table As Select) and try and limit the use of the Update statement. Understanding that SSIS works by pulling data from the data source adapter into the memory on the SSIS server and runs it there, is critical to manage memory and performance of SSIS.

My suggestion for using SSIS with PDW is to use it for the control flow and for any basic data preparation for types not supported on the appliance, for example XML data type. Then use the power of the PDW to do the advanced transformations. In a recent POC for a customer we were loading around 300gbs of data in less than one hour using DWLoader and then transforming 2 million XML documents into a relational data warehouse using SSIS and PDW stored procedures. We did have an issue with no Merge statement support in PDW so to upsert into a dimension table we created a “mega-merge” statement using the CTAS function. In Part 3 of this series I will cover how that worked and also how partitions are managed on the PDW.

If you have any comments or questions please feel free to post them here for me or tweet me at @mattasimpson