Microsoft BI – 2015 Highlights

It’s been a great year for BI! Power BI coming of age,  exciting SQL Server 2016 CTP releases and a maturity in the cloud for analytics, data science and big data.

For me Power BI is the biggest news of 2015. POCs ran in H1 of 2015 found it wanting. Basic functionality missing and the confusion of wrapping it in Office 365 made it to much for businesses to consider. However with the GA release, and the numerous updates, it had finally delivered on its vision and given Microsoft an end to end, enterprise solution, for the first time in its history; including multidimensional connectivity!

Microsoft also made some great tactical manoeuvres including the purchase of Datazen and Revolution R as well as their excellent Data Culture series. Datazen is a good tool in its own right with great dashboard creation capability and impressive mobile delivery functionality on all devices/platforms. It will nicely integrate to SSRS top deliver a modern reporting experience via mobile in SQL 2016. R is the buzz of 2015, a great statistical analysis tool that will really enhance SQL Server as the platform of choice for analytics as well as RDBMS. In fact you can already leverage is capability in Power BI today!

Cloud. So Microsoft finally realised that trying to drag businesses into the cloud was not the correct strategy. A hybrid approach is what is required. Give businesses the best of both worlds. Allowing them to benefit from their existing investments but “burst” into the cloud either for scale or new capability, as yet untested. SQL 2014’s ability to store some data files, perhaps old data purely kept for compliance,  is a great example of this. ExpressRoutes ability to offer a fast way to connect on-premises with cloud is brilliant. Or go experiment with Machine Learning, made Microsoft simple by the Azure offering.

For me I was also scored to see the PDW hot the cloud with Azure SQL Data Warehouse. An MVP platform is the closest my customers have needed to be to BigData but the initial outlay of circa half a million quid was a bit steep. With the cloud offering companies get all the benefits worn a minimal investment and an infinite ability to scale. But do consider speed of making data available as it could be limited by Internet connections.

So in summary an awesome year for Microsoft BI with the future looking great! I still feel Microsoft lack SSAS in the cloud but perhaps Power BI will gain that scale in 2016. Overall I envisage seeing Microsoft as a strong leader in the next Gartner quadrant release for BI and I can’t wait for SQL 2016’s full release!

The future (2016 at least) is bright, the future is hybrid cloud…

image

MS BI Current World

Integrating Hadoop and the Data Warehouse

The objective of any data warehouse should include:

  1. Identification of all possible data assets
  2. Select the assets that have actionable content and are accessible
  3. Model the assets into a high performance data model
  4. Expose the data assets most effective for decision making

New data assets are now available that may meet some of the above criteria but are difficult, or impossible, to manage using RDBMS technology. Examples of these are:

  1. Unstructured, semi-structured or machine structured data
  2. Evolving schemas, just in time schemas
  3. Links, Images, Genomes, Geo-Positions, Log Data

These data assets can be described as Big Data and this blog looks at Big Data stored in a Hadoop cluster.

In very few words Hadoop is an open source distributed storage and processing framework. There are a number of different software vendor implementations of Hadoop. The different Hadoop implementations should be investigated depending on your requirements.

Figure 1 highlights the key differences, and similarities, between relational database management systems (RDBMS) and Hadoop.

RDBMS and HadoopFigure 1 – Differences between RDBMS and Hadoop

The three layers that can be used to describe both systems are Storage, Metadata and Query. With a typical RDBMS system, these layers are “glued” together with the overall application, for example, SQL Server or Oracle. However, in Hadoop these layers work independently allowing for multiple access to each layer; meaning super-scalable performance.

Exploring Data between the Data Warehouse and Hadoop Cluster

Often there is an unknown quality or value in the Hadoop data. To start to identify value or explore the possibility of gaining new insight from the Hadoop data, it is useful to be able to query the data directly and alongside the existing data warehouse. To query by conformed dimensions, for example, is extremely powerful and can help to query Hadoop data based on well-governed dimension data.

This “exploration” can be relatively slow, compared to simply querying Hadoop with Hive or Impala directly, or by queries against a dimensional modelled data warehouse. However, this gives us an opportunity to explore data before we worry about leveraging an ETL process to extract, transform and load the data into our ultimate data warehouse.

To do this exploration there are two main options:

Option 1 – Mash Ups

By leveraging tools such as Power BI (Power Query and Power Pivot) or Alteryx Designer, you are able to bring together data from a Hadoop cluster and an RDBMS data warehouse. The data can be modelled and calculations added. Finally, the data can be queried to start to identify possible insights.

Option 2 – Direct Querying

There are some technologies, such as Microsoft Polybase or Teradata QueryGrid, that allow you to leverage SQL query language to add temporary structure to Hadoop data and join to data warehouse data. My hope from Microsoft is that Polybase is bought from the MPP appliance, APS, and into SMP SQL Server in its next release. This technology is perfect for people not wishing to learn Java, Python, Sqoop and Linux.

Extending the Data Warehouse

The explore options above are useful but limited. Performance will be limited by the Hadoop Cluster and a lack of structure on the data or by the RDBMS data warehouse. If insight is shown through the exploration then the next logical step will be to bring useful data together into a single data warehouse.

Initially you may wish to use existing ETL tools, such as SSIS, Information Builders, or go directly to what these tools often leverage which is Sqoop. This will allow you to bring data from the Hadoop cluster and then you can use Pig, for example, to transform the data into a dimensional model in your existing RDBMS data warehouse. This allows you to benefit from the proven performance of a dimensional model. I refer to this data as your “known unknowns”.

Secondly, you may wish to move your data warehouse or, more often, create your new data warehouse in Hadoop. This can be a sensible option when you compare the performance of the Hadoop architecture compared to RDBMS standard architecture. You can also still leverage your SQL skills using tools such as Hive or Impala to analyse the data. However, to further improve performance, you can add some semi-permanent structure to the data using Parquet. Parquet is a file format that uses columnar methods similar to existing in-memory columnar engines such as Vertipaq. This will allow us to apply dimensional modelling techniques to our data and benefit from conformed dimensions, for example.

In Summary

Ultimately, we should not ignore Big Data and Hadoop. The “Internet of Things” alone will mean the volume; variety and velocity of data available to our businesses will stretch traditional RDBMS data warehouses to the maximum. Will they cope? Do existing techniques, such as dimensional modelling, still work? The answer is probably yes, to both. Dr Ralph Kimball, in his webinar series with Cloudera last year, likened it to XML data when it first arrived. It was tough to manage and it took RDBMS vendors 10 years to integrate XML into their applications. However, why wait? With the tools mentioned in the exploration section, and there are many more, you have the ability to easily investigate Big Data and mix it up with your existing data warehouse. As BI professionals the more value we can add to the business will make investment into better hardware, more storage, advanced tools far easier to access.

References and Useful Links:

Cloudera and Ralph Kimball: http://cloudera.com/content/cloudera/en/resources/library/recordedwebinar/building-a-hadoop-data-warehouse-video.html

SSIS and Hadoop: http://sqlmag.com/blog/use-ssis-etl-hadoop

Power Query and Hadoop: http://msbiacademy.com/?p=6641

Microsoft Polybase: http://blogs.technet.com/b/dataplatforminsider/archive/2014/04/30/change-the-game-with-aps-and-polybase.aspx

Teradata and Hadoop: http://www.teradata.co.uk/Teradata-Portfolio-for-Hadoop/?LangType=2057&LangSelect=true

Introduction to Flume and Sqoop: http://www.guru99.com/introduction-to-flume-and-sqoop.html

Parquet (Hadoop): http://parquet.incubator.apache.org/

APS (PDW) – Extracting Load and Query Stats

APS (PDW) – Extracting Load and Query Stats

Hi, this is a short blog post that may be useful to users of the PDW to get a full list of load statistics and query statistics. The main area to get statistics is the PDW dashboard, but in a lot of cases this is not enough. It is even worse if best practice has not been implemented and labels for queries are not used then the dashboard becomes less use than a chocolate teapot.
So in order to extract load information from the APS the following query is rather useful, note that this will only pull back information on backups, restores and loads, if you loaded data using “insert into” for example information would not show in the results.

SELECT

r.[run_id], r.[name], r.[submit_time], r.[start_time], r.[end_time], r.[total_elapsed_time], r.[operation_type],

r.[mode], r.[database_name], r.[table_name], l.[name], r.[session_id], r.[request_id], r.[status], r.[progress], case when r.[command] is null

then q.[command] else r.[command] end as [command], r.[rows_processed], r.[rows_rejected], r.[rows_inserted] from sys.pdw_loader_backup_runs r

join sys.sql_logins l on r.principal_id = l.principal_id

left outer join sys.dm_pdw_exec_requests q on r.[request_id] = q.[request_id] where r.[operation_type] = ‘LOAD’

–AND l.[name] = ‘someusername’

order by CASE UPPER(r.[status])

WHEN ‘RUNNING’ THEN 0 WHEN ‘QUEUED’ THEN 1 ELSE 2 END ASC , ISNULL(r.[submit_time], SYSDATETIME())

DESC OPTION (label = ‘Rd_DataLoads’)

Note in the preceding statement a line is commented out. This line can be used to find loads completed by a specific user. The DMV for loads sys.pdw_loader_backup_runs stores all loads over time and persists after a region restart. Again best practice should be in place where users are logging into the PDW with their own user (or windows auth if possible) NOT sa! Finally note the use of labels:

OPTION (label = ‘some comment in here’) 

Labels can then be used either in the queries you use on this page or even through the dashboard where you can see a column for labels. I would recommend your team applying some naming conventions or standards for labelling. Next query below is useful for pulling back a list of all queries that have been executed on the APS:

select q.[request_id], q.[status], q.[submit_time] as start_time, q.[end_time], q.[total_elapsed_time], q.[command], q.[error_id], q.[session_id], s.[login_name], q.[label]

from sys.dm_pdw_exec_requests q inner join sys.dm_pdw_exec_sessions s on s.[session_id] = q.[session_id]

where LEFT(s.client_id, 9) <> ‘127.0.0.1’

order by [start_time] desc OPTION (label = ‘Rd_Query_history’)

This will give you a list of all queries performed by all users of the APS, however the DMV used: sys.dm_pdw_exec_requests only stores up to 10,000 rows so depending on the usage of the APS the query above will only give you a very recent snapshot of query performance. My recommendation for both of the above queries would be to set up SQL Agent job on the loading server to extract these stats, from the PDW, into a dedicated stats database on the loading server. You could then use SSRS or any other tool to do some proactive monitoring of large/long loads and queries for example. At worst you have a nice log of data over time should you start to get feedback about degrading performance for example.

PDW to APS – Market Transition?

So on 15th April 2014 Microsoft announced, albeit quietly that the PDW, Parallel Data Warehouse was being renamed to APS, Analytics Platform System. For me it was sad as I really liked the name PDW, it rolled off the tongue and in true Ronseal style, did what it said on the tin. However I can fully understand the re-branding as the market for BI is changing. I do not feel that BI is dead and Big Data is it’s replacement but I do feel there is a change in requirements from the business and traditional BI alone doesn’t meet these new requirements. These newer requirements include the ability to get data insight faster, regularly adding new data sources and the ability to link insights from the warehouse to big data (unstructured data sources). There is also a need from the business for proper analytics, yes Excel is still key but customers need more. Self service BI is required so analysts can do their job of analysing and presenting insight rather than spending 90% of their time preparing data. Mobile BI for delivering those important dashboards and reports into the hands of the decision makers, who, unfortunately use iPads!

There is an argument that the current Microsoft BI stack already caters for this. Apply SMP SQL in 2012/14 guise (with Columnstore, partitioning) with SSAS (cubes, tabular models), Power BI (with advanced visualisations and HTML5 support, sort of) and an agile development approach to the project and et voila! And for a lot of customers this is great, in fact I think this is the most all-round complete Microsoft BI stack since the birth of SSAS. But for others it doesn’t offer them the ability to manage VERY large/wide data sets or the confidence that it can cope with expected growth and acquisitions. Hardware costs spiral and if you have to go down the route of scale out the licensing costs also start to make this prohibitive. Then for big data you are looking at a second solution, Hadoop, HDInsight, Cloudera for example. Bringing both sources together can still be achieved by using Power Query and Power Pivot and then you could productionise this using Tabular models, however that is still a stretch and needs you to add at least some structure to the big data side.

To help with this Microsoft released AU1 for the PDW but realised that with this release the PDW was now not just a parallel data warehouse it was more than this and, inline with the market movements that it was more about general analytics not just datawarehousing. Being an appliance then this is really a platform and a high performance, scalable analytics platform. Hence the not so nice anacronym APS. But like ronseal it still does what it says on the tin!

Features added to the APS v1 or PDW AU1:
– HDInsight, inside the appliance
– Polybase V2
– linking to the HDInsight area of the appliance (using pushdown)
– HDInsight in Azure (no push down)
– Integrated user authentication through Active Directory
– Transparent data encryption
– Seamless scale out capabilities