Gartner BI & Analytics Conference – Modern Architecture

In an excellent session this afternoon, we were taken on the journey towards best practice implementation of BI/Analytics architecture. For the last 2 years Gartner have recommended that the 3 key areas are represented by the Information Portal, Analytics Workbench and Data Science Laboratory. Highlighted in the rather shaky picture below:

3 Tiers BI Arc

Each tier will offer different benefits to the business that can be summarised by the key roles and processes that a modern architecture need to accommodate. These are summarised in the images  below:

ModernArcProcesses.jpg

ModernArcProcesses

Finally by following this you could end up with the holy grail, if this is your businesses holy grail:

modernarcoverview

It was interesting in this session how they talked around vendors and again reiterated that no one single tool OR vendor can deliver this whole picture. However, they also pointed out that the magic quadrants should not be looked at in isolation as niche and smaller vendors that are in the lower left or not even quite on the board may well suit your businesses needs really well. Understanding your business needs, values and potential outcomes AGAIN seems the ultimate place to “bet your house” when it comes to delivering a successful BI/Analytics program. I would also quote Neil Chandler for businesses just venturing out into BI in that this will be a core competency of your business going forward and require “indefinite investment”. Don’t let this scare the financiers but do make sure people realise that the delivery of the information portal or a specific tool to do data discovery is not the end of the BI/Analytics journey. I much prefer Bibby’s Lead Architect, Richard Smith’s comments around this which has our internal program focussed on creating “an enduring BI capability” to be enduring you must react to change, within the business and within the marketplace.

Advertisements

Gartner BI & Analytics Summit London – Day 1

What a great day, very exciting and lots learnt from industry experts, with very little bias towards a vendor or specific technology. This post will form the first of two which is my attempt to mind dump some of the key takeaways I have picked up and found useful here at the conference.

BICC – Business Intelligence Competency Centre

Is dead… long live the ACE – Analytics Community of Excellence. In the keynote Neil Chandler suggested four things wrong with the BICC, business, intelligence, competency and centre! Although it is still the driver behind successful, versus non-successful, BI programs it is not encompassing enough of the modern world of BI and Analytics. Key benefits of this approach is an attempt to drive BI programs from business outcomes but mostly it fails and still becomes an efficiency drive, not linked to delivering actual business value. It also does not encompass a new wave of change in the business, self-service, which is near impossible to centrally manage. Finally it has not been driven around the new future of analytical applications which are focussed around algorithms and a scientific approach to running of businesses, and our lives!

Whilst there is a lot in words the evolution from BICC to ACE is not just about words it is about evolving and improving a good concept and bringing it up-to-date and inline with business needs. Analytics now seems to embrace BI and gives us a larger maturity scale for businesses in our ever changing world. Community takes away the need to centrally control and helps with the, already in-place, self-service. Finally excellence is about striving towards something we perceive as the ultimate goal, not just a list of competencies that are based around technology.

One thing that hasn’t changed but cannot be ignored is that you MUST FOCUS around BUSINESS OUTCOMES to achieve excellence in your BI or Analytics program.

Algortithms are KEY

An important key theme for the keynote centred around algorithms and their use in BI and Analytics as well as in day to day life. Guessing which classical piece of music was generated by a computer or Bach highlighted how compute power and science has progressed and there is a real feeling from Gartner that through the use of algorithms and the tools that support them we can automate, improve and gain valuable insight into our businesses. Algorithms are used all over your business today, take some time to document them and look to find tooling to support their automation and improvement. Citizen data science communities may spring up around this.

Other key notes:

  • IoT algorithms are set to generate $15 billion by 2018
  • Over half of organisations will be leveraging algorithms by 2018
  • By 2020 50% of Analytics leaders will be able to link their programs to real business value.
  • The best analytics leaders can formulate new questions as well as answer existing business ones. They also fail, learn and push the envelope; I would add they fail fast, time is gone for multi-year BI programs.
  • Data Management and Data Integration tools are converging, but not as fast as you may imagine.
  • BI and Analytics tools are also converging but it was noted NO one vendor can support all BI and Analytics needs and the buyers of each are currently different.
  • Quadrant analysts highlight IBM’s Watson Analytics as a good example of an analytical application.
  • Microsoft’s Power BI v1 (that horrible O365, SharePoint online linked tool) failed, but v2 has gained traction and is having a negative impact on Tableau’s performance.

Still a lot of learning to go, even on day 1 but I wanted to share this for those not fortunate enough to attend this amazing event!

 

Integrating Hadoop and the Data Warehouse

The objective of any data warehouse should include:

  1. Identification of all possible data assets
  2. Select the assets that have actionable content and are accessible
  3. Model the assets into a high performance data model
  4. Expose the data assets most effective for decision making

New data assets are now available that may meet some of the above criteria but are difficult, or impossible, to manage using RDBMS technology. Examples of these are:

  1. Unstructured, semi-structured or machine structured data
  2. Evolving schemas, just in time schemas
  3. Links, Images, Genomes, Geo-Positions, Log Data

These data assets can be described as Big Data and this blog looks at Big Data stored in a Hadoop cluster.

In very few words Hadoop is an open source distributed storage and processing framework. There are a number of different software vendor implementations of Hadoop. The different Hadoop implementations should be investigated depending on your requirements.

Figure 1 highlights the key differences, and similarities, between relational database management systems (RDBMS) and Hadoop.

RDBMS and HadoopFigure 1 – Differences between RDBMS and Hadoop

The three layers that can be used to describe both systems are Storage, Metadata and Query. With a typical RDBMS system, these layers are “glued” together with the overall application, for example, SQL Server or Oracle. However, in Hadoop these layers work independently allowing for multiple access to each layer; meaning super-scalable performance.

Exploring Data between the Data Warehouse and Hadoop Cluster

Often there is an unknown quality or value in the Hadoop data. To start to identify value or explore the possibility of gaining new insight from the Hadoop data, it is useful to be able to query the data directly and alongside the existing data warehouse. To query by conformed dimensions, for example, is extremely powerful and can help to query Hadoop data based on well-governed dimension data.

This “exploration” can be relatively slow, compared to simply querying Hadoop with Hive or Impala directly, or by queries against a dimensional modelled data warehouse. However, this gives us an opportunity to explore data before we worry about leveraging an ETL process to extract, transform and load the data into our ultimate data warehouse.

To do this exploration there are two main options:

Option 1 – Mash Ups

By leveraging tools such as Power BI (Power Query and Power Pivot) or Alteryx Designer, you are able to bring together data from a Hadoop cluster and an RDBMS data warehouse. The data can be modelled and calculations added. Finally, the data can be queried to start to identify possible insights.

Option 2 – Direct Querying

There are some technologies, such as Microsoft Polybase or Teradata QueryGrid, that allow you to leverage SQL query language to add temporary structure to Hadoop data and join to data warehouse data. My hope from Microsoft is that Polybase is bought from the MPP appliance, APS, and into SMP SQL Server in its next release. This technology is perfect for people not wishing to learn Java, Python, Sqoop and Linux.

Extending the Data Warehouse

The explore options above are useful but limited. Performance will be limited by the Hadoop Cluster and a lack of structure on the data or by the RDBMS data warehouse. If insight is shown through the exploration then the next logical step will be to bring useful data together into a single data warehouse.

Initially you may wish to use existing ETL tools, such as SSIS, Information Builders, or go directly to what these tools often leverage which is Sqoop. This will allow you to bring data from the Hadoop cluster and then you can use Pig, for example, to transform the data into a dimensional model in your existing RDBMS data warehouse. This allows you to benefit from the proven performance of a dimensional model. I refer to this data as your “known unknowns”.

Secondly, you may wish to move your data warehouse or, more often, create your new data warehouse in Hadoop. This can be a sensible option when you compare the performance of the Hadoop architecture compared to RDBMS standard architecture. You can also still leverage your SQL skills using tools such as Hive or Impala to analyse the data. However, to further improve performance, you can add some semi-permanent structure to the data using Parquet. Parquet is a file format that uses columnar methods similar to existing in-memory columnar engines such as Vertipaq. This will allow us to apply dimensional modelling techniques to our data and benefit from conformed dimensions, for example.

In Summary

Ultimately, we should not ignore Big Data and Hadoop. The “Internet of Things” alone will mean the volume; variety and velocity of data available to our businesses will stretch traditional RDBMS data warehouses to the maximum. Will they cope? Do existing techniques, such as dimensional modelling, still work? The answer is probably yes, to both. Dr Ralph Kimball, in his webinar series with Cloudera last year, likened it to XML data when it first arrived. It was tough to manage and it took RDBMS vendors 10 years to integrate XML into their applications. However, why wait? With the tools mentioned in the exploration section, and there are many more, you have the ability to easily investigate Big Data and mix it up with your existing data warehouse. As BI professionals the more value we can add to the business will make investment into better hardware, more storage, advanced tools far easier to access.

References and Useful Links:

Cloudera and Ralph Kimball: http://cloudera.com/content/cloudera/en/resources/library/recordedwebinar/building-a-hadoop-data-warehouse-video.html

SSIS and Hadoop: http://sqlmag.com/blog/use-ssis-etl-hadoop

Power Query and Hadoop: http://msbiacademy.com/?p=6641

Microsoft Polybase: http://blogs.technet.com/b/dataplatforminsider/archive/2014/04/30/change-the-game-with-aps-and-polybase.aspx

Teradata and Hadoop: http://www.teradata.co.uk/Teradata-Portfolio-for-Hadoop/?LangType=2057&LangSelect=true

Introduction to Flume and Sqoop: http://www.guru99.com/introduction-to-flume-and-sqoop.html

Parquet (Hadoop): http://parquet.incubator.apache.org/