Ian Abramson's Data and Database Blog: May 2016

Thursday, May 19, 2016

Innovative Thinking in the Age of Big Data and IoT

The recent Collaborate16 conference for Oracle professionals discussed many issues and challenges we have today and how innovation is changing how we create and utilize information. We live in the Age of Innovation. The change we have seen in the past 10 years has been unprecedented in our history. These changes have reached into our everyday lives as we have moved from the Printed Age to the Digital Age. The change has not been gradual, rather it has been revolutionary. We as information professionals have seen this change every day and need to adapt to the changes in technology.

Disruptive technologies are not new. In the past the car was seen as disruptive form of transportation and ultimately put an end to the horse-drawn carriages and in many ways changed how we saw trains. In the information area we saw email supplant traditional postal mail. In the past people sent thoughtful and detailed letters to each other to provide updates which arrived days or weeks after they occurred. Email changed the way we communicate. Today we can communicate our thoughts and feeling via email in an instantaneous manner which has caused the Postal service to rethink the products and services they offer. Even email has been impacted by disruptive technologies as well. The use of apps like Snapchat, text messaging, and Instagram.

Consider the message we got in the movie "The Graduate" in which a smug Los Angeles businessman takes aside the young Dustin Hoffman and declares, "I just want to say one word to you -- just one word -- 'plastics.'”. If the same movie was produced today the line may be …. “I just wanted to say 2 words to you…. Just 2 words… 3D Printing”. We have moved from use plastics by industry to the point where individuals can now control the plastic. We can use 3D Printing to produce machines, artwork and even a complete house. There is no limit to what can be done

A recent survey from TechPro Research identified the drivers for innovation. The list below illustrates their results:

http://zdnet1.cbsistatic.com/hub/i/2015/05/21/1035bb67-e148-4bff-8caa-4dbcc6f9d020/6ecdca0b07bfd36c0803e78a3f58d86c/innovation-tech-vendors-3.png

http://www.zdnet.com/article/it-innovations-four-horsemen-google-apple-amazon-microsoft/

The survey shows us that Cloud has been the most disruptive technology to spur on innovation. I find this interesting as I expected that the emergence of Big Data or data as a foundation of life as being the most disruptive, but the numbers show us that Cloud is significantly more impactful at spawning innovation.

How is it that the Cloud has made some an impact in innovation? Although it should be obvious, the Cloud has allowed us to create solution which are not encumbered by our own local limitations. In the Cloud we can now deploy solutions using new technologies as see if they work with limited cost and risk exposure. We now have an environment where we can leverage its elasticity to support the storage of more data than previously considered. By using the technology to support the deployment of solutions quickly it also provides a storage area which can be used to supply analysts with data when they need it at a cost which can be very competitive.

The leadership at Oracle have clearly shown that Cloud is their future as well. Larry Ellison, Oracle’s CEO, stated, “We are in the middle of a generational shift to the cloud, it is no less important than the move to personal computing”. Oracle has delivered on this vision by creating a cloud environment where you can get a database, deploy a Hadoop cluster or run Oracle’s eBusiness Suite. The transformation has been quick and Oracle’s focus is clearly in this area based on what they now offer. The question is whether or not customers will be willing to move to a SaaS and PaaS approach and move away from in-house software and application management.

Innovation comes in many forms but one thing is that innovation brings is change. We must remember that although innovation is meant to provide benefits, not all innovations will succeed. We therefore must ensure that regardless of success we must look at new ideas to allow our organizations to evolve.

Thursday, May 12, 2016

Bimodal Business Intelligence – The Next Evolution

Recently the talk in the Business Intelligence and IT world has been about bimodal IT and bimodal BI. The concept of bimodal IT is that there are two service requirements which need be provided to information producers and consumers. The first mode is Mode 1, of course. This mode is the set of traditional methods and approaches which we have all used in IT and BI. It is formal and it is well-defined. It is a system which is reliable and the information within it is well-govern. Mode 1 would comprise of your Enterprise applications as well as data warehouses and formal reporting. The challenge with Mode 1 is that it can take months or years to complete before the full value can be realized. Mode 2 is more aligned with an Agile approach to development and geared towards the users and not IT. Mode 2 looks to address needs as they are raised. The introduction of a more self-service approach means that data consumers can find what they need, when they need it and not wait for a formal report to be created. Users have specialized tools available to them which are designed to support this Agile approach by providing tools which are intuitive and easy to navigate for less technical users. Mode 2 is focused on the user and not in IT, this means that governance must be incorporated into all processes to verify that data is being used appropriately. As Gartner put it; bimodal IT is like comparing a marathon runner to a sprinter.

Why does this matter to us? In the data-based world we live in, we have the ability to leverage information which previously had been difficult to access. Unless the IT department determined that there was some value in the data they may not include it within the formal data stores. This is the foundation of data warehouses and supporting descriptive and diagnostic reporting and we accept that in this environment, speed to analysis is not the top priority. Data Analysts in companies which need to react quickly, this approach struggles and this is where Mode 2 comes into play. Mode 2 aligns with the need to answer almost random questions or simulate new scenarios which previously had not been defined. The ability to react quickly to change and not simply produce a product but to build the right product, is where Mode 2 helps us the most. I consider Mode-2 like an Agile approach to information collection, analysis and consumption. It puts the power of analysis in the hands of the business users with little or no support from IT. They can use analytic tools which are built to appeal to their needs by providing a robust interface to produce impactful reports and dashboards. The leading tools which have emerged in this area include Tableau, Qlik, Spotfire and more recently Oracle is included with a Cloud offering named Oracle Data Visualization (DV). All of these tools align with the business psyche to provide users with the ability to ask questions when they are needed and to address immediate and changing needs in a simple and powerful interface for reporting and visualizations.

At the conference I had the opportunity to see Oracle’s Data Visualization product and attend sessions about how this was Oracle’s answer to Tableau, some even said it would be a Tableau killer. I am not convinced this to be true today but we need to consider that these are the early days for DV and it does have some compelling features such as its compatibility with OBIEE. Oracle’s Data Visualization is cloud-based and provides very similar fund. The product allows for analysis of data from multiple data sources from Hive and Impala to Oracle, Redshift and Excel and bring the data together in a mashup approach for displaying the results of the analysis. In addition the product features the ability to create guided analysis via the use of story books. The challenge today is that there are few if any BI product companies which provide a single product to support both Mode 1 and Mode 2. The result is that we have one tool for Mode 1 BI; tools like OBIEE, Cognos and Business Objects and another for Mode 2 like Tableau, Qlik, Spotfire, SAP’s Lumira and Oracle’s Data Visualization.

The problem is what we do to make them all work together. How does one transition from Mode 2 to Mode 1 when analysis is required on a regular basis? How does Mode 1 satisfy immediate needs? By having products from different vendors we often have to redevelop work done by data analysts in Mode 2, by looking to a single vendor this problem may be reduced. However at this time this is not the reality. Instead we are in a time when we use heterogeneous products to service specific needs. The time for tools which cross the bimodal needs of a business are on their way, but not quite yet. We must create an environment where data discovery and analytics can be optimized in this bimodal approach but it must be done with care as data curation will be needed to instill the confidence that data is being used properly

Monday, May 2, 2016

Data Quality Using R

Data Quality is a valuable approach to making sure that your data meets your needs and provides you with reliable information when it is needed. The problem is that most organizations are hesitant to embark on the data quality journey due to the costs. The challenge is how can we reduce the cost of data quality? Most organizations realize the value data has but the cost of software, people and time, end many initiatives before they begin.

Companies put together data management plans which require them to make investments. These investments tend to be in the area of data quality tools. Organization consider Data Quality products from Informatica, SAS, SAP, Trillium and Oracle, all provide tools that support this need using a well curated approach. These tools are add-ons or options to existing ETL toolsets, and generally they are expensive and complex to implement. However the advantage they bring is tight integration with ETL tools. Some companies may find that the barriers to entry are too significant and therefore often ignore or reduce their data quality efforts.

We need to find a better way or at least one which is more cost effective. Many enterprises today are enamored with open-source solutions. These are the companies who are actively embarking on open-source projects like Hadoop and the various projects within its ecosystem. One of those tools which has emerged as a tool that is commonly paired with big data is R. As defined by Wikipedia and the Inside-R Community; R is a programming language and software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. R is a programming language: you do data analysis in R by writing scripts and functions in the R programming language. R is a complete, interactive, object-oriented language: designed by statisticians, for statisticians. The language provides objects, operators and functions that make the process of exploring, modeling, and visualizing data a natural one. Complete data analyses can often be represented in just a few lines of code. The challenge is that R is limited by memory. So you do need to consider this when you choose the right sample for your tests.

At the recent Collaborate16 conference, I attended a session by Michelle Kolbe, of Red Pill Analytics, about using R for Data Quality. This idea was one I thought to be an eye-opener. A viable solution to our problem was right in front of us. She proposed that R provides the necessary features and analytics to support a robust data quality approach. R packages which are applicable to data quality include summary, distribution, histograms, regressions and dataqualityR. These packages allow us to profile data quickly and easily. The summary package, profiles each column in the file or table and provides information including min, max, median, distributions and other useful measures. By combining graphical and text mining packages allows for advanced visualizations like word clouds and clustering which can be more telling by providing visual clues leading to outliers and other questionable data. This functionality within R can empower a Data Quality program, by defining a standard framework, one can build a library of data quality processes with you only having to provide the file or table you wish to analyze.

The choice to use R will probably be one which IT will initiate and is one which should support business needs and data requirements but it will not serve as a stand-alone solution rather a solution. We can look at R as a low cost alternative to the mainstream DQ tools, but it is not a complete solution and will require a data processing product such as Python. The idea I propose is that R could act as the initial tool to profile and identify data issues. With a construction of a framework to make it easy to use, you will have the start of an advanced approach to quality.

The age of Big Data has created the opportunity which R presents and you should consider it for your organization. Data Quality is an ancient problem that we have news ways to find and fix an old problem. Recently the CTO at Canadian Tire stated “We aren’t interested in best practices. That’s for our competitors. What we’re looking at is next practices.” This is how we need to look at R; it is the next practice which may become a best practice.