Monday, May 2, 2016

Data Quality Using R


Data Quality is a valuable approach to making sure that your data meets your needs and provides you with reliable information when it is needed. The problem is that most organizations are hesitant to embark on the data quality journey due to the costs. The challenge is how can we reduce the cost of data quality? Most organizations realize the value data has but the cost of software, people and time, end many initiatives before they begin.

Companies put together data management plans which require them to make investments. These investments tend to be in the area of data quality tools. Organization consider Data Quality products from Informatica, SAS, SAP, Trillium and Oracle, all provide tools that support this need using a well curated approach. These tools are add-ons or options to existing ETL toolsets, and generally they are expensive and complex to implement. However the advantage they bring is tight integration with ETL tools. Some companies may find that the barriers to entry are too significant and therefore often ignore or reduce their data quality efforts.

We need to find a better way or at least one which is more cost effective. Many enterprises today are enamored with open-source solutions. These are the companies who are actively embarking on open-source projects like Hadoop and the various projects within its ecosystem. One of those tools which has emerged as a tool that is commonly paired with big data is R. As defined by Wikipedia and the Inside-R Community; R is a programming language and software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. R is a programming language: you do data analysis in R by writing scripts and functions in the R programming language. R is a complete, interactive, object-oriented language: designed by statisticians, for statisticians. The language provides objects, operators and functions that make the process of exploring, modeling, and visualizing data a natural one. Complete data analyses can often be represented in just a few lines of code. The challenge is that R is limited by memory. So you do need to consider this when you choose the right sample for your tests.

At the recent Collaborate16 conference, I attended a session by Michelle Kolbe, of Red Pill Analytics, about using R for Data Quality. This idea was one I thought to be an eye-opener. A viable solution to our problem was right in front of us. She proposed that R provides the necessary features and analytics to support a robust data quality approach. R packages which are applicable to data quality include summary, distribution, histograms, regressions and dataqualityR. These packages allow us to profile data quickly and easily. The summary package, profiles each column in the file or table and provides information including min, max, median, distributions and other useful measures. By combining graphical and text mining packages allows for advanced visualizations like word clouds and clustering which can be more telling by providing visual clues leading to outliers and other questionable data. This functionality within R can empower a Data Quality program, by defining a standard framework, one can build a library of data quality processes with you only having to provide the file or table you wish to analyze. 

The choice to use R will probably be one which IT will initiate and is one which should support business needs and data requirements but it will not serve as a stand-alone solution rather a solution. We can look at R as a low cost alternative to the mainstream DQ tools, but it is not a complete solution and will require a data processing product such as Python. The idea I propose is that R could act as the initial tool to profile and identify data issues.  With a construction of a framework to make it easy to use, you will have the start of an advanced approach to quality. 

The age of Big Data has created the opportunity which R presents and you should consider it for your organization. Data Quality is an ancient problem that we have news ways to find and fix an old problem. Recently the CTO at Canadian Tire stated “We aren’t interested in best practices. That’s for our competitors. What we’re looking at is next practices.” This is how we need to look at R; it is the next practice which may become a best practice.

5 comments:

  1. Very nice blog its very informative thanks for sharing...

    ReplyDelete
  2. Thank you so much for sharing such informative post.
    Data Quality

    ReplyDelete
  3. I think this is the best article today about the future technology. Thanks for taking your own time to discuss this topic, I feel happy about that curiosity has increased to learn more about this topic. Artificial Intelligence Training in Bangalore. Keep sharing your information regularly for my future reference.

    ReplyDelete