Data Quality is a valuable approach to making sure that your data meets your needs and provides you with reliable information when it is needed. The problem is that most organizations are hesitant to embark on the data quality journey due to the costs. The challenge is how can we reduce the cost of data quality? Most organizations realize the value data has but the cost of software, people and time, end many initiatives before they begin.
Companies put together data management plans which require
them to make investments. These investments tend to be in the area of data
quality tools. Organization consider Data Quality products from Informatica, SAS,
SAP, Trillium and Oracle, all provide tools that support this need using a well
curated approach. These tools are add-ons or options to existing ETL toolsets,
and generally they are expensive and complex to implement. However the
advantage they bring is tight integration with ETL tools. Some companies may
find that the barriers to entry are too significant and therefore often ignore
or reduce their data quality efforts.
We need to find a better way or at least one which is more
cost effective. Many enterprises today are enamored with open-source solutions.
These are the companies who are actively embarking on open-source projects like
Hadoop and the various projects within its ecosystem. One of those tools which
has emerged as a tool that is commonly paired with big data is R. As defined by
Wikipedia and the Inside-R Community; R is a programming language and software
environment for statistical computing and graphics supported by the R Foundation
for Statistical Computing. R is a programming language: you do data analysis in
R by writing scripts and functions in the R programming language. R is a
complete, interactive, object-oriented language: designed by statisticians, for
statisticians. The language provides objects, operators and functions that make
the process of exploring, modeling, and visualizing data a natural one.
Complete data analyses can often be represented in just a few lines of code.
The challenge is that R is limited by memory. So you do need to consider this
when you choose the right sample for your tests.
At the recent Collaborate16
conference, I attended a session by Michelle
Kolbe, of Red Pill Analytics, about using R for Data Quality. This idea was
one I thought to be an eye-opener. A viable solution to our problem was right
in front of us. She proposed that R provides the necessary features and
analytics to support a robust data quality approach. R packages which are
applicable to data quality include summary,
distribution, histograms, regressions and dataqualityR.
These packages allow us to profile data quickly and easily. The summary package, profiles each column in
the file or table and provides information including min, max, median,
distributions and other useful measures. By combining graphical and text mining
packages allows for advanced visualizations like word clouds and clustering which
can be more telling by providing visual clues leading to outliers and other
questionable data. This functionality within R can empower a Data Quality
program, by defining a standard framework, one can build a library of data
quality processes with you only having to provide the file or table you wish to
analyze.
The choice to use R will probably be one which IT will
initiate and is one which should support business needs and data requirements
but it will not serve as a stand-alone solution rather a solution. We can look
at R as a low cost alternative to the mainstream DQ tools, but it is not a
complete solution and will require a data processing product such as Python.
The idea I propose is that R could act as the initial tool to profile and
identify data issues. With a
construction of a framework to make it easy to use, you will have the start of
an advanced approach to quality.
The age of Big Data has created the opportunity which R
presents and you should consider it for your organization. Data Quality is an
ancient problem that we have news ways to find and fix an old problem. Recently
the CTO
at Canadian Tire stated “We aren’t interested in best practices. That’s for
our competitors. What we’re looking at is next practices.” This is how we need
to look at R; it is the next practice which may become a best practice.
Very nice blog its very informative thanks for sharing...
ReplyDeleteThank you so much for sharing such informative post.
ReplyDeleteData Quality
am from non technical background. USA mailing lists
ReplyDeleteThanks for sharing this informative post.
ReplyDeleteData Quality software
I think this is the best article today about the future technology. Thanks for taking your own time to discuss this topic, I feel happy about that curiosity has increased to learn more about this topic. Artificial Intelligence Training in Bangalore. Keep sharing your information regularly for my future reference.
ReplyDelete