Ian Abramson's Data and Database Blog: May 2015

Tuesday, May 26, 2015

Where is Data Taking Us?

I recently spoke at the Data Summit in NYC where the focus was to “Unleash the Power of Your Data”. This was a great message for the event, as we live in a time where people are extolling the virtues of data but numerous people struggle to grasp how data can change a business, how can they harness it’s power. Today’s organizations have a good handle on how well-defined data can be used to present information on how they are doing. We produce reports on sales, on costs, we find can easily discover what has happened, but the real business challenge is why and how to make the future better. Predicting the future is the real silver bullet that companies need, but one that will continue to be elusive. The conference weighed this question in a number of different ways but all had the same undertone that data is a resource which must drive value.

In my presentation, “Analytics in the Time of Big Data”, I presented how BI and Analytics has changed over the years. The evolution from reporting for the masses which was reactive and provided summarization on what was happening to a future where results and effects can be accurately predicted and create an environment where the right decision or action can be enacted. Reporting is changing and growing and we need to understand the opportunities they provide.

The basic type of reporting or analytics is known as Descriptive Analytics and is the most common method of reporting we encounter, but in many ways it is the most limited form of analytics. In Descriptive Analytics we provide a description of what has happened. It provides the pulse on the business and is illustrated through standard reports and dashboards. The next is Predictive Analytics which uses statistics, modeling, data mining and machine learning to analyze current data to predict what may occur. This provided a way to figure out what may happen in the future based on experiences of the many. We see this type of reporting everyday we are on the Web. We see recommendations for content and products which are based on what we do and what we read and watch. The final and most productive analytics is Prescriptive Analytics which takes predictive analytics and adds actionable data from both internal and external sources and the ability to capture feedback to then improve future analytics. In the Oil and Gas industry they use prescriptive analytics to determine the best place to drill, the best methods to use and the production that the well will produce. This progression is natural but there is significant resistance and lack of knowledge which is restricting businesses chances of evolving and this was widely discussed during the event. The change is needed if businesses are going to compete but companies need to start as the journey is not a short one.

The challenge is to provide access to the data in a form which allows for the integration of all your data in one place in a governed and controlled area. We need to build data repositories which can support these needs, but in reality is it realistic to think all of your data will be in Hadoop or in an RDBMS. We need to find a solution which provides a federated view of all your data in a meaningful way for the business, this is known as data virtualization. The concept of data virtualization is one where the structure of the data may not be well understood at first but during the use and application we put meaning to it and create useful business definitions and relationships for use by the entire business. Consider this a Schema-on-the-fly approach which can provide flexibility while allowing the core data to be available, it also provides users with the ability to define the data that they need when they need it. It includes the merging of from both internal and external sources in a way that simplifies data access for users by hiding the complexity in the data through data virtualization. Now we have the ability to solve the integration and availability of the data but we will still need to change how we do analytics and look to our mathematicians to help us expand what we do and how we provide our services.

The evolution of reporting is underway and we need to look at how we can provide our business analysts, business users and consumers with the information they need when they need to empower them to make better and more accurate decisions.

Wednesday, May 20, 2015

Big Data and the Data Model

In the Age of Big Data we have seen a revolution in the quantity and variety of data which we collect. With this is the question of how should data be stored? What methods and approaches should we use to define the structures of the information contained in a Hadoop or HBase cluster? At the start of the time of Big Data, we were happy to store information in files in a flat format with little thought to how this information will be used or how it should be stored. The key was to get data into the cluster and let the Data Scientists at the data. We were not concerned that queries could take minutes to hours as we now had access to data which we never had before. This worked for a time but as Big Data matured we needed to consider performance and functionality and the data model becomes the place we can begin this optimization.

The idea of that we can work without a model and define things on the fly is truly one of the strengths of HDFS and Hive but in reality we need for most purposes define a need and use case. This will allow us to formalize a process or prepare data for use in a meaningful way. This is where a data model can exist and flourish within Big Data. Loading files provides you with the ability to quick gain access to new data, for data which is processed to address specific questions you can assume that it will have a formal structure which I would suggest would be similar to a star schema. The question is really do you design as you would have before using facts and dimensions or should you design using a denormalized approach where we store all the raw data and codes together with all the attributes required to support analytics. The argument against this approach would be that we are going to use even more disk space due to the Cartesian product we may have created. It will be wide file which also comes with its costs. IN addition when using compressed files the joining of files will require that the data be uncompressed before it can be joined. In favor of denormalization we look at data being available in a single place. No joining concerns. Data can be compressed to alleviate storage concerns.

I would approach it a way where I would load all the objects I can into a RAW data area. In that area I would include all data feeds including lookup and reference data as well as any business rules required. This becomes our landing area. The data from there can be moved using some combination of Spark, Hive and Pig to bring it to a more workable format. The suggestion I would make would be to generate the dimension objects from source. In addition I would design fact tables in a denormalized approach. In terms of physical layout you must consider that data will need to be distributed and partitioned, both of these decision will impact performance so consider this in your design. In the working/staging objects, I would include all codes and descriptions along with fact data. This information is constructed within the confines of the working area. We then can store this information for use down the line. I would even suggest that keeping all your intermediate work may provide some value. The adding indexes to these Hive objects can further enhance the performance of the data.

The bottomline is that you do need to put thought into how you will place data within the HDFS cluster. You also need to consider how information will be used and therefore place some structure to data you hold to truly empower its use. This combination can provide some radical improvements in performance and productivity and moves the idea of the data warehouse in Hadoop one step closer to reality.