Wednesday, May 20, 2015

Big Data and the Data Model

In the Age of Big Data we have seen a revolution in the quantity and variety of data which we collect. With this is the question of how should data be stored? What methods and approaches should we use to define the structures of the information contained in a Hadoop or HBase cluster? At the start of the time of Big Data, we were happy to store information in files in a flat format with little thought to how this information will be used or how it should be stored. The key was to get data into the cluster and let the Data Scientists at the data. We were not concerned that queries could take minutes to hours as we now had access to data which we never had before. This worked for a time but as Big Data matured we needed to consider performance and functionality and the data model becomes the place we can begin this optimization.

The idea of that we can work without a model and define things on the fly is truly one of the strengths of HDFS and Hive but in reality we need for most purposes define a need and use case. This will allow us to formalize a process or prepare data for use in a meaningful way. This is where a data model can exist and flourish within Big Data. Loading files provides you with the ability to quick gain access to new data, for data which is processed to address specific questions you can assume that it will have a formal structure which I would suggest would be similar to a star schema. The question is really do you design as you would have before using facts and dimensions or should you design using a denormalized approach where we store all the raw data and codes together with all the attributes required to support analytics.  The argument against this approach would be that we are going to use even more disk space due to the Cartesian product we may have created. It will be wide file which also comes with its costs. IN addition when using compressed files the joining of files will require that the data be uncompressed before it can be joined.  In favor of denormalization we look at data being available in a single place. No joining concerns. Data can be compressed to alleviate storage concerns. 

I would approach it a way where I would load all the objects I can into a RAW data area. In that area I would include all data feeds including lookup and reference data as well as any business rules required.  This becomes our landing area. The data from there can be moved using some combination of Spark, Hive and Pig to bring it to a more workable format. The suggestion I would make would be to generate the dimension objects from source. In addition I would design fact tables in a denormalized approach.  In terms of physical layout you must consider that data will need to be distributed and partitioned, both of these decision will impact performance so consider this in your design. In the working/staging objects, I would include all codes and descriptions along with fact data. This information is constructed within the confines of the working area. We then can store this information for use down the line. I would even suggest that keeping all your intermediate work may provide some value. The adding indexes to these Hive objects can further enhance the performance of the data.

The bottomline is that you do need to put thought into how you will place data within the HDFS cluster. You also need to consider how information will be used and therefore place some structure to data you hold to truly empower its use. This combination can provide some radical improvements in performance and productivity and moves the idea of the data warehouse in Hadoop one step closer to reality.