In the Age of Big Data we have seen
a revolution in the quantity and variety of data which we collect. With this is
the question of how should data be stored? What methods and approaches should
we use to define the structures of the information contained in a Hadoop or
HBase cluster? At the start of the time of Big Data, we were happy to store
information in files in a flat format with little thought to how this
information will be used or how it should be stored. The key was to get data
into the cluster and let the Data Scientists at the data. We were not concerned
that queries could take minutes to hours as we now had access to data which we
never had before. This worked for a time but as Big Data matured we needed to
consider performance and functionality and the data model becomes the place we
can begin this optimization.
The idea of that we can work
without a model and define things on the fly is truly one of the strengths of
HDFS and Hive but in reality we need for most purposes define a need and use
case. This will allow us to formalize a process or prepare data for use in a
meaningful way. This is where a data model can exist and flourish within Big
Data. Loading files provides you with the ability to quick gain access to new
data, for data which is processed to address specific questions you can assume
that it will have a formal structure which I would suggest would be similar to
a star schema. The question is really do you design as you would have before
using facts and dimensions or should you design using a denormalized approach
where we store all the raw data and codes together with all the attributes
required to support analytics. The
argument against this approach would be that we are going to use even more disk
space due to the Cartesian product we may have created. It will be wide file
which also comes with its costs. IN addition when using compressed files the
joining of files will require that the data be uncompressed before it can be
joined. In favor of denormalization we
look at data being available in a single place. No joining concerns. Data can
be compressed to alleviate storage concerns.
I would approach it a way where I
would load all the objects I can into a RAW data area. In that area I would
include all data feeds including lookup and reference data as well as any
business rules required. This becomes
our landing area. The data from there can be moved using some combination of
Spark, Hive and Pig to bring it to a more workable format. The suggestion I
would make would be to generate the dimension objects from source. In addition I
would design fact tables in a denormalized approach. In terms of physical layout you must consider that
data will need to be distributed and partitioned, both of these decision will
impact performance so consider this in your design. In the working/staging
objects, I would include all codes and descriptions along with fact data. This
information is constructed within the confines of the working area. We then can
store this information for use down the line. I would even suggest that keeping
all your intermediate work may provide some value. The adding indexes to these
Hive objects can further enhance the performance of the data.
The bottomline is that you do need
to put thought into how you will place data within the HDFS cluster. You also
need to consider how information will be used and therefore place some
structure to data you hold to truly empower its use. This combination can provide
some radical improvements in performance and productivity and moves the idea of
the data warehouse in Hadoop one step closer to reality.
what data type for username and password ?
ReplyDelete