Thursday, October 22, 2015
The convergence of information, technology and analytics in the era of Big Data is revolutionizing how we need to approach our data solutions. We now have capabilities to collect all data into a repository in a scalable way which can serve people with information to make more informed decisions. The challenge is that many companies have been building data warehouses and reporting systems for years. Recent studies have shown that organizations, both small and large are investing more in analytics and maintaining or reducing their traditional data warehouse spending. To me this seems an odd combination. How could one reduce data warehouse spending and increase analytics? My conclusion is that companies are maintaining their current data warehouses and are building new data solutions which are being built to complement their current warehouses. So if we consider all things maybe the true number is that data spending is increasing overall just in new ways.
This split focus between analytics and data warehousing is at odds with an enterprise approach to data. Organizations must consider how they can evolve their existing data warehouses and enable better analytics not through a revolution but rather an evolution. We must look at how Big Data is impacting our business and data processes and adjust how we architect our data warehouses.
According to Forbes, the average spending on data projects in 2015 was $7.4 million. Enterprise organizations spent $13.8M while SMB’s spend an average of $1.6M. With this level of funding the value must be realized. So how can we evolve our data warehouses? The key is finding the space where Big Data makes the most sense.
Today many organization are at the some point towards the development of data hubs and landing areas using Hadoop technology. We tend to concentration on the landing and staging areas where we can make the most impact with the least amount of disruption. By replacing these components in the data lifecycle, we can build a new region where data is collected and prepared to meet with analytic needs, replacing these areas which were based in a relational database at a significant cost. The new expanded Landing and Staging areas are now built with Big Data and analytics in mind in addition to the traditional needs for business reporting. Data Architects like myself are looking at creating an environment which collects all of the data and then prepares it into a conformed arrangement where data can be served up in a structured manner to supply data to the data warehouse while providing an environment where unstructured and structured data can supply raw information to the data scientists and data analysts for their analysis. This approach is one which is quite intuitive but also one which enables a better data architecture as we are separating the various parts of our solution. By separating the landing and staging we can use the technology which best suits it today while being mindful of the future. The same would apply to the high performance analytic platforms. So we may choose an RDBMS like Oracle or Netezza which today is the most appropriate platform for traditional BI but tomorrow could bring us a new technology which will be too appealing to ignore. So by separating the functionality and technology we can evolve our data warehouse in a more agile way.
The use case of replacing your landing and staging with Hadoop is one which serves many purposes including reducing costs and extending capacity but primarily it creates a new environment to support modern advanced analytics. This data evolution is needed to ensure that your data warehouse changes with times or gets left behind in the highly competitive business world. Now is the time to consider to renew your data warehouse architecture and see how Big Data can help to elevate your business reporting and analytics
Tuesday, May 26, 2015
I recently spoke at the Data Summit in NYC where the focus was to “Unleash the Power of Your Data”. This was a great message for the event, as we live in a time where people are extolling the virtues of data but numerous people struggle to grasp how data can change a business, how can they harness it’s power. Today’s organizations have a good handle on how well-defined data can be used to present information on how they are doing. We produce reports on sales, on costs, we find can easily discover what has happened, but the real business challenge is why and how to make the future better. Predicting the future is the real silver bullet that companies need, but one that will continue to be elusive. The conference weighed this question in a number of different ways but all had the same undertone that data is a resource which must drive value.
In my presentation, “Analytics in the Time of Big Data”, I presented how BI and Analytics has changed over the years. The evolution from reporting for the masses which was reactive and provided summarization on what was happening to a future where results and effects can be accurately predicted and create an environment where the right decision or action can be enacted. Reporting is changing and growing and we need to understand the opportunities they provide.
The basic type of reporting or analytics is known as Descriptive Analytics and is the most common method of reporting we encounter, but in many ways it is the most limited form of analytics. In Descriptive Analytics we provide a description of what has happened. It provides the pulse on the business and is illustrated through standard reports and dashboards. The next is Predictive Analytics which uses statistics, modeling, data mining and machine learning to analyze current data to predict what may occur. This provided a way to figure out what may happen in the future based on experiences of the many. We see this type of reporting everyday we are on the Web. We see recommendations for content and products which are based on what we do and what we read and watch. The final and most productive analytics is Prescriptive Analytics which takes predictive analytics and adds actionable data from both internal and external sources and the ability to capture feedback to then improve future analytics. In the Oil and Gas industry they use prescriptive analytics to determine the best place to drill, the best methods to use and the production that the well will produce. This progression is natural but there is significant resistance and lack of knowledge which is restricting businesses chances of evolving and this was widely discussed during the event. The change is needed if businesses are going to compete but companies need to start as the journey is not a short one.
The challenge is to provide access to the data in a form which allows for the integration of all your data in one place in a governed and controlled area. We need to build data repositories which can support these needs, but in reality is it realistic to think all of your data will be in Hadoop or in an RDBMS. We need to find a solution which provides a federated view of all your data in a meaningful way for the business, this is known as data virtualization. The concept of data virtualization is one where the structure of the data may not be well understood at first but during the use and application we put meaning to it and create useful business definitions and relationships for use by the entire business. Consider this a Schema-on-the-fly approach which can provide flexibility while allowing the core data to be available, it also provides users with the ability to define the data that they need when they need it. It includes the merging of from both internal and external sources in a way that simplifies data access for users by hiding the complexity in the data through data virtualization. Now we have the ability to solve the integration and availability of the data but we will still need to change how we do analytics and look to our mathematicians to help us expand what we do and how we provide our services.
The evolution of reporting is underway and we need to look at how we can provide our business analysts, business users and consumers with the information they need when they need to empower them to make better and more accurate decisions.
Wednesday, May 20, 2015
In the Age of Big Data we have seen a revolution in the quantity and variety of data which we collect. With this is the question of how should data be stored? What methods and approaches should we use to define the structures of the information contained in a Hadoop or HBase cluster? At the start of the time of Big Data, we were happy to store information in files in a flat format with little thought to how this information will be used or how it should be stored. The key was to get data into the cluster and let the Data Scientists at the data. We were not concerned that queries could take minutes to hours as we now had access to data which we never had before. This worked for a time but as Big Data matured we needed to consider performance and functionality and the data model becomes the place we can begin this optimization.
The idea of that we can work without a model and define things on the fly is truly one of the strengths of HDFS and Hive but in reality we need for most purposes define a need and use case. This will allow us to formalize a process or prepare data for use in a meaningful way. This is where a data model can exist and flourish within Big Data. Loading files provides you with the ability to quick gain access to new data, for data which is processed to address specific questions you can assume that it will have a formal structure which I would suggest would be similar to a star schema. The question is really do you design as you would have before using facts and dimensions or should you design using a denormalized approach where we store all the raw data and codes together with all the attributes required to support analytics. The argument against this approach would be that we are going to use even more disk space due to the Cartesian product we may have created. It will be wide file which also comes with its costs. IN addition when using compressed files the joining of files will require that the data be uncompressed before it can be joined. In favor of denormalization we look at data being available in a single place. No joining concerns. Data can be compressed to alleviate storage concerns.
I would approach it a way where I would load all the objects I can into a RAW data area. In that area I would include all data feeds including lookup and reference data as well as any business rules required. This becomes our landing area. The data from there can be moved using some combination of Spark, Hive and Pig to bring it to a more workable format. The suggestion I would make would be to generate the dimension objects from source. In addition I would design fact tables in a denormalized approach. In terms of physical layout you must consider that data will need to be distributed and partitioned, both of these decision will impact performance so consider this in your design. In the working/staging objects, I would include all codes and descriptions along with fact data. This information is constructed within the confines of the working area. We then can store this information for use down the line. I would even suggest that keeping all your intermediate work may provide some value. The adding indexes to these Hive objects can further enhance the performance of the data.
The bottomline is that you do need to put thought into how you will place data within the HDFS cluster. You also need to consider how information will be used and therefore place some structure to data you hold to truly empower its use. This combination can provide some radical improvements in performance and productivity and moves the idea of the data warehouse in Hadoop one step closer to reality.