Monday, May 27, 2013

Designing a Future-Proof Data Solution



The challenges we face today when designing solutions is how do we avoid the pitfalls of constant design changes? How can we reduce the impact to our data designs? Is it even possible?
The design of a data warehouse has been well discussed and debated over the years. The battle between Ralph Kimball and Bill Inmon over the years is legendary. The choice of an Information Factory versus a Dimensional approach continues to be one which all new data warehouses need to consider. In this discussion the choice is really immaterial. Whether you choose either design approach you will still need to consider how the design will be developed. Can we build the design incrementally? Can we minimize the impact of the overall project and minimize regression testing? This is the constant challenge we face when developing data solution at EPAM, especially when using Agile practices to drive our project success. The key is to design and develop once and to evolve the design as you go but there are some key considerations you will need to make when designing to optimize the design and minimize the refactoring which may be required by design changes.

The first item you must consider is using Design Patterns for Data Warehouse modelling techniques. This approach basically says that all objects will be built using templates which can be used to address most of the needs within your design. This means, that similar tables in the design will follow predefined patterns. At the most basic level we predefine what a dimension and a fact will look like. They will include a surrogate key and the various attributes required by each. In addition they will include control fields to allow us to manage how and when data is processed. For more complex facts or dimensions we also provide a template which allows us to support all of the different slowly changing dimensions as well as to manage quickly changing facts, both of these more advanced design methods provide us with the ability to manage the data effectively and consistently. For relationships we look to an approach where we define intermediate tables to manage relationships. We build “bridge” tables for this purpose which provide a reliable manner to relate facts and dimensions to improve performance and extend query capabilities in the future and form a key part of allowing the model to work across multiple subject areas.

The second consideration is to Design with the Future in Mind. In this situation you are faced with the choice of building based only on defined requirements with little consideration for future requirements. In the Agile context this seems like an obvious approach of design what you need when you need it. The concept of Just-in-Time Design is one which has been discussed and developed in the past few years. However when we put this practice the reality is that you want to try and define your facts and dimensions as completely as possible at the time you first design it to ensure that you design for the future needs in addition to the ones you have at the moment. This will result in additional attributes which might only be used in a much later sprint but are defined in order to reduce refactoring. In addition it may be necessary to define additional dimensions so that you will minimize the rework when it comes to adding dimensions to fact tables in the future. The key is to design what you need when you need it and provide as much forward thinking in your object definitions as early in the process as possible.

The final suggestion I would have to future-proof things is to ensure that your data warehouse is designed to support the integration of multiple data sources right from the start. So add additional attributes and ETL functionality which supports this approach. The data warehouse is really all about providing the business with an integrated and reliable solution; therefore you must design with the goal of integration from the beginning.

Ultimately the design and development of a data warehouse requires the data architect and data modeler to look to the future. They need to anticipate data requirements and to try and define that data objects and relationships as completely as possible right from the start and you can avoid the many pitfalls of a data warehouse design by designing with the end in mind while allowing the design to evolve based on business needs.


Wednesday, February 13, 2013

Master of Your Data using the Database

I recently was involved in a project for an organization who needed one thing. They needed a master customer and master product list to enable cross-organization analysis. This may seem like a simple task; create a single customer and product but it is not simple.

As JFK said about going to the moon, "We choose to go to the moon in this decade and do the other things. Not because they are easy, but because they are hard.", the same is true of customer and product integration and mastering. MDM as a technology and a process is not easy it is hard but it provides so much value in the end that it is worth the journey to achieve.

The challenge of MDM is focused squarely on creating a technical solution which enables the business to automate the process of matching customer and products into a single master list. This can take significant effort to get to the point where the rules you defined for matching are meaningful and effective. 

The project I was involved in required us to create a solution which was cost effective and did not include the use of a matching product like DataFlux or Trillium but was based in the database and ETL tool. Our database of choice was Oracle which provided some SQL extensions to support matching. We implemented the matching within an ETL tool (Talend) which further extended our capabilities which we had in the database. A number of functions were consider and the following Oracle functionality was used in our cleansing and matching approach:
  1. Regular Expressions where used to find patterns and remove and alter to enable a standardization of names and addresses
  2. Equi-joins and other join types to match
  3. Soundex or Metaphone function in combination with other matches to enable fuzzy matches
  4. Jaro-Winkler, Levenshtein and Distance functions for fuzzy matching
  5. ETL Tool Functionality which further extends the base database functionality
 All of these functions can help you to find the right matches in your database and provide functionality to build your own MDM solution where you can leverage the investments you have already made in your database and tool without making a huge investment in software.

I will be presenting this solution at COLLABORATE13 in Denver in April, and this entry should help you as you consider an alternative approach to matching which will be critical to your MDM solution.