Data Warehouse 3.0 “the Interactive Data Repository”
Data Warehousing 3.0 born of issues resurging from the early data warehouse versions, namely data warehousing 1.0 (the data mart) and data warehouse 2.0 (the schema). I am continually asked what is the difference between 1.0, 2.0 and 3.0 and why are the differences important. What follows is an illustration that may provide some understanding.
Data Warehouses 1.0 (the Data Mart)
Data Marts are often referred to as a “subset” of a data warehouse; unfortunately data marts, and, in fact, instances of data warehouses 1.0, are no more than a SQL table representation of the processing system that is providing the data, such as a trading system, or accounting system. These data marts were born from the generally “poor reporting” of the originating processing engines created in the last century. The benefit of a data mart is that you can report, but you are limited to silos of data with extremely limited navigation (i.e. you couldn’t navigate from performance data to position data, to tax lots, or to allocations, as the data would be in 3 different data marts with different references, and different keys and foreign keys. Put simply, when creating a data mart, a developer typically takes an existing extract from a processing engine (e.g. trade history) and loads it into a SQL table verbatim, with minimal or no “warehouse” functionality.
Data Warehouse 2.0 (the Schema)
The requirement to “navigate” and “drill through” is sometimes called data mining; data mining, the need to provide analytics across data sets, has driven difficult task of designing holistic schemas. However, schema-based data warehouses typically suffer from at least 1 of 3 issues: 1) the schema is poorly designed and the desired mining / reporting / analytics are practically impossible to achieve; 2) the ability to view and extract data is prohibitively slow or 3) the ability to load data is too slow, which has resulted in incomplete data sets.
There are several different approaches for loading data into warehouses, with the more popular approach being to load data through staging tables (i.e. starting with un-typed data and loading/transforming data into hierarchal groups, often called dimensions, and into facts and aggregate facts). The combination of facts and dimensions is often called a star schema – a very trendy approach 15 years ago.
The task of creating an effective data warehouse in the securities industry was indeed very challenging, and, as a result, in the early 2000s there were only two securities industry oriented data warehouses available (Eagle Investment Systems and Netik LLC). Eagle was typically deployed at large asset managers, whereas Netik was much more successful within asset services companies (e.g. Custodians, Prime Brokers, Fund Administrators, and Exchanges) where the data volumes were very large.
The key components of data warehouse 2.0 that differentiated it from data warehouse 1.0 are:
- Industry-specific schemas
The combination of normalized reference data with de-normalized numerical results was the practical solution the securities industry needed. De-normalized data resulted in user-friendly representations of data that were quick to access and understand, whereas the normalized approach resulted in multiple tables that were more complex to understand, but it facilitated data conformity where it was necessary (i.e. performance attribution). Both normalized and de-normalized data were prevalent in this period.
- Flexible ETL
Extract, Transform and Load (“ETL”) is the process of collecting data: a) Extracting data from source systems; b) Transforming the data into a defined schema; and c) loading the data into a new physical database
- Atomic Data Sets
One of the key challenges for most data warehouse engineers is the question of where you can make data atomic (a data set with 100% integrity with cross references to other data sets). An example would be the prior day’s trades adding up to the beginning of day positions.
The problem with Data Warehouse 2.0
Typically, views of data were created by programmers; data extracted into Excel or various business intelligence solutions were connected without true attention to business needs. The real problem was that views needed to be programmed and were really not user friendly for business analysts or operational users. In general, data warehouse 2.0 achieved results that were better than straight accounting (source system) based reports, but the ideal of truly interconnected data was never achieved.
In addition, the cost of ownership was very high, and the creation of new views for users was also very expensive and slow to deploy – typically this was a frustrating process.
Data Warehouse 3.0 (The Open User)
The core driver of Data Warehouse 3.0 is the openness of the warehouse to users: the combination and interaction of many non-traditional data points (such as media and documents), and the ability to perform analytics on data in real time. The requirements for points 1,2 and 3 (from data warehouse 2.0) above still exist; however, now ETL has been extended to include real time data feeds such as exchange prices and cloud-based documents. Data Warehouse 3.0 also includes:
- In-memory non-SQL based data access
This is a substantial difference between Data Warehouse 2.0 (with its dreadful DBMS-specific SQL stored procedures), and the ability to render and access data in memory via Data Warehouse 3.0. The benefit is super fast queries and data interactions. Point 7 below “mashing” requires substantially more “data hits”: and therefore would not be possible without the achievements in speed that came with data warehouse 3.0. InvestCloud’s approach has been substantively unique again, with the further introduction of message-based queries and interactions. The thinking in Data Warehouse 2.0 moved the warehouse beyond an Operational Data Store (ODS) or data mart, and therefore there was always a need for a second system that is an ODS. However, with Data Warehouse 3.0 this is no longer the need. The ODS need is met simultaneously with the need for quick access to normalized and de-normalized data.
- Big Data & Real-Time Data
The differentiator of our Big Data solution is the “aspect” of the data. Today, we store all of aspects prescribed in data warehouse 2.0; however, it further links to market information (social media, news feeds, and regulatory reporting). The benefit is a massive amount of data that can be used for investment decision making a) Information Advantage (e.g. access to data that can be difficult to gather and collate); b) Analytical Advantage (historical and pro forma models); and c) Behavioral Advantage (push notifications related to events and data events). In the past, this information could be accessed, but it was un-integrated, never in one place, and there was little or no concept of history or time-series data management for anything other than portfolio data (i.e. positions and trades).
- Views and Interaction (Generation and Personalization)
Big and Real-Time data engender the potential for thousands of views of this powerful and rich data set. Accordingly, a new requirement exists to provide users with the ability to personalize their views and then store and allow recall of the those customized views, for everything from simply adding or subtracting data columns to detaching views into applets (small powerful functional components). This also includes allowing users to assemble personalized dashboards – and this is all must be achieved without one second of programming/coding. All of the views in Data Warehouse 3.0 are generated and, therefore, the number of permutations is endless. Today, InvestCloud has over 2,000 views of information that have been built without one programmer writing one line of code.
The power of applets results in the empowerment of the user to have control and the ability to combine many data dimensions and aspects from the Data Warehouse 3.0. Mashing also enables many different third party components to be assembled to provide an even more powerful experience.
The above is “secret sauce” of data warehouse 3.0. The problem with the above is that they are not that obvious. However, they are fundamental to design and usability. I hope the vision of points 4, 5, 6, and 7 are useful and provide an insight into Data Warehousing 3.0 and InvestCloud. In InvestCloud SaaS we have implemented data warehouse 3.0. Today the time it takes to implement Data Warehouse 3.0 with InvestCloud SaaS is weeks, not months.
. . . and finally, The three “Ts:”
- T[…] – something that starts with the letter T.
- T[…] – something else that starts with the letter T.
- T[…] – finally, something else starting with the letter T.
The above three “Ts” are the “secret sauce” of Data Warehouse 3.0 and have been designed and implemented via InvestCloud. The challenge for most data warehouse developers is that the three “Ts” are not that obvious, nor will we publish them here. However, they are fundamental to design and usability.
I hope the our illustration of points 4, 5, and 6 are useful and provide an insight into Data Warehousing 3.0 and InvestCloud. In InvestCloud SaaS we have implemented data warehouse 3.0 including all the “Ts”. Today the time it takes to implement Data Warehouse 3.0 via InvestCloud SaaS is weeks and not months (or years, as was the case with Data Warehouse 2.0).
The final problem we had with Data Warehouse 2.0 was the enterprise deployment model (now resolved in 3.0). Organizations that implement enterprise software struggle with implementing updates and patches to the software. There is a tremendous, typically undocumented, cost of ownership for enterprise software born from the time IT teams must sped to maintain, update and re-test. InvestCloud, through our SaaS, PaaS and BPaaS cloud models now implements updates for all customers weekly, rather than yearly. We take on the cost of testing and updating and our clients save considerable time and money.
Original article at: http://investcloudsolutions.blogspot.com/2012/05/data-warehousing-30-understand.html