Under 5 min. | Summary for data warehouse, lake, and virtualization

What is a data warehouse?

In a basic sense, a data warehouse(DW) is a fairly large warehouse filled with data rather than physical products.

It is not the same as the database.

  • Data in the data warehouse comes from elsewhere. It’s not created there. It may come from our operational systems or external sources.
  • Data is copied, not moved.

Data remains in our source systems and then copies are made and sent into the data warehouse(DW).

Rules about how we manage, govern and store data in DW -

In 1990, Bill Inmon came up with the rules for data warehousing.

Data Warehousing (DW) is a process for collecting and managing data from varied sources to provide meaningful business insights.

  • Integrate: Data is sent from various resources to DW.
  • Subject-oriented: regardless of how many systems and which data is coming from which systems. More on this later.
  • Time variant: It doesn't only contain historical data but also current data.
  • Non- Volatile: It remains stable in between the data refreshes.

Traditionally, we periodically load data in the data warehouse and we do so in batches. Suppose once in a day, we pull in data from all the sources.

Between the last refresh and upcoming refresh, DW remains as it is irrespective of the no. of transactions happening in our transactional systems at that time.

and in this time, we could do strategic planning without the data changing underneath us.

Why did we build this DW?

  • data-driven decision-making
  • one-stop shopping.

To support data-driven decision-making rather than rely on experience or intuition or hunches.

we need a view of the past, present, and what the future will likely be with regards to various areas of organizations and to get the unknown, something important and interesting from the data.

if we consider all these views of our business as represented through our data as one, we actually have a discipline as Business Intelligence(BI).

BI + DW = Package1

and second is the idea of one-stop shopping.

Before DW, any data-driven decision-making requires going after the data either in the original applications or extracting files that drew data from one or a couple of those applications.

so that’s how with the help of DW, we are more focused on analyzing the data rather than gather it from here and there, over and over again.

Data warehouse vs Data lake

The idea of data warehouse goes back to the early 1990s and the data lake is the more recent term, both are valuable tools.

DW is built on the top of a relational database like Microsoft SQL Server, Oracle, etc

RDBMS can be used for transactional systems and applications, not just for data warehousing.

Sometimes we build DW on the top of MDBMS known as cube.

In contrast, a data lake is built on top of some sort of big data environment.

  • Volume: helps us manage extremely large volumes of data in data lakes, even larger than what we typically would include in the largest data warehouse.
  • Velocity: rapid intake of new and changed data wrt. DW.
  • Variety: DW works mostly with structured data whereas big data also supports semi (e-mail, blogs, etc) and unstructured data (audio, video) as well.

BI + DW/DL = Package 2

Data Virtualisation

In the 1980s, before data warehouses, data-driven decisions are made through extract files. Let’s call them a mini data warehouse pulling data from various applications but without any coordination of data structure, rules, or pretty much anything else. Far more time was spent on gathering data instead of analyzing data.

DDBMS

In the late 1980s, computer system companies started working on distributed data management systems (DDBMS).

The idea of a DDBMS was a complete failure as the technology in the late 1980s was not good enough to support the distributed usage of data, although the concept was a solid one. Even distributed database is very much the earlier incarnation of how Hadoop works like a big data engine.

That led us to two paths -

  • Data Warehousing

The idea was if we can’t reach out and grab the data on-demand from original applications or databses. let's preemptively pull what we think we need and integrate the copies of those data together in a separate environment known as a data warehouse.

  • Data Virtualization

Another group was not happy with the idea of copying data, so they came up with a modified version of DDBMS. It can be thought of as a read-only distributed database idea versus the original distributed database idea of, not only reading but also updating it in place via those indexes.

Unlike DW, we don’t copy the data into a separate database instead we access it from original locations at the time we need for reports and analytics.

It has many names over the years for the same concept such as virtual data warehousing, enterprise information integration, enterprise data access, and now data virtualization.

  • Simple Transformations
    If we have data that needs simple or no transformation in BI and analytics.
  • Small number of data sources
  • Relaxed Response time to our queries and reports.

BI + DW/DL/DV = Package 3

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Pranav Bansal

Pranav Bansal

Highly curious generalist with a bag full of ideas!