Big data has to be one of the most over-hyped and over-used phrases during the last few years. The ability to analyse near unimaginably large sets of data brings the promise of gaining amazing insights that were previously out of reach. As with many buzz-words it is quite difficult to find a clear definition of big data. One way of looking at it however is as data sets that are too large to be processed by traditional tools such as relational databases. If relational databases are no use for big data then what should we use instead? One of the answers is the data warehouse.
Data warehouses are essentially a form of database, but differ from more familiar relational databases in a number of significant ways. A traditional relational database is typically used for applications such as online shopping websites. They are optimised for large numbers of short transactions (imagine the number of sales taking place on a typical well-known brands website every minute) and for maintaining data integrity (vital for financial transactions).
Data warehouses on the other hand are designed facilitate data analysis. This means that they need to deal with a much lower volumes of transaction when compared with a relational database, but the queries used are often much more complex and frequently involve aggregating large amounts of data.
The major problem with data warehouses is that they are typically very expensive and time consuming to procure and configure. In an attempt to tackle this problem public Cloud providers have started to make managed data warehouse solutions available. These have the advantage of being very quick to provision (less than a day as opposed to weeks or months for an on-premises solution) and having no up-front costs. Being a managed service you also do not need to worry about applying patches and updates; and backups are typically handled automatically as well.
For example, Amazon Web Services (AWS) <https://aws.amazon.com/> provides Amazon Redshift <https://aws.amazon.com/redshift/>. This is a fully managed, Cloud based data warehouse service designed for analysis of large data sets. Redshift is a columnar <https://en.wikipedia.org/wiki/Column-oriented_DBMS> based data warehouse which can scale from a few hundred gigabytes to petabytes or more. It is perfect for loading huge amounts of data from multiple disparate sources and then analysing it. It can ingest data very efficiently using a parallel load process from AWS S3 <https://aws.amazon.com/s3/> buckets, but it also integrates well with other AWS services such as Data Pipeline <https://aws.amazon.com/datapipeline/>, Kinesis <https://aws.amazon.com/kinesis/> and Amazon Machine Learning <https://aws.amazon.com/machine-learning/>. Third-party data analysis tools such as Looker <https://looker.com/> can also easily be used in conjunction with Redshift.
Data warehouses can be a near essential tool for the analysis of large volumes of data.
Using a Cloud based solution allows you to provision a very powerful “warehouse” in a fraction of the time, and cost of a traditional on-premise solution. This makes them an obvious choice allowing you to get started with your analysis both quickly and cheaply.
Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone things everyone else is doing it, so everyone claims they are doing it.