Why Big Data should be a priority at the moment
The reason for the belief in the data age is that as each millionth of a second passes, almost everything we see, hear and touch is generating data. Using PCs, mobile handsets, tablets, GPS devices, servers and sensors attached to vehicles, buildings and satellites, leads to huge amounts of data being stored across multiple databases and data stores worldwide. The speed at which this data is being generated is increasing rapidly and so is the volume and array of different structured and unstructured data types, which are commonly referred to as Big Data.
In theory, Big Data means having tons and tons of data at your disposal. However, in practice, this data is next to useless if businesses cannot apply analytics to benefit and gain insight from it. Businesses must find a ‘mechanism’ that can perform collaborative filtering to ‘net’ vital information and trends and to answer key questions.
But, how easy is it to find this mechanism? Currently, there are frameworks that have been developed to store and retrieve the data using MapReduce algorithms. Apache Hadoop is one of the frameworks of choice, and some of the large tech and social media companies have shown a keen interest in it.
Hadoop is so popular because of the speed at which it can load the volume and variety of data compared with other ETL (extract-transform-load) tools and frameworks. Hadoop can load data much quicker than relational databases because it has no gatekeeping rules. It stores the data on its HDFS (Hadoop Distributed File System) without needing to match the destination structure with the source structure – a relational database would require ‘Table B’ to be the same as ‘Table A’ in order to transfer data across. Instead, Hadoop stores the data across numerous data nodes using key-value pairs. Initially, querying data on HDFS was slow and involved a great deal of effort. To complete the task, a more SQL-like language was needed. However, over the years, many contributions have been made to this open-source software framework. Currently there are many ‘SQL on Hadoop’ languages, such as Apache Hive, Stinger, Apaches Drill and Spark SQL, which have improved data-retrieval performance considerably.
Hadoop’s architecture is very scalable so this allows for millions of servers to work together instead of running one high-specced server. Hadoop is also fault tolerant, storing multiple copies of each piece of data in different nodes. If one node goes down, the framework will automatically switch over to another in the most efficient manner.
In recent times, Hadoop has received a lot of buy-in from the big-player analytics and visualisations vendors. Cloudera is partnering with NoSQL creator MongoDB, and Hortonworks has collaborated with Tableau. Other vendors, such as Qlikview, Spotfire and Microstrategy, have some ODBC configuration to connect to Hadoop. Cloudera has released a universal ODBC driver that enables the connection of many applications, such as Teradata Parallel Transporter (TPT), SSIS, IBM DataStage, Ab Initio, Informatica PowerCenter, SAP Data Services, Business Objects, OBIEE, Cognos, SAS, SPSS, Unica, Linked Server, Oracle Database Gateway and many more.
This all sounds very exciting and convincing that Hadoop is the way to go. However, although Big Data has been around for years, it has still not taken off in the South African market as initially predicted. If you or your company is equipped with skill set similar to these large tech and social media companies, then implementing a Big Data solution is not an issue. However, the question is how many companies are actually equipped with a skill set? The answer is not many and the market is showing a massive demand for Big Data resources. Global management consulting firm McKinsey & Company has predicted that by the year 2018, the shortfall of Big Data experts will be at anywhere from 140,000 to 190,000.
The good news is that tertiary institutions are now including Big Data in their curricula and post-graduate degrees, so we can expect the Big Data resource pool to grow over time as these students complete their studies. Until this happens, companies are investing in their current staff compliment to fill the gap. However, learning Big Data via frameworks, such as Apache Hadoop, is quite a steep learning curve compared with traditional, structured BI involving SQL, ETL and OLAP tools, and to a certain degree, it requires a mindset shift.
Companies do have other options. They can opt for a cloud solution like Microsoft’s HDInsight sitting on their Azure platform. This option has its pros and cons. On the plus side, it collects and stores your data at a reasonable price on your HDInsight cluster. There is no need to understand the background processing and architecture of Hadoop. Microsoft has also developed a way to combine and query both non-relational data (HDInsight) with structured SQL Server Parallel Data Warehouse in the form of Polybase. This integrates perfectly with the latest version of Microsoft Excel and an assortment of PowerBI tools, such as PowerPivot and PowerQuery. One of the biggest negatives for this option is the cost involved in processing and retrieving the answers you looking for. There is also the security element of storing sensitive data on an off-site server, perhaps half-way across the globe. Implementing data governance can also be tricky with regard to unstructured data and defining which users are authorised to see which data.
Another option companies may choose is to engage with consulting companies that have already invested time, research and development in Hadoop and its architecture. Consultancy firms allow their staff autonomy to play around with cutting-edge technologies and often give them carte blanche to find innovative ways of solving tomorrow’s business needs. This allows companies to use this expertise and knowledge without wasting unnecessary time and money.
Whichever option you choose to invest in, I believe Big Data is the must-have component that should be found in every company’s BI architectural framework. Due to the current skill shortage and reluctance to opt for a cloud solution, Big Data may not be a priority for companies at the moment, but it will definitely become a must in the future. The quicker businesses get their Big Data solution up and running, the faster they will be able to gain a substantial leading edge over their competitors. The race is on for companies to find the answers of tomorrow based on the data that was always there but could never before be analysed.