Big Data and the Vs
After having many discussions about the topic of Big Data with Entelect’s CTO, Martin Naude, I decided to visit my go-to website, Wikipedia, for a ‘tidy’, encyclopedic definition. According to Wiki, the concept of Big Data has been around since 2001. Initially, there were three V's that defined Big Data. These were volume, velocity and variety. However, this is where the main dilemma begins… no one can define any of these V's for me. There is no concise definition of what each one means or entails. So after all the research and more discussions, I have decided to throw the dictionary (and Wiki) out of the window on this one and come up with simpler, more realistic definitions of all the V's and at the same time, hopefully make sense of Big Data.
While researching the ‘volume’ side of Big Data, I found quite a few definitions. Some say the data must be terabytes, petabytes and even larger. However, ‘volume’ was set as a pillar back in 2001, and when I look back on my career, in the early 2000s if someone had 100 gigabytes of data in a data store, he or she was on the bleeding edge and would probably asked to be a keynote speaker at a conference to explain how he or she actually got it right. Today, we are in a more privileged environment. For just US$3500 we can buy a small footprint hard drive that can store all the music that the world has ever produced. With the uptake of solid state drives, we are able to equip a server that has the ability to store and access tons of data without breaking the bank. So my definition of volume is more around the ability to store and access large volumes of data, and not purely a focus on the amount of data that constitutes Big Data.
‘Velocity’ refers to rapidly changing data that is processed at a very quick rate. It won’t come as a shock to find that I have an issue with this one, too! Where is the line in the sand to say we are processing quickly: is it 6 gigabytes a second as in the case of the Large Hadron Collider, is it 1,000 transactions a second or is it once a month? I believe that velocity is less of an important consideration and rather that it is the ability to process, analyse and output the transactions at a required rate over a specific required timeframe. For example, transactional fraud detection is important and we may need near-real-time algorithms to run against the transactions processing, otherwise, our risk exposure would increase dramatically. For geofencing applications we need near-real time, too. However, if we want to analyse a debtors’ book, an acceptable time of a day or two to run certain scenarios and put plans in place for results processing is fine.
When we look at velocity, we also need to take into account several external factors. These include infrastructure and Internet speeds, storage ability and processing speeds. This is where software meets hardware and fortunately, we are in the middle of a software boom: new data storage engines are being released every month and we now have the likes of Mongo, NewSQL, NoSQL, Hadoop and many more that challenge our preconceptions of storing data in non-traditional formats.
‘Variety’ is the one ‘V’ that has really matured over the last ten years. We now have the Internet of Things (IoT) and in sensor-driven manufacturing, we have the ability to pull in social media information to take our understanding of our customer to the next level, and we also have many more lines of business systems that we did ten years ago. This is where the real challenge starts from a Business Intelligence (BI), analytics and Big Data point of view. We can no longer pick a technology and run with it. Historically, organisations were a Microsoft or an Oracle outfit but they are now faced with the challenge of running side-by-side appliances that are appropriate for the relevant variety of data they are attempting to consolidate and relate. Today, organisations need to start diversifying from a single-vendor solution (although I don’t think believe traditional relational database will ever die), to face a new challenge of taking that traditional relational data and relating it to unstructured data (for example, on a Hadoop appliance) and yet still provide the results through a mechanism that business users can understand while remaining intuitive.
I am not sure what happened in 2012 but another V was added to the list – ‘Veracity’. This means that the data needs to conform to the truth and the data has to be trusted. This, to me, brings everything back together and forms the foundation for a Big Data implementation. In everything we are doing, we need to ensure that what we are providing to business conforms to the truth and can be trusted. If we look up some successful Big Data applications and implementations, we just need to ask the question ‘what if the data was wrong?’ and we will quickly understand the importance of this. When President Obama ran his micro-targeting campaigns, there was very little room for error and the data had to be correct. This is also the case with Amazon.com, which is trialling delivering goods before customers order them. To do this effectively, the company has written such great algorithms to predict what customers are going to order (and when they will order), that if the data is only 90 per cent accurate, the company, stands to lose a lot of money. Then we have Uber, a company that is setting a benchmark for others, innovating in several industries as not just another taxi service. Uber published the algorithm for predicting a client’s destination, and can now begin to offer car-pool services. The value of accurate and correct data is easily understood.
With all these V’s in place, there seem to be many more marketable V's that can easily be strapped onto the existing foundation. ‘Visualisation’ is important: can we provide business with the information in a format that makes sense? Businesses are spoilt with the number of technologies available to use for a presentation layer. However, this also brings another conundrum – which visualisation tool do they use? And this is where we start balancing cutting edge with bleeding edge. At least once a month, I speak to a potential client who mentions a new visualisation tool and asks if we at Entelect have expertise in it. Each tool has its own pros and cons, and here is where we need to take a step back and define what functionality we are actually looking for in a visualization tool.
I have done countless implementations where self-service BI was a hard requirement, so the tools were selected and the universe of data provided. Then, however, when I checked in a few months later, the company had employed report writers because they didn’t actually want to write their own reports or their IT department was experiencing headaches because everyone was applying different filters to the data and expecting the same result. Do we actually need interactive charts and graphs, with drill down and drill through, bottom-up and top-down reporting? For many, these requirements are a must, until you mention the price of the potential tool, then they quickly move to a show value first and add the sexy visualisations later. I have never seen a successful revenue assurance implementation run on anything other than raw data. What I am trying to emphasise is that companies need to keep their short-term goals in mind, and before they fork out a lot of money on a visualisation tool, make sure that the path they are going down is going to add value to the business.
‘Value’ has also been added to the list of V's. Now that we have all the data and it is available to look at, how do we actually extract value from this? New roles are being created within organisations, for example, we now have chief data officers and the role of the data scientist is also gaining business traction. Once again, before businesses go out and spend money on someone with a quants or risk qualification, they need to look internally for the person I like call the ‘Miss Congeniality’ of your company. This is someone who has been at the company for many years and worked across the majority of the departments. This is the person who best understands the lower levels of the business. He or she is one who understands how the insurance department influences the credit department and the implications thereof. This person will create the foundation before the data scientists arrive.
In the field of data analytics, many companies have invested in out-of-the-box analytics that will provide a perceived ‘easy way’ to analyse data. It is quick to push out a time series forecast or a decision tree and to cluster the base. This is great, but keep in mind that there is no competitive advantage. Out-of-the-box tools are available to anyone who is willing to pay for them. Know up front that if you want something different, it is going to have to be bespoke.
So now we have all the theory and the best-case scenario examples. The number of success stories I have read are countless. However, the number I have seen properly implemented are far fewer. We all know the benefits of using this data to drive sales, increase revenue during down times and to engage customers more effectively.
In South Africa we have access to exactly the same technology as the rest of the world, we have people in this country who are able to create the custom algorithms and to stand up as the true data scientists. The question remains, ‘why are we still behind the curve on this, why are we not the benchmark?’ My pledge to the data community of South Africa is to invest in Big Data competency over the next 12 months and keep you updated each step of the way, I hope to further learn what actually works and what doesn’t, in practical and real-world environments. With a bit of luck and a lot of hard work, we will be able to position South Africa as the world leaders in Big Data.