Thursday 10 January 2013

Future for big data is small, fast and powered by end-users


I was intrigued by this article on the hype around big data: http://venturebeat.com/2012/12/21/big-data-club/

Last year I was invited to speak at a Corporate IT Forum workshop on MIS with lots of big data debate included. Some of the attendees were bemoaning a lack of 'accessible' big data technology, along the lines of 'we have petabytes of data to process and nothing to do it with', whereas others saw this as absolutely irrelevant as their organisations weren't generating this kind of data in the first place.

At Sabisu we do a lot of work with organisations that generate a lot  of data. Some is structured well, lots is structured badly, lots is unstructured. But even these guys don't really have 'big data' issues along the lines of the link above. Virtually everyone we talk to has plain old data issues - the sames ones they had 10 years ago, just on a bigger scale - but not multi-petabyte big data issues.

To put it into perspective, a big enterprise might have 30,000 users, all storing Excel/Word/Ppt docs and emails. Facebook has a billion, all storing video and photos. So chill out. Your organisation probably has an MIS or data management problem but not a big data problem.

That's not to say the technology and techniques pioneered by the Facebooks and Googles of this world don't have value. Every organisation would benefit from working with unstructured, non-relational data in a  distributed, resilient architecture...and that's what I take to mean by big data technology.

As a definition that's pretty sloppy. The fact is that distributed algorithms have been around a while. They've just not been 'accessible', which brings us back to our friends running the IT functions at 'normal' sized enterprises.

Our friends are being sold - and are buying - huge data-warehouses that cost a fortune. It is in the interests of the vendors to push the need for big data capability even if a 'normal' sized enterprise doesn't need it. And I don't believe they do.

I suggest that 99% of enterprises could function magnificently on 5% of the KPIs they currently capture. Most of the KPIs have little operational relevance. Most of the data-warehouse manipulation and exploitation is a waste of time. The reason for this is that the end-users cannot ask the questions they need to ask - there is no interface in place, so they ask a question they can get an answer to instead.

Sure, you have an MIS system. And it's self-service right? And your users love it, right? So how many of those KPIs affect your organisation's bottom-line?

Here's where 'accessible' implementations of unstructured, non-relational, distributed data processing will change things. Users would be able to ask questions that directly affect the bottom-line and it won't matter whether the right cube has been built, or batch job run, or ETL function completed, or whatever; the answer will be constructed on the fly by millions of small worker algorithms distributed throughout the IT architecture.


In this way, companies can exploit the data they already have but can't get to - the data in spreadsheets, documents, presentations along with the structured/unstructured line-of-business data. Data Scientists will be roving consultants, building pre-packaged algorithms that users can exploit easily.  



No comments: