Thursday, 10 January 2013

Future for big data is small, fast and powered by end-users


I was intrigued by this article on the hype around big data: http://venturebeat.com/2012/12/21/big-data-club/

Last year I was invited to speak at a Corporate IT Forum workshop on MIS with lots of big data debate included. Some of the attendees were bemoaning a lack of 'accessible' big data technology, along the lines of 'we have petabytes of data to process and nothing to do it with', whereas others saw this as absolutely irrelevant as their organisations weren't generating this kind of data in the first place.

At Sabisu we do a lot of work with organisations that generate a lot  of data. Some is structured well, lots is structured badly, lots is unstructured. But even these guys don't really have 'big data' issues along the lines of the link above. Virtually everyone we talk to has plain old data issues - the sames ones they had 10 years ago, just on a bigger scale - but not multi-petabyte big data issues.

To put it into perspective, a big enterprise might have 30,000 users, all storing Excel/Word/Ppt docs and emails. Facebook has a billion, all storing video and photos. So chill out. Your organisation probably has an MIS or data management problem but not a big data problem.

That's not to say the technology and techniques pioneered by the Facebooks and Googles of this world don't have value. Every organisation would benefit from working with unstructured, non-relational data in a  distributed, resilient architecture...and that's what I take to mean by big data technology.

As a definition that's pretty sloppy. The fact is that distributed algorithms have been around a while. They've just not been 'accessible', which brings us back to our friends running the IT functions at 'normal' sized enterprises.

Our friends are being sold - and are buying - huge data-warehouses that cost a fortune. It is in the interests of the vendors to push the need for big data capability even if a 'normal' sized enterprise doesn't need it. And I don't believe they do.

I suggest that 99% of enterprises could function magnificently on 5% of the KPIs they currently capture. Most of the KPIs have little operational relevance. Most of the data-warehouse manipulation and exploitation is a waste of time. The reason for this is that the end-users cannot ask the questions they need to ask - there is no interface in place, so they ask a question they can get an answer to instead.

Sure, you have an MIS system. And it's self-service right? And your users love it, right? So how many of those KPIs affect your organisation's bottom-line?

Here's where 'accessible' implementations of unstructured, non-relational, distributed data processing will change things. Users would be able to ask questions that directly affect the bottom-line and it won't matter whether the right cube has been built, or batch job run, or ETL function completed, or whatever; the answer will be constructed on the fly by millions of small worker algorithms distributed throughout the IT architecture.


In this way, companies can exploit the data they already have but can't get to - the data in spreadsheets, documents, presentations along with the structured/unstructured line-of-business data. Data Scientists will be roving consultants, building pre-packaged algorithms that users can exploit easily.  



Wednesday, 9 January 2013

Running kit list

Chatting to the guys in work about getting in shape...not that I am at the moment.

Here's my rough kit list:

http://www.wiggle.co.uk/ronhill-pursuit-short/

http://www.wiggle.co.uk/ronhill-pursuit-tight/




I'm running in old model Inov-8 Terraflys at the moment which they seem to have discontinued. These are the nearest equivalent I think and will be my next shoe:

http://www.inov-8.com/New/Global/Product-View-Trailroc-255.html?L=26

(The new model Terraflys have a bigger heel/toe differential.)

Typically what I wear is roughly temp dependent:
  • >11C = short sleeve top, shorts
  • Between 3C and 11C = long sleeve top, shorts
  • <3C = long sleeve top, tights
  • <0C = long sleeve top, tights, base layer (gloves, beanie if req'd)

I take into account wind/rain by regarding it as lowering the temp slightly. I've never got it wrong. Nothing I run in is waterproof - I'm only ever 30 mins or so from a warm car/house. Mountain running...well, I'd have more kit.


Tuesday, 8 January 2013

CAD/CAM as an analogy for data processing algorithms

My last couple of posts (here, here) have been focused on manufacturing automation as an analogy for big data software techniques along with some discussions on the topic.

Here I thought I'd just jot down how CAD/CAM in particular might be an analogy for algorithmic data processing.

CAD/CAM is all about manufacturing physical things required for a physical process and ultimately, a physical product:




Algorithms are all about manufacturing digital things (e.g., datasets, results) required for a business process and ultimately, some sort of product (digital or physical) :





The process is the same, but results in a digital artifact for use in a business process.

I'm sure that there's an argument that what's been done here is to essentially abstract each process to such a degree that it's not representative. But the fact is that production is automated in manufacturing by assigning small, discrete packages of work to many actors, with as much parallelisation as possible, in order to produce a high quality output.

Seems a good analogy to me.



Monday, 7 January 2013

Riding/walking with your iPhone - s/ware, settings, kit

With a bit of care the iPhone can be an asset when you're outdoors - I use it for mapping and tracking runs, walks and bike rides. Strangely, the communications aspect of it is least important; as you'll see later I disable wifi and cellular data to conserve the battery a lot of the time.

Clearly, if you're walking (in particular) you need a proper map and compass. Cross-check with them regularly - you can't be certain that the iPhone hasn't gone quietly nuts.

What I want is:

  • A detailed map, available off-line so it's not sucking battery, using data allowance or relying on 3G when out in the great outdoors
  • A track of my activity
  • Options to conserve battery


There are quite a few integrated mapping & activity solutions out there but I don't want all my eggs in one basket; I've got a suite of software which allows me to chop and change mapping & activity apps.

1. Map it

Personally I like to map the ride using the classic GMap Pedometer:

http://www.gmap-pedometer.com/

I don't have a GMap account, so I simply save the map shortcut as public and save the shortcut for later.

Then I click a shortcut with this URL I got from here. This opens a little JavaScript window which generates a GPX file which I then save with a .GPX extension.

I use Gaia GPS on my iPhone, which means I can email my GPX file to upload@gaiagps.com as described here. It comes back as a link in an email - opening that link imports it into Gaia.

The reason for using Gaia is that it allows you to download all the maps in advance so you don't rely on a mobile connection to download the maps as you go. The maps have been pretty good so far and it's got a lot the OpenCycle routes shown already as a map overlay.

That gives you all the maps and a track ready to follow.

2. Turn everything off

That's the rule; turn everything you need off on the iPhone. I've not tried simply engaging flight mode because that might impact the GPS. Instead I turn off:


  • Wifi (no sense in it seeking for a network when you're in the great outdoors)
  • Cellular data - all of it (because you don't need it)
  • Phone (because hunting for the next cell costs power)
  • Auto lock screen (because there's nothing more irritating than it going to the lock screen when you want to know whether it's a right or left turn next)
  • Bluetooth - not required.

Ideally there'd be a 'profile' setting which would allow me to do this in one go.

When I move to a Bluetooth cadence sensor clearly that'll have to stay on.

Gaia also allows (in the latest version) you to turn off automatic GPS acquisition, so it'll only acquire and record your position on request. This is a great battery saver.

If you can turn off the screen then doing so will save a lot of battery. With the screen on constantly but all other settings as above, my iPhone 4 burns at most about 10% an hour. Screen off it'll last all day.

3. Onto the bike/trail

Hopefully you've got an iPhone holder and a way of keeping it dry on your handlebars, or you're going to fall off trying to retrieve it. If you're on a signed course, you could put it in a saddle bag or triathlon bag and forget about it (see link below).

iPhone 5 users will find limited mounting options at the moment - Topeak have a mount scheduled for spring '13.

Walkers can put it in a pocket or backpack. I've got one of these which is padded and just fits my iPhone 5 (it's fine with a 4/4S) though these would do fine too if it's going in a pocket not surrounded by scratchy things.

If you're out and about for a long time (say, days walking or a few hours on the bike) then an external battery pack is an option. I get 4 full charges out of one of these at a cost of 300g or so. On the bike you can put it in a triathlon bag near the bars and have it plugged in for the duration if required.

I've tried solar chargers - don't bother.

I'll keep running with this and try to build some sort of profile of the likely performance, screen on/off and bluetooth on/off.

Friday, 4 January 2013

Manufacturing automation shows us the future of big data in most companies

Here I wrote about big data software techniques as an analogy to manufacturing automation, and then in practice:
http://onelesscut.blogspot.co.uk/2013/01/big-data-software-techniques-in.html

The analogy is perhaps more interesting than the practice. What do robots bring to manufacturing and how does the analogy with big data software techniques play out in the future?

Regardless of repetition, robots bring :

  • Accuracy and quality - they execute repetitive jobs to a high standard with repeatable results
  • Speed and efficiency - they're great at crunching through repetitive tasks
  • Reliability - they're 'always on'
  • Low cost - see Reliability; also they can replace people (hey, it's true) 

They also allow integration up the design and manufacturing process, with the encoding of a physical form into a digital representation with CAD/CAM.

Robots are getting more complex. They're getting smaller, cheaper and more autonomous. They can handle jobs that perhaps required human intervention a few years ago. Most of the advances in robotics appear to be down to advances in software, e.g., signal processing, logic, or whatever.

Big data software techniques are like our robots. They bring:

  • Accuracy and quality - algorithms manage distribution and execution of repetitive jobs to a high standard with repeatable results
  • Speed and efficiency - they're great at crunching through repetitive tasks
  • Reliability - distributed processes give greater resilience but also if you get the algorithm right once, it can be applied to truly massive datasets
  • Low cost - you can do an awful lot with less (smaller, cheaper) computing power, and for some tasks they can replace people manually sifting unstructured data.

Where now?

Well, first off I think we may see Data Scientists being moved off the big data frontline, away from the data itself and back towards widely applicable algorithms. Scientists are usually first in to new areas of learning but they're quickly supplanted by engineers. As it was with robotics.

Once the (software) engineers have got these techniques working for industry, their role will move to be supportive, with end-users taking the lead. As CAD/CAM is a way for a subject matter expert to apply their knowledge of the physical domain so as to optimise manufacturing capability, so big data software techniques will allow subject matter experts to apply their algorithms to improve a process - sales, production or whatever.

That sounds a bit like marketing flannel, so here are some examples:

1. "Hey computer, I'm worried about benzene contamination in my product. Should I be?"

[Computer starts complex, distributed log analysis of a few millions lines of real-time data.]

2. "Hey computer, find out how much product we've had to flare off and how much it cost."

[In the future, everyone says 'Hey, computer'. Computer finds all the flaring incidents, exactly how much product was sent to flare from where, the value of each product.]

3. "Hey computer, can you reduce my electricity bill?"

[Computer looks at efficiency of every component in a process, tries to optimise usage taking into account the effect on other parts of the process.]

This is elegant as it doesn't need a multi-petabyte dataset for these techniques to show value; it's about using the existing data, reforming it and translating it into new forms on the fly through algorithms.

Ultimately this feeds right back up to the design process for new facilities, processes and even businesses. As Google extends each of our knowledge and even memories, we'll rely on algorithms chewing through lots of data to be our enterprise memory.

Thursday, 3 January 2013

Big data software techniques in automation - as an analogy & in use

I really liked this post (HT @stewarttownsend) :

http://gigaom.com/data/why-big-data-might-be-more-about-automation-than-insights/

As we (@sabisu) do a lot of work in the process industries (oil & gas, chems, manufacturing) the analogy of 'big data' techniques to robotic manufacturing processes really worked for me.

What do robots do in manufacturing? We give them small tasks which require relentless repetition and accuracy. Robots don't do anything you couldn't do by hand but they're many times faster, more reliable and less expensive.

That's a pretty good analogy with many of the big data software tools which involve the distribution of work to many processing nodes. This work is repetitive, requires accuracy, speed, reliability and low cost (in all senses). It's a good fit. For example, the 'reduce' part of MapReduce is in fact a software robot, executing an algorithm. For Hadoop clusters, read any redundant architecture in your manufacturing process.

It's easy to see this translating to the real world. Need to aggregate all the flaring incidents at your petrochems plant? MapReduce will do what your Climate Change Manager might spend hours collating.

Need resilience in data acquisition from real-time manufacturing systems? Plenty of options.

Need someone to check all the logs for incidences of benzene contamination 1.2 standard deviations over the mean? Get an algo to do it.

Now, in fact I think most 'big data' technology is needlessly expensive and perhaps sub-optimal for some of these use-cases but the analogy holds.


Wednesday, 2 January 2013

Top films of 2012

Every year Mark Kermode gives his top 10 films of 2012...well, top 12 in this case...and with a few other honourable mentions thrown in.

Here's a full list to solve those 'what shall we watch tonight' conversations...

You've Been Trumped
Holy Motors
The Raid
(Argo)
Beasts of the Southern Wild
Martha Marcy May Marlene
(Liberal Arts)
Life of Pi
Even the Rain
(Angel's Share)
The Dark Knight Rises
Amour
Skyfall
(The Grey)
(Moonrise Kingdom)
A Royal Affair
(The Hunt)
Berberian Sound Studio

I'm also going to squeeze in Last Shop Standing if I can - about the decline of independent record shops in the UK.