One Less Cut In A Thousand: January 2013

Wednesday, 23 January 2013

The first thing you should do with Facebook Graph Search

Just reading this (HT LettersOfNote) and got to thinking: what am I going to do with Graph Search?\

(If you don't know what Facebook's Graph Search is...well, check out the link above and perhaps this. It's interesting and scary in equal measure.)

First, I'm going to show my 5 year old a couple of innocent searches. We're going to talk about privacy. (I know, I'm such a crazy, fun parent.)

I'm beginning to think that privacy education is a serious advantage as they grow up. It may become the most important bit of education kids get as Graph Search proves that anything online with your name on it is a potential threat. One misplaced post could be the end of all opportunities in life.

Then, I'm going to work through some less palatable searches, particularly of my colleagues. Why? So we can pre-emptively address anything that shows up which is a threat. Of course, if it shows up some really unpalatable stuff, we may need to have a chat but it's unlikely that employees of a tech-savvy start-up are going to put up anything that's in that category.

What it will do is force everyone to assess how much slack they cut people. If you have an employee in the EDL, is that a problem for you? If your employees are friends with lots of competitor employees, does that matter?

Then we're going to talk within the company about our attitude to Graph Search. Sales people will see it as a powerful tool to be used by the dark side; those in recruitment will be terrified and excited in equal measure. We need to decide just how far into each other's lives we want to look.

I can't help wonder whether it's a bit like Sauron's ring; even those who would use it for good are ultimately sucked into that which is bad.

Perhaps you start by trying to weed out the racists, but ultimately you end up weeding out everyone you disagree with?

Perhaps Graph Search will increase prejudice? Perhaps it'll play on prejudices you don't know you have?

Thursday, 17 January 2013

It's not the fast miles that count

Runners call it building a 'base'...cyclists tend to look a bit morose, or guilty, usually addressing a loved one or cycling companion with the confession; "I'm not doing the miles".

These days there's a constant focus on speed. It seems expected that you'll have a visionary idea on the train, text it to your development team while the Starbucks drone grunts his/her way through making you a latte, and have a prototype awaiting you by the time you've swiped in.

Well, here's the truth:

Starbucks coffees tend to be mostly sweet milk.
No lasting change has ever been effected at in a morning.
If you need a swipe card to get into your office, you're probably not an innovator

See, like the runners, cyclists, triathletes out there that have to get up early and do the miles, innovation is 90% base mileage. It's all about grinding out the 1% improvements in all areas, day after day. (See Team Sky Procycling.)

Over time, those 1% gains accrete into a win - however you define it.

I hear stories cranked out about Twitter/Facebook/<insert favourite> being invented in 10 minutes on a hack day. Yeah? And the infrastructure? In fact, the most crucial bit, the business model isn't yet finished for most of these dominating platforms, some 6/7 years after they hit the market.

Even the initial idea moves on. Facebook now is not Facebook as was. Twitter also. They evolve because they know that you don't conceive a complete vision in a single blinding flash; you refine it over a period.

The code is the least important bit. The infrastructure is the second least important.

The implementation is what gives the software life. The business model is what gives the software a future.

Tuesday, 15 January 2013

Is the end in sight for bloated MI/BI systems?

There are 2 reasons you do not need a big, monolithic, singular, global datawarehouse.

1. Distributed processing

The techniques and architectures exist to move the aggregation process close to the data - in fact, right on top of it. This makes it quick, efficient and easy to modify without significant overhead - unlike cube -dependent data-warehouses.

Hybrid-cloud architectures make it possible to aggregate on a local basis while making the aggregates available over the cloud to the global enterprise, which is then dealing with relatively small datasets.

So there's no need to spend a fortune on the communications and storage involved with pulling data to a central location, and no need to take the risk of having all your data centralised onto a single point of failure.

So no need for a big datawarehouse.

2. Cloud power

At Sabisu we do much work in the process industry where the perennial question is: do we connect our essential production systems to the cloud?

Sure, you can take advantage of the virtualisation and outsourcing available for risk mitigation and cost reduction, but in fact you're just shifting the risk to the communications provider and you're unlikely to find multi-tenant cost benefits because you're going to want a very private cloud indeed for all your valuable process data.

Cloud computing is valuable to our customers because it gives unlimited, immediately available processing power. This means that all those clever data network modelling techniques that have been the preserve of those with entire datacentres at their disposal are now accessible by anyone with a bit of budget.

So what we have now is an opportunity to try new analysis techniques that do not need a local, on-premise, expensive data-warehouse. All you need is enough communications capability to get the dataset you want to analyse to the cloud, or as described in (1) above, get the right level of aggregate to the cloud.

In fact, you don't need to persist any data in the cloud; you can reconstruct the set of results later if required by supplying the raw/aggregated data.

So, distributed processing and cloud power; an antidote for bloated MIS perhaps?

Thursday, 10 January 2013

Future for big data is small, fast and powered by end-users

I was intrigued by this article on the hype around big data: http://venturebeat.com/2012/12/21/big-data-club/

Last year I was invited to speak at a Corporate IT Forum workshop on MIS with lots of big data debate included. Some of the attendees were bemoaning a lack of 'accessible' big data technology, along the lines of 'we have petabytes of data to process and nothing to do it with', whereas others saw this as absolutely irrelevant as their organisations weren't generating this kind of data in the first place.

At Sabisu we do a lot of work with organisations that generate a lot of data. Some is structured well, lots is structured badly, lots is unstructured. But even these guys don't really have 'big data' issues along the lines of the link above. Virtually everyone we talk to has plain old data issues - the sames ones they had 10 years ago, just on a bigger scale - but not multi-petabyte big data issues.

To put it into perspective, a big enterprise might have 30,000 users, all storing Excel/Word/Ppt docs and emails. Facebook has a billion, all storing video and photos. So chill out. Your organisation probably has an MIS or data management problem but not a big data problem.

That's not to say the technology and techniques pioneered by the Facebooks and Googles of this world don't have value. Every organisation would benefit from working with unstructured, non-relational data in a distributed, resilient architecture...and that's what I take to mean by big data technology.

As a definition that's pretty sloppy. The fact is that distributed algorithms have been around a while. They've just not been 'accessible', which brings us back to our friends running the IT functions at 'normal' sized enterprises.

Our friends are being sold - and are buying - huge data-warehouses that cost a fortune. It is in the interests of the vendors to push the need for big data capability even if a 'normal' sized enterprise doesn't need it. And I don't believe they do.

I suggest that 99% of enterprises could function magnificently on 5% of the KPIs they currently capture. Most of the KPIs have little operational relevance. Most of the data-warehouse manipulation and exploitation is a waste of time. The reason for this is that the end-users cannot ask the questions they need to ask - there is no interface in place, so they ask a question they can get an answer to instead.

Sure, you have an MIS system. And it's self-service right? And your users love it, right? So how many of those KPIs affect your organisation's bottom-line?

Here's where 'accessible' implementations of unstructured, non-relational, distributed data processing will change things. Users would be able to ask questions that directly affect the bottom-line and it won't matter whether the right cube has been built, or batch job run, or ETL function completed, or whatever; the answer will be constructed on the fly by millions of small worker algorithms distributed throughout the IT architecture.

In this way, companies can exploit the data they already have but can't get to - the data in spreadsheets, documents, presentations along with the structured/unstructured line-of-business data. Data Scientists will be roving consultants, building pre-packaged algorithms that users can exploit easily.

Wednesday, 9 January 2013

Running kit list

Chatting to the guys in work about getting in shape...not that I am at the moment.

Here's my rough kit list:

http://www.wiggle.co.uk/ronhill-pursuit-short/

http://www.wiggle.co.uk/ronhill-pursuit-tight/

http://www.wiggle.co.uk/ronhill-pursuit-square-cut-short/

http://www.wiggle.co.uk/ronhill-vizion-long-sleeve-crew-top/

http://www.wiggle.co.uk/inov-8-racesoc-16-socks-twin-pack/

I'm running in old model Inov-8 Terraflys at the moment which they seem to have discontinued. These are the nearest equivalent I think and will be my next shoe:

http://www.inov-8.com/New/Global/Product-View-Trailroc-255.html?L=26

(The new model Terraflys have a bigger heel/toe differential.)

Typically what I wear is roughly temp dependent:

>11C = short sleeve top, shorts
Between 3C and 11C = long sleeve top, shorts
<3C = long sleeve top, tights
<0C = long sleeve top, tights, base layer (gloves, beanie if req'd)

I take into account wind/rain by regarding it as lowering the temp slightly. I've never got it wrong. Nothing I run in is waterproof - I'm only ever 30 mins or so from a warm car/house. Mountain running...well, I'd have more kit.

Tuesday, 8 January 2013

CAD/CAM as an analogy for data processing algorithms

My last couple of posts (here, here) have been focused on manufacturing automation as an analogy for big data software techniques along with some discussions on the topic.

Here I thought I'd just jot down how CAD/CAM in particular might be an analogy for algorithmic data processing.

CAD/CAM is all about manufacturing physical things required for a physical process and ultimately, a physical product:

Algorithms are all about manufacturing digital things (e.g., datasets, results) required for a business process and ultimately, some sort of product (digital or physical) :

The process is the same, but results in a digital artifact for use in a business process.

I'm sure that there's an argument that what's been done here is to essentially abstract each process to such a degree that it's not representative. But the fact is that production is automated in manufacturing by assigning small, discrete packages of work to many actors, with as much parallelisation as possible, in order to produce a high quality output.

Seems a good analogy to me.

Monday, 7 January 2013

Riding/walking with your iPhone - s/ware, settings, kit

With a bit of care the iPhone can be an asset when you're outdoors - I use it for mapping and tracking runs, walks and bike rides. Strangely, the communications aspect of it is least important; as you'll see later I disable wifi and cellular data to conserve the battery a lot of the time.

Clearly, if you're walking (in particular) you need a proper map and compass. Cross-check with them regularly - you can't be certain that the iPhone hasn't gone quietly nuts.

What I want is:

A detailed map, available off-line so it's not sucking battery, using data allowance or relying on 3G when out in the great outdoors
A track of my activity
Options to conserve battery

There are quite a few integrated mapping & activity solutions out there but I don't want all my eggs in one basket; I've got a suite of software which allows me to chop and change mapping & activity apps.

1. Map it

Personally I like to map the ride using the classic GMap Pedometer:

http://www.gmap-pedometer.com/

I don't have a GMap account, so I simply save the map shortcut as public and save the shortcut for later.

Then I click a shortcut with this URL I got from here. This opens a little JavaScript window which generates a GPX file which I then save with a .GPX extension.

I use Gaia GPS on my iPhone, which means I can email my GPX file to upload@gaiagps.com as described here. It comes back as a link in an email - opening that link imports it into Gaia.

The reason for using Gaia is that it allows you to download all the maps in advance so you don't rely on a mobile connection to download the maps as you go. The maps have been pretty good so far and it's got a lot the OpenCycle routes shown already as a map overlay.

That gives you all the maps and a track ready to follow.

2. Turn everything off

That's the rule; turn everything you need off on the iPhone. I've not tried simply engaging flight mode because that might impact the GPS. Instead I turn off:

Wifi (no sense in it seeking for a network when you're in the great outdoors)
Cellular data - all of it (because you don't need it)
Phone (because hunting for the next cell costs power)
Auto lock screen (because there's nothing more irritating than it going to the lock screen when you want to know whether it's a right or left turn next)
Bluetooth - not required.

Ideally there'd be a 'profile' setting which would allow me to do this in one go.

When I move to a Bluetooth cadence sensor clearly that'll have to stay on.

Gaia also allows (in the latest version) you to turn off automatic GPS acquisition, so it'll only acquire and record your position on request. This is a great battery saver.

If you can turn off the screen then doing so will save a lot of battery. With the screen on constantly but all other settings as above, my iPhone 4 burns at most about 10% an hour. Screen off it'll last all day.

3. Onto the bike/trail

Hopefully you've got an iPhone holder and a way of keeping it dry on your handlebars, or you're going to fall off trying to retrieve it. If you're on a signed course, you could put it in a saddle bag or triathlon bag and forget about it (see link below).

iPhone 5 users will find limited mounting options at the moment - Topeak have a mount scheduled for spring '13.

Walkers can put it in a pocket or backpack. I've got one of these which is padded and just fits my iPhone 5 (it's fine with a 4/4S) though these would do fine too if it's going in a pocket not surrounded by scratchy things.

If you're out and about for a long time (say, days walking or a few hours on the bike) then an external battery pack is an option. I get 4 full charges out of one of these at a cost of 300g or so. On the bike you can put it in a triathlon bag near the bars and have it plugged in for the duration if required.

I've tried solar chargers - don't bother.

I'll keep running with this and try to build some sort of profile of the likely performance, screen on/off and bluetooth on/off.

Friday, 4 January 2013

Manufacturing automation shows us the future of big data in most companies

Here I wrote about big data software techniques as an analogy to manufacturing automation, and then in practice:
http://onelesscut.blogspot.co.uk/2013/01/big-data-software-techniques-in.html

The analogy is perhaps more interesting than the practice. What do robots bring to manufacturing and how does the analogy with big data software techniques play out in the future?

Regardless of repetition, robots bring :

Accuracy and quality - they execute repetitive jobs to a high standard with repeatable results
Speed and efficiency - they're great at crunching through repetitive tasks
Reliability - they're 'always on'
Low cost - see Reliability; also they can replace people (hey, it's true)

They also allow integration up the design and manufacturing process, with the encoding of a physical form into a digital representation with CAD/CAM.

Robots are getting more complex. They're getting smaller, cheaper and more autonomous. They can handle jobs that perhaps required human intervention a few years ago. Most of the advances in robotics appear to be down to advances in software, e.g., signal processing, logic, or whatever.

Big data software techniques are like our robots. They bring:

Accuracy and quality - algorithms manage distribution and execution of repetitive jobs to a high standard with repeatable results
Speed and efficiency - they're great at crunching through repetitive tasks
Reliability - distributed processes give greater resilience but also if you get the algorithm right once, it can be applied to truly massive datasets
Low cost - you can do an awful lot with less (smaller, cheaper) computing power, and for some tasks they can replace people manually sifting unstructured data.

Where now?

Well, first off I think we may see Data Scientists being moved off the big data frontline, away from the data itself and back towards widely applicable algorithms. Scientists are usually first in to new areas of learning but they're quickly supplanted by engineers. As it was with robotics.

Once the (software) engineers have got these techniques working for industry, their role will move to be supportive, with end-users taking the lead. As CAD/CAM is a way for a subject matter expert to apply their knowledge of the physical domain so as to optimise manufacturing capability, so big data software techniques will allow subject matter experts to apply their algorithms to improve a process - sales, production or whatever.

That sounds a bit like marketing flannel, so here are some examples:

1. "Hey computer, I'm worried about benzene contamination in my product. Should I be?"

[Computer starts complex, distributed log analysis of a few millions lines of real-time data.]

2. "Hey computer, find out how much product we've had to flare off and how much it cost."

[In the future, everyone says 'Hey, computer'. Computer finds all the flaring incidents, exactly how much product was sent to flare from where, the value of each product.]

3. "Hey computer, can you reduce my electricity bill?"

[Computer looks at efficiency of every component in a process, tries to optimise usage taking into account the effect on other parts of the process.]

This is elegant as it doesn't need a multi-petabyte dataset for these techniques to show value; it's about using the existing data, reforming it and translating it into new forms on the fly through algorithms.

Ultimately this feeds right back up to the design process for new facilities, processes and even businesses. As Google extends each of our knowledge and even memories, we'll rely on algorithms chewing through lots of data to be our enterprise memory.

Thursday, 3 January 2013

Big data software techniques in automation - as an analogy & in use

I really liked this post (HT @stewarttownsend) :

http://gigaom.com/data/why-big-data-might-be-more-about-automation-than-insights/

As we (@sabisu) do a lot of work in the process industries (oil & gas, chems, manufacturing) the analogy of 'big data' techniques to robotic manufacturing processes really worked for me.

What do robots do in manufacturing? We give them small tasks which require relentless repetition and accuracy. Robots don't do anything you couldn't do by hand but they're many times faster, more reliable and less expensive.

That's a pretty good analogy with many of the big data software tools which involve the distribution of work to many processing nodes. This work is repetitive, requires accuracy, speed, reliability and low cost (in all senses). It's a good fit. For example, the 'reduce' part of MapReduce is in fact a software robot, executing an algorithm. For Hadoop clusters, read any redundant architecture in your manufacturing process.

It's easy to see this translating to the real world. Need to aggregate all the flaring incidents at your petrochems plant? MapReduce will do what your Climate Change Manager might spend hours collating.

Need resilience in data acquisition from real-time manufacturing systems? Plenty of options.

Need someone to check all the logs for incidences of benzene contamination 1.2 standard deviations over the mean? Get an algo to do it.

Now, in fact I think most 'big data' technology is needlessly expensive and perhaps sub-optimal for some of these use-cases but the analogy holds.

Wednesday, 2 January 2013

Top films of 2012

Every year Mark Kermode gives his top 10 films of 2012...well, top 12 in this case...and with a few other honourable mentions thrown in.

Here's a full list to solve those 'what shall we watch tonight' conversations...

You've Been Trumped

Holy Motors

The Raid

(Argo)

Beasts of the Southern Wild

Martha Marcy May Marlene

(Liberal Arts)

Life of Pi

Even the Rain

(Angel's Share)

The Dark Knight Rises

Amour

Skyfall

(The Grey)

(Moonrise Kingdom)

A Royal Affair

(The Hunt)

Berberian Sound Studio

I'm also going to squeeze in Last Shop Standing if I can - about the decline of independent record shops in the UK.