Friday 30 September 2011

The perils of over-engineering


A couple of weeks ago, as referred to obliquely in our Sabisu blog we upgraded our development and QA environments. Our hand was forced because we lost the development environment irretrievably due to over-engineering in the earliest days of the company; something that couldn't possibly harm us in the future did.

The lessons learned can be summarised thus;

1. If you do experience an exponential increase in activity, you can probably find the funding for more capable infrastructure.

2. Advanced technology steals time. Use the simplest technology you can find that will do the job.

3. Only use technology you understand and have a track record in configuring successfully. If you're an infrastructure guy/company, fine - if not, stick to what you're good at and outsource the rest.

Anyway, here's the story.

When you set up a company, particularly a boot-strapping start-up, you are short of everything apart from ambition - that's what makes it fun. The two things you're critically short of are money and time but you can't help planning for the 'hockey-stick' eventuality; a graph that suddenly swings from gradual linear to exponential growth.

So we originally built two hardware platforms in-house on high grade consumer kit rather than industry standard hardware. (This is a problem in itself because high grade consumer kit often has the features of professional hardware but it's less expensive - and that cost saving appears as a time cost as, inevitably, the hardware is less reliable.)

Then we created three virtual environments on each machine; mirrored VMs for Development, QA and fileserver/DC. Each virtual environment was snapshotted in its entirety on a schedule. Data was backed up to a spare hard drive in one machine and the source code was backed up off-site to Amazon Web Services.

(Worth saying that the live system has always been hosted to the highest corporate standards by a third party infrastructure specialist.) 

In the worst case scenario, we'd lose a hardware platform but we'd never lose any data.

As a little start-up we didn't have the time or people to define and implement the kind of processes you need to manage this kind of environment. So, the VM paused mid-snapshot one day in something of a dither having ran out of room mid-snapshot.

We didn't expect this; we expected each snapshot to be a few GB at most, because we didn't understand snapshots fully or have the processes in place to monitor storage usage. We had no processes defined to allow us to bring back a failed VM. It was catastrophic.

As a backup, the snapshot was useless; it took several hours of crunching to get the most recent snapshot back and loaded to figure out that it was too old. So the VM was effectively bricked because the data was old - so the multiple platforms were irrelevant.

The VMs were a waste of time. We would have been better with any alternative. As it happens we reconstructed the source code from the online Amazon Web Services backups and each developer's work in progress version. We rebuilt the hardware platforms as decent standalone Development servers, which we understand.

Friday 23 September 2011

How we set up Trac for agile development


Following this post, I had a couple of queries about how we set up Trac for managing Sabisu. Hope this helps.

Where does Trac live?
We implemented it onto our Dev server, so we have control over the environment. It could live anywhere but just seemed to make sense. We expose certain elements of our dev environment to the internet through the Sabisu platform, so that's how we get remote access.

How do you divide your product into Trac?
Initially we divided the platform up at the top level, so we had a separate instance of Trac for each major application; platform itself, Chronos time logging, Forms and so on. However, this made it difficult to see ticket allocations across the team so now everything is in a single project.

We then split the platform using into each Component, e.g., Chat, Chronos, Communities Functions, Communities View etc, through to Widget Editor. Every component is assigned to a different member of the team to make default work allocation easy.

Milestones
We use Milestones a lot. Every milestone corresponds to a release and we allocate each a codename because it's easier to say, "We're moving ticket 192 to the Nestor release" and have everyone know what you mean. We try to get a balance of about 20-30 tickets per release and we release new revisions on a weekly basis.

Priorities
Our priorities, running from highest to lowest; blocker, critical, major, minor, trivial, cosmetic. If part of the system is inaccessible, or we can't complete a test, that's the highest priority. At the other end of the scale a cosmetic indicates something that's genuinely cosmetic - if it affects UX in anyway it's major or minor.

Severities
Our severities, running from most to least; Multiple Customer Outage, Customer Outage, Customer Inconvenience, Irritant, Risky to leave, One for later.
For any severity lower than Customer Outage there's generally a work around available. 'Risky to leave' tends to be architectural or infrastructure work but no reason why it should be limited to that.
Also a ticket regarded an 'outage' mightn't be a Blocker; it could be that the functionality is accessible but fails.

Ticket Types
Couple of interesting categories: Defect, Enhancement, Live Incident, PoC.
Of course, Live Incident and Defect are both important categories. However, in conjunction with the Severity and Priority we can properly direct our efforts; a Live Incident could be relatively minor and addressed at a later date without significant impact.
The 'PoC' type is used to denote 'proof of concept' work. This is usually pure R&D work that needs productisation at a later date, usually through a series of Enhancement tickets.

The Priority, Severity and Ticket Types fields work together; the most serious ticket is a Live Incident causing Multiple Customer Outage preventing access to part of the system (a Blocker).

Resolutions
Very dull; Fixed, Invalid, Wontfix, Duplicate, Worksforme, Unable to Replicate.
I hate the Worksforme resolution because it's not a resolution of any kind…but tolerate it because sometimes you just can't reproduce a user reported defect.

We don't use Versions and don't link Trac to SVN, though it's perfectly feasible - it's just not something we've needed to do.

Comments or thoughts welcome.

Friday 16 September 2011

Five rules for organising an agile, timeboxed product dev team


Over at the official Sabisu blog we outline some guidelines we use to manage the development of the product. I thought it might be good to expand why we established them , what they really mean in practice, and what they give us.

1. Work to the next release. It’s always next Monday.

The 'release early' philosophy is well established in agile software development; the sooner you can expose your work to your customers and react to their needs the better. Weekly releases allow us to turn around requests, incorporate customer feedback and incrementally improve the user experience.

Early in the lifecycle we tried to go to weekly releases but such a rapid release cycle caused a dip in quality as we tried to work in complex back end code too fast - now we have a mature platform and processes, the weekly releases are more logical. We take care not to expose users to complex functionality until it's usable, but the functionality is being gradually constructed behind the scenes as we go.

2. Incidents first. Defects second. Then enhancements. Always.

Many IT teams will hit incidents first; if your customers can't use your product for some reason, that needs to be sorted.

However, we only work on enhancements once we've got through defects; only defects waiting on a third party are put on hold.

This is in stark contrast to a lot of development teams where enhancements and defects are worked simultaneously. The problem with this approach is that (i) no one wants to work defects over enhancements and (ii) regression testing is tough.

Our approach does mean that in some releases there's little new functionality. We think that's a good thing; we concentrate on quality.


3. Every work item & every update goes into Trac.

Our defect/incident/enhancement/release management tracking tool is Trac. It's open source, simple but full featured tool that's our day to day management tool. Everything is logged, graded in terms of criticality and severity, assigned to a developer and allocated to a release. We tend to work about 25-30 Trac tickets into each release, with some developers taking only 2 or 3 effort intensive tickets.

Developer updates, testing notes, screenshots and anything else relevant goes into Trac. If we need a progress report, need to write release notes or are affected by a live incident, Trac will tell us what changes to commitments have to be made. Of course, it's all auditable if we need to trace the route of a change back through the process.

This means that we can forecast when a new feature or defect fix is going to be made available and if it should move, then all the relevant parties are informed.

4. The work plan gives the high level resource allocation.

Basically, the development team moves too fast for project plans to remain current, so we have a high level work plan that simply shows who's allocated to what customer (if they're off production) or release (if they're on production).


Having spent a lot of years as a project/programme manager trying to squeeze huge plans onto a small screen in Primavera/MS Project or whatever it's difficult for me to say this but...it's not something we do at Sabisu.

Basically there's no point. We work on such short timescales that a detailed project plan isn't much use beyond 3 weeks and all the detail is in TRAC anyway. All we'd be doing is shifting data from TRAC into a Gantt chart. By the time we've shifted the data the work's done.

So it's easier just to hit TRAC for a report of what's done and what's assigned to the next release. As long as we can get the 'big' bits of functionality into the main code trunk in a safe and sensible way we'll be ok.

You might legitimately ask how we plan the implementation of significant amounts of new functionality; there's a planning process implied in order to get the work into the build. The answer is simply that it happens offline, outside TRAC and the work is broken down into simple, achievable pieces of independent functionality prior to entry into TRAC. If a function is too big to be completed inside a release window, it gets broken down further into chunks that will fit.

Any bespoke customer work is dealt with the same way; we chunk it, tell the customer when they can expect each chunk and go for it.

(Now it's particularly interesting that back in my day at Motorola, we were expected to give the PMO a four week forecast for task completion, separate to the plan. I wonder if the data from the Primavera implementation didn't quite cut it?)

Beyond the high level work plan we have a long term roadmap which guides us in choosing the right features to bring into the product.

Whilst timesheets are done through our own Sabisu application, they're principally done for invoicing  customers as we do some bespoke work; we can be very accurate about how much time we've spent on each task.

5. Flex enhancements out to meet the release date (timeboxing).

Finally, we flex scope all the time. Making the Monday release with quality code is more important than shoehorning in new functionality. Generally, it means the new feature is delayed a week and we've never encountered functionality that's so time critical that it's worth endangering the quality of the code for.



Friday 9 September 2011

The limits of metadata in the manufacturing enterprise

Moving on from the previous topic of curation being a better fit for manufacturing needs than 'conventional' BI, we should really look at the other data produced in large volumes within any enterprise: documents.

Documents tend to be produced by a fat client on an end-user OS, both of which imbue them with metadata and place them in a taxonomy.
Often both the metadata and taxonomy are of varying usefulness and accuracy as taxonomies are corrupted, folders duplicated and metadata rendered invalid by server processes. At least an end user can make a value judgement about the document and an enterprise search tool has something to index, meaning that from a list of apparently similar documents returned by a search query a user can make an educated guess as to the valuable item.

Once that valuable item has been located, the user might well share the location of the file with a distribution list...

...and that's curation picking up where enterprise search has failed.

When ERP data is considered, you'll find little metadata of value to the end user. Again, it would be a common scenario where an enterprise search returns possibilities and the end-user selects and publicises those of value to the wider community.

Manufacturing systems also generate very little metadata as they're designed around a single purpose, e.g., to log data in real-time. The metadata is limited to that which is necessary to make sense of the reading - you could argue it's not metadata at all. Clearly, in these instances enterprise search has nothing to offer here but expert end-users do; they can identify the key trends and highlight them to a wider community.

Of course, there's an network effect as more curation takes place; as more data is linked together by expert judgement, the value of the network increases exponentially with each link created.

Just as internet search engines are devalued by systematic metadata corruption (link stuffing, spamming, or any other 'black' SEO practice) so enterprise search is devalued by closed, proprietary or legacy systems producing unlinkable data.

And just as on the internet the value of curated content (usually) outweighs that of content returned by a search algorithm, so it will be in the enterprise, where the editors or curators are experts in the technical aspects of their business.

Friday 2 September 2011

Why conventional BI fails manufacturing enterprises but curation succeeds

Back here I was describing what the terms democratisation, syndication and curation mean in the Enterprise 2.0 environment.

Of these, curation is particularly important to the process industries and perhaps to manufacturing as a whole. And here's why.

The data generated in a manufacturing environment can be thought of as broadly falling into the following categories; documents, ERP data and manufacturing data.

Whilst tempting to exclude documents from any BI discussion it's false to do so; whether in Lotus Notes, SQL or elsewhere, this is where day-to-day manufacturing decisions, events and instructions are stored. They represent a key data source for understanding trends yet are often ignored by BI solutions.

ERP data is typically proprietary and stored deep in an inaccessible database designed with system and process integrity rather than data reuse in mind, to be accessed only by a vendor specific MIS client.

Manufacturing systems data is generally generated with very little metadata by proprietary systems that are designed around a single purpose, e.g., to log data in real-time.

As any business intelligence vendor will tell you, the value of collecting such data is in the analysis of trends; identifying series of points that demand action. Yet the value of such analysis is exponentially increased by deriving relationships between trends, e.g., an interesting manufacturing trend may become a critical decision point when placed against an ERP trend. Causal relationships are what drive effective decisions - decisions which may require considering ERP data alongside manufacturing data alongside operational documentation.

This is precisely where conventional BI fails in the manufacturing environment; it's usually vendor aligned and incapable of dealing with proprietary data from multiple sources.

It's also difficult for end users to get to grips with, which means the enterprise can't leverage the expertise within the wider user base; conventional BI relies on users to be experts in the construction of queries, when their expertise is the construction of manufacturing processes.

It's end users, expert on the business process but inexpert on BI tools, who will spot these relationships and must be empowered to act.

This is curation; without meaningful metadata to make connections algorithmically, expert human filtering and nomination is the only way a community of users can be notified of a relevant trend. This is the real data that needs user collaboration, selected by a user that appreciates the nuances of the community's shared interests.

These users must have easy access to data from multiple proprietary sources; a level playing field that promotes mash-ups and comparisons. End users must be able to identify their own causal relationships and share their findings immediately with the wider community, driving quick decisions and developing knowledge that is in turn utilised in the future. There can be no reliance on IT to enable this process - it has to be in the hands of the end-users so they can act quickly.

In this way, data can be socialised; business intelligence can become social business intelligence; communities can benefit from shared expertise, expertly applied to their data.