A couple of weeks ago, as referred to obliquely in our Sabisu blog we upgraded our development and QA environments. Our hand was forced because we lost the development environment irretrievably due to over-engineering in the earliest days of the company; something that couldn't possibly harm us in the future did.
The lessons learned can be summarised thus;
1. If you do experience an exponential increase in activity, you can probably find the funding for more capable infrastructure.
2. Advanced technology steals time. Use the simplest technology you can find that will do the job.
3. Only use technology you understand and have a track record in configuring successfully. If you're an infrastructure guy/company, fine - if not, stick to what you're good at and outsource the rest.
Anyway, here's the story.
When you set up a company, particularly a boot-strapping start-up, you are short of everything apart from ambition - that's what makes it fun. The two things you're critically short of are money and time but you can't help planning for the 'hockey-stick' eventuality; a graph that suddenly swings from gradual linear to exponential growth.
So we originally built two hardware platforms in-house on high grade consumer kit rather than industry standard hardware. (This is a problem in itself because high grade consumer kit often has the features of professional hardware but it's less expensive - and that cost saving appears as a time cost as, inevitably, the hardware is less reliable.)
Then we created three virtual environments on each machine; mirrored VMs for Development, QA and fileserver/DC. Each virtual environment was snapshotted in its entirety on a schedule. Data was backed up to a spare hard drive in one machine and the source code was backed up off-site to Amazon Web Services.
(Worth saying that the live system has always been hosted to the highest corporate standards by a third party infrastructure specialist.)
In the worst case scenario, we'd lose a hardware platform but we'd never lose any data.
As a little start-up we didn't have the time or people to define and implement the kind of processes you need to manage this kind of environment. So, the VM paused mid-snapshot one day in something of a dither having ran out of room mid-snapshot.
We didn't expect this; we expected each snapshot to be a few GB at most, because we didn't understand snapshots fully or have the processes in place to monitor storage usage. We had no processes defined to allow us to bring back a failed VM. It was catastrophic.
As a backup, the snapshot was useless; it took several hours of crunching to get the most recent snapshot back and loaded to figure out that it was too old. So the VM was effectively bricked because the data was old - so the multiple platforms were irrelevant.
The VMs were a waste of time. We would have been better with any alternative. As it happens we reconstructed the source code from the online Amazon Web Services backups and each developer's work in progress version. We rebuilt the hardware platforms as decent standalone Development servers, which we understand.