Six Hours Of Downtime

Earlier today a power transient knocked both of our systems offline. I noticed it at the house and quickly figured out that it hit the office, too.

Recall the configuration of our systems:

  • HP Z620 workstations
  • Dual NAS grade disks
  • ZFS based mirroring
  • Multiple VMs hosting data

The multiple VM configuration we use, three per workstation, is in place for two reasons. Elasticsearch has requirements on the number of available systems before it will behave properly, such that two systems alone a not workable. The second reason is the 32 gig memory limit for a java virtual machine. These 192 gig systems easily support three large VMs and a good sized ZFS cache.

We tested this architecture prior to putting any data on it. That involved pulling the power to the shared switch, turning down or otherwise degrading a single system, and in general mistreating things to understand their failure modes.

But we never simply pulled the plug on both systems at once. As it was a sag rather than a full outage, one of the workstations restarted spontaneously, while the other required a power on restart. Both systems booted cleanly, but Elasticsearch as an operational service was nowhere to be found after the VMs were restarted, and the bug hunt began.

After ninety minutes of head scratching and digging it became apparent that the 100% disk usage was some built in maintenance procedure required before the cluster would even start to recover in an observable fashion. Shutting down two VMs on each system permitted one on each to finish, then turning the other two on let the cluster begin to recover.

The cluster would likely have recovered on its own eventually, but the disk contention for three VMs all trying to validate data at once seems to perform worse than three individual machines recovering serially. We’ll do something about this revealed architectural fault the next time we expand, although it isn’t clear at this time precisely what that will be.

We lost no data, but we do have a six hour gap in our streaming coverage, and we lost six hours in processing time on the current batch of user IDs. This could have been something much worse. There will be some power conditioners ordered in the near future and a weekly fifteen minute outage on Sunday for simultaneous ZFS snapshots of the data VMs would seem to be a wise precaution.