When manned missions to space are executed, the equipment that make these missions possible have one primary mission: to protect the astronauts. This is because they are the most important asset in the program. To do this, redundancies are built directly into the system to ensure that failures do not place the astronauts at risk. This approach has led to an amazing success rate, despite a few tragic episodes. In business, the most important asset to the company is the data, and maintaining the availability of that data should be the first requirement of every storage product.
When IT departments consider a new storage product for their environment, they should first and foremost be concerned with the availability of that product. All too often we hear of businesses who lost millions of dollars because of systems being down. There should be two primary goals of all data storage technologies: keep the data available and ensure the integrity of the data. This is why single points of failure and lack of resiliency are often used in the Fear, Uncertainty, and Doubt (FUD) that is all too common between competitors in the storage industry.
resiliency – The ability of a storage element to preserve data integrity and availability of access despite the unavailability of one or more of its storage devices.
Building a resilient data storage architecture is all about reducing single points of failure to eliminate the impact lost devices have on production workloads, and reliably recover data services from the loss of non-redundant components. While there are many challenges to designing and building a scale-out storage architecture, one advantage is the ability to create different layers of failure domains. Maintaining the availability of a node and maintaining the availability of a cluster of nodes are the two main failure domains.
Wherever possible, you want to eliminate single points of failure within a node in order to avoid any downtime of the node itself. The use of standard x86 servers as the basic building block for scale-out storage means you get a platform that has already been architected with highly redundant components, including power supplies, ECC memory, and multiple NIC ports. Disks are statistically the most likely component to fail in a server, which is why SimpliVity utilizes RAID controllers to minimize the impact on performance after the loss of a disk and improve the number of disks that can be lost before availability is impacted.
Of course, even with all these redundancies, nodes do fail, so making sure the node itself isn’t a single point of failure is equally important. This is why SimpliVity does not commit a write back to the VM until it has been committed to two different SimpliVity nodes. At this point, the block of data is stored in the RAM on the OmniStack Accelerator Card in each node. This RAM is backed by super capacitors that can be used to flush the data to flash storage on the card should there be a power loss on the node. Once fully processed, the block is stored down to the RAID-protected disks for permanent storage. In this way, every single block of data is protected from the loss of an entire node, and within the node from both disk and power failures from the moment the block is committed.
Of course, there are some components that simply are impractical to make redundant within a single node. The most obvious example here is the motherboard itself. Instead of placing two motherboards in a single server, IT systems use multiple servers with automated workload failover (e.g. VMware vSphere HA, Microsoft Server Cluster Services) to handle the loss of a single motherboard. This is the approach we took to provide resiliency for the OmniStack Accelerator Card. Since multiple nodes are necessary to protect the failure domain of a single node/motherboard, multiple nodes are used to protect the failure domain of a single OmniStack Accelerator Card.
If an OmniStack Accelerator Card fails, the associated OmniStack Virtual Controller will also be shutdown and the IP address of the OmniStack Virtual Controller will be failed over to another OmniStack Virtual Controller. This allows the VMs on the node with the failed OmniStack Accelerator Card to continue to run and still have uninterrupted access to the data. All the data is available on one of the other nodes, so no loss of storage or application availability occurs. On the other hand, technologies like vSphere HA and Microsoft Server Cluster Services usually require a restart of the VM or application services.
If the OmniStack Accelerator Card failure is a permanent loss and needs to be replaced, there is no dependency on the original card to retrieve and understand data off the disks. All data, metadata, and index tables are persisted on HDD or SSD, so a new OmniStack Accelerator Card will be able to read the existing data. On the rare chance that the data was corrupted or could’ve been corrupted, all data can be rebuilt from the existing data on the remaining nodes.
SimpliVity utilizes the OmniStack Accelerator Card to provide the predictable and peak performance that enterprise customers require, making it a critical component of the SimpliVity OmniStack platform. Utilizing this approach, along with our native data protection, we can ensure a resilient data storage architecture that customers can rely on to protect their data, even during failures.