In my previous blog, I wrote about the trade offs made when considering different approaches to compression. In this blog, we will consider the same trade-offs for deduplication. At the highest level, deduplication and compression are both technologies that can be used to make data more efficient. Let’s drill down into what deduplication does and what resources are required.
Deduplication is a technology that, as the name suggests, eliminates duplicate data patterns within a set of data. It became common in the early 2000s in backup-to-disk (B2D) devices. Several vendors, like Data Domain and Diligent Technologies, brought products to market that allowed weeks or months of backup data to be stored on disk. It was the catalyst for a major shift in the industry when B2D became a viable alternative to backup to tape. These technologies were designed to pack as much data as possible into an HDD and the overall benefits began to tip the scale in favor of disk over tape.
The objective of deduplication in backup was to save capacity. The positive capacity savings benefits outweighed some of the negatives, like performance. However, since deduplication was being performed on backup data—and not production data, where a performance impact would be more disruptive—performance didn’t matter as much (as long as backup windows were met). And, luckily, overall backup performance was greatly improved with backup to disk.
Purpose-built deduplication devices used in backup invest CPU resources in order to reduce HDD capacity consumption. Unfortunately, deduplication does not offer the low latency required to perform this processing on production data.
Deduplication technologies compare new data to the existing data stored and eliminate the redundant data. Like compression, there is no way to confirm the benefit from deduplication without investing resources. In other words, after investing resources to deduplicate the data, there may be no savings. There are multiple ways to implement deduplication, but for the purposes of this discussion, I will focus on inline and post-processing deduplication.
Inline deduplication requires CPU resources to process the incoming data to determine if it is unique. If it is unique, then it is written to the storage media. If it is a duplicate, then the data block does not need to be stored. Instead, only a metadata update is required. The result? Fewer IOPS are required and capacity consumption is decreased. Unfortunately, the overhead of deduplication processing introduces latency into the write process. This slows down write operations and decreases the number of IOPS available to service application IO.
It is also possible to write the data to the storage media and then, at a later time, process the data to determine if it can be deduplicated. This is called post-processing deduplication and the trade-offs are quite different. CPU resources are still required, however, in this case, deduplication also consumes additional storage IOPS. This is because the duplicate data will always be written to disk. These disk operations would have been eliminated with inline deduplication. The results of post-processing deduplication are an eventual reduction of capacity and fewer IOPS available for business applications. The operations required to process the data compete with business applications trying to perform IO.
There is another challenge with post-processing deduplication that is less obvious. It is much more difficult to deliver maximum efficiency in a multi-site environment. If there is data replication set up from a primary site to a disaster recovery location, then either the replication task needs to wait for the post-processing deduplication to be completed or it needs to send the redundant data to the remote site. If redundant data is sent to the remote site, then it will need to be processed again on the other side of the WAN.
In a hyperconverged infrastructure environment, efficiency is critical. All of the infrastructure applications and business applications are sharing the same resource pool. Each time the data is processed, it consumes resources that are then unavailable to run business applications.
SimpliVity takes a different approach: deduplication is done at inception, once and forever. SimpliVity’s Data Virtualization Platform delivers inline deduplication, but without a performance penalty. That’s because the OmniStack Accelerator allows all of the “heavy lifting” to be offloaded from the host CPUs. This leaves as much CPU as possible available to run the business applications. The OmniStack Accelerator also delivers extremely predictable performance, which allows SimpliVity’s hyperconverged infrastructure to deliver predictable performance for the business applications.
Deduplication is performed in many different data center services or infrastructure, such as WAN optimization, backup, replication, and production storage. While it’s possible to consolidate these services in hyperconverged infrastructure by running each service as a virtual appliance in the virtual environment, it’s extremely inefficient. That’s because each virtual appliance will process the same data independently—usually having to “rehydrate” it to its original state and then deduplicate the data as it moves through its lifecycle or is moved between sites. When the same data is processed multiple times, more resources are consumed and the infrastructure inherently becomes less efficient.
The SimpliVity “once and forever” approach means that data is processed once and then it remains in an efficient format forever, whether it is for production storage, backup, replication, movement to the cloud, etc. The Data Virtualization Platform also extends across many sites. This means that not only is the data efficient locally, but also “once and forever” means that redundant data will not be sent to a remote site when data is replicated. Even the first data transfer will be deduplicated against the data in the remote site.
Next up in the blog series, we’ll look at deduplicating IO.