Popular wisdom holds that the primary value of data deduplication is in storage capacity savings. This is not surprising given that the first generation of data deduplication companies were focused on a single use case: backup to disk. It was the right solution at that time. Disk drives were getting larger and less expensive, and deduplication was the enabling technology that allowed backup to disk to become price competitive with backup to tape for many customers. The deduplication engine for these devices was designed to reduce the amount of disk capacity required to store weeks or months of backup data. These solutions were never intended for primary storage applications.
Primary storage has a completely different set of requirements. Latency is far more important than throughput. Random performance is generally more important than sequential performance. Caching becomes a critical feature. Whereas disk based backup devices are optimized for a single workload of sequential write streams with periodic sequential reads for restore, primary storage has to address a mixed workload that services many different data center applications at the same time.
What value can deduplication technology deliver for primary storage? It goes far beyond just capacity. Deduplication can optimize HDD capacity, HDD IOPS, SSD efficiency, DRAM efficiency, and WAN efficiency. HDD capacity is the most well-known feature given the recent history of deduplication, but I would argue it is the least valuable of the group for primary storage. Let’s assume a 3TB HDD can deliver 100 IOPS (to make the math simple). If I create a new 20GB vmdk that contains 10GB of data that already exists in the system, I will save 10GB of capacity (0.003% of one HDD), but more importantly I will eliminate the IOPS required to write those 10GB to disk (100% of the IOPS for over 1 minute on one HDD). The IOPS savings is significantly more valuable.
Deduplication also impacts SSD and DRAM efficiency. If a system deduplicates intelligently, then there should never be two copies of the same piece of data in DRAM. This also holds true for SSD. The cost per GB of DRAM and SSD are much higher than HDD, so here the capacity efficiency is very powerful. This allows for an effective DRAM and SSD size that is greater than the physical size. The more efficient these components are, the more requests they can fulfill—and the fewer HDD IOPS are required, and the faster the system performs.
Let’s not forget about the WAN. The Wide Area Network link is an expensive and limited resource. Storage replication has become extremely common and replicating the data over the WAN is the only practical way to obtain disaster recovery capabilities with a Recovery Point Objective (RPO) of less than 24 hours. If all of the data in the primary storage system is already deduplicated, then there should be no need to transfer any data to DR that is already there. In order for this to work, the data needs to be deduplicated in real time, and the data centers need to be able to communicate efficiently to transfer only the unique data at a fine grain level. This dramatically reduces the storage replication burden on the WAN, achieving bandwidth reductions of between 50 percent to 90 percent.
Not all systems deduplicate the same way. Unless the deduplication is done in real time at a fine grain level at the very core of the system, it will not be capable of enabling all of the functionality described above. The SimpliVity OmniCube deduplication is systemic. The system is simply not capable of creating duplicate data. This means that OmniCube benefits from all five optimizations—HDD capacity, HDD IOPS, SSD efficiency, DRAM efficiency, and WAN efficiency. Next time you are looking at a product that claims to have deduplication technology, ask if it benefits the platform in these five areas.
And one more thing…SimpliVity OmniCube also compresses all of the data making it that much more efficient.