In this blog we will outline the general concept of deduplication of data, focus on the two most common forms of deduplication, and explain why VMcom chose fixed block length deduplication.
The basics about deduplication
If you have a data block on your hard drive and use a deduplication tool while performing a back up, the tool will match all incoming blocks to be backed up against those blocks already backed up. Where there is a repeat block, the tool will not back up that block because it already exists. Rather, the tool will create a reference address to the 1st block for future recovery. Deduplication should always be based on the block level and not the file level. Say you are backing up two separate files and only the last 5% of each file is different. Block level backs up only one instance of the 95% that the files have in common. It will then back up a separate instance of each file reflecting the 5% that is different.
Fixed block length deduplication vs. Variable block length deduplication
Fixed block length deduplication (Fixed Block) – the way VMcom does it.
Fixed Block is when chunks of blocks, all with the same defined length, are processed. VMcom uses Fixed Block because it requires fewer check sums. This makes it easier to calculate which in turn means less CPU demand. And CPU costs money. VMcom by default uses 1MB block lengths. During the backup, VMcom checks each incoming chunk of blocks against the check sums and an internal table of blocks already backed up. If there is a duplicate, VMcom simply creates a reference to where it is and doesn’t backup the repetitive block. If it is a new, unique block, VMcom will write it to the hard drive like all previous unique blocks and create a check sum and table reference for all future incoming blocks. And this read, write process happens again and again until the defined range of data is backed up.
Variable block length deduplication (Variable Block)
Variable Block, as the name implies, can have a different length for every data block in the backup. It uses specialized algorithms to split the data into chunks which can vary in size, thus achieving slightly higher deduplication ratio than Fixed Block. These algorithms, however, can be far more CPU intense. And CPU costs money. Often, for backup solutions running Variable Block deduplication, you will need to purchase high end HW to handle the processing. On the flip slight, the higher number of check sums in Variable Block results in more granular deduplication and thus a smaller back up foot print.
Boiled down, what does it mean?
If you go with a backup solution using Fixed Block deduplication, you’ll be able to run it on more generic, cheaper HW and get a good processing rate as opposed to Variable Block. But, your storage footprint will be larger than Variable Block.
If you go with a backup solution using Variable Block deduplication, you’ll need to run it on more expensive, specialized HW to get a good processing rate as opposed to Fixed Block. But, your storage footprint will be smaller than Fixed block.
Why did VMcom go with Fixed Block?
Because, we at VMcom are real IT admins and know how things work in practice. CPU and memory for processing are more expensive than storage. In fact, companies often use older, end of life HW for backup storage.
Costs aside, you need to process and get through as much backup as possible in as short a time as possible, so you can revert CPU back to business-critical processes. And we believe that is more easily achieved across a wider range of HW using Fixed block as opposed to Variable block.
If you actually need or want Variable Block, you are better off buying specialized hardware like EMC DataDomain or HPE StoreOnce to handle the deduplication. These and similar devices can be easily used with VMcom and you get the very best from both worlds. They come with a price though.