Friday, December 12, 2008

Storage de-duplication

Right now de-duplication is being used mainly as a way to decrease the size of backup data sets. I have made the Data Domain 5 TB appliance a key part of my backup strategy in the past. Basically I was able to write around 50 TB of data to the appliance even though it only had 5 TB of actual disk.

How you do this isnt something I want to go into. Basically it breaks data into chunks, compares them to what is already in the system, if something is already there it uses a pointer instead of storing the data.

Since we were backing up entire VMs, we were getting 20X de-dupe ratios because of the redundancy with the c drive, for example.

My whole issue is that now de-duplication is thought of as a technology solution. You buy a data domain, anything you write to it is de-duped. I dont think that line of thought makes sense.

I think there are two things that must change. The first is that de-duplication should be thought of as a feature or service. You own a Clariion array, you own a DMX, you own an EVA. You can enable de-duplication. It is a service. Not a hardware solution like Data Domain.

With Data Domain you have silos of de-duplicated data. For this reason the second thing that I think will have to happen soon is global de-duplication. I have always thought of this since I can remember understanding storage. It has yet to come in full swing, but must happen.

The growth rates are so high and there is so much waste that a global de-duplication system must happen. Basically something that sits above all of your physical storage, on the same level or part of storage virtualization, that goes into all of your data and de-duplicates the entire enterprise. Now you dont have to worry about seperate de-dupe silos.

The virtual layer can handle the data access and pointers. My bet is that the ratios will be significantly high. Maybe not as high as it is with backup for obvious reasons.

Along with this is the need to be able to access de-duplicated data at fast speeds. This is the hard part and can partly be handled through caching or tiering within the de-duplicated environment.

No comments: