Saturday 4 February 2012

Deduplication



Deduplication is a technology to control the rate of data growth. The average UNIX or windows disk volumes have thousands or even millions of duplicate data objects. Which consumed lots of valuable space, by eliminating redundant data objects and referencing just the original object huge benefit is obtained through storage space efficiencies.
There is lots of question asked by customer that how much deduplication data is support by their system.
MAXIMUM FLEXIBLE VOLUME SIZE
The maximum flexible volume size limitation for deduplication varies based on the platform (this number depends primarily on the amount of system memory). When this limit is reached, writes to the volume fail just as they would with any other volume after it is full. This could be important to consider if the flexible volumes are ever moved to a different platform with a smaller maximum flexible volume size. Table 5 shows the maximum usable flexible volume size limits (including any snap reserve space) for the different NetApp storage system platforms. For versions of Data ONTAP prior to 7.3.1, if a volume ever gets larger than this limit and is later shrunk to a smaller size, deduplication cannot be enabled on that volume
The maximum shared data limit per volume for deduplication is 16TB, regardless of the platform type. Once this limit is reached, there is no more deduplication of data in the volume, but writes to the volume continue to work successfully until the volume is completely full.
 Table 6 shows the maximum total data limit per deduplicated volume for each platform. This is the maximum amount of data that can be stored in a deduplicated volume. This limit is equal to the maximum volume size plus the maximum shared data limit. For example, in an R200 system that can have a deduplicated volume of up to 4TB in size, 20TB of data can be stored; that is 4TB + 16TB = 20 TB.
Next important question asked by customer is that how of the actual space saving they will get by running deduplication.
There is a tool called SSET tool , this tool can calculate and tell u the actual space saving of the data , but it works up to 2TB max. you can download this tool from now.netapp.com
1.  Volume deduplication overhead - for each volume with deduplication enabled, up to 2% of the logical amount of data written to that volume will be required in order to store volume dedupe metadata.
2.  Aggregate deduplication overhead - for each aggregate that contains any volumes with dedupe enabled, up to 4% of the logical amount of data contained in all of those volumes with dedupe enabled will be required in order to store the aggregate dedupe metadata. For example, if 100GB of data is to be deduplicated within a single volume, then there should be 2GB worth of available space within the volume and 4GB of space available within the aggregate.  As a second example, consider a 2TB aggregate with 4 volumes each 400GB’s in size within the aggregate where three volumes are to be deduplicated, with 100GB of data, 200GB of data and 300GB of data respectively.  The volumes will need 2GB, 4GB, and 6GB of space within the respective volumes; and, the aggregate will need a total of 24GB ((4% of 100GB) + (4% of 200GB) + (4%of 300GB) = 4+8+12 = 24GB) of space available within the aggregate. Note: The amount of space required for deduplication metadata is dependent on the amount of data being deduplicated within the volumes, and not the size of the volumes or the aggregate.
For deduplication to provide the most benefit when used in conjunction with Snapshot copies, consider the following best practices: 
1.  Run deduplication before creating new Snapshot copies. 
2.  Remove unnecessary Snapshot copies maintained in deduplicated volumes.
3.  If possible, reduce the retention time of Snapshot copies maintained in deduplicated volumes.
 4. Schedule deduplication only after significant new data has been written to the volume.
 5. Configure appropriate reserve space for the Snapshot copies.
 6. If the space used by Snapshot copies grows to more than 100%, it causes df –s to report incorrect results, because some space from the active file system is being taken away by Snapshot, and therefore actual savings from deduplication aren’t reported.

Some Facts About Deduplication.
Deduplication consumes system resources and can alter the data layout on disk. Due to the application’s I/O pattern and the effect of deduplication on the data layout, the read and write I/O performance can vary. The space savings and the performance impact depend on the application and the data contents.
  NetApp recommends that the performance impact due to deduplication be carefully considered and measured in a test setup and taken into sizing considerations before deploying deduplication in performance-sensitive solutions. For information about the impact of deduplication on other applications, contact the specialists at NetApp for their advice and test results of your particular application with deduplication.
  If there is a small amount of new data, run deduplication infrequently, because there’s no benefit in running it frequently in such a case, and it consumes system resources. How often you run it depends on the rate of change of the data in the flexible volume.
  The more concurrent deduplication processes you’re running, the more system resources are consumed.
  Given the previous two items, the best option is to do one of the following:
-  Use the auto mode so that deduplication runs only when significant additional data has been written to each flexible volume. (This tends to naturally spread out when deduplication runs.)
-  Stagger the deduplication schedule for the flexible volumes so that it runs on alternative days, reducing the possibility of running too many concurrent sessions.

-  Run deduplication manually.
  If Snapshot copies are required, run deduplication before creating them to minimize the amount of data before the data gets locked in to the copies. (Make sure that deduplication has completed before creating the copy.) If a Snapshot copy is created on a flexible volume before deduplication has a chance to complete on that flexible volume, this could result in lower space savings.
  For deduplication to run properly, you need to leave some free space for the deduplication metadata.
IMPACT ON THE SYSTEM DURING THE DEDUPLICATION PROCESS
The deduplication operation runs as a low-priority background process on the system. However, it can still affect the performance of user I/O and other applications running on the system. The number of deduplication processes that are running and the phase that each process is running in can cause performance impacts to other applications running on the system (up to eight deduplication processes can actively run at any time on a system). Here are some observations about running deduplication on a
FAS3050 system: With eight deduplication processes running, and no other processes running, deduplication uses 15% of the CPU in its least invasive phase, and nearly all of the available CPU in its most invasive phase.  When one deduplication process is running, there is a  0% to 15% performance degradation on other applications.   With eight deduplication processes running, there may be a 15% to more than 50% performance penalty on other applications running on the system.

No comments:

Post a Comment