Thought Leadership and Industry Trends
Best practices for culling your data to save time and money
By Parke McManis, Esq., Managing Director, CDS Mid-Atlantic.
Culling data can be an extremely helpful tool when trying to reduce your dataset to a manageable size. There are many aspects to culling and, like many things in life, you get as much out of it as you put into it. The process involves utilizing various techniques in tandem to remove as many documents as possible from a collection before processing the data and embarking on the costly and time consuming task of reviewing the data. Its value lies in saving both time and money. EDiscovery service providers offer culling at a cost significantly lower than that of processing data, so cost savings can appear early on if you are left with a significantly smaller amount of data to process. This assumes you are able to cull out enough data pre-processing to justify the cost of culling. However, the biggest savings from culling come from having to review less data, since reviewing is typically the most expensive and lengthy part of any eDiscovery project.
The industry average for culling is generally about a 70% reduction in size, meaning that if you start with 100GB of data, your culling should reduce that population to 30GB. This number is an average and individual results can vary significantly depending on the contents of the dataset and the thoroughness of the culling. Maximizing culling requires an intimate knowledge of both the facts of the case and the tools available, joining the two together using best practices.
There are several methods used in culling, including the following:
De-NISTing removes industry accepted “junk” files (mostly program and system files that do not contain user-generated data) which clutter up your review. This process is almost always done during culling.
Removing duplicates is a great way to reduce the size of the collection that has to be processed and reviewed. However, an important issue to decide is whether to perform a global dedupe or custodial dedupe. Custodial deduping only dedupes a custodian against himself. Global deduping will remove documents across custodians. The main advantage to custodial deduping is it ensures that a custodian’s entire collection is kept intact, whereas global deduping maximizes the number of duplicative documents that are removed.
The purpose of using search terms is to find relevant documents. However, the challenge is identifying the right terms and that requires intimate knowledge of the facts of the case. Finalizing the search terms may require an agreement among the parties, which often occurs in the meet and confer or in a subsequent discovery agreement.
In selecting appropriate search terms, consider the following:
- Don’t make your terms overly broad. Try to think of ways that the terms could give you false positives. For example, CDS worked on a case where one of the agreed-upon search terms ended up inadvertently being the last name of an attorney that was involved with the case. This led to not only a large number of false positives, but these false positives were potentially privileged documents.
- Use terms that are as unique as possible so they only bring back documents that are potentially responsive.
- Be careful with wildcards and root extenders. “Wild cards” can increase the breadth of your search terms, such as searching for “import*” to find various versions of the verb import, such as “imported” or “importing.” This tool can be a very helpful way to broaden your search when needed, but be sure not to inadvertently make the search overly broad or it will bring back false positives. For example, this search would give you false positives by hitting on the words “important” and “importance.”
You don’t want your date range to be too big or too small. If your date range is too broad, it can defeat the purpose of having a date range because it may not remove enough documents to justify the cost of culling. Alternatively, if your date range is too narrow, you may have to “go back to the well” and look for more documents. Depending on how long it takes to realize that you need to broaden your date range considered alongside the data retention policy of your vendor, this could lead to added time and cost if the data has to be re-collected or restored from backup because the data is no longer available.
These are just a few of the techniques available to cull data. One of the benefits of working with an eDiscovery service provider is that you can work together on customized workflows and leverage proprietary ECA tools. Depending on your needs, there may be additional steps that can be taken to reduce your data, saving you time and money.
The right Early Case Assessment toolkit can provide a variety of methods for investigating and quickly learning about your data, which can result in significant cost savings in terms of lower volumes of data to process and host as well as lower downstream cost during document review.
Contact the CDS Advisory Services team to discuss how you can streamline your next eDiscovery project.