Our Insights

Thought Leadership and Industry Trends

Home 9 Insights 9 CDS Study: Custom Deduping Solution is Big Win for Client

CDS Study: Custom Deduping Solution is Big Win for Client

May 1, 2019


When a recollection of data introduced different formats and deduplication complications, CDS was hired to develop custom deduplication options to maximize efficiency of a large-scale managed review.


Our client, an Am Law 100 firm, had an end-client who self-collected data, which was then sent to another vendor to process; extracting date and domain values, and searching for keywords. The case team realized some documents were missing, so the end-client recollected the data. This recollection resulted in two footers on every document. The case team then had the new recollection processed and deduplicated against the original collection, but deduping did not work due to various problems, including the double-footers and differences in the formatting within headers, line breaks, and date/subject field values (some were even null). The combination of these content and formatting issues meant that traditional deduplication was not an option. CDS was brought in to find a solution and avoid having the client spend time and money to recollect and/or reprocess one or both sets of data (and potentially restart their deduplication efforts from scratch).


The CDS Team utilized several technologies to develop a solution for the client. Email threading by itself had only minimal results. However, we employed Relativity email threading structured analytics and concept clustering to force batching of similar clusters together to help streamline review workflow. In addition, we performed quality checks to ensure the Email Thread Group relational field pulled in all family members along with their aforementioned clusters. In effect, we created a custom deduplication by grouping all similar email threads together that would not have automatically threaded and/or otherwise deduplicated out.


By using Relativity’s email threading and analytics tools, our team identified and culled out tens of thousands of duplicative and/or irrelevant system files from priority review, saving the client significant time and money.

Initially, we started with 861,791 documents; analyzed 558,349 email documents; identified 436,490 unique emails; and culled out 54,447 duplicative emails. For priority review purposes, we further culled by client date and relevance terms, ultimately analyzing 83,409 email documents; resulting in 75,822 inclusive emails and 7,587 duplicative emails. We also created an analytics index grouping to streamline review via threaded/chronological batches.

Concomitantly, our Advisory Services team ran additional custom deduplication by focusing on client-requested fields and utilizing field value normalization to manually “deduplicate” a population of around 125,000 parent emails (from the second/later-in-time recollected dataset noted previously). This resulted in around 83,000 duplicative parent emails and around 42,000 non-duplicative/unique parent emails. Ultimately, we identified approximately 120,000 emails plus family/attachments to remove from review as duplicative. Moreover, we had also identified 77,834 JAR/system/junk email files which we were able to exclude from all final batch results.

Contact CDS today to learn about how our suite of Analytics tools can be used to help you in your next matter.

About the Author

Devon Crosbie, Esq

Devon Crosbie, Esq

Devon Crosbie is a UNC School of Law graduate and Relativity Master who began his career as a licensed attorney and eDiscovery/litigation support professional in 2007. Since then, he has been strategically leveraging a broad array of tools – while coordinating all aspects of the EDRM process, from data retention, collection, privacy, and security, through production and presentation – to help clients and internal teams alike produce defensible, effective workflows and results.

Relativity AI Bootcamp: Atlanta

Relativity is kicking off a third season of AI Bootcamps on April 23-24 in Atlanta, where CDS’ Director of Advanced Analytics & Data Privacy Danny Diette will be a featured panelist.

Find out more

7th Annual Putting Insights into Practice Forum

Navigate a virtual journey through today’s biggest legal data management challenges at PIIP 2024: ADVENTURES ON THE DATA CONTINUUM

Find out more

Sign Up for Our Newsletter