resources   /   news   /   events
Select Page

    Our Insights

    Thought Leadership and Industry Trends

    CDS Case Study: Custom Deduping Solution is Big Win for Client

    May 1, 2019


    When recollection of data introduced different formats and deduping complications, CDS was hired to develop custom deduplication option(s) to maximize efficiency of a large-scale managed review.


    Our client, an Am Law 100 firm, had an end-client who self- collected data, which was then sent to another vendor to process; extracting the date and domain values and to search for keywords. The case team realized documents were missing so the end-client recollected the data. The recollection resulted in two footers on every document. The case team had the new recollection processed and deduplicated against the original collection, but deduping did not work because of various problems, including the double-footers as well as differences in the formatting in headers, line breaks, and Subject/Sent Date field values (e.g., some null). The combination of these content and formatting issues meant that traditional deduplication was not an option. CDS was brought in to find a solution and avoid having the client spend time and money to recollect and reprocess one or both sets of data and potentially start from scratch with deduplication efforts.


    The CDS Team utilized several technologies to develop a solution for the client. Email threading by itself had only minimal results. However, we employed Relativity email threading structured analytics and concept clustering to force batching of similar clusters together to help streamline review workflow. In addition, we performed QC to ensure the Email Thread Group relational field pulled in all family members and similar concept clusters. In effect, we created a custom deduplication by grouping all similar email threads together that would not have automatically threaded and/or deduplicated out otherwise.


    By using Relativity’s email threading and analytics tools, our team identified and culled out thousands of documents of system files from priority review saving the client significant time and money.

    Initially, we started with 861,791 documents; analyzed 558,349 email documents; identified 436,490 unique emails and culled out 54,447 duplicative emails. For initial review purposes, we further culled by client date and relevance terms, ultimately analyzing 83,409 email documents, resulting in 75,822 inclusive emails and 7,587 duplicative emails. We also created an analytics index grouping to streamline review via threaded/chronological batches.

    Concomitantly, our Advisory Services team ran additional custom deduplication by focusing on client-requested fields and utilizing field value normalization to manually “deduplicate” a population of ~125,000 parent emails (from the second/later-in-time dataset noted previously), resulting in ~82,000 duplicative parent emails and ~42,000 non-duplicative/unique parent emails. Ultimately, we identified approximately 120,000 emails plus family attachments to remove from review as duplicative. Moreover, we had also identified 77,834 documents that were JAR/system/junk email files which we were able to exclude from all final batch results.

    Contact CDS today to learn about how our suite of Analytics tools can be used to help you in your next matter.