Our Insights

Thought Leadership and Industry Trends

Home 9 Insights 9 CDS Study: Custom Deduping Solution is Big Win for Client

CDS Study: Custom Deduping Solution is Big Win for Client

May 1, 2019

Synopsis:

When a recollection of data introduced different formats and deduplication complications, CDS was hired to develop custom deduplication options to maximize efficiency of a large-scale managed review.

Challenge:

Our client, an Am Law 100 firm, had an end-client who self-collected data, which was then sent to another vendor to process; extracting date and domain values, and searching for keywords. The case team realized some documents were missing, so the end-client recollected the data. This recollection resulted in two footers on every document. The case team then had the new recollection processed and deduplicated against the original collection, but deduping did not work due to various problems, including the double-footers and differences in the formatting within headers, line breaks, and date/subject field values (some were even null). The combination of these content and formatting issues meant that traditional deduplication was not an option. CDS was brought in to find a solution and avoid having the client spend time and money to recollect and/or reprocess one or both sets of data (and potentially restart their deduplication efforts from scratch).

Solution:

The CDS Team utilized several technologies to develop a solution for the client. Email threading by itself had only minimal results. However, we employed Relativity email threading structured analytics and concept clustering to force batching of similar clusters together to help streamline review workflow. In addition, we performed quality checks to ensure the Email Thread Group relational field pulled in all family members along with their aforementioned clusters. In effect, we created a custom deduplication by grouping all similar email threads together that would not have automatically threaded and/or otherwise deduplicated out.

Results:

By using Relativity’s email threading and analytics tools, our team identified and culled out tens of thousands of duplicative and/or irrelevant system files from priority review, saving the client significant time and money.

Initially, we started with 861,791 documents; analyzed 558,349 email documents; identified 436,490 unique emails; and culled out 54,447 duplicative emails. For priority review purposes, we further culled by client date and relevance terms, ultimately analyzing 83,409 email documents; resulting in 75,822 inclusive emails and 7,587 duplicative emails. We also created an analytics index grouping to streamline review via threaded/chronological batches.

Concomitantly, our Advisory Services team ran additional custom deduplication by focusing on client-requested fields and utilizing field value normalization to manually “deduplicate” a population of around 125,000 parent emails (from the second/later-in-time recollected dataset noted previously). This resulted in around 83,000 duplicative parent emails and around 42,000 non-duplicative/unique parent emails. Ultimately, we identified approximately 120,000 emails plus family/attachments to remove from review as duplicative. Moreover, we had also identified 77,834 JAR/system/junk email files which we were able to exclude from all final batch results.

Contact CDS today to learn about how our suite of Analytics tools can be used to help you in your next matter.

About the Author

Devon Crosbie, Esq

Devon Crosbie is a UNC School of Law graduate and Relativity Master who began his career as a licensed attorney and eDiscovery/litigation support professional in 2007. Since then, he has been strategically leveraging a broad array of tools – while coordinating all aspects of the EDRM process, from data retention, collection, privacy, and security, through production and presentation – to help clients and internal teams alike produce defensible, effective workflows and results.

01 May 2024

7th Annual Putting Insights into Practice Forum

Navigate a virtual journey through today’s biggest legal data management challenges at PIIP 2024: ADVENTURES ON THE DATA CONTINUUM

Find out more

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
bcookie	2 years	This cookie is set by linkedIn. The purpose of the cookie is to enable LinkedIn functionalities on the page.
lang	session	This cookie is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc	1 day	This cookie is set by LinkedIn and used for routing.

Cookie	Duration	Description
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gcl_au	3 months	This cookie is used by Google Analytics to understand user interaction with the website.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visted in an anonymous form.
_hjFirstSeen	30 minutes	This is set by Hotjar to identify a new user’s first session. It stores a true/false value, indicating whether this was the first time Hotjar saw this user. It is used by Recording filters to identify new user sessions.
oktgid	1 year	This cookie is used for storing the visitor ID of the user who clicked on an okt.to link.
oktsid		This cookie is used for storing the session ID of the user who clicked on an okt.to link.
pardot	past	The cookie is set when the visitor is logged in as a Pardot user.
vuid	2 years	This domain of this cookie is owned by Vimeo. This cookie is used by vimeo to collect tracking information. It sets a unique ID to embed videos to the website.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to deliver advertisement when they are on Facebook or a digital platform powered by Facebook advertising after visiting this website.
bscookie	2 years	This cookie is a browser ID cookie set by Linked share Buttons and ad tags.
fr	3 months	The cookie is set by Facebook to show relevant advertisments to the users and measure and improve the advertisements. The cookie also tracks the behavior of the user across the web on sites that have Facebook pixel or Facebook social plugin.
IDE	1 year 24 days	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
test_cookie	15 minutes	This cookie is set by doubleclick.net. The purpose of the cookie is to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.

Cookie	Duration	Description
_dc_gtm_UA-109542572-2	1 minute	No description
_hjAbsoluteSessionInProgress	30 minutes	No description
_hjid	1 year	This cookie is set by Hotjar. This cookie is set when the customer first lands on a page with the Hotjar script. It is used to persist the random user ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_hjIncludedInPageviewSample	2 minutes	No description
_hjTLDTest	session	No description
AnalyticsSyncHistory	1 month	No description
CONSENT	16 years 8 months 26 days 9 hours 2 minutes	No description
UserMatchHistory	1 month	Linkedin - Used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.

Our Insights

Thought Leadership and Industry Trends

CDS Study: Custom Deduping Solution is Big Win for Client

Devon Crosbie, Esq

7th Annual Putting Insights into Practice Forum

Our Blog

Sign Up for Our Newsletter

About CDS

Contact Us