Our Insights

Thought Leadership and Industry Trends

Home 9 Insights 9 eDiscovery Data Processing 101

eDiscovery Data Processing 101

May 23, 2018

Before any raw or native data can be reviewed for eDiscovery, it must first be “processed.” For those new to eDiscovery, the jargon and steps involved in data processing can be confusing. Our goal is to demystify data processing by describing what happens during this phase of eDiscovery at a high level and explaining some common processing terms and phrases.

Initial Specifications:

In many matters, especially those involving Government Agencies, parties will provide Processing Specifications (a set of instructions outlining how data should be processed) or Production Specifications (a set of instructions outlining how data must be produced). Production Specifications or “specs” often impact data processing since certain information or metadata being produced must be captured at the onset during processing.

Processing Steps:

During the Data Processing phase, documents are ingested (copied into) a review platform (database) where the below items take place:

Native files are unpacked/expanded so that a single file becomes multiple separate records. For example, a single email with an attachment becomes two records, a parent email and child attachment. This is what is meant by the commonly used term Extracting Attachments during processing. Metadata from each record is extracted and preserved (retaining original values) and normalized (for example, all dates conform to the same format). All extracted metadata is placed into its own field which becomes viewable and searchable within the review platform. This allows users to search for all emails from a certain individual or allows users to only find Microsoft Word documents. Finally, each separate record receives a Control Number or Document ID (a unique number and way to track documents during review and then again when they are produced or bates stamped multiple times, so they can be identified regardless of whether they have the same file name.
Other common processing elements include:

Optical Character Recognition or OCR which involves taking an image, often a scanned document or PDF that is stored as a picture file as opposed to text, and automatically detecting each character, letter, and word, thus making the documents searchable by keywords.
Embedded Object Extraction is also done. Embedded objects are files contained within other files. The most common examples are a Word document someone adds to another Word document with the second document as an icon making it clickable to open or an Excel file with underlying data and numbers embedded as a visual chart in a PowerPoint file. Extracting embedded objects means that the Excel file is extracted from the PowerPoint file and loaded as a separate record. The Excel file would be considered a “child” of the PowerPoint. Not extracting embedded objects could potentially mean content and files could be produced that are not easily viewable during review. Certain embedded objects, called Inline Images (images pasted into the body of another document, such as email signature logos) are often excluded during the embedded object extraction process.
A Processing Time Zone will also be chosen to indicate which time zone to display dates and times on documents that are imaged or produced since time zone is not typically specified.

Using the extracted metadata fields, documents are then culled (filtered out or set to the side). The most common document culling techniques are: date filtering (only including documents within a certain time frame), deduplication (removing 100% identical documents that might exist within the same custodian’s files, or in multiple custodians files), deNISTing (removing known system files that do not contain any user generated content), and Search Terms (only including documents that contain a certain key word or search condition such as Term X within 2 words of Term A). When deduplication is performed, in most cases a DupeCustodian field is created which captures the names of custodians whose version of a file was deduped out, so it can be determined who else had a copy of all documents.

Once all of these steps are completed, data can be uploaded (or published) to the document review platform and indexed. At this point, CDS Project Managers and Consultants help clients efficiently and cost-effectively decide on the best approach to accomplish review goals and get through their data sets. Learn more about CDS’ Early Case Assessment toolkit or contact us regarding how we can help you manage your eDiscovery.

About the Author

CDS Staff

Our leadership team and advisory consultants, project managers, and technical experts assist clients through all phases of the eDiscovery process.

01 May 2024

7th Annual Putting Insights into Practice Forum

Navigate a virtual journey through today’s biggest legal data management challenges at PIIP 2024: ADVENTURES ON THE DATA CONTINUUM

Find out more

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
bcookie	2 years	This cookie is set by linkedIn. The purpose of the cookie is to enable LinkedIn functionalities on the page.
lang	session	This cookie is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc	1 day	This cookie is set by LinkedIn and used for routing.

Cookie	Duration	Description
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gcl_au	3 months	This cookie is used by Google Analytics to understand user interaction with the website.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visted in an anonymous form.
_hjFirstSeen	30 minutes	This is set by Hotjar to identify a new user’s first session. It stores a true/false value, indicating whether this was the first time Hotjar saw this user. It is used by Recording filters to identify new user sessions.
oktgid	1 year	This cookie is used for storing the visitor ID of the user who clicked on an okt.to link.
oktsid		This cookie is used for storing the session ID of the user who clicked on an okt.to link.
pardot	past	The cookie is set when the visitor is logged in as a Pardot user.
vuid	2 years	This domain of this cookie is owned by Vimeo. This cookie is used by vimeo to collect tracking information. It sets a unique ID to embed videos to the website.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to deliver advertisement when they are on Facebook or a digital platform powered by Facebook advertising after visiting this website.
bscookie	2 years	This cookie is a browser ID cookie set by Linked share Buttons and ad tags.
fr	3 months	The cookie is set by Facebook to show relevant advertisments to the users and measure and improve the advertisements. The cookie also tracks the behavior of the user across the web on sites that have Facebook pixel or Facebook social plugin.
IDE	1 year 24 days	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
test_cookie	15 minutes	This cookie is set by doubleclick.net. The purpose of the cookie is to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.

Cookie	Duration	Description
_dc_gtm_UA-109542572-2	1 minute	No description
_hjAbsoluteSessionInProgress	30 minutes	No description
_hjid	1 year	This cookie is set by Hotjar. This cookie is set when the customer first lands on a page with the Hotjar script. It is used to persist the random user ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_hjIncludedInPageviewSample	2 minutes	No description
_hjTLDTest	session	No description
AnalyticsSyncHistory	1 month	No description
CONSENT	16 years 8 months 26 days 9 hours 2 minutes	No description
UserMatchHistory	1 month	Linkedin - Used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.

Our Insights

Thought Leadership and Industry Trends

eDiscovery Data Processing 101

CDS Staff

7th Annual Putting Insights into Practice Forum

Our Blog

Sign Up for Our Newsletter

About CDS

Contact Us