Thought Leadership and Industry Trends
eDiscovery Data Processing 101
By Brian Pick, Client Director at CDS, and Pete Lwin, Senior Project Engineer at CDS.
Before any raw or native data can be reviewed for eDiscovery, it must first be “processed.” For those new to eDiscovery, the jargon and steps involved in data processing can be confusing. Our goal is to demystify data processing by describing what happens during this phase of eDiscovery at a high level and explaining some common processing terms and phrases.
In many matters, especially those involving Government Agencies, parties will provide Processing Specifications (a set of instructions outlining how data should be processed) or Production Specifications (a set of instructions outlining how data must be produced). Production Specifications or “specs” often impact data processing since certain information or metadata being produced must be captured at the onset during processing.
During the Data Processing phase, documents are ingested (copied into) a review platform (database) where the below items take place:
- Native files are unpacked/expanded so that a single file becomes multiple separate records. For example, a single email with an attachment becomes two records, a parent email and child attachment. This is what is meant by the commonly used term Extracting Attachments during processing. Metadata from each record is extracted and preserved (retaining original values) and normalized (for example, all dates conform to the same format). All extracted metadata is placed into its own field which becomes viewable and searchable within the review platform. This allows users to search for all emails from a certain individual or allows users to only find Microsoft Word documents. Finally, each separate record receives a Control Number or Document ID (a unique number and way to track documents during review and then again when they are produced or bates stamped multiple times, so they can be identified regardless of whether they have the same file name.
- Other common processing elements include:
- Optical Character Recognition or OCR which involves taking an image, often a scanned document or PDF that is stored as a picture file as opposed to text, and automatically detecting each character, letter, and word, thus making the documents searchable by keywords.
- Embedded Object Extraction is also done. Embedded objects are files contained within other files. The most common examples are a Word document someone adds to another Word document with the second document as an icon making it clickable to open or an Excel file with underlying data and numbers embedded as a visual chart in a PowerPoint file. Extracting embedded objects means that the Excel file is extracted from the PowerPoint file and loaded as a separate record. The Excel file would be considered a “child” of the PowerPoint. Not extracting embedded objects could potentially mean content and files could be produced that are not easily viewable during review. Certain embedded objects, called Inline Images (images pasted into the body of another document, such as email signature logos) are often excluded during the embedded object extraction process.
- A Processing Time Zone will also be chosen to indicate which time zone to display dates and times on documents that are imaged or produced since time zone is not typically specified.
- Using the extracted metadata fields, documents are then culled (filtered out or set to the side). The most common document culling techniques are: date filtering (only including documents within a certain time frame), deduplication (removing 100% identical documents that might exist within the same custodian’s files, or in multiple custodians files), deNISTing (removing known system files that do not contain any user generated content), and Search Terms (only including documents that contain a certain key word or search condition such as Term X within 2 words of Term A). When deduplication is performed, in most cases a DupeCustodian field is created which captures the names of custodians whose version of a file was deduped out, so it can be determined who else had a copy of all documents.
Once all of these steps are completed, data can be uploaded (or published) to the document review platform and indexed. At this point, CDS Project Managers and Consultants help clients efficiently and cost-effectively decide on the best approach to accomplish review goals and get through their data sets. Learn more about CDS’ Early Case Assessment toolkit or contact us regarding how we can help you manage your eDiscovery.