search
  resources   /   news   /   events
Select Page

    Our Insights

    Thought Leadership and Industry Trends

    Data Analytics FAQs with CDS’ Advisory Services Team

    August 19, 2019

    Data analytics tools offer significant opportunities to improve the accuracy and efficiency of document reviews. Analytics features, such as Technology Assisted Review (“TAR”) can be used to supplement or even replace a human based linear review. Others, such as email threading or near duplicate identification can be run quickly to help accelerate the review process. While these tools have some sophisticated capabilities, it is important to understand when and how they should be used. The most frequent questions we get from clients include the following:

    1. Can we de-duplicate our documents with a near duplicates analysis?

    Our team is asked this question a few times each week. Clients frequently find themselves needing to review a set of documents where a forensic, hash value deduplication is not available (e.g. images and OCR text rather than native files). In other cases, clients do have native files, but they’re seeing many documents with duplicative content nonetheless. 

    A near duplicates analysis may work depending on the circumstances. The first and most important factor is what type of documents are involved. Near duplicates analysis is a text comparison, so it will not be helpful for scanned documents where the OCR text is analyzing handwriting. It also is not the best tool if you are looking at email, but luckily email threading has a duplicate identification component.

    After confirming the data is appropriate for near duplicates analysis, the next question we need to ask is what is the goal of the review? 

    • Production of documents. In a review for production to an opposing party in civil litigation or a regulatory investigation, nearduplicates are unlikely to be acceptable review exclusions. The analytics software has no way to tell if the 5% or 10% difference in text between two documents is relevant, so by excluding anything less than an exact match, there is a chance of excluding relevant and unique content. A different workflow, such as predictive coding might be worth considering. 
    • In an investigation or fact development review without production requirements, removing near duplicates might be a great solution. When you do find particularly interesting documents, it would be easy to look at the near duplicates to ensure nothing important was missed.
    1. We’re interested in using TAR 2.0 to automatically prioritize our document review. How do we incorporate document families into the workflow?

    For those unfamiliar with the term, TAR 2.0 (or Continuous Active Learning) is a machine-learning driven review workflow where a software learns from reviewers’ prior coding decisions to supply the most likely responsive documents to your review team. Because of the tremendous efficiency gains it can create, it is becoming increasingly popular with review teams.

    We are often asked how we recommend dealing with document families while taking advantage of this workflow. In a traditional review, reviewers are presented with a batch of documents wherein email are presented to reviewers immediately followed by any attachments. If any document in a family is relevant, the entire family is eligible for production because this context is typically considered necessary to meet the “usual course of business” formatting requirement in FRCP 34(b)(2)(i).

    As noted earlier, in a TAR 2.0 workflow, the software presents the review team with the most-likely (highest scoring) documents. Each document is scored independently from its attachments, so it is unlikely that a reviewer will see a parent email and its attachment consecutively as they would in the traditional review batches. Consider as an example an email that reads “See attached” between two custodians, co-workers on the same team, that has a highly relevant attachment. The high-scoring attachment is presented to the review team early in the review and marked relevant. The parent email, on the other hand, likely resembles hundreds of other emails that have attachments that are irrelevant to the case at hand. It receives a low score from the TAR software, maybe so low that it is below a review cut-off chosen after a sufficient percent of the responsive documents have already been found. 

    Clients are reasonably concerned about these emails that might be simultaneously excluded from review and also eligible for production, but there is some good news. First, these types of families are usually a substantial minority. While not all documents in families are created equal, they generally tend to score highly enough to merit review before a cut-off decision is made. Second, there are some pretty easy workflow options to ensure these documents are reviewed before production.

    • For smaller reviews, it might be easiest to use Relativity’s related items panel and quickly review the attachments of any relevant documents. Once that is complete, the reviewer can navigate back to the automated stream.
    • For larger reviews, a quick and easy approach is to create a search for any unreviewed attachments and have the team review for sensitive information whenever the TAR cut-off has been accomplished.
    • For large and more complicated reviews where the team is coding for several issues during the first pass, an alternative workflow that has proven to be very efficient is to have a small team dedicated to the TAR 2.0 training and coding only for relevance. A “second level” team can review the highest scoring documents plus their families to apply the first pass at privilege, confidentiality, issues, etc.

    As data analytics tools continue to evolve, it is increasingly important to consult eDiscovery experts for guidance on how they may help streamline your review. Contact CDS today to learn about how our suite of Analytics tools can assist you in your next matter.

    About the Author

    Dan Diette, Esq., Data Scientist, CDS

    Dan is an eDiscovery Data Scientist specializing in Technology Assisted Review and eDiscovery Analytics at CDS.  He has over five years of experience focusing on the application of machine learning and predictive coding technology to eDiscovery.  He has designed TAR workflows and validation reporting that have been presented to and approved by the DOJ and FTC for HSR Second Requests, as well as in multi-billion dollar civil litigation in federal courts.   Dan has managed the Technology Assisted Review process for all of CDS’s large and complex Second Request Reviews during his tenure at CDS.  Dan is additionally an attorney admitted to the New York State Bar Association.

         ddiette@cdslegal.com