Our Insights

Thought Leadership and Industry Trends

Home 9 Insights 9 Insights - Advisory Services 9 Technology, Tools and Techniques for Mining Cloud Archives

Technology, Tools and Techniques for Mining Cloud Archives

Jul 22, 2021

In our recent webinar, Crate Digging: Finding Relevant Materials in a Universe of Accessible Cloud Backup, experts discussed the latest technology and best practices for handling archived data. They also discuss how the accessibility of cloud storage will impact the Courts’ interpretation of proportionality in eDiscovery.

Read on for a lightly edited transcript of their conversation, Part II of a series. To start with Part I, click here. To watch the entire recorded webinar, click here.

Moderator
Chris O’Connor, Director of eDiscovery Strategy, CDS

Panelists
William Wallace Belt, Jr., Managing Director, CDS
Lindsey Lanier, Product Management Director, VerQu, A Relativity Company
Pete Lwin, Senior Project Engineer, CDS
John Rabiej, partnering with GW Humphreys Complex Litigation Center
Adam Rogers, Senior Forensic Analyst, CDS

Relativity Solutions for Accessing Archived Data

Chris O’Connor:
I’d like to ask Lindsey Lanier, Product Management Director at VerQu, how we access archived data. Relativity Collect, which is online now in the RelativityOne platform, includes a vast array of connectors. What are you guys doing in terms of connecting disparate systems? Can you explain how we’re able to connect to these systems?

Lindsey Lanier:
VerQ started out as a data migration company. Our first flagship product was Phoenix Migrator. We started out with Enterprise Vault, and as more opportunities came through to pull data out of these legacy systems and into more modern, next generation cloud archives, we continued creating new sources and destinations to support our clients. So, our experience is in these projects and directly accessing these archives.

We’ve seen a huge benefit through Relativity, and bringing our tools and teams together gives customers a way to more easily pull content from a regulated organization system of record. Like Proofpoint, like Enterprise Vault, both proactively or reactively.

Our VerQu product suite provides capabilities to allow for filtering via API or SDK on these archive targets based on various criteria like date range, like custody and aliases, or specific message classes as Adam mentioned. We’d like to, in certain cases, be able to target specific streams like Teams, like Slack and email and then be able to query the indexes directly with search terms so that we can retrieve smaller data sets to then push into RelativityOne for processing indexing and review.

Chris O’Connor:
Right. So, how are you connecting these things like Office 365, for Teams and Slack? What do those connections look like? Are they simple APIs? Is there a more advanced approach?

Lindsey Lanier:
In addition to pulling data out of archives, the VerQu communication capture platform, Hydra, enables a lot of clients to pull data from internal and external communication streams. So, we can pull things like emails, enterprise, social, chat, financial messaging streams, like Office 365, Microsoft Exchange or Microsoft Teams. Bloomberg’s obviously a big one. As we continue to work through the VerQu and Relativity integration, we’re going to end up with more than 30 cloud data sources for RelOne Collect and Relativity Trace. It’s just a seamless ingestion to the platform. And by leveraging Relativity Short Message Format (RSMF) for more sources rather than treating the data like email as we have historically, we no longer have the communication and collaboration data.

For those that aren’t familiar with Relativity Short Message Format (RSMF), it was a standard developed by Relativity specifically for chat type communications, and what that means is it provides extra context for conversations like the threads, the reactions, and the emojis that are used in communication, the edits and the deletes that happen, and basically just giving a full rich review experience to make the reviewers’ and the customers’ lives easier.

Chris O’Connor:
Great. You talked a little bit about being able to run those terms. I know a number of these systems that leverage different indexing profiles and not just in the profile in which they allow for searching, but the indices themselves are built on different systems. So, as opposed to running terms in Slack, and then trying to run terms in Teams which are utilizing different indices, this opportunity allows us to connect to these sources, run some restrictive collection techniques such as date restrictions per individuals, etc., and then bring it back in and do a single search. What is the goal of being able to put this together so that we can examine data in a reasonable way?

I know a lot of people from the defense bar have concerns about the ballooning costs that John got into. How does Relativity not only enable me to retrieve information, but as best as I can tell, retrieve the information that’s relevant for the investigation or litigation?

Lindsey Lanier:
Relativity provides an end-to-end solution from collection through staging, premising and review. We offer the same intuitive user interface across all data sources, so that any user can easily collect from a wide range of cloud applications without any specialized or deep forensic training.

That means anyone on the team can identify all of the custodian’s data and collect what’s needed in a defensible and efficient way without having to handle the data or pull from users manually.

Within processing, Relativity indexes and ingests a lot of different types of data so clients can search and ultimately get to review. And then following processing, customers can run OCR on things like non searchable PDFs and images and things like that. And then all of this information gets indexed so they can run searches with terms and work to get a subset of the data created for review.

The goal here is to simplify the acquisition of the data first, and then ingest on only what you need.

Chris O’Connor:
Can you show that this system is being utilized holistically as a disaster recovery and not being leveraged for active compliance purposes or something of a similar nature?

Lindsey Lanier:
Sure. RelativityOne exposes a number of different reports within the system. One, specifically being RelOne auditing. All changes made in the system are audited. Everything that a user does within the application gets logged, and then our framework exposes those audit logs, which makes visible everything that was done within our collect jobs, within the workspaces and within the instance levels.

Additionally, RelativityOne Collect provides reports that detail all the items that were collected, and the summary of basically everything that was pulled from a given job based on the target or target status. You can even filter that down and look at what you pulled back per custodian to get subtitles from each target that we had.

Now, A Data Mining Deep Dive

Chris O’Connor:
Adam, same question to you. When we approach these systems, except for RelativityOne Collect, what will allow us to make determinations on use? Are there log files that can be investigated?

Adam Rogers:
I would take a step back here and look at the traditional solution, which would be to conduct custodial interviews and discuss with IT to see how the system is being used.

But from the technology side, we can check system settings to see how often the backups are being created, when they’re being created and what is being archived. There are logs and system settings that we can review to see how the system is being used to make a determination from there. It would be an opinion or an expert advice.

Chris O’Connor:
If I pull the collection using RelativityOne Collect, what am I getting from this? What are these reports and dashboards showing us?

Lindsey Lanier:
The UI is completely customizable. Customers can determine what’s presented in their dashboards. Some examples would be item level reports of what was collected with metadata. We also provide hash values to show what was collected for defensibility, and a summary report for those who want high-level information of what was collected.

Chris O’Connor:
If I have one unique collection workspace covering multiple matters, I can set it up so that I’m monitoring all the aspects at the same time. And I could probably run multiple questions at the same time, right?

Lindsey Lanier:
Exactly.

Chris O’Connor:
One more question. When I pull data from a cloud source using Collect, I’m not altering metadata, am I?

Lindsey Lanier:
Short answer is no. It’s designed to read on the application. A good example is when we’re doing Office 365 collections. There are seven permissions granted to the app that we have to register within Azure, and they’re all read only, so nothing can change.

Chris O’Connor:
I’m going to turn it over to Pete Lwin, Senior Project Engineer at CDS, our processing expert. Pete, you’ve been doing this for about 15 years.

Tell us some of the things that you encounter when you’re connecting to different systems and pulling from these archives. What happens to data and what are some challenges we’ve encountered over the years?

Pete Lwin:
What we have seen so far were issues within the timestamp of the same file, and the same file coming off from two collection systems, and also some issue with missing information within the body of the files from the collection systems. We have seen small differences in the timestamp, some millisecond differences, some with a few seconds’ difference. Sometimes we have seen the body checks missing, sometimes even a little tiny space throughout the whole process.

Chris O’Connor:
You mentioned timestamp differences. Are there things we can do to correct that?

Pete Lwin:
Yes. There are some workarounds for those, what we do is ignore milliseconds or seconds or sometimes minutes when we try to calculate for the hash values for the duplication process. Another thing we can try is to run near duplication, but it’s not part of the executive duplication. But in the end, there’s no actual way for correcting the data natively.

Chris O’Connor:
Something else that can happen is that email attachments can be separated from the parent email during the archiving process. How do we reconnect those? Is it even possible?

Pete Lwin:
Mainly it’s not going to be possible. We have seen the messages coming out from the archiving systems sometimes missing the attachment files. We believe it mainly has to do with the settings on the system or sometimes the system itself can cause that kind of issue. If the attachment got exported somehow with the parent file, there’s other ways of manipulating data to connect them back, but that’s going to be time consuming.

Chris O’Connor:
You mentioned earlier about extra spaces, perhaps in the text of the email. What’s email stubbing, that’s the opposite, right? I’m going to have less coming out of the archive than I have in the live system. How do I account for that?

Pete Lwin:
Email stubbing is basically keeping track of the archived data for a file sent to the external storage or cloud-based storage. A computer-generated file would get created, and then that file will be available for the user to have immediate access while the original file or the bigger portion of the file will be stored somewhere else externally. Email stubbing is basically a process of saving some space since you’re not saving the entire file or the data on the drive, instead you’re saving a link or stub file for the users.

Chris O’Connor:
And what is that doing to deduplication on a global level?

Pete Lwin:
We’ve seen the messages are coming out with all the attachments for email stubbing also. I’m trying to get one of the messages without the attachment file and another one with the attachment. So, that will cause duplication issues.

Chris O’Connor:
And then email layering. Let’s talk really about this and what does it do to the dataset if you end up with layered emails coming out of archives?

Pete Lwin:
Email layering . . . we call it journaling. It’s when the system attaches an origin email to another container email, and then sends that to the archiving system. When those emails get collected for processing, we will have issues with the deduplication process, and we would also have issues with having too much published in the database because of the container emails. For the deduplication process, there are workarounds. It’s all time consuming.

Costs and Burdens of ESI: Factoring for Proportionality

Chris O’Connor:
We’re talking about increased costs, and manual manipulation for this data. These are things you want to consider when you’re investing in cloud archive systems.

Let’s take a step back to John and Bill. We covered what the rule makers envisioned when they wrote Rule 26. How are the parties addressing their understanding of accessibility meet and confer negotiations? At 26(F), what’s going on, and what are some things that that parties should be considering, especially when it comes to proportionality? How is the time burden factored in?

Bill Belt:
I think the points that Pete, Adam and Lindsey raised are good ones. There are both advantages and in Rule 26 terms, there’s both a burden and value or a benefit to getting data out of archives. One of the things I’m going to highlight that I am often brought in to talk about is the time burden. Pete talked about that. He said that that’s going to take more time and it’s going to cost more money.

To one of our original points, this is not just a button that you press. You find a large volume of data sitting in an archive somewhere, number one, the volume of it in and of itself implies costs because they need to understand what’s in there and what’s responsive, what’s not, what’s privileged and what’s not, but the technology itself is giving you data with inherent complexities.

There may be workarounds for them and some of the technical solutions that Lindsey’s spoke about, which is great, and they’re changing all the time and probably getting better and better, but still there is the time factor. What I try to make sure people are aware of is that by doing this, you’re not only increasing the cost of the producing party, but you’re increasing the time it’s going to take to get discovery done many times over probably.

What you’re asking for potentially is what you’re going to complain about later on in the document zone. You’re going to get a much more voluminous production than what you have in mind right now.

Chris O’Connor:
This is really a cost-benefit conversation that we should be insisting on between both sides, because it sounds like the plaintiffs are also going to be disadvantaged at some point, right? If they get the data they want, but it comes with a giant mountain of other data, that may not be the best way of approaching it. And from the defense side, obviously it’s a time and cost consideration at the outset.

John Rabiej:
Chris, let me pick up from something that Lindsey and you said about burdens and costs getting lower for accessing information from these archived data sources. Bill says, “Well, there’s still costs involved there.” That’s going to be the issue. When the rule makers were considering this in 2006, the line was pretty clear between backup tapes and active sources/databases when it came to accessibility.

There were a lot of costs for all the new steps involved, so what Lindsey’s telling me is that those steps are starting to get ameliorated, and it’s becoming easier to access data.

The closer they come together, the active and the archive with regard to burden and cost . . . there are still costs now, but maybe eventually they’re going to be the same and there’s not going to be much of a difference in getting the information from archive data. At that point, the other rule provision kicks in, the unreasonably cumulative and the redundancy. And that’s where you’ll be focusing.

But at this point, it looks like there’s still additional costs involved, no matter where you place it, and so then Bill’s concerns are 100% right, you start applying the proportionality, start looking at the costs and burdens in much more detail.

Chris O’Connor:
Should business factor in the legal implications of the use of these systems at the outset? Should an IT department who’s about to deploy a cloud archiving system, for whatever purpose, be having conversations with lawyers to make sure from a legal perspective, they understand the implications of deploying the system?

John Rabiej:
Well, I can tell you from the rules committee perspective, because this issue is touched upon quite often, and referred from in-house counsel all the time.

Legal does not drive the business purposes information. They’re going to make their IT decision based on what the business needs the information for, and how they’re going to be using it. I think the point you’re making is that nonetheless, though legal doesn’t drive it, you still ought to be aware of the consequences.

And there are some consequences. If what Lindsey is saying becomes a reality down the road, then you’re going to have to start considering the impact of that, because then the provision (B) 2(B) becomes irrelevant because the costs and burdens are the same, you’re going to have to look at proportionality and you’re going to have to look instead at non-cumulative or unreasonably cumulative. So, the analysis becomes different.

Bill Belt:
You do often hear legal is not brought to the table as early as they should be, so the answer is yes, they should consider it. And we are seeing more of that – more and more people in-house that are experts in Relativity and in the collections tools that Adam’s talking about and the processing that Pete’s talking about. So, they’re raising the concerns, and being heard a little bit more often especially as Office 365 starts to become more prevalent, issues with Slack and Teams become more common.

The voice of legal is probably getting a little bit more traction or having a little bit more of an audience. But John’s point is still correct that businesses run their businesses but they should consult Legal.

Click here to read Part III: The GW Proportionality Initiative: A New Framework for eDiscovery.

About the Author

CDS Staff

Our leadership team and advisory consultants, project managers, and technical experts assist clients through all phases of the eDiscovery process.

01 May 2024

7th Annual Putting Insights into Practice Forum

Navigate a virtual journey through today’s biggest legal data management challenges at PIIP 2024: ADVENTURES ON THE DATA CONTINUUM

Find out more

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
bcookie	2 years	This cookie is set by linkedIn. The purpose of the cookie is to enable LinkedIn functionalities on the page.
lang	session	This cookie is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc	1 day	This cookie is set by LinkedIn and used for routing.

Cookie	Duration	Description
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gcl_au	3 months	This cookie is used by Google Analytics to understand user interaction with the website.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visted in an anonymous form.
_hjFirstSeen	30 minutes	This is set by Hotjar to identify a new user’s first session. It stores a true/false value, indicating whether this was the first time Hotjar saw this user. It is used by Recording filters to identify new user sessions.
oktgid	1 year	This cookie is used for storing the visitor ID of the user who clicked on an okt.to link.
oktsid		This cookie is used for storing the session ID of the user who clicked on an okt.to link.
pardot	past	The cookie is set when the visitor is logged in as a Pardot user.
vuid	2 years	This domain of this cookie is owned by Vimeo. This cookie is used by vimeo to collect tracking information. It sets a unique ID to embed videos to the website.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to deliver advertisement when they are on Facebook or a digital platform powered by Facebook advertising after visiting this website.
bscookie	2 years	This cookie is a browser ID cookie set by Linked share Buttons and ad tags.
fr	3 months	The cookie is set by Facebook to show relevant advertisments to the users and measure and improve the advertisements. The cookie also tracks the behavior of the user across the web on sites that have Facebook pixel or Facebook social plugin.
IDE	1 year 24 days	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
test_cookie	15 minutes	This cookie is set by doubleclick.net. The purpose of the cookie is to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.

Cookie	Duration	Description
_dc_gtm_UA-109542572-2	1 minute	No description
_hjAbsoluteSessionInProgress	30 minutes	No description
_hjid	1 year	This cookie is set by Hotjar. This cookie is set when the customer first lands on a page with the Hotjar script. It is used to persist the random user ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_hjIncludedInPageviewSample	2 minutes	No description
_hjTLDTest	session	No description
AnalyticsSyncHistory	1 month	No description
CONSENT	16 years 8 months 26 days 9 hours 2 minutes	No description
UserMatchHistory	1 month	Linkedin - Used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.

Our Insights

Thought Leadership and Industry Trends

Technology, Tools and Techniques for Mining Cloud Archives

Relativity Solutions for Accessing Archived Data

Now, A Data Mining Deep Dive

Costs and Burdens of ESI: Factoring for Proportionality

CDS Staff

7th Annual Putting Insights into Practice Forum

Our Blog

Sign Up for Our Newsletter

About CDS

Contact Us