If there was ever a single eDiscovery trend that everyone could agree on, it would without a doubt be the fact that data volumes in litigation grow year to year without fail. In fact, some 2019 surveys indicated that increasing data volumes are the biggest concern among the eDiscovery practitioners today.
The size of an average collection is burgeoning, due not only to an increase in the daily volume of email, but also to the changing nature of the workspace. For example, our work-product and communications might be synced across multiple different devices or duplicated in several back-up storage locations, not to mention that, with data storage costs dropping, we are probably preserving a lot more data than before.
Corporate IT and Legal departments are quickly waking up to this reality. We have seen more and more calls in the industry for better information governance and data management, as well as closer scrutiny of outside counsel’s eDiscovery workflows. Nonetheless, a case that would have been considered “large” in terms of gigabyte volume or document count just five years ago would probably fail to raise any eyebrows today. This, combined with rising cost awareness and external legislative pressure on issues of data privacy and data security, is creating a perfect catalyst for a change in the way E-discovery is done. Clients today are looking to seize on every tool and workflow available to reduce the amount of documents they need to look at and drive down the overall discovery spend – albeit in a safe and defensible fashion.
No discussion about data reduction strategies today should go without mentioning machine learning and technology assisted review solutions available to clients – and FRONTEO’s own AI tool, KIBIT Automator has been showing incredible strength in that arena. But we can also rely on simple statistical techniques and analysis to significantly reduce the burden of discovery, increase accuracy of document review, and gain greater measure of control over the litigation process.
Statistics, at a broad level, is a science about data, and a core part of it deals with giving people the tools to make educated inferences about a large data set through observing only a small portion of that data. So it makes for a perfect partner to the field of eDiscovery, where that is precisely what we want to do most of the time. While an in-depth discussion on the merits and applications of different statistical methods, or the underlying math, is beyond the scope of this blog post, what we want to do is provide an overview of the different ways in which you can leverage one of the simplest statistical techniques – sampling – to enhance your workflows.
There are two overarching categories of sampling that you can employ in a given case. Conveniently named, they are ‘probability sampling’ and ‘non-probability sampling’. For the purposes of this article, we will focus on the most popular sampling technique from each category.
Probability Sampling – Simple Random Sampling
Simple random sampling is a process by which a subset of a population is picked in such a way that the probability of choosing each individual item is equal and random. The main advantage of this randomization is that, as long as each item within the sample had an equal chance of being drawn, the sample can be said to be representative of the population that it was taken from, and inferences can be made about the whole population based on the observations from the sample. The sample itself can be orders of magnitude smaller than the population yet still offer valid, defensible insights into the whole set.
This difference between the amount of documents sampled, and the population size that the sample can represent, allows eDiscovery practitioners to test at a relatively low cost the assumptions that can impact large document volumes.
Typical use-case scenarios for simple random sampling in an eDiscovery workflow are pretty broad:
culling – document volumes in an ECA environment can easily exceed
millions, and introducing various criteria for removing documents (for
example removing emails from certain domains, or those generated by
out-of-office replies, calendar invites, and so on) can significantly
impact final review volume. Statistically sampling the discarded
population and evaluating the results can lend much-needed credibility and
defensibility to the process.
term application – this process can be quite painless if search terms
are well established and agreeable to both parties. In the event they are
not, however, or if certain search terms appear over-broad and bring in
thousands of documents more than others, it might be worth applying simple
random sampling to the outliers and testing their yield (quite simply, how
many relevant documents do those terms actually bring) to give yourself
some quantitative ammunition during search term negotiations. This can
also be used to great effect in testing search term adjustments – for example,
sampling the difference between your set of proposed search terms and the
opposing side’s, even on a per-term basis, if need be, can be a powerful
tool in your arsenal.
Learning / Technology Assisted Review – simple random sampling is
typically used in validating the accuracy and completeness of AI decisions
when it comes to utilizing Technology Assisted Review (TAR) processes.
This typically happens either through measuring metrics like recall and
precision via a control set, or elusion rate via an elusion test, or some
combination of the two. It can also be used to generate the initial
training set for the Assisted Review platform, though it is not the only
way to generate the training set, and nor is it always the most efficient
way of doing it. (More on this in the section on judgmental sampling in
Part II of this blog post.)
Control – random sampling is probably the most popular way of
conducting quality control on documents queued up for production. A lot of
document review service providers utilize random sampling to generate
their Quality Control (QC) sets from the population of documents tagged as
responsive, privileged, confidential or any other important criteria that
needs to be double-checked. Although random sampling is a tried and true
process in QC, the process can add a lot of inefficiencies and costs to
the QC workflow. This was one of
the reasons why we developed our KIBIT Automator AI software to utilize
AI-generated heat maps to prioritize groups of documents that are most in
need of verification.
your dataset – although this item is last on our list, it is by no
means the least important. Understanding the data that you have collected
can be critical for many reasons. Measuring prevalence of relevant
documents in the document population, for example, can drive important
decisions regarding usage of TAR technology, document review staffing, and
litigation logistics planning. A rapid and random sample of the potential
review population can give you insights into all of the above, and much
more. It can also flag any potential problems with processed documents (if
your vendor hasn’t caught them already), identify unexpected foreign
languages and, in general, give you early insights into the type of data
that your review team will be covering.
In our next post, we’ll look in more detail at the technical execution of a random sampling process.