Considering the Impact of Richness on Control Set Metrics

A more thorough and introductory explanation of this topic was originally featured on the ACEDS Blog.
Attorneys often calculate recall and precision estimates based on a control set to validate the TAR process. A control set is a random sample of documents excluded from the TAR model and used to evaluate the predictive strength of the model. The sample size used for validation is often calculated based on a specified confidence level (CL) and margin of error (MoE) using the normal approximation for binomial proportion confidence intervals.

Many eDiscovery professionals erroneously expect the MoE used for sampling purposes to apply to their recall and precision estimates. The problem with this expectation is that the sample size used to calculate these proportions does not match the size given by the sampling calculation with which they began.

It’s crucial to remember that statistics allow us to estimate a range that contains the true rate with which we’re concerned. A statistic without a confidence interval or MoE can be meaningless or even misleading. For this reason, a document universe with very low richness can make it very difficult to calculate useful estimates of recall and precision using a control set.

Because precision and recall are proportions of only the documents designated as responsive by either the software or a human reviewer, you need to make sure you have enough responsive documents in your random sample to calculate useful estimates of these validation metrics. That’s not too onerous if 30% to 34% of your documents are responsive, but things often get tricky and inefficient when only 1% to 3% of all the documents are responsive.
Consider the following examples:
Let’s imagine we’re starting with 100,000 unreviewed documents. The sampling feature of my review software gives me 383 documents for my control set using 95% CL, 5% MoE. The associate on the case reviews them with perfect accuracy and identifies 123 as responsive, 260 as non-responsive.
After training the machine learning algorithm, we find that the software categorized our control set such that we have:

Remembering that recall and precision are calculated as:

Our statistics indicate 75% recall and 70% precision, so our sample size suggests we’re at 70% – 80% recall and 65% – 75% precision, right?

This is where many go wrong.
Leaving aside questions about probability theory, which confidence interval formula to use, and mathematical proofs, our sample size and 32% rate of responsiveness suggest around 27% to 37% of all the documents are responsive. However, since our sample size for recall is 123, not 383, we can only estimate with 95% confidence that recall is anywhere between 66% and 82%. Remember that it’s just as likely it’s really 67% as it is that it’s 79%. Are you comfortable defending the decision to produce only 2 out of every 3 responsive documents? Our precision estimate is also fairly wide at 62% to 78%.

We’d have to triple the size of the control set, and we still might not have enough responsive documents to calculate a 5% confidence interval, but we should probably be in good shape if we include at least 1425 total in the control set. While that’s quite a bit more than 383, it’s still a practical number. In this scenario it may be reasonable to conclude that a control set is appropriate for validation, but make sure it’s large enough and that you (or your data science consultant) are calculating the actual confidence intervals for your validation metrics.

Now let’s imagine the same scenario, but this time the associate coded with perfect accuracy and found that 8 were responsive and 375 were not responsive. She then trained the model and we’re looking at:

Now recall and precision are calculated as:

Same 75% recall and now 67% precision, but, with a 95% CL, our range is anywhere from 35% to 97% using a conservative estimate for recall, and our precision estimate is equally useless at 30% to 93%. We’d probably need around 9,500 to 38,000 documents in our control set to estimate these within our desired 5% interval!

It rarely makes sense to include so many documents in a control set, so in situations like these it’s best to note the proportion of responsive documents to help guide workflow, then count on elusion testing for validation.
About the author:
Lilith Bat-Leah, CEDS, is a data science director at FRONTEO, a publicly traded global technology and services company specializing in artificial intelligence, cross-border litigation, managed review, and consulting for the eDiscovery market. She is a founding board member and vice president of the ACEDS Chicago Chapter and a contributor to the EDRM TAR guidelines. Lilith graduated from Northwestern University magna cum laude and has been CEDS certified since 2015.