Recently, we shared with you the first set of terms that are commonly used in predictive coding not only to help you navigate the field, but to provide you with a firm foundation for understanding what service providers are talking about. This week, we look at six additional terms commonly heard when predictive coding is up for discussion.
6. “Recall.” One of the two key metrics for measuring predictive-coding accuracy, recall is the percentage of documents that belong to a category (responsive, privileged, hot, etc.) that were accurately identified. It is a measure of completeness: of all the documents that should have been found, what percentage were found? Academic studies have measured attorney recall at below 50%. To the author’s knowledge, the only specific recall level ever approved by judicial order is 75%. An extended discussion of recall is available here.
7. “Precision.” The second key metric for measuring predictive-coding accuracy, precision refers to the percentage of documents identified as belonging to a category that do, in fact, belong to the category. It is a measure of over-inclusiveness: of all the documents identified as responsive to a tag, what percentage actually were responsive? An extended discussion of precision is available here.
8. “Error rate.” The rate at which predictive-coding software incorrectly categorized documents. It can be misleading, making recall and precision the preferred metrics. Imagine a document production from a population of 1 million documents, of which 1% or 10,000 are relevant, and predictive-coding software identifies all documents as nonresponsive. The error rate would be an impressive-sounding 1% despite the software missing all 10,000 relevant documents. Extended discussions of error rate are available here and here.
9. “Prevalence” or “richness.” The proportion of documents in a population that belong to a given category. For example, in the context of production, suppose that 1 million documents have been collected, of which 100,000 are actually relevant and must be produced. Prevalence would be 10% (100,000 / 1,000,000). The lower the prevalence, the larger the training set generally must be.
10. “Random set” or “control set.” The control set is a random sample that attempts to represent the universe of documents the predictive-coding software will analyze. The control set is used to measure accuracy. Some software providers, such as Dagger Analytics, allow parties to use part or all of the training set as the control set, reducing the number of documents requiring review. The random set is typically a few hundred to a few thousand documents, depending on prevalence and the desired margins of error.
11. “Concept clustering.” A type of Technology Assisted Review that uses linguistic analysis to group documents relating to the same concept (for example, documents dealing with marketing). Groups can then be sampled and bulk-categorized (e.g., as responsive or nonresponsive), analyzed for inconsistent coding, or assigned to reviewers with the appropriate expertise.
Have you heard a term in a predictive coding discussion that needed clarification that you don’t see here? Send us a tweet with the term and we’ll get you a definition. You can find us on Twitter here…