Cutting Through the Confusion Around Predictive Coding
January 16, 2013 10:00 am

Click here to subscribe to an e-mail digest of The Forensic Brief blog.

By Todd Marlin and Piyush Dixit

Todd Marlin

The modern-day litigation and regulatory landscape has resulted in an explosion of electronically stored data being retained in the business environment. As a result, corporate entities, regardless of industry, are faced with mounting financial and resource outlays when dealing with discovery of electronically stored data in response to formal litigation, regulatory investigation or government investigation.To combat the skyrocketing costs and time investments legal teams must devote to such endeavors, many practitioners are turning to a technological “savior” commonly referred to as predictive coding.

Recent court rulings in Da Silva Moore v. Publicis Groupe, Global Aerospace v. Landow Aviation, Kleen Products v. Packaging Corp of America and In Re: Actos Products Liability Litigation have provided guidance on appropriate uses of predictive coding.

Piyush Dixit

While this perceived acceptance has reduced some of the initial corporate and outside counsel skepticism around this technology, many companies and law firms are still reluctant to embrace predictive coding as a viable solution. Much of this is attributable to the vast ambiguity of what predictive coding truly embodies.

The confusion is further compounded by the vast array of predictive coding offerings promoted by vendors looking to gain footing in this emerging market. Some providers refer to their offering as predictive coding while others promote their solution as technology-assisted review (TAR) or computer-assisted review, automated document review, and a number of other catchphrases and acronyms.

Technology-assisted review

TAR is generally perceived as any solution in which non-linear review methodologies are employed to cull and reduce datasets to identify documents likely to be responsive. This definition is fairly broad and can span strategies across a wide spectrum, from simple to complex. On the lower end of the scale, solutions are deployed around deduplication (eliminating redundant information) and keyword searching, which nevertheless still require many manual eyes on review. Other TAR solutions assess language frequency and position within the document to cluster documents into specific topics and concepts into stratified review sets.

Training a machine

Moving farther up the technology chain are advanced solutions that leverage machine-based learning to extrapolate decisions across the corpus of documents at issue. This is largely achieved by relying on subject matter experts who train the analytic engine through the use of exemplars commonly referred to as “seed” or training sets. The results of the computer-generated decisions are then sampled by the case team to assess overall precision of the system’s classifications. Finally, there are active learning solutions that combine the use of mathematics and linguistics with seed sets to generate statistical confidence thresholds as they relate to relevance.

As human review teams manually review the threshold levels and determine whether they’re accurate, the machine is further trained and its output is refined until a statistically sound classification of documents is achieved. The latter two approaches are more accurate descriptions of how predictive coding should be perceived. They should be used as a baseline when contemplating the use of predictive coding for a litigation, regulatory response or investigation purpose.

Keep your own needs in mind

Each of these solutions has its own benefits and drawbacks and should be examined based on your particular needs. For example, where the data volumes or scope are of reasonable size, full-scale predictive coding may not be necessary and the dataset can be sufficiently culled using keywords or concept clustering. In instances where data volumes make a linear or keyword-based methodology cost and time prohibitive, a machine learning or active learning approach may be more suitable to either prioritize the set of documents for eyes on review or identify the document set to be sampled prior to production.

When deciding on an appropriate solution, it’s important that the primary decision-makers in selecting the provider are well-informed and educated in understanding what they are buying. The reality is not all predictive coding solutions are the same, and one person’s predictive coding may be another’s TAR.

The views expressed herein are those of the author and do not necessarily reflect the views of Ernst & Young LLP. This material has been prepared for general information only and is not intended to be relied upon as accounting, tax or other professional advice. Please refer to your advisors for specific advice.

Facebook Twitter Linkedin Email
Leave a Comment

You must be logged in to post a comment.

About Ernst & Young LLP

Dealing with complex issues of fraud, regulatory compliance and business disputes can detract from efforts to succeed. Better management of frau...

EY's Ted Acosta on how life sciences companies can use their data to improve compliance.