Category: E-Discovery

My Prediction: Predictive Coding Will Help Information Governance Get Real

Last week I attended a “Predictive Coding Boot Camp” produced by the E-Discovery Journal and presented by Karl Schieneman of Review Less and Barry Murphy.  I’ve participated in many workshops, seminars, discussions, and webinars on the topic, but this half-day seminar went the deepest of any of them into the legal, technology, and case strategy implications of using technology to minimize the cost of human document review in e-discovery. It was a solid event.

(But, I wasn’t there to learn about e-discovery. I’ll tell you why I was there in a moment.)

You see how I snuck in an implied definition above? Because, whatever you call it – predictive coding, technology-assisted review, computer-assisted review, or magic – isn’t that the problem that we are trying to solve?  To defensibly reduce the number of documents that a human needs to review during e-discovery? There are a number of way to get there using technology, but the goal is the same.

What does e-discovery have to do with IG?

To review, in civil litigation, both sides have an obligation to produce information to the other side that is potentially relevant to the lawsuit. In the old days, this was a mostly a printing, photocopying, and shipping problem. Today it is primarily a volume, complexity, and cost problem. Although discovery of physical evidence and paper records is obviously still part of the process, electronic evidence naturally dominates.

So, how does a litigant determine whether a given document is potentially relevant and must be produced, or if it is irrelevant, privileged, or otherwise does not need to be produced to the other side?

If I sue my mechanic because he screwed up my transmission repair, the process is pretty simple. I will bring bills, receipts, and other stuff I think is relevant to my lawyer, my mechanic will do the same, our attorneys will examine the documents, determine a case strategy, produce responsive evidence to the other side, perhaps conduct some depositions, and – in real life – a settlement offer will likely be negotiated. In a case like this, there are probably only one or two people who have responsive information, there isn’t much information, and the information is pretty simple.

Now, what happens if 10,000 people want to sue a vehicle manufacturer because their cars seemingly have a habit of accelerating on their own, causing damage, loss, and even death? In a case like this, the process of finding, selecting, and producing responsive information will likely be a multi-year effort costing millions of dollars. The most expensive part of this process has traditionally been the review process. Which of the millions of email messages the manufacturer has in its email archive are related to the case? Which CAD drawings? Which presentations that management used to drive key quality control decisions? Which server logs?

Before we got smart and started applying smart software to this problem, the process was linear, i.e., we made broad cuts based on dates, custodians, departments etc. and then human reviewers – expensive attorneys in fact – would look at each document and make a classification decision. The process was slow, incredibly expensive, and not necessarily that accurate.

Today, we have the option to apply software to the problem. Software that is based on well-known, studied and widely used algorithms and statistical models. Software that, used correctly, can defensibly bring massive time and cost savings to the e-discovery problem. (There are many sources of the current state of case law on predictive coding, such as this.)  Predictive coding software, for example, uses a small set of responsive documents to train the coding engine to find similar documents in the much larger document pool. The results can be validated through sampling and other techniques, but the net result is that the right documents can potentially be found much more quickly and cheaply.

Of course predictive coding is just a class of technology. It is a tool. An instrument. And, as many aspiring rock gods have learned, owning a vintage Gibson Les Paul and a Marshal stack will not in and of itself guarantee that your rendition of Stairway to Heaven at open mic night will, like, change the world, man.

So why did I go to the Predictive Coding Bootcamp? I went because I believe that Information Governance will only be made real when we find a way to apply the technologies and techniques of predictive coding to IG. In other words, to the continuous, day-to-day management of business information. Here’s why:

Human classification of content at scale is a fantasy.

I have designed, implemented, and advocated many different systems for human-based classification of business records at dozens of clients over the last decade. In some limited circumstances, they do work, or at least they improve upon an otherwise dismal situation. However, it has become clear to me (and certainly others) that human based-classification methods alone will not solve this problem for most organizations in most situations moving forward. Surely by now we all understand why. There is too much information. The river is flowing too quickly, and the banks have gotten wider. Expecting humans to create dams in the river and siphon of the records is frankly, unrealistic and counterproductive.

Others have come to the same conclusion. For example, yesterday I was discussing this concept with Bennett B. Borden (Chair of the Information Governance and eDiscovery practice at Drinker Biddle & Reath) at the MER Conference in Chicago, where he provided the opening keynote.  Here’s what Bennett had to say:

“We’ve been using these tools for years in the e-discovery context. We’ve figured out how to use them in some of the most exacting and high-stakes situations you can imagine.  Using them in an IG context is an obvious next step and quite frankly probably a much easier use case in some ways. IG does present different challenges, but they are primarily challenges of corporate culture and change management, rather than legal or technical challenges.”

The technology has been (and continues to be) refined in a high-stakes environment.

E-discovery is often akin to gladiatorial combat. It is often conducted under incredible time pressures, with extreme scrutiny of each decision and action by a both and enemy and a judge.  The context of IG in most organizations is positively pastoral by comparison. Yes, there are of course enormous potential consequences for failure in IG, but most organizations have wide legal latitude to design and implement reasonable IG programs as they see fit. Records retention schedules and policies, for example, are rarely scrutinized by regulators outside of a few specific industries.

I recently talked about this issue with Dean Gonsowski, Associate General Counsel at Recommind. Recommind is a leader in predictive coding software for the e-discovery market and is now turning its attention to the IG market in a serious way. Here’s what Dean had to say:

“E­-discovery is the testing ground for cutting-edge information classification technology. Predictive coding technology has been intensively scrutinized by the bench and the bar. The courts have swung from questioning if the process was defensible to stating that legal professionals should be using it. The standard in IG is one of reasonableness, which may be a lower standard than the one you must meet in litigation.”

There is an established academic and scientific community.

The statistical methods, algorithms, and other techniques embodied by predictive coding software are the product of a mature and developing body of academic research and publishing. The science is well-understood (at least by people much, much smarter that me). TREC is a great example of this. It is a program sponsored by the US government and overseen by a program committee consisting of representatives from government, industry, and academia. It conducts research and evaluation of the tools and techniques at the heart of predictive coding. The way that this science is implemented by the software vendors who commercialize it varies widely, so purchasers must learn to ask intelligent questions.TREC and other groups help with this as well.

I will soon be writing more about the application of predictive coding technology to IG, but today I wanted to provide an introduction to the concept and the key reasons why I think it points the way forward to IG. Let me know your thoughts.

What is Big Data to Information Governance Professionals?

Spring is in the air in New York City. Here’s a picture of the beautiful Magnolia trees in Prospect Park. Magnolia Trees in Prospect Park

Last week, I was pleased to help lead the discussion at The Cowen Group’s Leadership Breakfast in Manhattan. I’ve been spending a lot of time thinking and writing about Big Data lately, and jumped at the chance to hear what this community was thinking about it. Then, this week we did it again in Washington, DC.

It was a great group of breakfasters – predominantly law firm attendees, with a mix of in-house lawyers, consultants, and at least one journalist. The discussion was fast ride through a landscape of emotional responses to Big Data: excitement, skepticism, curiosity, confusion, optimism, confusion, and ennui. Just like every other discussion I have had about Big Data.

We spent a lot of time talking about what, exactly, Big Data is. The problem with this discussion is that, like most technology marketing terms, it can mean something or nothing at all. How can a bunch of smart people having breakfast in the same room one morning be expected to define Big Data when the people who are paid to create such definitions leave us feeling . . .  confused?

Here’s how Gartner defines Big Data:

Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.

 Here’s how McKinsey defines it:

‘Big data’ refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. This definition is intentionally subjective . . .

Forrester:

Big Data is the frontier of a firm’s ability to store, process, and access (SPA) all the data it needs to operate effectively, make decisions, reduce risks, and serve customers.

Huh? No wonder we were confused as we scarfed our bacon and eggs.

Big Data is a squishy term, and for lawyers without a serious technology or data science background it is even squishier.

The concepts behind it are not new. However, there are some relatively new elements. One is the focus on unstructured data (e.g., documents, email messages, social media) instead of data stored in enterprise databases (the traditional focus of “Business Intelligence.) Two is the technologies that store, manage, and process data in a way that is not just incrementally better, bigger, or faster, but that are profoundly different (new file systems; aggregating massive pools of unstructured data instead of databases; storage on cheap connected hard drives, etc.). Three is newly commercialized tools and methods for performing analysis on these pools of unstructured data (even data that you don’t own) to draw business conclusions. There is a lot of skepticism about the third point – specifically about the ease with which truly insightful and accurate predictions can be generated from Big Data. Even Nate Silver –  famous for accurately predicting the outcome of the 2012 US Presidential Election with data – cautions that even though data is growing exponentially, the “amount of useful information almost certainly isn’t.” Also, correlative insights often get sold as causative insights.

Big Data is a lot of things to a lot of people. But what is it to e-discovery professionals? I think there are three pieces to the Big Data discussion that are relevant for this community.

  1. Is Data Good or Bad? In the world of Big Data, all data is good and more data is better. A well-known data scientist was recently quoted in the New York Times as saying, “Storing things is cheap. I’ve tended to take the attitude, ‘Don’t throw electronic things away.” To a data scientist this makes sense. After all, statistical analysis gets better with more (good) data. However, e-discovery professionals know that storage is not cheap when its full potential lifecyle is calculated, such as a company spending “$900,000 to produce an amount of data that would consume less than one-quarter of the available capacity of an ordinary DVD.” Data itself is of course neither good or bad, but e-discovery professional need to help Big Data proponents understand that data most definitely can have a downside. I wrote about this tension extensively here.

  2. Data Analytics for E-Discovery. Though not often talked about, I believe there is serious potential for some parties in the e-discovery process to analyse the data flowing through its process and to monetize that analysis. What correlations could a smart data scientist investigate between the nature of the data collected and produced across multiple cases and their outcomes and costs. Could useful predictions be made? Could e-discovery processes be improved and routinized? I have some idea, but no firm answers. We should dig into this further, as a community.

  3. Privacy and Accessibility. What does “readily available” mean in our age — an age where a huge chunk of all human knowledge can be accessed in seconds using a device you carry around in your pocket? Does better access to information simply offer speed and convenience, or does it offer something more profound? When a local newspaper posted the names and addresses of gun permit holders on an interactive map in the wake of the Sandy Hook Elementary School shooting, there was a huge outcry –  despite the fact that this information is publicly available, by law. This is a critical emerging issue as the pressure to consolidate and mine unstructured information to gain business insight collides with expectations of privacy and confidentiality.

Simply put legal and ediscovery professionals need to be at the table when Big Data discussions are happening. They bring a critical perspective that no one else offers.

Monica Bay provided an overview of the event, artfully putting it in context of what is going on across the legal industry.

By the way, my article about accessing and getting rid of information in the Big Data era has been syndicated to the National Law Journal, under the title, “Data’s Dark Side, and How to Live With It.” Check it out here. You can also check out my podcast discussion with Monica Bay about the article here.

Video: The Leadership Vacuum in Information Governance

As regular readers know, one of my favorite topics is the leadership vacuum in information governance. Who really is steering the ship in most organizations? Is it the CIO? It it the legal department? Is it a new leader like a Chief Digital Officer? This is a critical question, and you should be asking it. I provide some further thoughts on this topic in the video below –  check it out.

5 Questions about Information Governance in 5 Minutes: What’s Your Favorite Information Governance Story?

Here is the fifth and final (except for a bonus video coming soon) in our five-part video series where I asked 30 Information Governance the same 5 questions. This video is the longest of the five, as I ask our interviewees to tell us their favorite story about IG –  something that illustrates what it is, why it is hard, challenges they have faced and so on. There are some great stories, so get yourself a fresh cup of coffee and a snack and enjoy.