I recently provide a briefing to a group of e-discovery professionals about Big Data and why it matters to them, and I thought there might be some value in sharing my notes.
1. What is Big Data?
Gartner: Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision-making.
McKinsey: ‘Big data’ refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. This definition is intentionally subjective . . .
It is subjective, but has definable elements
The data itself: large, unstructured information
The infrastructure: “Internet scale” in the enterprise
The analysis: Asking questions using very large data sets
2. Why Does Big Data Matter to E-Discovery Professionals?
Data scientists and technologists do not understand the risk side of information
You need to be at the table to educate them on:
The legal and business value of deleting information
The privacy requirements and implications
E-Discovery implications of too much data
The technologies of Big Data may process and manipulate information in a way that affects their accessibility and evidentiary value – you need to be aware of this and guide your clients appropriately
3. Does Big Data offer value to the legal community?
Performing sophisticated analysis on large pools of data is not exclusive to any particular industry – there is no reason it could not be applied to the legal community (and already is being used in some limited ways)
Relatively speaking, most law firms do not generate massive amounts of data in their day-to-day operations
In e-discovery, the technology innovations of Big Data could be helpful in very large cases to help with storage and processing tasks
4. What are some examples of Big Data in action?
President Obama’s data-driven election campaign.
An online travel company showing more expensive travel options to those who used higher-prices Macintosh computers to access their website.
Fraud Detection: Targeting $3.5 trillion in fraud from banking, healthcare utilities, and government.
The City of New York finding those responsible for dumping cooking oil and grease into the sewers by analysing data from the Business Integrity Commission, a city agency that certifies that all local restaurants have a carting service to haul away their grease. With a few quick calculations, comparing restaurants that did not have a carter with geo-spatial data on the sewers, they generate a list of statistically likely suspects to track down dumpers with a 95% success rate.
5. What professional and career opportunities does Big Data represent for e-discovery professionals?
Organizations need people who understand the risk side of the equation and who can provide practical guidance
Your clients may have Big Data projects that right now, today, are creating unmonitored, unmitigated risk; you need to be able to help them identify and manage that risk
Big Data focuses on unstructured information, i.e., the documents, email messages and other information that the e-discovery community knows well. These same skills and techniques can be very useful to business-led Big Data projects.
Last week I attended a “Predictive Coding Boot Camp” produced by the E-Discovery Journal and presented by Karl Schieneman of Review Less and Barry Murphy. I’ve participated in many workshops, seminars, discussions, and webinars on the topic, but this half-day seminar went the deepest of any of them into the legal, technology, and case strategy implications of using technology to minimize the cost of human document review in e-discovery. It was a solid event.
(But, I wasn’t there to learn about e-discovery. I’ll tell you why I was there in a moment.)
You see how I snuck in an implied definition above? Because, whatever you call it – predictive coding, technology-assisted review, computer-assisted review, or magic – isn’t that the problem that we are trying to solve? To defensibly reduce the number of documents that a human needs to review during e-discovery? There are a number of way to get there using technology, but the goal is the same.
What does e-discovery have to do with IG?
To review, in civil litigation, both sides have an obligation to produce information to the other side that is potentially relevant to the lawsuit. In the old days, this was a mostly a printing, photocopying, and shipping problem. Today it is primarily a volume, complexity, and cost problem. Although discovery of physical evidence and paper records is obviously still part of the process, electronic evidence naturally dominates.
So, how does a litigant determine whether a given document is potentially relevant and must be produced, or if it is irrelevant, privileged, or otherwise does not need to be produced to the other side?
If I sue my mechanic because he screwed up my transmission repair, the process is pretty simple. I will bring bills, receipts, and other stuff I think is relevant to my lawyer, my mechanic will do the same, our attorneys will examine the documents, determine a case strategy, produce responsive evidence to the other side, perhaps conduct some depositions, and – in real life – a settlement offer will likely be negotiated. In a case like this, there are probably only one or two people who have responsive information, there isn’t much information, and the information is pretty simple.
Now, what happens if 10,000 people want to sue a vehicle manufacturer because their cars seemingly have a habit of accelerating on their own, causing damage, loss, and even death? In a case like this, the process of finding, selecting, and producing responsive information will likely be a multi-year effort costing millions of dollars. The most expensive part of this process has traditionally been the review process. Which of the millions of email messages the manufacturer has in its email archive are related to the case? Which CAD drawings? Which presentations that management used to drive key quality control decisions? Which server logs?
Before we got smart and started applying smart software to this problem, the process was linear, i.e., we made broad cuts based on dates, custodians, departments etc. and then human reviewers – expensive attorneys in fact – would look at each document and make a classification decision. The process was slow, incredibly expensive, and not necessarily that accurate.
Today, we have the option to apply software to the problem. Software that is based on well-known, studied and widely used algorithms and statistical models. Software that, used correctly, can defensibly bring massive time and cost savings to the e-discovery problem. (There are many sources of the current state of case law on predictive coding, such as this.) Predictive coding software, for example, uses a small set of responsive documents to train the coding engine to find similar documents in the much larger document pool. The results can be validated through sampling and other techniques, but the net result is that the right documents can potentially be found much more quickly and cheaply.
Of course predictive coding is just a class of technology. It is a tool. An instrument. And, as many aspiring rock gods have learned, owning a vintage Gibson Les Paul and a Marshal stack will not in and of itself guarantee that your rendition of Stairway to Heaven at open mic night will, like, change the world, man.
So why did I go to the Predictive Coding Bootcamp? I went because I believe that Information Governance will only be made real when we find a way to apply the technologies and techniques of predictive coding to IG. In other words, to the continuous, day-to-day management of business information. Here’s why:
Human classification of content at scale is a fantasy.
I have designed, implemented, and advocated many different systems for human-based classification of business records at dozens of clients over the last decade. In some limited circumstances, they do work, or at least they improve upon an otherwise dismal situation. However, it has become clear to me (and certainly others) that human based-classification methods alone will not solve this problem for most organizations in most situations moving forward. Surely by now we all understand why. There is too much information. The river is flowing too quickly, and the banks have gotten wider. Expecting humans to create dams in the river and siphon of the records is frankly, unrealistic and counterproductive.
Others have come to the same conclusion. For example, yesterday I was discussing this concept with Bennett B. Borden (Chair of the Information Governance and eDiscovery practice at Drinker Biddle & Reath) at the MER Conference in Chicago, where he provided the opening keynote. Here’s what Bennett had to say:
“We’ve been using these tools for years in the e-discovery context. We’ve figured out how to use them in some of the most exacting and high-stakes situations you can imagine. Using them in an IG context is an obvious next step and quite frankly probably a much easier use case in some ways. IG does present different challenges, but they are primarily challenges of corporate culture and change management, rather than legal or technical challenges.”
The technology has been (and continues to be) refined in a high-stakes environment.
E-discovery is often akin to gladiatorial combat. It is often conducted under incredible time pressures, with extreme scrutiny of each decision and action by a both and enemy and a judge. The context of IG in most organizations is positively pastoral by comparison. Yes, there are of course enormous potential consequences for failure in IG, but most organizations have wide legal latitude to design and implement reasonable IG programs as they see fit. Records retention schedules and policies, for example, are rarely scrutinized by regulators outside of a few specific industries.
I recently talked about this issue with Dean Gonsowski, Associate General Counsel at Recommind. Recommind is a leader in predictive coding software for the e-discovery market and is now turning its attention to the IG market in a serious way. Here’s what Dean had to say:
“E-discovery is the testing ground for cutting-edge information classification technology. Predictive coding technology has been intensively scrutinized by the bench and the bar. The courts have swung from questioning if the process was defensible to stating that legal professionals should be using it. The standard in IG is one of reasonableness, which may be a lower standard than the one you must meet in litigation.”
There is an established academic and scientific community.
The statistical methods, algorithms, and other techniques embodied by predictive coding software are the product of a mature and developing body of academic research and publishing. The science is well-understood (at least by people much, much smarter that me). TREC is a great example of this. It is a program sponsored by the US government and overseen by a program committee consisting of representatives from government, industry, and academia. It conducts research and evaluation of the tools and techniques at the heart of predictive coding. The way that this science is implemented by the software vendors who commercialize it varies widely, so purchasers must learn to ask intelligent questions.TREC and other groups help with this as well.
I will soon be writing more about the application of predictive coding technology to IG, but today I wanted to provide an introduction to the concept and the key reasons why I think it points the way forward to IG. Let me know your thoughts.
Early this year I was lucky enough (thanks to a great sponsor) to carve out some significant research and writing time to answer a complicated (and maybe even complex) set of questions: what does unstructured information really cost? How do we answer this question? Which kinds of costs should be included in the answer? Can we use this answer to drive desirable Information Governance behaviors?
I looked at existing models for structured data, studied the emerging Big Data market, talked to clients and experts, and developed some answers to these questions that I think are actually pretty novel. You can download the entire paper here now (at the website of Nuix, the sponsor), and you can also follow along here as I discuss out some of the key ideas and findings over the next few weeks.
A PowerPoint slide (with notes) is available for download here: IG PowerPoint Slide of the Day from Barclay T Blair-10 Factors Driving Unstructured Information Cost. If you do use it, I would appreciate you letting me know how and where.
Unstructured information is ubiquitous. It is typically not the product of a single-purpose business application. It often has no clearly defined owner. It is endlessly duplicated and transmitted across the organization. Determining where and how unstructured information generates cost is difficult.
However, it is possible. Our research shows that there are at least ten key factors that drive the total cost of owning unstructured information. These ten factors identify where organizations typically spend money throughout the lifecycle of managing unstructured information. These factors are listed in Figure 1, along with examples of elements that typically increase cost (“Cost Drivers,” on the left side) and elements that typically reduce costs (“Cost Reducers,” on the right hand side).
- E-Discovery. Finding, processing, and producing information to support lawsuits, investigations and audits. Unstructured information is typically the most common target in e-discovery, and a poorly managed information environment can add millions of dollars in cost to large lawsuits. Simply reviewing a gigabyte of information for litigation can cost $14,000.[i]
- Disposition. Getting rid of information that no longer has value because it is duplicate, out of date, or has no value to the business. In poorly managed information environments, just “separating the wheat from the chaff” can cost large organizations millions of dollars. For enterprises with frequent litigation, the risk of throwing away the wrong piece of information only increases risk and cost. Better management and smart information governance tools drive costs down.
- Classification and Organization. Keeping unstructured information organized so that employees can us it. Also necessary so management rules supporting privacy, privilege, confidentiality, retention, and other requirements can be applied.
- Digitization and Automation. Many business processes continue to be a combination of digital, automated steps and paper-based, manual steps. Automating and digitizing these processes requires investment, but also can drive significant returns. For example, studies have shown that automating Accounts Payable “can reduce invoice processing costs by 90 percent.”[ii]
- Storage and Network Infrastructure. The cost of the devices, networks, software, and labor required to store unstructured information. Although the cost of the baseline commodity (i.e., a gigabyte of storage space) continues to fall, for most organizations overall volume growth and complexity means that storage budgets go up each year. For example, between 2000 and 2010, organization more than doubled the amount they spent on storage-related software even though the cost of raw hard drive space dropped by almost 100 times.[iii]
- Information Search, Access, and Collaboration. The cost of hardware, software, and services designed to ensure that information is available to those who need it, when they need it. This typically includes enterprise content management systems, enterprise search, case management, and the infrastructure necessary to support employee access and use of these systems.
- Migration. The cost of moving unstructured information from outdated systems to current systems. In poorly-managed information environments, the cost of migration can be very high – so high that some organizations maintain legacy systems long after they are no longer supported by the vendor just to avoid (more likely, to simply defer) the migration cost and complexity.
- Policy Management and Compliance. The cost of developing, implementing, enforcing, and maintaining information governance policies on unstructured information. Good policies, consistently enforced will drive down the total cost of owning unstructured information.
- Discovering and Structuring Business Processes. The cost of identifying, improving, and routinizing business processes that are currently ad hoc and disorganized. Typical examples include contract management and accounts receivable as well as revenue-related activities such as sales and customer support. Moving from informal, email and document-based processes to fixed workflows drives down cost.
- Knowledge Capture and Transfer. The cost of capturing critical business knowledge held at the department and employee level and putting that information in a form that enables other employees and part of the organization to benefit from it. Examples include intranets and their more contemporary cousins such as wikis, blogs, and enterprise social media platforms.
[i] Nicholas M. Pace, Laura Zakaras, “Where the Money Goes: Understanding Litigant Expenditures for Producing Electronic Discovery,” RAND Institute for Civil Justice, 2012. Online at, http://www.rand.org/content/dam/rand/pubs/monographs/2012/RAND_MG1208.pdf
[ii] “A Detailed Guide to Imaging and Workflow ROI,” The Accounts Payable Network, 2010
“As soon as IT sets it up so that people can self provision and create these new sites, it’s always amazing to see how it proliferates . . .”
Bill Gates, speech at the first Microsoft SharePoint Conference, May 15, 2006
One of the key attractions of SharePoint – for IT at least – is the ease with which users can set up and use SharePoint sites with little to no involvement from IT. While this may drive adoption of the product and reduce the burden on IT departments, it can make IG more challenging, as sites can be set up with little or no enterprise control or insight into the information.
This is the role of SharePoint governance – the rules and processes organizations must adopt to ensure that they are leveraging the strengths of SharePoint, but also maximizing the value – and minimizing the risk – associated with the information within SharePoint.
We cover this concept of SharePoint governance in latest entry in our OpenText Executive Brief series. Click here to download the new brief from the OpenText website.
Click here for more information about the series, and for links to the other Briefs in the series.