Girding For Battle: A Clash Is Brewing Between Big Data and E-Discovery

Late 2012 I was honored to provide a feature editorial for Law Technology News, a fine publication helmed by Monica Bay. You can read it online here (with free registration) or you can read it in full below.

Girding for Battle:  A clash is brewing between Big Data and e-discovery

When was the last time you sat at your computer and deleted old files? Yesterday? Never? Don’t remember? Before today’s ubiquitous search engines, there was practical value in being a filer rather than a piler — it was difficult to find a document in a filing cabinet without an index.

Today’s sophisticated search engines obviate the need to manually index. Search technology is wonderful if we know what we are looking for, but is it an information management panacea? Information is growing at an astonishing rate, so much so that the numbers used to communicate growth projections are now so huge that they are almost meaningless.

Until recently, this unfettered growth was generally viewed as hazardous. It drives up storage costs, makes it difficult to find the wheat among the chaff, and increases electronic data discovery risk and cost, the argument goes. The resulting mantra: “We need to categorize it, control it, and clean it up!” Companies have spent decades paralyzed by a near inability to adapt modernist paper records management programs to decidedly postmodern information systems. Today, no part of the organization (including IT) exerts centralized command-and control over data, and we have yet to find an easy replacement for the file clerk. Enter Big Data, where uncontrollable information growth is no longer viewed as evil, or even a necessary evil. In the Big Data world, system administrators now treat bursting databases and file shares not as a shameful secret shared sotto voce in committee meetings, but as something to brag about. In Big Data, information has no downside. It is exalted in Davos, where the World Economic Forum recently “declared data a new class of economic asset, like currency or gold.” It’s been profiled by The New York Times. Proponents call it “the new oil,” proclaiming it presents the biggest opportunies since the dawn of the internet.

So why does Big Data matter to the legal community? Because it heralds a new battle, over a single question: Should we keep the information we create forever, or should we throw some of it away? The answer used to be simple: it was not feasible to keep everything. The cost was too high, the effort too great. Overburdened systems fail. Information overload reduces productivity. Data must be migrated from old to new systems, with great difficulty and expense.

The chance that you might have a smoking gun buried in the data creates too high a risk of liability. After all, if we learned one lesson from the seminal EDD cases metastasiz- ing from the bankruptcies of Enron (Andersen v. U.S., 544 U.S. 696, 704 (2005)) and Sunbeam, (Coleman (Parent) Holdings, Inc. v. Morgan Stanley & Co., Inc., No. 502003CA005045XXO- CAI (Fla. Cir. Ct., March 1, 2005)), it is that data skeletons in the closet can be spooky.

But Big Data changes the calculus. The software used by Google and Yahoo to index the internet is open source, called Apache Hadoop. This brings internet scale and speed to just about any organization, and it can be run on cheap, off-the-shelf disk drives. Tools to analyze the data (some first commercialized in EDD) are accessible and powerful, promising profound new business and societal insights drawn from the vast pools of data. The fundamental promise of Big Data is that it enables insights into business (and the world) that were not possible before. Proponents see Big Data creating a better world, one fulfilling the promise of the internet itself.

But Big Data advocates downplay the downsides of data, and specifically, the EDD challenges. In the near-Nirvana contemplated by some Big Data proponents, all data is good and more data is better. In EDD, the opposite is usually true.

A recent study by the Pew Research Center about the future of Big Data was positive overall, but acknowledged concerns related to privacy, social control, misinformation, civil rights abuses, and the possibility of simply being overwhelmed by the deluge of data. Within legal, the burden of finding, processing, and producing Big Data in EDD is a foreign concept to most Big Data advocates. Perhaps this is because the Big Data enthusiasm cycle has not yet reached the “trough of disillusionment” where the hype faces the reality of corporate culture and complex legal and compliance requirements.

Records management doctrines specify that organizations should clearly define the business or legal purpose of a piece of information when created. That analysis determines whether, for how long, and in what form the data should be kept. Records retention schedules are intended to provide a measure of defensibility against spoliation claims, as they evince an intent to delete a record based on a proactive and standardized calculation of its value, rather than a reactive determination based on fears about bad evidence. Many organizations have attempted to play records management catch up in advance of pending litigation and have paid the price.

Big Data advocates argue that the economies of scale now make it feasible and desirable to capture and store information that currently has no clear or definable business value. Although large organizations have long collected and analyzed data (using business intelligence software), proponents argue that Big Data is different. They posit that cheaper storage and technical innovations make it easier and faster than ever before to analyze that data, eliminating the need to identify the business purpose of data before it is collected and retained.

With Big Data, no rigid “schema” or organizational approach is necessary before capturing content (unlike in a traditional database). Data professionals now (or in the future) can ask open-ended questions of the data. That includes questions that may be not be germane now, but may be critical in an unpredictable future.

As a result, more data will be kept longer, in a manner that is unmoored from records management tenets. Without a doubt, this philosophy will complicate the governance and e-discovery of data.


So, when was the last time you sat down in front of your computer and deleted old files? In the world of Big Data, this is not only unnecessary, it’s undesirable. And it’s a waste of time.

Should we keep everything forever? Absolutely not. Too much information still has a downside. It is a liability, as well as an asset. Information has risk. Information has real, unavoidable legal and regulatory requirements. Information has a bite that Big Data proponents ignore at their peril.

But the good news: The same tools and infrastructure that empower the potentially profound insights of Big Data can and should be employed to help organizations make informed decisions about data retention. A vast amount of unstructured data in many organizations (over half, according to some studies) is duplicate, outdated, transitory junk that has no business value. Getting rid of this information en mass, without dragging every employee into the process, is now possible.

E-discovery is the place where the cost of information management myopia becomes painfully visible, and is why EDD has consistently driven innovation in handling and under- standing vast amounts of data. However, even with these innovations, the risk and cost of information in EDD is undeniable, and is correlated to the overall volume of information in the organization.

These are the contours of the coming battle between Big Data and e-discovery. It is a philosophical and cultural battle. It is the responsibility of EDD and information governance attorneys and practitioners to gird themselves for this battle. Learn about Big Data, and inform the discussion and decisions in your organization.

Reprinted with permission from Legal Technology News. Further duplication prohibited. 

Common Big Data Use Cases

  • Sentiment analysis. Analyzing sentiment on social media networks in order to improve marketing campaigns and customer service programs.
  • Fraud Detection. Analyzing transactions for patterns and events that may indicate fraud (familiar to anyone who has received a phone call from their credit card company when first using the card outside their home country).
  • Retail pricing optimization. Setting the price of a product based on sophisticated analysis of purchasing patterns, customer demographics, and geographic demand variations

Who should you talk to?

Big Data projects are likely being planned in your organization, or your client’s organization right now. Here are some people and places to pay attention to:

  • Marketing and customer service. A common real-world current application of Big Data techniques in social media sentiment analysis. These programs are typically driven by marketing or customer service groups.
  • IT: Information Security. The IT professionals responsible for information security may already be collecting and analyzing log files from the hundreds or thousands of devices that generate them in the company. This may not technically be a Big Data project yet, but find out what their plans are for correlation with other data sources that may give rise to privacy and other concerns.
  • Data scientists and analysts. If your organization is currently hiring data scientists or analysts, there is a good chance that Big Data projects are ongoing. Find out who these people are and learn about their plans. Not only is their work typically very interesting, it may also have serious legal and regulatory implications related to retention, privacy, and e-discovery.

Chart: Drawing the Battle Lines

Primary Postures
Factor Big Data Information Governance, E-Discovery
Primary motivation Business value Legal risk
Prevalent attitude towards information More data is an opportunity More data is expensive and risky
Information type focused on Databases, moving towards unstructured information Documents, email, and unstructured information, moving towards databases
Bleeding edge analysis How much is a piece of data worth? Is this a future smoking gun?
Biggest potential downside Unintended consequences of analysis (e.g., civil rights violations); cost in litigation Throwing away documents that in aggregate reveal valuable business insight


Author: Barclay T. Blair 


  1. Pingback: The Many Faces of Mike McBride » Blog Archive » This Week’s Links (weekly)
  2. Pingback: What is Big Data to the Information Governance Community? | Barclay T. Blair

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s