Last week, I was pleased to help lead the discussion at The Cowen Group’s Leadership Breakfast in Manhattan. I’ve been spending a lot of time thinking and writing about Big Data lately, and jumped at the chance to hear what this community was thinking about it. Then, this week we did it again in Washington, DC.
It was a great group of breakfasters – predominantly law firm attendees, with a mix of in-house lawyers, consultants, and at least one journalist. The discussion was fast ride through a landscape of emotional responses to Big Data: excitement, skepticism, curiosity, confusion, optimism, confusion, and ennui. Just like every other discussion I have had about Big Data.
We spent a lot of time talking about what, exactly, Big Data is. The problem with this discussion is that, like most technology marketing terms, it can mean something or nothing at all. How can a bunch of smart people having breakfast in the same room one morning be expected to define Big Data when the people who are paid to create such definitions leave us feeling . . . confused?
Here’s how Gartner defines Big Data:
Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.
Here’s how McKinsey defines it:
‘Big data’ refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. This definition is intentionally subjective . . .
Big Data is the frontier of a firm’s ability to store, process, and access (SPA) all the data it needs to operate effectively, make decisions, reduce risks, and serve customers.
Huh? No wonder we were confused as we scarfed our bacon and eggs.
Big Data is a squishy term, and for lawyers without a serious technology or data science background it is even squishier.
The concepts behind it are not new. However, there are some relatively new elements. One is the focus on unstructured data (e.g., documents, email messages, social media) instead of data stored in enterprise databases (the traditional focus of “Business Intelligence.) Two is the technologies that store, manage, and process data in a way that is not just incrementally better, bigger, or faster, but that are profoundly different (new file systems; aggregating massive pools of unstructured data instead of databases; storage on cheap connected hard drives, etc.). Three is newly commercialized tools and methods for performing analysis on these pools of unstructured data (even data that you don’t own) to draw business conclusions. There is a lot of skepticism about the third point – specifically about the ease with which truly insightful and accurate predictions can be generated from Big Data. Even Nate Silver – famous for accurately predicting the outcome of the 2012 US Presidential Election with data – cautions that even though data is growing exponentially, the “amount of useful information almost certainly isn’t.” Also, correlative insights often get sold as causative insights.
Big Data is a lot of things to a lot of people. But what is it to e-discovery professionals? I think there are three pieces to the Big Data discussion that are relevant for this community.
Is Data Good or Bad? In the world of Big Data, all data is good and more data is better. A well-known data scientist was recently quoted in the New York Times as saying, “Storing things is cheap. I’ve tended to take the attitude, ‘Don’t throw electronic things away.” To a data scientist this makes sense. After all, statistical analysis gets better with more (good) data. However, e-discovery professionals know that storage is not cheap when its full potential lifecyle is calculated, such as a company spending “$900,000 to produce an amount of data that would consume less than one-quarter of the available capacity of an ordinary DVD.” Data itself is of course neither good or bad, but e-discovery professional need to help Big Data proponents understand that data most definitely can have a downside. I wrote about this tension extensively here.
Data Analytics for E-Discovery. Though not often talked about, I believe there is serious potential for some parties in the e-discovery process to analyse the data flowing through its process and to monetize that analysis. What correlations could a smart data scientist investigate between the nature of the data collected and produced across multiple cases and their outcomes and costs. Could useful predictions be made? Could e-discovery processes be improved and routinized? I have some idea, but no firm answers. We should dig into this further, as a community.
Privacy and Accessibility. What does “readily available” mean in our age — an age where a huge chunk of all human knowledge can be accessed in seconds using a device you carry around in your pocket? Does better access to information simply offer speed and convenience, or does it offer something more profound? When a local newspaper posted the names and addresses of gun permit holders on an interactive map in the wake of the Sandy Hook Elementary School shooting, there was a huge outcry – despite the fact that this information is publicly available, by law. This is a critical emerging issue as the pressure to consolidate and mine unstructured information to gain business insight collides with expectations of privacy and confidentiality.
Simply put legal and ediscovery professionals need to be at the table when Big Data discussions are happening. They bring a critical perspective that no one else offers.
By the way, my article about accessing and getting rid of information in the Big Data era has been syndicated to the National Law Journal, under the title, “Data’s Dark Side, and How to Live With It.” Check it out here. You can also check out my podcast discussion with Monica Bay about the article here.