Last week, I was pleased to help lead the discussion at The Cowen Group’s Leadership Breakfast in Manhattan. I’ve been spending a lot of time thinking and writing about Big Data lately, and jumped at the chance to hear what this community was thinking about it. Then, this week we did it again in Washington, DC.
It was a great group of breakfasters – predominantly law firm attendees, with a mix of in-house lawyers, consultants, and at least one journalist. The discussion was fast ride through a landscape of emotional responses to Big Data: excitement, skepticism, curiosity, confusion, optimism, confusion, and ennui. Just like every other discussion I have had about Big Data.
We spent a lot of time talking about what, exactly, Big Data is. The problem with this discussion is that, like most technology marketing terms, it can mean something or nothing at all. How can a bunch of smart people having breakfast in the same room one morning be expected to define Big Data when the people who are paid to create such definitions leave us feeling . . . confused?
Here’s how Gartner defines Big Data:
Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.
Here’s how McKinsey defines it:
‘Big data’ refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. This definition is intentionally subjective . . .
Big Data is the frontier of a firm’s ability to store, process, and access (SPA) all the data it needs to operate effectively, make decisions, reduce risks, and serve customers.
Huh? No wonder we were confused as we scarfed our bacon and eggs.
Big Data is a squishy term, and for lawyers without a serious technology or data science background it is even squishier.
The concepts behind it are not new. However, there are some relatively new elements. One is the focus on unstructured data (e.g., documents, email messages, social media) instead of data stored in enterprise databases (the traditional focus of “Business Intelligence.) Two is the technologies that store, manage, and process data in a way that is not just incrementally better, bigger, or faster, but that are profoundly different (new file systems; aggregating massive pools of unstructured data instead of databases; storage on cheap connected hard drives, etc.). Three is newly commercialized tools and methods for performing analysis on these pools of unstructured data (even data that you don’t own) to draw business conclusions. There is a lot of skepticism about the third point – specifically about the ease with which truly insightful and accurate predictions can be generated from Big Data. Even Nate Silver – famous for accurately predicting the outcome of the 2012 US Presidential Election with data – cautions that even though data is growing exponentially, the “amount of useful information almost certainly isn’t.” Also, correlative insights often get sold as causative insights.
Big Data is a lot of things to a lot of people. But what is it to e-discovery professionals? I think there are three pieces to the Big Data discussion that are relevant for this community.
Is Data Good or Bad? In the world of Big Data, all data is good and more data is better. A well-known data scientist was recently quoted in the New York Times as saying, “Storing things is cheap. I’ve tended to take the attitude, ‘Don’t throw electronic things away.” To a data scientist this makes sense. After all, statistical analysis gets better with more (good) data. However, e-discovery professionals know that storage is not cheap when its full potential lifecyle is calculated, such as a company spending “$900,000 to produce an amount of data that would consume less than one-quarter of the available capacity of an ordinary DVD.” Data itself is of course neither good or bad, but e-discovery professional need to help Big Data proponents understand that data most definitely can have a downside. I wrote about this tension extensively here.
Data Analytics for E-Discovery. Though not often talked about, I believe there is serious potential for some parties in the e-discovery process to analyse the data flowing through its process and to monetize that analysis. What correlations could a smart data scientist investigate between the nature of the data collected and produced across multiple cases and their outcomes and costs. Could useful predictions be made? Could e-discovery processes be improved and routinized? I have some idea, but no firm answers. We should dig into this further, as a community.
Privacy and Accessibility. What does “readily available” mean in our age — an age where a huge chunk of all human knowledge can be accessed in seconds using a device you carry around in your pocket? Does better access to information simply offer speed and convenience, or does it offer something more profound? When a local newspaper posted the names and addresses of gun permit holders on an interactive map in the wake of the Sandy Hook Elementary School shooting, there was a huge outcry – despite the fact that this information is publicly available, by law. This is a critical emerging issue as the pressure to consolidate and mine unstructured information to gain business insight collides with expectations of privacy and confidentiality.
Simply put legal and ediscovery professionals need to be at the table when Big Data discussions are happening. They bring a critical perspective that no one else offers.
By the way, my article about accessing and getting rid of information in the Big Data era has been syndicated to the National Law Journal, under the title, “Data’s Dark Side, and How to Live With It.” Check it out here. You can also check out my podcast discussion with Monica Bay about the article here.
A few weeks ago, I mentioned that I was working on new feature article for Law Technology News about about how making more and more data “easily accessible” is both essential for Big Data to fulfill its promise and also a huge risk to privacy, intellectual property, and so on.
The promise of Big Data is based on a central assumption: that information will be easily, quickly, and cheaply available, on a grand scale. The plumbing of Big Data — the technology infrastructure — is designed to bring internet scale to enterprise data. Some of the surprising insights that data scientists hope to gain from Big Data analytics come from correlating information from disparate sources, in a context that was never imagined when the information was first created — such as correlating the type of computer used to book a trip with how much a traveler is willing to pay for a hotel room. Or using prescription drug history to screen health insurance applicants.
The problem of protecting privacy, intellectual property, and other rights will only grow more complex as our ability to access and process information becomes more sophisticated.
I also write about how these issues came to the forefront in the wake of the shooting tragedy at Sandy Hook Elementary school in Newton, CT. I also explore emerging technology that allows electronic content to “self-destruct.”
The article has now been published, and you can read it here (free registration required).
I was also interviewed about the article by Monica Bay, Editor-In-Chief of LTN, on Law Technology Now. You can listen to our discussion on the embedded podcast below.
Author: Barclay T. Blair
Update: Interesting article from NY Mag claiming that SnapChat is, “absolutely blowing up right now” on Wall Street because “the chances of incriminating material ending up in the hands of a boss or a compliance officer – or in a Daily Intelligencer story, for that matter – are low.”
This weekend I was finishing up my next opinion piece for the fine Law Technology News. My piece is about how making more and more data “easily accessible” is both essential for Big Data to fulfill its promise and also a huge risk to privacy, intellectual property, and so on. Look for that in the next issue.
Part of what inspired me to write about this was the success of Snapchat, a mobile app that lets users “chat” using photographs instead of text. Neat idea, but the twist is that the images automatically disappear after 1-10 seconds (the time is set by the sender). As you would imagine, Snapchat has gained a reputation as a teenage sexting tool, despite some indications otherwise. I set it up to see what all the fuss was about, and cajoled my wife to install it as well. Frankly I would say that any service that automatically deletes any self-portrait I have taken after turning 40 is doing me a huge favor. Anyway, Snapchat was quickly copied by Facebook, with its Poke application, although Poke seems to be less popular than Snapchat to date.
I did some more digging around in this space, and it turns out there are a number of startups focused on so-called self-destructing messages. For example:
- Vaporstream offers “secure recordless messaging” technology aimed at enterprise users
- A startup involving Phil Zimmerman, crypto-hero and creator of PGP, called Silent Circle offers secure mobile voice and messaging, including “burn notices” for text messages
- Burn Note: self-destructing email
- Wickr: self-destructing texts, pictures, video
- Gryphn: self-destructing text messages, with screenshot capability disabled
- Privnote: web-based, self-destructing notes
- Tigertext: enterprise-focused secure texting with message timers
- Burner: temporary phone numbers for calling and texting (hat tip to Bill Potter at The Cowen Group for pointing me to the last two on this list)
The category of “disappearing email” has been around at least since the late 1990s. In that era, a company called “Disappearing Inc.” got a lot of attention, but was not successful. A similar company called Hushmail from that era is still around, but suffered from some bad press when email that users thought had been “disappeared” was turned over in the course of a lawsuit. In any case, neither company ushered in a new era where email automagically goes away. However, given this new crop of startups, I wonder: were these 90s companies ahead of their time, poorly managed, or just a bad idea?
On the corporate side, I don’t see a large appetite for this kind of technology. I have had this conversation with clients many times, and although they love the idea in concept, they are very worried that using the technology will create the appearance of evil (just as the first thought we naturally have about Snapchat is that is must really be for sexting). Executives in particular feel that the use of the technology creates the impression of having something to hide. Perhaps if email had had this capability from the beginning, the risk would not be there. Corporate culture is conservative by nature, and no company wants to draw attention to itself in this area.
This fear is not without justification. Many general counsels are fearful of deleting any corporate email messages at all, which is why many of the world’s largest and “well-managed” companies have hundreds of terabytes of old email sticking around. Remember that in the world we live in, prosecutors sometimes chastise companies for not keeping all their messages forever because, after all, tape storage is “almost free.” There certainly is a case to be made that spoliation fears are generally overblown, given the number of times spoliation actually leads to a a fine or judgement, but the fear of throwing away the wrong thing is not groundless. Getting rid of junk defensibly requires a logical, justifiable process.
Unless an organization is in a highly classified environment, I think most general counsels and their litigation partners would tremble at the thought of explaining why most of the company used “normal” email but their executives/salespeople/take your pick used “special” email that disappears. It does not pass the smell test. Selective use is problematic.
On top of that, you have users who find operational benefit from having records of their business activities in email. You also have the emerging world of Big Data, where email in aggregate potentially has big value if you get it onto Internet-scale infrastructure and point the right tool at it.
In any case, check out the full piece when it runs in the next issue of Law Technology News.
Author: Barclay T. Blair
Late 2012 I was honored to provide a feature editorial for Law Technology News, a fine publication helmed by Monica Bay. You can read it online here (with free registration) or you can read it in full below.
Girding for Battle: A clash is brewing between Big Data and e-discovery
When was the last time you sat at your computer and deleted old files? Yesterday? Never? Don’t remember? Before today’s ubiquitous search engines, there was practical value in being a filer rather than a piler — it was difficult to find a document in a filing cabinet without an index.
Today’s sophisticated search engines obviate the need to manually index. Search technology is wonderful if we know what we are looking for, but is it an information management panacea? Information is growing at an astonishing rate, so much so that the numbers used to communicate growth projections are now so huge that they are almost meaningless.
Until recently, this unfettered growth was generally viewed as hazardous. It drives up storage costs, makes it difficult to find the wheat among the chaff, and increases electronic data discovery risk and cost, the argument goes. The resulting mantra: “We need to categorize it, control it, and clean it up!” Companies have spent decades paralyzed by a near inability to adapt modernist paper records management programs to decidedly postmodern information systems. Today, no part of the organization (including IT) exerts centralized command-and control over data, and we have yet to find an easy replacement for the file clerk. Enter Big Data, where uncontrollable information growth is no longer viewed as evil, or even a necessary evil. In the Big Data world, system administrators now treat bursting databases and file shares not as a shameful secret shared sotto voce in committee meetings, but as something to brag about. In Big Data, information has no downside. It is exalted in Davos, where the World Economic Forum recently “declared data a new class of economic asset, like currency or gold.” It’s been profiled by The New York Times. Pro￼ponents call it “the new oil,” proclaiming it presents the biggest opportunies since the dawn of the internet.
So why does Big Data matter to the legal community? Because it heralds a new battle, over a single question: Should we keep the information we create forever, or should we throw some of it away? The answer used to be simple: it was not feasible to keep everything. The cost was too high, the effort too great. Overburdened systems fail. Information overload reduces productivity. Data must be migrated from old to new systems, with great difficulty and expense.
The chance that you might have a smoking gun buried in the data creates too high a risk of liability. After all, if we learned one lesson from the seminal EDD cases metastasiz- ing from the bankruptcies of Enron (Andersen v. U.S., 544 U.S. 696, 704 (2005)) and Sunbeam, (Coleman (Parent) Holdings, Inc. v. Morgan Stanley & Co., Inc., No. 502003CA005045XXO- CAI (Fla. Cir. Ct., March 1, 2005)), it is that data skeletons in the closet can be spooky.
But Big Data changes the calculus. The software used by Google and Yahoo to index the internet is open source, called Apache Hadoop. This brings internet scale and speed to just about any organization, and it can be run on cheap, off-the-shelf disk drives. Tools to analyze the data (some first commercialized in EDD) are accessible and powerful, promising profound new business and societal insights drawn from the vast pools of data. The fundamental promise of Big Data is that it enables insights into business (and the world) that were not possible before. Proponents see Big Data creating a better world, one fulfilling the promise of the internet itself.
But Big Data advocates downplay the downsides of data, and specifically, the EDD challenges. In the near-Nirvana contemplated by some Big Data proponents, all data is good and more data is better. In EDD, the opposite is usually true.
A recent study by the Pew Research Center about the future of Big Data was positive overall, but acknowledged concerns related to privacy, social control, misinformation, civil rights abuses, and the possibility of simply being overwhelmed by the deluge of data. Within legal, the burden of finding, processing, and producing Big Data in EDD is a foreign concept to most Big Data advocates. Perhaps this is because the Big Data enthusiasm cycle has not yet reached the “trough of disillusionment” where the hype faces the reality of corporate culture and complex legal and compliance requirements.
Records management doctrines specify that organizations should clearly define the business or legal purpose of a piece of information when created. That analysis determines whether, for how long, and in what form the data should be kept. Records retention schedules are intended to provide a measure of defensibility against spoliation claims, as they evince an intent to delete a record based on a proactive and standardized calculation of its value, rather than a reactive determination based on fears about bad evidence. Many organizations have attempted to play records management catch up in advance of pending litigation and have paid the price.
Big Data advocates argue that the economies of scale now make it feasible and desirable to capture and store information that currently has no clear or definable business value. Although large organizations have long collected and analyzed data (using business intelligence software), proponents argue that Big Data is different. They posit that cheaper storage and technical innovations make it easier and faster than ever before to analyze that data, eliminating the need to identify the business purpose of data before it is collected and retained.
With Big Data, no rigid “schema” or organizational approach is necessary before capturing content (unlike in a traditional database). Data professionals now (or in the future) can ask open-ended questions of the data. That includes questions that may be not be germane now, but may be critical in an unpredictable future.
As a result, more data will be kept longer, in a manner that is unmoored from records management tenets. Without a doubt, this philosophy will complicate the governance and e-discovery of data.
So, when was the last time you sat down in front of your computer and deleted old files? In the world of Big Data, this is not only unnecessary, it’s undesirable. And it’s a waste of time.
Should we keep everything forever? Absolutely not. Too much information still has a downside. It is a liability, as well as an asset. Information has risk. Information has real, unavoidable legal and regulatory requirements. Information has a bite that Big Data proponents ignore at their peril.
But the good news: The same tools and infrastructure that empower the potentially profound insights of Big Data can and should be employed to help organizations make informed decisions about data retention. A vast amount of unstructured data in many organizations (over half, according to some studies) is duplicate, outdated, transitory junk that has no business value. Getting rid of this information en mass, without dragging every employee into the process, is now possible.
E-discovery is the place where the cost of information management myopia becomes painfully visible, and is why EDD has consistently driven innovation in handling and under- standing vast amounts of data. However, even with these innovations, the risk and cost of information in EDD is undeniable, and is correlated to the overall volume of information in the organization.
These are the contours of the coming battle between Big Data and e-discovery. It is a philosophical and cultural battle. It is the responsibility of EDD and information governance attorneys and practitioners to gird themselves for this battle. Learn about Big Data, and inform the discussion and decisions in your organization.
Reprinted with permission from Legal Technology News. Further duplication prohibited.
Common Big Data Use Cases
- Sentiment analysis. Analyzing sentiment on social media networks in order to improve marketing campaigns and customer service programs.
- Fraud Detection. Analyzing transactions for patterns and events that may indicate fraud (familiar to anyone who has received a phone call from their credit card company when first using the card outside their home country).
- Retail pricing optimization. Setting the price of a product based on sophisticated analysis of purchasing patterns, customer demographics, and geographic demand variations
Who should you talk to?
Big Data projects are likely being planned in your organization, or your client’s organization right now. Here are some people and places to pay attention to:
- Marketing and customer service. A common real-world current application of Big Data techniques in social media sentiment analysis. These programs are typically driven by marketing or customer service groups.
- IT: Information Security. The IT professionals responsible for information security may already be collecting and analyzing log files from the hundreds or thousands of devices that generate them in the company. This may not technically be a Big Data project yet, but find out what their plans are for correlation with other data sources that may give rise to privacy and other concerns.
- Data scientists and analysts. If your organization is currently hiring data scientists or analysts, there is a good chance that Big Data projects are ongoing. Find out who these people are and learn about their plans. Not only is their work typically very interesting, it may also have serious legal and regulatory implications related to retention, privacy, and e-discovery.
Chart: Drawing the Battle Lines
|Factor||Big Data||Information Governance, E-Discovery|
|Primary motivation||Business value||Legal risk|
|Prevalent attitude towards information||More data is an opportunity||More data is expensive and risky|
|Information type focused on||Databases, moving towards unstructured information||Documents, email, and unstructured information, moving towards databases|
|Bleeding edge analysis||How much is a piece of data worth?||Is this a future smoking gun?|
|Biggest potential downside||Unintended consequences of analysis (e.g., civil rights violations); cost in litigation||Throwing away documents that in aggregate reveal valuable business insight|
Author: Barclay T. Blair