“Information retrieval is a significant problem for businesses. Further, the extent of the problem worsens with increasing size of the document collection [and] the less formal the information stored.”
Information Retrieval in Business: An Unmet Challenge[i]
Unstructured information, at its simplest, is information that does not reside in the rows and columns of a database. Any database user understands that the meaning of a field in a database is a combination of what the row and the column each mean, such as the price of a widget on a certain date. However, unlike the structured information that resides in databases, unstructured information does not always have a predetermined form, business purpose, use, value, or security classification.
As a result, managing unstructured information is tricky. Many long-established techniques for database administration simply do not apply. This complexity also makes calculating the total cost of unstructured information difficult.
Unstructured information comes in many forms, including word processing documents, spreadsheets, social media posts, and log files automatically generated by computer servers. Some unstructured information has more structure than others (email messages, for example, all have a header, subject line, and message body). Some call this information semi-structured information, but for our purposes, we will use the term unstructured information to include semi-structured information as well.
The volume of unstructured information is growing dramatically. Analysts estimate that, over the next decade, the amount of data worldwide will grow by 44 times (from .8 Zetabytes to 35 Zetabytes: 1 Zetabyte = 1 trillion Gigabytes).[ii] However, the volume of unstructured information will actually grow 50% faster than structured data. Analysts also estimate that fully 90% of unstructured information will require formal governance and management by 2020. In other words, the problem of unstructured information governance is growing faster than the problem of data volume itself.
What makes unstructured information so challenging? There are several factors, including
- Horizontal vs. Vertical. Unstructured information is typically not clearly attached to a department or a business function. Unlike the vertical focus of an ERP database, for example, an email system serves multiple business functions – from employee communication to filing with regulators – for all parts of the business. Unstructured information is much more horizontal, making it difficult to develop and apply business rules.
- Formality. The tools and applications used to create unstructured information often engender informality and the sharing of opinions that can be problematic in litigation, investigations, and audits – as has been repeatedly demonstrated in front page stories over the past decade. This problem is not likely to get any easier as social media technologies and mobile devices become more common in the enterprise.
- Management Location. Unstructured information does not have a single, obvious home. Although email systems rely on central messaging servers, email is just as likely to be found on a file share, mobile device, or laptop hard drive. This makes the application of management rules more difficult than the application of the same rules in structured systems, where there is a close marriage between the application and the database.
- “Ownership” Issues. No employee thinks that they “own” data in an accounts receivable system like they “own” their email, or documents stored on their hard drive. Although such information generally has a single owner, i.e., the organization itself, this mindset can make the imposition of management rules for unstructured information more challenging than structured data.
- Classification. The business purpose of a database is generally determined prior to its design. Unlike structured information, the business purpose of unstructured information is difficult to infer from the application that created or stores the information. A word processing file stored in a collaboration environment could be a multi-million dollar contract or a lunch menu. As such, classification of unstructured content is more complex and expensive than structured information.
Taken together, these factors reveal a simple truth: managing unstructured information is a separate and distinct discipline from managing databases. Moreover, determining the costs and benefits of owning and managing unstructured information is a unique – but essential – challenge.
[i] Michael D. Gordon, “Information Retrieval in Business: An Unmet Challenge,” The University of Michigan, 1991. Online at, http://deepblue.lib.umich.edu/handle/2027.42/35654
[ii] International Data Corporation, “The 2011 Digital Universe Study,” June 2011. Online at, http://www.emc.com/collateral/demos/microsites/emc-digital-universe-2011/index.htm
We have worked with the eDJ Group in the past to survey the market about Information Governance attitudes and practices, and I am pleased be working together on a new survey. This time we have an additional partner – ARMA International – which is excellent.
Our new survey asks some of the same questions we asked previously so that we can track year-over-year changes, but we are also digging into some new areas like big data and predictive coding. Please take a moment to complete the survey. We will be releasing the results publicly, and this kind of data is good for all of us as we try to move the information governance ball down the field (unlike the NY Giants this year – what the heck?).
Check out the results of our previous surveys to get a flavor of the kind of insight that we expect to get from the survey.
“We need to take automation to another level, leaving human or manual efforts behind, to increase productivity and lower cost for clients in all areas of the information governance spectrum.”
Jason R. Baron, Of Counsel, Information Governance & eDiscovery Group, Drinker Biddle & Reath LLP
Jason R. Baron, Director of Litigation for NARA and a widely recognized and highly respected authority on e-discovery and electronic records, has left NARA to join the information governance and e-discovery practice of Drinker Biddle & Reath. He is joining an already stacked deck at a group that already includes Bennett B. Borden and Jay Brudz as chairs.
I have known Jason for many years, and not only is he a class act, he is one of the few people who can truly be credited with driving and changing our thinking about e-discovery and information governance. Jason has a long list of accomplishments, but most significant for me is the tireless academic and evangelism work he has done to drive understanding of advanced search, predictive coding, and other techniques that help to automate information governance. Automation is the future of information governance, and it is a future that only exists because of people like Jason.
I had the pleasure to interview Jason about his big career change (he was at NARA for 13 years), and loved to see how excited he is about the future of information governance.
Highlights of our discussion include:
- Jason was NARA’s first Director of Litigation, which speaks both to the changes to the information landscape in the past decade and to Jason’s expertise.
- Jason played a key role in developing a directive that requires all federal agencies to move to all digital form for permanent electronic records by the end of the decade.
- NARA will soon be managing upwards of a billion White House email messages – forever.
- Jason believes that predictive coding and other advanced search and document review methods will drive significant automation of information governance in the coming years.
My Interview with Jason R. Baron
Why now? Why are you leaving your role at NARA to go into private practice?
Well, I can tell you it has nothing to do with being placed on furlough! For the past 13 years, I have considered my time at NARA to be in a dream job for any lawyer. As NARA’s first appointed Director of Litigation, I have had the opportunity to work with high-ranking officials and lawyers throughout government, including in the White House Counsel’s Office, on landmark cases involving electronic recordkeeping and e-discovery issues.
I also have been particularly privileged to work with Archivist David Ferriero and others in crafting a number of high-visibility initiatives in the records and information governance space, including the Archivist’s Managing Government Records Directive (August 2012), which includes an “end of the decade” mandate to federal agencies requiring that all permanent electronic records created after 2019 are preserved in electronic or digital form. With this background and experience, I think I can now be of even greater help in facilitating adoption of industry best practices that meet the Archivist’s various mandates. I also wanted to work on cutting edge e-discovery and information governance matters in a wider context.
What was it that attracted you to Drinker Biddle & Reath? Did you consider other firms or other career paths?
The biggest attraction was knowing that I share the same vision with Bennett B. Borden and Jay Brudz, Co-chairs of Drinker Biddle’s Information Governance and eDiscovery Group. Collectively, we see e-discovery challenges as only part of a more systemic “governance” problem. Big Data is only getting bigger, and I believe our group at Drinker Biddle is on the leading edge of law firms in recognizing the challenge and offering innovative solutions to clients. Of course, there are any number of other firms in e-discovery and other “hot” areas, and I have friends and colleagues at a number of firms and corporations who I have had discussions with. I’d like to think that my closest peers in this area will act as strategic partners with me in any number of educational forums, and I look forward to that prospect.
What will your role at Drinker Biddle be? What will you focus on?
As Of Counsel to the Information Governance and eDiscovery Group, I expect to be most heavily involved in helping to build out three areas of practice. First, providing legal services to those private sector actors that are involved in large IT-related engagements with the federal sector, and wish to optimize information governance requirements. Second, consulting on records and information governance initiatives in the private sector, especially employing cutting-edge automated technologies (predictive coding, auto categorization, and the like). Third and finally, I hope to take on special master assignments in the area of e-discovery, as the need arises, and would consider it a great honor to do so.
What do you think about the future of NARA and its role as the federal government transitions to the digital world?
As I said earlier, NARA is leading the way in issuing policies that will result in electronic capture of all e-mail records by the end of 2016, as well as ensuring that all electronic records appraised as “permanent” are preserved in future federal digital archives. NARA has shown leadership in issuing an important joint directive with OMB in 2012, which followed on the heels of President Obama’s Memorandum on Managing Government Records dated November 2011.
If NARA doesn’t lead in the area of setting information governance policies for federal applications, including in the cloud, it risks becoming an irrelevant player in the digital age. The present Archivist of the US and other senior leaders inside NARA are committed to doing everything they can to avoid that fate.
What are the key initiatives that you are working on right now?
My plate is full: Along with a few others, I have been involved in finishing up an update of The Sedona Conference’s 2007 Search Commentary and 2009 Commentary on Achieving Quality in E-Discovery. Over the next few weeks I will be criss-crossing the United States to participate in some excellent forums, including in October the upcoming EDI Summit in Santa Monica, where I am moderating a panel on “Beyond IS0 9001,” all about standards in the e-discovery and information governance space; and being invited to speak at the inaugural IT-Lex Conference in Orlando, where along with Ralph Losey and Maura Grossman I will be speaking on the future of predictive coding.
You will also find me at ARMA 2013 in Las Vegas, at Georgetown’s Advanced E-Discovery Institute, and of course at LegalTech next February, all wonderful venues to get a message out about cutting edge issues in these areas.
What do you think is the most interesting thing happening in the IG space today?
I am most excited about bringing the “good news” of predictive coding and other advanced search and document review methods to a wider records and information governance audience, and intend to speak at any number of upcoming forums on how to do so. We need to take automation to another level, leaving human or manual efforts behind, to increase productivity and lower cost for clients in all areas of the information governance spectrum.
Do you think that organizations will ever achieve the promise of IG? What will it take to get there?
Woody Allen says there are two types of people in the world: those who believe the glass is half full, and those who say it is half poison.
I am optimistic about us doing better in the space – if lawyers can think outside of the box in adopting best practices from other disciplines, including artificial intelligence and information retrieval. A reality check is in order, however, given that predictions about the “future” of anything tend to be overly optimistic (where are the cars that glide over highways, or the cities on the moon, both of which were predicted in the 1964 World’s Fair to already to have happened?).
And the first mention of “yottabytes” by an op-ed columnist in the New York Times occurred in the last couple of weeks. Ask I mentioned earlier, the world of big data is only getting bigger and more complex. I think lawyers in this area can give solid guidance to help clients do better in this “real” world, and certainly hope to do so with the great team already in place at Drinker Biddle.
What was the biggest structural or philosophical change that you observed at NARA during your career there?
I recall going to what was billed as an “e-mail summit” meeting a half decade ago, in which the really great people assembled could not believe that most end users failed to print out email for placement in traditional hard copy files. Archivists and records managers by their very nature are just too good at doing so! However, NARA has come a long way since then, in pushing capture and filter policies for email (the so-called recent “Capstone” initiative), as well as the digital mandate by 2019 I mentioned earlier. These really do represent policy shifts that hold out the potential for leading many agencies to adopt new ways of doing business.
What do you think that private organizations can learn from NARA’s experiences in trying to manage and control the information explosion?
NARA certainly has unique challenges. For example, it needs preserve and provide access on a permanent basis to what I have estimated will soon be upwards of a billion White House emails. What the private sector can learn from NARA’s (and the White House’s experience) in this area is that in an era where massive and ever-increasing data flow through corporate networks, there need to be technological solutions put into place to be able to filter out low-value data, to guard privacy interests, and to provide greater access through advanced means of search and categorization.
NARA knows that it needs to confront all of these issues, and is now engaging in outreach to the private sector in an effort to find solutions in the public space (BB note: I recently attended one of these meetings, and will be writing about it soon.) Corporations of all sizes also need to confront information governance issues before a black swan event occurs that materially affects the bottom line.
What was the most interesting challenge or case you faced at NARA?
I have written and spoken at length about dealing with U.S. v. Phillips Morris (the RICO tobacco case), and so won’t repeat what I have said about my experience searching through 20 million White House emails, and starting on my quest in search of better search methods. My time at NARA just has been one fascinating experience after another, and not just involving electronic records of course, so it’s hard to choose.
At one point I found myself in the back room of Christie’s auction house in Manhattan with a senior archivist, poring over a massive Excel spreadsheet that listed 5000 documents taken from Franklin Roosevelt’s White House by his trusted secretary Grace Tully. We had to decide which documents should have ended up at the Roosevelt Library in Hyde Park. An auction of paintings worth millions was about to take place and all around us people where shouting, “Where are the Picasso’s?” and “What about the Matisse’s?” It was definitely surreal.
And yes, after drafting a Complaint and working with the US Attorney’s Office in the Southern District, we ended up settling the dispute over the Grace Tully collection (where the owners were represented by, among others, former Rep. Elizabeth Holtzman working at a mid-Manhattan law firm), with timely assistance from passage of a special bill in Congress allowing for a favorable valuation of the collection. From one week to the next, I never knew what new disputes involving the history of the 19th and 20th century I would be involved with.
I will be providing the keynote address on a half-day seminar hosted by Sita Corp, SAP, and HP at New York Athletic Club, on October 15, 2013 from 8:30-10:30 am.
I am going to be talking about the challenges of Information Governance in a Big Data world.
Register now at: http://ow.ly/po2mm