On June 6, 2013, the US National Archives and Records Administration published a call for comments on its draft Bulletin regarding a proposed “Capstone” approach to email retention at federal agencies. NARA was having technical problems with its comment system when I tried to submit my comments, so based on their instructions I have submitted my comments to them directly by email, and I am also posting them here.
You can find the request for comment and the draft Bulletin on NARA’s website.
Feedback on NARA’s Capstone Email Records Management Bulletin
As requested, I am providing comments on the “Capstone” approach to email management outlined in the June 6, 2013 draft NARA Bulletin provided above. Thank-you for the opportunity to provide input on this important issue.
I am the founder and principal of an information governance consulting firm based in New York. Since 2001 I have advised many organizations and government agencies on the development and implementation of email retention strategies.
Based on my experience and research, I believe that most organizations currently fall into one of two email records management camps.
The first camp does very little. While they may impose mailbox size limitations, they provide sparse guidance to employees who are forced to delete messages to meet these quotas. Consequently, business records are likely lost – especially if no storage space is allocated for retention of records that simply happen to reside in the email system. Others allow – or turn a blind eye to – the practice of employees exporting email messages out of the corporate email system so they can be tucked away in shared drives, thumb drives, or taken home for “safekeeping.” This practice results in an effective loss of management control over records found in the email system, and can greatly increase collection costs and increase spoliation risk in e-discovery.
The second camp “manages” email, but treats all email messages equally, regardless of their content. Some – seeking to minimize the cost and potential risk of email – automatically purge all email older than 30, 60, or 90 days. In the absence of a method to capture email messages containing record content, records are surely lost – violating laws that require retention of specified records, regardless of their form. Others – perhaps inspired by SEC Rules 17a-3 and 17a-4 and the email archiving software industry that those Rules singlehandedly created – capture a copy of all messages sent and received and keep them in a separate archive for a fixed period of time. This approach ignores the reality that such an archive will undoubtedly contain both trivial content and critical business records. From a compliance perspective, this may be just fine if you are a broker-dealer subject to these unique, email-specific Rules, but is less fine if you are, like most of the business world, subject to retention rules that do not exempt or treat email in special way, but rather require identification and retention of business records regardless of the form they take.
There are of course other approaches to email retention, one of which is outlined in your draft Bulletin. As I understand it, Capstone is a role-based method that uses the role of the email creator/recipient as a predictor of the content of that user’s account. In the past I have advocated such an approach to clients as a pragmatic method for improving otherwise nascent email records management practices.
NARA should certainly be commended for embracing such pragmatism, and in recognizing that complex user classification systems are often impractical and lightly adopted.
However, I would like to share two additional ideas that may be helpful as NARA finalizes its guidance.
First, while a knowledge worker’s role can certainly be a predictor of an email message’s content, our research has shown us the limits of this approach. We have assessed role-based approaches at client organizations by analyzing actual email accounts sampled from a range of user roles. We have then estimated the percentage of email content that would require retention under the client’s own retention rules. Across a range of users we have found as little as 5% and as much as 95% record content. There is certainly some correlation between the percentage of record content and the role of the user, but it is not always categorical. For example, some users are mostly information processors, and thus may have an extremely high percentage of email records in their inboxes.
Consider for example, a claims processor who receives a partially completed claims form attached to an email message, opens that form and completes it using information they possess, and than sends the completed form to an employee who represents the next link in the processing chain. This scenario is very common, even in large organizations. Assuming that these completed claim forms are records, and that they are not otherwise captured in a content management system, this user’s email account is quite important from a records management perspective.
However, a Capstone system based solely on seniority (i.e., “officials at or near the top of an agency,” as described in the Bulletin) may miss this important account and result in such records disappearing as “temporary” records. Conversely, senior officials may have a relatively low percentage of record content in their email system when they use other systems to communicate their decisions, document those decisions formally, or otherwise use other official or formal systems to complete their work. Capture and permanent retention of their entire account then, would result in retention of largely trivial content.
These issues can in part be addressed by careful examination of the way email is used by each agency and its users, as mentioned in the Bulletin.
Second, I wonder if NARA is turning away from a content-based approach to record identification and retention too soon – in fact, act just at the time in history when technology to enable semi-automated, content-based approaches is becoming widely available. Our clients are currently evaluating and implementing technology from OpenText and Recommind (there are other providers in the market as well) that marries human and machine intelligence to remove the classification burden from the user. Such systems are by no means trivial to implement and configure, but I believe that they point the way forward for email records management. The effectiveness of automated statistical methods for content classification has been demonstrated effectively in the intensely observed world of US civil litigation; a demonstration that I believe provides a foundation for it application to the records management problem.
Further, while the Capstone method would seem – as noted in your Memo – to foster compliance with the “OMB/NARA M-12-18 Managing Government Records Directive” requirement to “manage both permanent and temporary email records in an accessible electronic format,” I wonder to what extent it addresses the spirit of Section A3 of the Directive to “investigate and stimulate applied research in automated technology to reduce the burden of records management responsibilities?”
Once again, thank-you for the opportunity to provide feedback on this important Bulletin, and I am confident that NARA will continue to provide leadership as federal agencies continue this critical transition.
Last week I attended a “Predictive Coding Boot Camp” produced by the E-Discovery Journal and presented by Karl Schieneman of Review Less and Barry Murphy. I’ve participated in many workshops, seminars, discussions, and webinars on the topic, but this half-day seminar went the deepest of any of them into the legal, technology, and case strategy implications of using technology to minimize the cost of human document review in e-discovery. It was a solid event.
(But, I wasn’t there to learn about e-discovery. I’ll tell you why I was there in a moment.)
You see how I snuck in an implied definition above? Because, whatever you call it – predictive coding, technology-assisted review, computer-assisted review, or magic – isn’t that the problem that we are trying to solve? To defensibly reduce the number of documents that a human needs to review during e-discovery? There are a number of way to get there using technology, but the goal is the same.
What does e-discovery have to do with IG?
To review, in civil litigation, both sides have an obligation to produce information to the other side that is potentially relevant to the lawsuit. In the old days, this was a mostly a printing, photocopying, and shipping problem. Today it is primarily a volume, complexity, and cost problem. Although discovery of physical evidence and paper records is obviously still part of the process, electronic evidence naturally dominates.
So, how does a litigant determine whether a given document is potentially relevant and must be produced, or if it is irrelevant, privileged, or otherwise does not need to be produced to the other side?
If I sue my mechanic because he screwed up my transmission repair, the process is pretty simple. I will bring bills, receipts, and other stuff I think is relevant to my lawyer, my mechanic will do the same, our attorneys will examine the documents, determine a case strategy, produce responsive evidence to the other side, perhaps conduct some depositions, and – in real life – a settlement offer will likely be negotiated. In a case like this, there are probably only one or two people who have responsive information, there isn’t much information, and the information is pretty simple.
Now, what happens if 10,000 people want to sue a vehicle manufacturer because their cars seemingly have a habit of accelerating on their own, causing damage, loss, and even death? In a case like this, the process of finding, selecting, and producing responsive information will likely be a multi-year effort costing millions of dollars. The most expensive part of this process has traditionally been the review process. Which of the millions of email messages the manufacturer has in its email archive are related to the case? Which CAD drawings? Which presentations that management used to drive key quality control decisions? Which server logs?
Before we got smart and started applying smart software to this problem, the process was linear, i.e., we made broad cuts based on dates, custodians, departments etc. and then human reviewers – expensive attorneys in fact – would look at each document and make a classification decision. The process was slow, incredibly expensive, and not necessarily that accurate.
Today, we have the option to apply software to the problem. Software that is based on well-known, studied and widely used algorithms and statistical models. Software that, used correctly, can defensibly bring massive time and cost savings to the e-discovery problem. (There are many sources of the current state of case law on predictive coding, such as this.) Predictive coding software, for example, uses a small set of responsive documents to train the coding engine to find similar documents in the much larger document pool. The results can be validated through sampling and other techniques, but the net result is that the right documents can potentially be found much more quickly and cheaply.
Of course predictive coding is just a class of technology. It is a tool. An instrument. And, as many aspiring rock gods have learned, owning a vintage Gibson Les Paul and a Marshal stack will not in and of itself guarantee that your rendition of Stairway to Heaven at open mic night will, like, change the world, man.
So why did I go to the Predictive Coding Bootcamp? I went because I believe that Information Governance will only be made real when we find a way to apply the technologies and techniques of predictive coding to IG. In other words, to the continuous, day-to-day management of business information. Here’s why:
Human classification of content at scale is a fantasy.
I have designed, implemented, and advocated many different systems for human-based classification of business records at dozens of clients over the last decade. In some limited circumstances, they do work, or at least they improve upon an otherwise dismal situation. However, it has become clear to me (and certainly others) that human based-classification methods alone will not solve this problem for most organizations in most situations moving forward. Surely by now we all understand why. There is too much information. The river is flowing too quickly, and the banks have gotten wider. Expecting humans to create dams in the river and siphon of the records is frankly, unrealistic and counterproductive.
Others have come to the same conclusion. For example, yesterday I was discussing this concept with Bennett B. Borden (Chair of the Information Governance and eDiscovery practice at Drinker Biddle & Reath) at the MER Conference in Chicago, where he provided the opening keynote. Here’s what Bennett had to say:
“We’ve been using these tools for years in the e-discovery context. We’ve figured out how to use them in some of the most exacting and high-stakes situations you can imagine. Using them in an IG context is an obvious next step and quite frankly probably a much easier use case in some ways. IG does present different challenges, but they are primarily challenges of corporate culture and change management, rather than legal or technical challenges.”
The technology has been (and continues to be) refined in a high-stakes environment.
E-discovery is often akin to gladiatorial combat. It is often conducted under incredible time pressures, with extreme scrutiny of each decision and action by a both and enemy and a judge. The context of IG in most organizations is positively pastoral by comparison. Yes, there are of course enormous potential consequences for failure in IG, but most organizations have wide legal latitude to design and implement reasonable IG programs as they see fit. Records retention schedules and policies, for example, are rarely scrutinized by regulators outside of a few specific industries.
I recently talked about this issue with Dean Gonsowski, Associate General Counsel at Recommind. Recommind is a leader in predictive coding software for the e-discovery market and is now turning its attention to the IG market in a serious way. Here’s what Dean had to say:
“E-discovery is the testing ground for cutting-edge information classification technology. Predictive coding technology has been intensively scrutinized by the bench and the bar. The courts have swung from questioning if the process was defensible to stating that legal professionals should be using it. The standard in IG is one of reasonableness, which may be a lower standard than the one you must meet in litigation.”
There is an established academic and scientific community.
The statistical methods, algorithms, and other techniques embodied by predictive coding software are the product of a mature and developing body of academic research and publishing. The science is well-understood (at least by people much, much smarter that me). TREC is a great example of this. It is a program sponsored by the US government and overseen by a program committee consisting of representatives from government, industry, and academia. It conducts research and evaluation of the tools and techniques at the heart of predictive coding. The way that this science is implemented by the software vendors who commercialize it varies widely, so purchasers must learn to ask intelligent questions.TREC and other groups help with this as well.
I will soon be writing more about the application of predictive coding technology to IG, but today I wanted to provide an introduction to the concept and the key reasons why I think it points the way forward to IG. Let me know your thoughts.
Here is the fourth video in our five-part series where I asked 30 Information Governance experts the same question, then produced a 5 minute video of their responses. As you watch the series, it is very interesting to see the common threads that weave through the answers, depending on the role and the type of organization the interviewee comes from.
Thanks to everyone who joined our information governance webinar on Thursday. Barry and I had fun go through the results of the joint ViaLumina and eDiscovery Journal 2011 Information Governance Survey. Judging by some of the questions, you found it valuable as well. Attendees to the webinar will get a complimentary copy of the survey report. Also, Barry and I will be continue to blog about results from the survey here and over at Barry’s blog.
I’ll let you know when the webinar recording is available, but in the meantime, here are some selected audio outtakes that I though you might enjoy.
On addressing information governance cynics (we’re looking at you, IT) when defining the term:
[haiku url=”https://barclaytblair.files.wordpress.com/2011/09/minnesota-blue-apples.m4a” title=”Minnesota Blue Apples”]
(Minnesota blue apples? What can I say?)
On the role of litigators and the corporate governance vacuum
[haiku url=” https://barclaytblair.files.wordpress.com/2011/09/the-role-of-litigators.m4a” title=”The Role of Litigators in Information Governance”]
More to come . . .