What Is Unstructured Information & Why Is It So Challenging?

“Information retrieval is a significant problem for businesses. Further, the extent of the problem worsens with increasing size of the document collection [and] the less formal the information stored.”

Information Retrieval in Business: An Unmet Challenge[i]

Unstructured information, at its simplest, is information that does not reside in the rows and columns of a database. Any database user understands that the meaning of a field in a database is a combination of what the row and the column each mean, such as the price of a widget on a certain date. However, unlike the structured information that resides in databases, unstructured information does not always have a predetermined form, business purpose, use, value, or security classification.

As a result, managing unstructured information is tricky. Many long-established techniques for database administration simply do not apply. This complexity also makes calculating the total cost of unstructured information difficult.

Unstructured information comes in many forms, including word processing documents, spreadsheets, social media posts, and log files automatically generated by computer servers. Some unstructured information has more structure than others (email messages, for example, all have a header, subject line, and message body). Some call this information semi-structured information, but for our purposes, we will use the term unstructured information to include semi-structured information as well.

The volume of unstructured information is growing dramatically. Analysts estimate that, over the next decade, the amount of data worldwide will grow by 44 times (from .8 Zetabytes to 35 Zetabytes: 1 Zetabyte = 1 trillion Gigabytes).[ii] However, the volume of unstructured information will actually grow 50% faster than structured data. Analysts also estimate that fully 90% of unstructured information will require formal governance and management by 2020. In other words, the problem of unstructured information governance is growing faster than the problem of data volume itself.

What makes unstructured information so challenging? There are several factors, including

  • Horizontal vs. Vertical. Unstructured information is typically not clearly attached to a department or a business function. Unlike the vertical focus of an ERP database, for example, an email system serves multiple business functions – from employee communication to filing with regulators – for all parts of the business. Unstructured information is much more horizontal, making it difficult to develop and apply business rules.
  • Formality. The tools and applications used to create unstructured information often engender informality and the sharing of opinions that can be problematic in litigation, investigations, and audits – as has been repeatedly demonstrated in front page stories over the past decade. This problem is not likely to get any easier as social media technologies and mobile devices become more common in the enterprise.
  • Management Location. Unstructured information does not have a single, obvious home. Although email systems rely on central messaging servers, email is just as likely to be found on a file share, mobile device, or laptop hard drive. This makes the application of management rules more difficult than the application of the same rules in structured systems, where there is a close marriage between the application and the database.
  • “Ownership” Issues. No employee thinks that they “own” data in an accounts receivable system like they “own” their email, or documents stored on their hard drive.  Although such information generally has a single owner, i.e., the organization itself, this mindset can make the imposition of management rules for unstructured information more challenging than structured data.
  • Classification. The business purpose of a database is generally determined prior to its design. Unlike structured information, the business purpose of unstructured information is difficult to infer from the application that created or stores the information. A word processing file stored in a collaboration environment could be a multi-million dollar contract or a lunch menu. As such, classification of unstructured content is more complex and expensive than structured information.

Taken together, these factors reveal a simple truth: managing unstructured information is a separate and distinct discipline from managing databases. Moreover, determining the costs and benefits of owning and managing unstructured information is a unique – but essential – challenge.

Sometimes my primary value as a consultant is as a calm voice in the raging storm of information-related problems that my clients face. Most information governance practitioners are assaulted from multiple sides today. Legal wants   e-discovery support (and they want it yesterday). IT wants help hiding from legal every time there is a new lawsuit. The business is pushing back on the new policy that requires employees to . . . gasp . . .  take  some responsibility for the information they create. In the midst of this the IG practitioner is supposed to be crafting and implementing  an IG strategy. Simply getting started can seem impossible.

In the second piece of our OpenText Executive Brief series, I address the challenge of getting started with IG. I’ve lost count of the number of times I have heard the cliche, “boiling the ocean” in introductory  meetings with clients struggling to take control of their information. The combination of IT complexity, the massive mountain of legacy  content, organizational change and legal uncertainty can make it very difficult to figure out where to start. We’ve provided a practical way to think about IG, and some practical tips on getting started. Check out the new brief here (one-time registration required).

Today’s PowerPoint Slide: The Origins of Information Governance By the Numbers

Rather than keep them locked away in my private treasure trove, and rather than simply dump them into a SlideShare queue where they make no sense without the verbal presentation that they were designed to enhance, I thought I would start sharing some of my PowerPoint slides with you, along with my thinking behind them.

Here’s the first one.

I created this slide to illustrate the information governance problem.

Let’s start on the bottom tine of the trident in this graphic, which shows us that the cost of raw hard disk space is about 100 times less now that what it was 10 years ago (it costs less than 1% what it did ten years ago). Dramatic, but obvious to anyone who has purchased a computer in the last decade.

Now, look at the middle tine – it shows us that the money we spend on enterprise storage equipment has remained relatively unchanged over those same ten years. At first, this doesn’t seem all that dramatic. In fact, when I first show this to people, they are surprised that the enterprise storage hardware numbers haven’t gone up dramatically. The reality is that the numbers fluctuate significantly with economic conditions-like any commodity. The other factor is that we are starting the comparison near the peak of an unprecedented boom in IT spending (i.e., the dot com years). However, this misses the larger point, which is quite startling: we are spending as much on storage 10 years later, when the price of the raw materials – disk drives – has dropped to 1% of what it was.

Let’s put this in perspective with an analogy.  The average American drives 12,000 miles each year. At a rate of 30 mpg, that means he/she uses 400 gallons of fuel, at current prices of $3.00 per gallon. As such, he/she spends $1200 each year on gas. Now, if the price of gas dropped the equivalent of the price of hard drives –  from $3.00 per gallon to 3 cents per gallon, for that same $1200, he/she could drive 1.2 million miles per year, not 12,000. And that is exactly what we have been doing with digital information, as the cost of hard drives has dropped 100 times, we have continued to spend the same amount of money even though the cost is less than 1% of what it was. Clearly, we are “driving” more.

The third tine at the top of the graphic shows a natural consequence of this – the market for software to manage all this data is growing dramatically – more than doubling in the same decade.  This tracks well to the growth in interest and investment in information governance. Managing all this information is no longer a storage problem – it’s about how well we can manage, harness, and govern that information.

10 Reasons Information Governance Makes Sense: Reason #1

The Economist Intelligence Unit, in a recent study on information governance, found that the single biggest worldwide challenge to successful adoption of information governance is the difficulty of identifying its benefits and costs. In other words, the difficulty of making the case for information governance (IG).

My next series of posts are designed to help with this problem. Although there is no magic formula or perfect argument for IG, there are many reasons that makes sense today, and will make sense well into the future. These posts won’t try to advance an airtight argument, nor will they propose a detailed financial model. Instead, my posts will be based on observations I have made working in this market over the past decade.[1. Last year I wrote an eBook that laid out what I thought were the ten best reasons for organizations to invest in information governance. That eBook is available for download on the FCS IG website here, but based on several requests from readers, I thought I would adapt some of the content for my blog here.]

#1. We Can’t Keep Everything Forever

“Information workers, who comprise about 63% of the U.S. work force, are each bombarded with 1.6 gigabytes of information on average every day through emails, reports, blogs, text messages, calls and more. . .”

“Don’t You Dare Email This Story,” Wall Street Journal[2. Andrea Coombes, “Don’t you Dare Email This Story,” Wall Street Journal, May 17, 2009. Online at, http://online.wsj.com/article/SB124252211780027326.html%5D

In Brief. IG makes sense because it enables organizations to get rid of unnecessary information in a defensible manner. Organizations need a sensible way to dispose of information in order to reduce the cost and complexity of IT environment. Having unnecessary information around only makes it more difficult and expensive to harness information that has value.

Most statistics on the volume of digital information organizations create contain numbers so large that they are hard to comprehend (for example, “the digital universe” is 281 exabytes in size[3. International Data Corporation, “The Diverse and Exploding Digital Universe: An Updated Forecast of Worldwide Information Growth Through 2011,” March 2008.]). Organizations experience 30, 50, or even 100 per cent annual growth in the volume of information they store. The trend doesn’t seem to be slowing down. Although the cost of storage hardware continues to drop, storage hardware costs are just the beginning. According to International Data Corporation, the total cost of storage ownership “far outweighs the initial purchase price” of the hardware, and includes factors such as migration, outage, performance, information governance, environmental, data protection, maintenance, and staff costs.[4. Nick Sundby, “Storage Economics: Assessing the Real Cost of Storage,” International Data Corporation, December 2008.]

Organizations often claim that they are just keeping a piece of information “for now.” Without a firm plan in place, this really means “keeping it forever.” After all, unless you plan on keeping a piece of information forever, you will need to make a destruction decision about it at some point. Will that destruction decision be easier or more difficult in the future? After all, in three, five, or ten years will:

  • You have the software that created the information?
  • You have the hardware to read the media that the information is stored on?
  • The employee that created it still be working at the company?
  • The department that the employee worked in still exist?
  • Anyone remember anything about the project that document was created for?
  • Litigation be filed that requires the preservation of that information?

IG, with its legal and compliance foundations, provides a defensible approach to disposing of unnecessary information. The combination of good policies around retention of information during normal business operations and preservation of information during litigation or regulatory investigation protects your organization. The law doesn’t require us to keep everything forever, but only IG provides a defensible framework to help us get rid of the information we don’t want and aren’t required to keep.

NOTE: Stay tuned for the next nine reasons.