“Information retrieval is a significant problem for businesses. Further, the extent of the problem worsens with increasing size of the document collection [and] the less formal the information stored.”
Information Retrieval in Business: An Unmet Challenge[i]
Unstructured information, at its simplest, is information that does not reside in the rows and columns of a database. Any database user understands that the meaning of a field in a database is a combination of what the row and the column each mean, such as the price of a widget on a certain date. However, unlike the structured information that resides in databases, unstructured information does not always have a predetermined form, business purpose, use, value, or security classification.
As a result, managing unstructured information is tricky. Many long-established techniques for database administration simply do not apply. This complexity also makes calculating the total cost of unstructured information difficult.
Unstructured information comes in many forms, including word processing documents, spreadsheets, social media posts, and log files automatically generated by computer servers. Some unstructured information has more structure than others (email messages, for example, all have a header, subject line, and message body). Some call this information semi-structured information, but for our purposes, we will use the term unstructured information to include semi-structured information as well.
The volume of unstructured information is growing dramatically. Analysts estimate that, over the next decade, the amount of data worldwide will grow by 44 times (from .8 Zetabytes to 35 Zetabytes: 1 Zetabyte = 1 trillion Gigabytes).[ii] However, the volume of unstructured information will actually grow 50% faster than structured data. Analysts also estimate that fully 90% of unstructured information will require formal governance and management by 2020. In other words, the problem of unstructured information governance is growing faster than the problem of data volume itself.
What makes unstructured information so challenging? There are several factors, including
- Horizontal vs. Vertical. Unstructured information is typically not clearly attached to a department or a business function. Unlike the vertical focus of an ERP database, for example, an email system serves multiple business functions – from employee communication to filing with regulators – for all parts of the business. Unstructured information is much more horizontal, making it difficult to develop and apply business rules.
- Formality. The tools and applications used to create unstructured information often engender informality and the sharing of opinions that can be problematic in litigation, investigations, and audits – as has been repeatedly demonstrated in front page stories over the past decade. This problem is not likely to get any easier as social media technologies and mobile devices become more common in the enterprise.
- Management Location. Unstructured information does not have a single, obvious home. Although email systems rely on central messaging servers, email is just as likely to be found on a file share, mobile device, or laptop hard drive. This makes the application of management rules more difficult than the application of the same rules in structured systems, where there is a close marriage between the application and the database.
- “Ownership” Issues. No employee thinks that they “own” data in an accounts receivable system like they “own” their email, or documents stored on their hard drive. Although such information generally has a single owner, i.e., the organization itself, this mindset can make the imposition of management rules for unstructured information more challenging than structured data.
- Classification. The business purpose of a database is generally determined prior to its design. Unlike structured information, the business purpose of unstructured information is difficult to infer from the application that created or stores the information. A word processing file stored in a collaboration environment could be a multi-million dollar contract or a lunch menu. As such, classification of unstructured content is more complex and expensive than structured information.
Taken together, these factors reveal a simple truth: managing unstructured information is a separate and distinct discipline from managing databases. Moreover, determining the costs and benefits of owning and managing unstructured information is a unique – but essential – challenge.
[i] Michael D. Gordon, “Information Retrieval in Business: An Unmet Challenge,” The University of Michigan, 1991. Online at, http://deepblue.lib.umich.edu/handle/2027.42/35654
[ii] International Data Corporation, “The 2011 Digital Universe Study,” June 2011. Online at, http://www.emc.com/collateral/demos/microsites/emc-digital-universe-2011/index.htm