I was discussing these issue again today - so I thought this old paper must still be relevant….
There is a growing consensus that semi-structured and unstructured data sources contain information critical to the business [1, 3] and must be made accessible both for business intelligence and operational needs. It is also clear that amount of relevant unstructured business data is growing, and will continue to grow in the foreseeable future. That trend is converging with the “opening” of business data through standardized XML formats and industry specific XML data standards (e.g. ACORD in insurance, HL7 in healthcare). These two trends are expanding the types of data that need to be handled by BI and integration tools, and are straining their transformation capabilities. This mismatch between existing transformation capabilities and these emerging needs is opening the door for a new type of “universal” data transformation products that will allow transformations to be defined for all classes of data (e.g., structured, semi-structured, unstructured), without writing code, and deployed to any software application or platform architecture.
The Problem with Unstructured Data
The terms semi-structured data and unstructured data can mean different things in different contexts. In this article I will stick to a simple definition for both. First when I use the terms unstructured or semi-structured data I mean text based information, not video or sound, which has no explicit meta data associated with it, but does have implicit meta-data that can be understood by a human (e.g. a purchase order sent by fax has no explicit meta-data, but a human can extract the relevant data items from the document). The difference between semi-structured and unstructured is whether portions of the data have associated meta-data, or there is no meta-data at all. From now on I will use the term unstructured data to designate both semi-structured and unstructured data.
The problem is that both unstructured data and XML are not naturally handled by the current generation of BI and integration tools – especially Extract, Transform, Load (ETL) technologies. ETL grew out of the need to create data warehouses from production database, which means that it is geared towards handling large amounts of relational data, and very simple data hierarchies. However in a world that is moving towards XML, instead of being able to assume well-structured data with little or no hierarchy in both the source and target, the source and target will be very deeply hierarchical and probably have very different hierarchies. It is clear that the next generation of integration tools will need to do a much better job of inherently supporting both unstructured and XML data.
XML as a Common Denominator
By first extracting the information from unstructured data sources into XML format, it is possible to treat integration of unstructured data similarly to integration with XML. Also, structured data has a “natural” XML structure that can be used to describe it (i.e. a simple reflection of the source structure) so using XML as the common denominator for describing unstructured data and structured data makes integration simpler to manage.
Using XML as the syntax for the different data types allows a simple logical flow for combining structured XML and unstructured data (see Figure 1):
1. extract data from structured sources into a “natural” XML stream,
2. extract data from unstructured sources into an XML stream,
3. transform the two streams as needed (cleansing, lookup etc.)
4. map the XMLs into the target XML.
This flow is becoming more and more pervasive in large integration projects, hand-in-hand with the expansion of XML and unstructured data use-cases. These use cases fall outside the sweet spot of current ETL and Enterprise Application Integration (EAI) integration architectures – the two standard integration platforms in use today. The reason is that both ETL and EAI have difficulty with steps 1 and 4. Step 1 is problematic since there are very few tools on the market that can easily “parse” unstructured data into XML and allow it to be combined with structured data. Step 4 is also problematic since current integration tools also have underpowered mapping tools that fall apart when hierarchy changes, or when other complex mappings, are needed. All of today’s ETL and EAI tools require hand coding to meet these challenges.
Figure 1: A standard flow for combing structured, unstructured and XML information
The Importance of Parsing
Of course, when working with unstructured data, it is intuitive that parsing the data to extract the relevant information is a basic requirement. Hand-coding a parser is difficult, error-prone and tedious work, which is why it needs to be a basic part of any integration tool (ETL or EAI). Given its importance it is surprising that integration tool vendors have only started to address this requirement.
The Importance of Mapping
The importance of powerful mapping capabilities is less intuitively obvious. However, in an XML world, mapping capability is critical. As XML is becoming more pervasive, XML schemas are looking less like structured schemas and are becoming more complex, hierarchically deep and differentiated.
This means that the ability to manipulate and change the structure of data by complex mapping of XML to XML is becoming more and more critical for integration tools. They will need to provide visual, codeless design environments to allow developers and business analysts to address complex mapping, and a runtime that naturally supports it.
Unstructured data is needed both by BI and application integration, and the transformations needed to get the information from the unstructured source data can be complex, these use cases will push towards the requirement of “transformation reusability” – the ability to transform the data once (from unstructured to XML, or from XML to XML) and reuse the transformation in various integration platforms and scenarios. The will cause a further blurring of the lines between the ETL and EAI use cases.
Customer data is a simple example use case. The example is to take customer information from various sources, merge it and then put the result into an XML application the uses the data. In this case structured customer data is extracted from a database (e.g. a central CRM system), merged with additional data from unstructured sources (e.g. branch information about that customer stored in a spreadsheet), which is then mapped to create a target XML representation. The resulting XML can be used as input into a customer application, migrate data to a different customer DB or create a file to be shipped to a business partner.
Given the trends outlined above there are some pretty safe bets about where integration tools and platforms will be going in the next 12-24 months:
1. Better support for parsing of unstructured data.
2. Enhanced mapping support, with support for business analyst end-users
3. Enhanced support for XML use cases.
4. A blurring of the line separating ETL integration products from EAI integration products (especially around XML and unstructured use cases)
5. Introduction of a new class of integration products that focus on the XML and unstructured use case. These “universal” data transformation products will allow transformations to be defined for all classes of data (e.g., structured, semi-structured, unstructured), without writing code, and deployed to any software application or platform architecture.
 Knightsbridge Solutions LLP – Top 10 Trends in Business Intelligence for 2006
 ACM Queue, Vol. 3 No. 8 - October 2005, Dealing with Semi-Structured Data (the whole issue)
 DM review - The Problem with Unstructured Data by Robert Blumberg and Shaku Atre, February 2003 Issue