About Us Our Team Our Investments News Investor Info Blog

Archive for the ‘unstructured information’ Category

Unstructured, Semi-Structured and Structured Processes

Sunday, February 15th, 2009

I was thinking about a comment from Dennis Byron where he asks (and answers) “Are there ten times as many unstructured processes in the world as structured processes just as there is ten times as much unstructured data as structured data?”

So I thought I’d try to take this analogy a bit further. Before I do that I’ll define business process using a modified wikipedia definition: “A business process or business method is a collection of related activities or tasks that produce a specific service or product (serve a particular goal) for a particular customer or customers.” Wikipedia actually used the term “structured activity” - but I don’t understand what that means, so I left it out. So now on to the different types of processes:

  • Unstructured processes - every instance of the process can be different from another based on the environment, the content and the skills of the people involved. These are always human processes. These processes may have a framework or guideline driving the process, but only as a recommendation.
  • Structured processes - a rigorously defined process with an end-to-end model, that takes into account all the process instance permutations. No process instance can stray from process model, Just like structured data - there is a specific data model associated with the data - and the data cannot stray from that model - and if it does, the data is invalid.
  • Semi-structured processes - these are processes in which a portion of the process is structured, and sometimes unstructured processes are invoked (during exceptions, or when the model doesn’t hold).

While thinking that through I came to conclusion that, as opposed to data, there really is no such thing as a true structured business process once you get people involved (and most business processes require people sooner or later). If you really want an end-to-end model of a business process that works - the best you can hope for is a semi-structured process.

Bottom Up vs. Top Down Process Understanding, or Another Difference between BPM and HPM

Tuesday, September 16th, 2008

I was at the Gartner BPM conference last week. Walking around the vendor showcase, one thing that struck me was how similar all of the vendors offerings seemed to me (with a few exceptions). Sure some were traditional enterprise software, some were SaaS, some vendors stressed one set of features, while another stressed a different set - but to me all the vendors in certain BPM space (document centric vs. integration centric) all looked pretty similar. I am guessing most people interested in a BPMS feel the same way.

What interested me was how they proscribed the process for creating a BPMS based application to implement an existing  business process. For most, the first step is to create a model describing the process - using a BPMN modeling tool. The model is usually created by a business analyst (usually someone in the IT department) that understands the process.  This model is a high level description of the business process which is used to bridge the gap between the business (they understand the process) and IT (they understand implementation and data). What struck me was how much the methodology reminded me a of the “traditional” top down ways of creating software .  Since it very difficult to automatically create the actual complete, executable production system from the BPMN model - the model serves as a requirement definition for the development phase - which is handled by IT. Any end-user iteration and understanding is around the BPMN - which is a very abstract description of the process to be implemented.  This is then handed to the IT folks for implementation - with the standard lag of months between requirements and actual system. This will work fine for processes that are rigorously defined, unchanging and complete, it may work for processes that are rigorously defined with a small number of exceptions - and will completely breakdown for ad-hoc, unstructured Human Processes. The reason is that these ad-hoc human processes are not well defined, and exceptions are the rule. The only way to approach this is same way you approach building human intensive software  - iteratively, working intensely with customers on working prototypes - either low-fidelity or high-fidelity. John Gould and Stephen Boies taught me long ago that iterating on the spec (i.e. requirements or model) just doesn’t work.  I also learned that if you are implementing an existing process - you want to keep it as familiar as possible to the users, which means let users continue to use whatever they are used to (or feels natural to them) whenever possible.

This is why I think that existing BPMS vendors won’t do well in the ad-hoc, unstructured Human Process space. It will require a much lighter weight, flexible (or bottoms up) environment where processes can be easily created, modified and tested in the field - with the turnaround between versions (including the initial version) is measured in days (or hours) instead of weeks or months. I personally believe that the more Human Process Management Systems let users remain within familiar user environments (currently eMail and MS Office tools, Wikis and other tools in the future), the easier it will be to get these systems accepted by the organization and end users.

More on Human Process Management

Wednesday, August 27th, 2008

I have been thinking some more about Human Process Management, especially how it differs from Business Process Management. Certainly one key difference is whether the process is structured or not - i.e. whether you can prescribe the execution of the process based on some model of the business.

It is clear that there are a number of mainstream business processes that lend themselves to such a model (e.g. ERP,CRM), but I claim that most processes in an organization are human2human (or people2people) processes and the tend to be ad-hoc and dynamic. It turns out that even structured processes have a large number of exceptions - that tend to be handled in relatively ad-hoc, case-by-case manner.

I was reading an old blog post by Pravin Indurkar where he looked at B2B EDI purchase order transactions at a small business and found that though there was one standard process -there were 65(!) different variations depending on the nature of the order. While it would be possible to model all the possible process paths - it certainly would be time consuming and expensive. My guess is as soon as you model the 65 different possibility there will be a 66th and 67th that are used to take into account variation and business conditions - so the way these are handled is usually through an unmanaged human exception process. These human processes are either exclusively human to human process (collaboration) or human processes that invoke various systems as part of the process (or what Barry Briggs calls human-down processes).

These types of Human Process are far too fluid and dynamic to be made part of an Enterprise BPM system - and tend to handled through email - yet another cause of email Information Overload…

Linking Documents and Process

Thursday, August 7th, 2008

I have been thinking about documents and their usage context in organizations. Knowing how a document is used is just as important as knowing the content, though today’s document repositories don’t really know the usage context for the documents they store. At best they and let user try make up the gap with tagging and descriptions. Most human centric organizational processes entail the use of various documents as a natural part of the process - either as input to the process (e.g. research or background) or output (e.g. a findings report).  So the link between documents and their process context is a natural one, and critical if you really want to understand the document.

So it is surprising to me that this hasn’t come up more as an issue in document management systems - the need to really connect documents and the flow of the human centric process that uses them - even if the process is an ad-hoc one executed (as most are) over email.

You could decide to implement all processes as a workflow in a document mangement system - but for many processes that would be overkill (especially the ad-hoc kind), would just take too long and require to many IT resources - not to mention that it would require the users learning a new way to do things. If you decide to keep the documents in a standard repository - then you lose the connection between the process that used or generated the documents - which means that you really can’t understand how the document is actually used in an organizational context…

eMail and Human Process Management

Monday, July 14th, 2008

Zvi referred me to an interesting post on read-write web on Is Email In Danger? by Alex Iskold, and in many ways the comments were just as interesting as the article. It is clear that email vs. twitter vs. IM vs. wiki is a topic that interests people.  Even though those tools overlap in functionality, I’d bet each will find its proper place and there won’t be one winner.  It would be interesting to see the best practices that are forming about when people use which tool. Just like Fedex, US Mail and email all coexist comfortably…

Personally I am sure that at least in a corporate setting, email is not going to be replaced in the foreseable future. The main reason is that email has become more than just “electronic mail” it has become the implicit mechanism of choice for managing many (if not most) the Human Processes in most organizations.

Using email for unstructured human centric processes is both its strength and its weakness. Just the fact that email is amenable to so many diverse, unstructured processes (and all without IT support) is a huge benefit, the downside is that email isn’t really optimized for managing those processes (but rather for single messages) - so we get Information Overload in our inbox. Threaded conversations are an interesting innovation, but they don’t solve the problem either.

Think about it - in many companies there are specialty systems for the “standard, heavy-duty” processes (like ERP, CRM), but for the other processes (or as someone coined the outside SAP - or OSAP processes) - what does everybody use? eMail! Even if you have a system in place for a specific process - how do you handle exceptions? eMail! How do you work across organizational silos (or across companies)? eMail!

So as I said, I don’t think eMail will be going away any time soon.

Personalized Feeds (or more on Open APIs)

Friday, October 5th, 2007

 I just read an interesting study on the problems with existing news RSS feeds from the University of Maryland’s International Center for Media and Public Relations. I think it is a great example of how user’s can’t depend on the organization that creates the content to provide access to the content in the form or format most useful for them, and why the ability for users to create their own feeds is so valuable. To quote from the study:

“This study found that depending on what users want from a website, they may be very disappointed with that website’s RSS.  Many news consumers go online in the morning to check what happened in the world overnight—who just died, who’s just been indicted, who’s just been elected, how many have been killed in the latest war zone.  And for many of those consumers the quick top five news stories aggregated by Google or Yahoo! are all they want.  But later in the day some of those very same consumers will need to access more and different news for use in their work—they might be tracking news from a region or tracking news on a particular issue.

It is for that latter group of consumers that this RSS study will be most useful.  Essentially, the conclusion of the study is that if a user wants specific news on any subject from any of the 19 news outlets the research team looked at, he or she must still track the news down website by website.”

Bottom line, as long as we depend on publishers as both content providers and access providers we as consumers of content won’t be able to get what we need in the way we need it - just like with APIs.  The only way to solve the problem is to allow users or some unaffiliated community to create the access to content (or API), as opposed to limiting that ability to only the publisher.  As web 2.0 paradigms catch on with the masses, turning more and more of us to prosumers, this will become more and more of an issue.  Publishers that try to control access will lose out to those that provide users the to tailor the content to their own needs. Publishers need to understand that this benefits both them and the users.

I see signs that this is actually starting to happen (in a small way) with the NYTimes and WSJ both announcing personal portals for thier users. The jump to personalized feeds isn’t that unthinkable…

Structured, Semi-Structured and Unstructured Data in Business Applications

Monday, July 16th, 2007

I was discussing these issue again today - so I thought this old paper must still be relevant….
 
There is a growing consensus that semi-structured and unstructured data sources contain information critical to the business [1, 3] and must be made accessible both for business intelligence and operational needs. It is also clear that amount of relevant unstructured business data is growing, and will continue to grow in the foreseeable future. That trend is converging with the “opening” of business data through standardized XML formats and industry specific XML data standards (e.g. ACORD in insurance, HL7 in healthcare). These two trends are expanding the types of data that need to be handled by BI and integration tools, and are straining their transformation capabilities. This mismatch between existing transformation capabilities and these emerging needs is opening the door for a new type of “universal” data transformation products that will allow transformations to be defined for all classes of data (e.g., structured, semi-structured, unstructured), without writing code, and deployed to any software application or platform architecture.

 The Problem with Unstructured Data
 The terms semi-structured data and unstructured data can mean different things in different contexts. In this article I will stick to a simple definition for both. First when I use the terms unstructured or semi-structured data I mean text based information, not video or sound, which has no explicit meta data associated with it, but does have implicit meta-data that can be understood by a human (e.g. a purchase order sent by fax has no explicit meta-data, but a human can extract the relevant data items from the document). The difference between semi-structured and unstructured is whether portions of the data have associated meta-data, or there is no meta-data at all. From now on I will use the term unstructured data to designate both semi-structured and unstructured data.

The problem is that both unstructured data and XML are not naturally handled by the current generation of BI and integration tools – especially Extract, Transform, Load (ETL) technologies. ETL grew out of the need to create data warehouses from production database, which means that it is geared towards handling large amounts of relational data, and very simple data hierarchies. However in a world that is moving towards XML, instead of being able to assume well-structured data with little or no hierarchy in both the source and target, the source and target will be very deeply hierarchical and probably have very different hierarchies. It is clear that the next generation of integration tools will need to do a much better job of inherently supporting both unstructured and XML data.

XML as a Common Denominator
 By first extracting the information from unstructured data sources into XML format, it is possible to treat integration of unstructured data similarly to integration with XML. Also, structured data has a “natural” XML structure that can be used to describe it (i.e. a simple reflection of the source structure) so using XML as the common denominator for describing unstructured data and structured data makes integration simpler to manage.

Using XML as the syntax for the different data types allows a simple logical flow for combining structured XML and unstructured data (see Figure 1):
1. extract data from structured sources into a “natural” XML stream,
2. extract data from unstructured sources into an XML stream,
3. transform the two streams as needed (cleansing, lookup etc.)
4. map the XMLs into the target XML.

This flow is becoming more and more pervasive in large integration projects, hand-in-hand with the expansion of XML and unstructured data use-cases. These use cases fall outside the sweet spot of current ETL and Enterprise Application Integration (EAI) integration architectures – the two standard integration platforms in use today. The reason is that both ETL and EAI have difficulty with steps 1 and 4. Step 1 is problematic since there are very few tools on the market that can easily “parse” unstructured data into XML and allow it to be combined with structured data. Step 4 is also problematic since current integration tools also have underpowered mapping tools that fall apart when hierarchy changes, or when other complex mappings, are needed. All of today’s ETL and EAI tools require hand coding to meet these challenges.

dm-review-no-affiliation.jpg
Figure 1: A standard flow for combing structured, unstructured and XML information

The Importance of Parsing
 Of course, when working with unstructured data, it is intuitive that parsing the data to extract the relevant information is a basic requirement. Hand-coding a parser is difficult, error-prone and tedious work, which is why it needs to be a basic part of any integration tool (ETL or EAI). Given its importance it is surprising that integration tool vendors have only started to address this requirement.

 The Importance of Mapping
 The importance of powerful mapping capabilities is less intuitively obvious. However, in an XML world, mapping capability is critical. As XML is becoming more pervasive, XML schemas are looking less like structured schemas and are becoming more complex, hierarchically deep and differentiated.

This means that the ability to manipulate and change the structure of data by complex mapping of XML to XML is becoming more and more critical for integration tools. They will need to provide visual, codeless design environments to allow developers and business analysts to address complex mapping, and a runtime that naturally supports it.

Unstructured data is needed both by BI and application integration, and the transformations needed to get the information from the unstructured source data can be complex, these use cases will push towards the requirement of “transformation reusability” – the ability to transform the data once (from unstructured to XML, or from XML to XML) and reuse the transformation in various integration platforms and scenarios. The will cause a further blurring of the lines between the ETL and EAI use cases.

Customer data is a simple example use case. The example is to take customer information from various sources, merge it and then put the result into an XML application the uses the data. In this case structured customer data is extracted from a database (e.g. a central CRM system), merged with additional data from unstructured sources (e.g. branch information about that customer stored in a spreadsheet), which is then mapped to create a target XML representation. The resulting XML can be used as input into a customer application, migrate data to a different customer DB or create a file to be shipped to a business partner.

Looking Ahead
 Given the trends outlined above there are some pretty safe bets about where integration tools and platforms will be going in the next 12-24 months:
1. Better support for parsing of unstructured data.
2. Enhanced mapping support, with support for business analyst end-users
3. Enhanced support for XML use cases.
4. A blurring of the line separating ETL integration products from EAI integration products (especially around XML and unstructured use cases)
5. Introduction of a new class of integration products that focus on the XML and unstructured use case. These “universal” data transformation products will allow transformations to be defined for all classes of data (e.g., structured, semi-structured, unstructured), without writing code, and deployed to any software application or platform architecture.

References
[1] Knightsbridge Solutions LLP – Top 10 Trends in Business Intelligence for 2006
[2] ACM Queue, Vol. 3 No. 8 - October 2005, Dealing with Semi-Structured Data (the whole issue)
[3] DM review - The Problem with Unstructured Data by Robert Blumberg and Shaku Atre, February 2003 Issue