Text analysis + content management = insight

We have long wondered why more content management vendors don’t fully embrace text analysis (or even enterprise search for that matter).

These guardians of most organizations unstructured data were beaten to the punch in terms of exploiting text by business intelligence companies, which are more accustomed to manipulating structured data. It’s great that the BI companies are starting (slowly) to embrace the idea of unlocking the value locked within unstructured text, it’s somewhat bizarre why content management vendors didn’t get there first.

We said this many years ago, in the most coherent form in mid 2005 with our report called Text-aware applications: the endgame for unstructured data (the clue’s in the title).

In report that we said:

“…while the penetration of content management systems is relatively high when compared with other ways of managing unstructured data, these systems do little at present to help analyze that unstructured data.”

and somewhat optimistically:

“Indeed, despite the CMS’s [content management systems] ability to organize, most implementations rarely attempt to push into anything that could be considered a semantic understanding of the content. This may be set to change, however, with some vendors, such as EMC, making headway in automatically parsing documents at a deeper level than just file-level metadata.”

That was a tad premature on our part.

Think about the main players and what they do to understand what resides in the documents they ‘manage.’

EMC Documentum – it has its content intelligence services classification engine sure, and it bought a federated search product many moons ago, but neither are exactly front and central to its product strategy. And ILM (try searching on that now at EMC and see what you get) only dealt with file-level metadata, not semantic metadata. However the X-Hive acquisition was an interesting one from this standpoint (see below for more on XML databases).

Vignette – bar an OEM relationship with Autonomy (which most vendors have) nothing much doing here despite the need for Web content management to increase its understanding of the text its managing to make websites more attractive to advertisers (think of using text analysis to build links to other content automatically to keep visitors on the site longer).

Interwoven – Metatagger isn’t exactly at the bleeding edge any more, although the idea is sound.

IBM Filenet – here there is hope. IBM has taken a classifier it got from its iPhrase acquisition and used it to do initial classification to help determine what should or should not be deemed a record. IBM has all sorts of text analysis toys to play with and we expect more from it in the future.

Open Text – it once had five search engines, and was a pioneer in that space. But I’m not aware of anything it does to extract meaning from the content it manages.

Autonomy – Its tagline is ‘Meaning-based computing.’ It owns a powerful classification engine but now also owns records management and a bunch of other stuff. It’s the one company that checks most of the boxes here (but isn’t a document or Web content management vendor). But as the company currently refuses to talk to us, we’re in the dark as to which bit fits where and are unable to tell our clients what benefits Autonomy could bring them as a result. If the company cares to get in touch with me, I’m here.

This post was prompted partly by a recent conversation I had with Nstein . It is morphing from being a struggling text analysis vendor laden with debt (it’s publicly traded in Canada, so the numbers don’t lie) to a fast-growing combination of Web content management, digital asset management (via acquisitions in 2006 and 2007) and text analysis, built atop an XML database licensed from IxiaSoft. Its focusing exclusively on the largest publishing companies, using the text analysis to automatically create links between new and archived content (thus pushing it up Google rankings). It competes with Mark Logic and Interwoven, mainly.

Any Gmail user that looks in their spam folder and see ads for “Spam Swiss Pie – Bake 45-55 minutes or until eggs are set,” can appreciate how crude keyword matching against content is next to useless.

There’s so much more that can be done here and so much insight being left on the table, whether it be in better website management to attract readers, voice of the customer analysis tied to BI, or government intelligence.

Tools that manage content need to understand that content – its language, its meaning, its sentiment. Otherwise, they are missing a trick.

Tags: , , , , , , ,