Goodbye Newssift, we hardly knew you

Newssift, which was set up by the FT Search unit within the Financial Times and launched in March 2009 was shut down recently after just a few months in full operation. We looked  at it from time and time and obviously looked at it closely at launch time, talked to the executive in charge of it (who also appears to have left the FT) and to most of the vendors that supplied the technology – namely Endeca, Lexalytics, Nstein and Reel Two.

But the fact that people like us only looked at it from time was indicative of the problems the site apparently had. According to a source at one of the technology suppliers, the site was sticky once people had stayed around on it for the first time, but  those new users were hard to come by and those that didn’t persist that first time didn’t find a reason to come back.

But Newssift’s loss is a bit of a blow to those in the text analysis industry, as it was supposed to be a flagship application of the technology, brought to market by one of the pre-eminent publishers in the world. That combination apparently wasn’t enough to make it succeed.

The thing that reminded us to look at it was yesterday’s acquisition of Nstein technologies by fellow Canadian content management player (and roll-up machine) Open Text for $35m, which seems to be a case primarily of Open Text consolidating its industry as long as it can get a good price, rather than being a deal for customers or technology.

We probably should have been using Newssift daily rather than relying on M&A to jog our memory as to its existence. But we weren’t, and now it’s gone .

Text analysis + content management = insight

We have long wondered why more content management vendors don’t fully embrace text analysis (or even enterprise search for that matter).

These guardians of most organizations unstructured data were beaten to the punch in terms of exploiting text by business intelligence companies, which are more accustomed to manipulating structured data. It’s great that the BI companies are starting (slowly) to embrace the idea of unlocking the value locked within unstructured text, it’s somewhat bizarre why content management vendors didn’t get there first.

We said this many years ago, in the most coherent form in mid 2005 with our report called Text-aware applications: the endgame for unstructured data (the clue’s in the title).

In report that we said:

“…while the penetration of content management systems is relatively high when compared with other ways of managing unstructured data, these systems do little at present to help analyze that unstructured data.”

and somewhat optimistically:

“Indeed, despite the CMS’s [content management systems] ability to organize, most implementations rarely attempt to push into anything that could be considered a semantic understanding of the content. This may be set to change, however, with some vendors, such as EMC, making headway in automatically parsing documents at a deeper level than just file-level metadata.”

That was a tad premature on our part.

Think about the main players and what they do to understand what resides in the documents they ‘manage.’

EMC Documentum – it has its content intelligence services classification engine sure, and it bought a federated search product many moons ago, but neither are exactly front and central to its product strategy. And ILM (try searching on that now at EMC and see what you get) only dealt with file-level metadata, not semantic metadata. However the X-Hive acquisition was an interesting one from this standpoint (see below for more on XML databases).

Vignette – bar an OEM relationship with Autonomy (which most vendors have) nothing much doing here despite the need for Web content management to increase its understanding of the text its managing to make websites more attractive to advertisers (think of using text analysis to build links to other content automatically to keep visitors on the site longer).

Interwoven – Metatagger isn’t exactly at the bleeding edge any more, although the idea is sound.

IBM Filenet – here there is hope. IBM has taken a classifier it got from its iPhrase acquisition and used it to do initial classification to help determine what should or should not be deemed a record. IBM has all sorts of text analysis toys to play with and we expect more from it in the future.

Open Text – it once had five search engines, and was a pioneer in that space. But I’m not aware of anything it does to extract meaning from the content it manages.

Autonomy – Its tagline is ‘Meaning-based computing.’ It owns a powerful classification engine but now also owns records management and a bunch of other stuff. It’s the one company that checks most of the boxes here (but isn’t a document or Web content management vendor). But as the company currently refuses to talk to us, we’re in the dark as to which bit fits where and are unable to tell our clients what benefits Autonomy could bring them as a result. If the company cares to get in touch with me, I’m here.

This post was prompted partly by a recent conversation I had with Nstein . It is morphing from being a struggling text analysis vendor laden with debt (it’s publicly traded in Canada, so the numbers don’t lie) to a fast-growing combination of Web content management, digital asset management (via acquisitions in 2006 and 2007) and text analysis, built atop an XML database licensed from IxiaSoft. Its focusing exclusively on the largest publishing companies, using the text analysis to automatically create links between new and archived content (thus pushing it up Google rankings). It competes with Mark Logic and Interwoven, mainly.

Any Gmail user that looks in their spam folder and see ads for “Spam Swiss Pie – Bake 45-55 minutes or until eggs are set,” can appreciate how crude keyword matching against content is next to useless.

There’s so much more that can be done here and so much insight being left on the table, whether it be in better website management to attract readers, voice of the customer analysis tied to BI, or government intelligence.

Tools that manage content need to understand that content – its language, its meaning, its sentiment. Otherwise, they are missing a trick.