Entries from September 2009 ↓

Information governance Q&A

Our webinar last week on information governance went well and generated some interesting questions.  I didn’t get to answer all the questions on the call so I’ll take the opportunity to briefly answer some of them here, including some of the more interesting ones I did answer live.   Most of these topics were covered in much more detail in our recently published report on information governance, which also spawned the webinar. The full recorded webinar is also available online as well.

Q: Can you talk to any trends you see in terms of who in an organization is purchasing governance/e-discovery tools?

This is something covered in some detail in the report itself.  In general, there’s some difference in terms of purchasing between “governance” and “e-discovery.”  If the use case being addressed in a particular procurement process is specifically for reactive e-discovery – meaning, the ability to respond to a specific legal discovery request – then the process is likely to have heavy involvement from the legal department if not full ownership by that team with IT involvement.

Governance is generally broader and is likely to involve more underlying pieces of technology (e.g., archiving, records management, indexing tools for distributed data and e-discovery / early case assessment).  There’s certainly no single approach to governance and most organizations are in the earliest of stages in terms of putting in place some kind of broader governance strategy.  Procurement is still likely to be tied to more tactical requirements and the specifics of those requirements will dictate who’s involved (e.g., e-discovery is more likely to be run by legal, as noted above, while an email archiving decision is more likely to be led by IT with legal involvement).  Generally speaking, hashing out broader governance strategies may well involve IT (email management, storage, ECM and search folks), legal, compliance officers, records managers and security personnel, among others.

Q: What are your thoughts about how far right along EDRM the big ECM vendors will move?

So far, ECM vendors are focusing on the far left of the electronic discovery reference model (EDRM).  This has expanded in the last twelve months or so from a far more limited focus solely on the “information management” process step to greater capabilities for data identification, collection, preservation, and some review and analysis.  This is likely to continue, though I’d be surprised to see ECM vendors move beyond this.  Identification, collection and preservation will be key areas in the short term (EMC’s recent Kazeon buy is a good example of how ECM vendors will look to better handle distributed data).  Review and analysis capabilities are likely to remain in the area of early-case assessment, with the expectation that a winnowed-down set of data is still likely to be turned over to external counsel for further review and analysis. That’s likely to be where most ECM vendors stop, though not all; Autonomy, for example, plays specifically in the legal market as well with iManage and Discovery Mining.

Q: Can you explain a bit more what you mean by “litigation readiness”?   What processes does this cover?

I guess this is a phrase I use a lot when talking about information governance and perhaps I didn’t explain it well enough on the webinar.  Litigation readiness is really just one reason organizations are interested in information governance.  Poor information governance makes it difficult to respond efficiently and cost effectively to e-discovery.  There are a number of processes involved in better preparing for litigation, but ideally, organizations need to have some high-level understanding of what data exists, where it is and who has access to it.  That’s a whole lot easier said than done of course, particularly when you need to include data on desktops, laptops, shared file drives and so forth.  The processes generally need to encompass maintaining some kind of index of what resides on all those devices and how that data will be captured and secured if needed.  That needs to be combined of course with more formalized management of data in archives and records management systems, with some consistency in terms of retention and disposition policies (that are standardized and enforced) across sources.  Few organizations have a very good handle on this sort of thing across repositories and unmanaged devices today, but those that are more often involved in litigation are likely to be more litigation-ready.

Q: Is Information Governance of primary interest in the US or are companies in Europe also concerned? I.e. is there an opportunity for vendors beyond the US?

Information governance as it relates primarily to litigation readiness is of primary interest to those in the US and in parts of Europe that have similar discovery or disclosure requirements for electronic information.  In geographies that don’t yet have as strict requirements for electronic discovery, governance may still be an interest but may be for different reasons.  Compliance with specific regulations (e.g., privacy-related legislation) can be a concern, for example, as can IP protection or other types of security.   So there is certainly opportunity for vendors in specific markets, such as archiving, but the drivers might be different.

That’s probably enough for one blog post.  Again, those interested in the full webinar can find it here.

Let’s talk about info governance

This Thursday I’ll host a short webinar to discuss some of the findings from our recently-published report on the emerging Information Governance market.  This report looks at how archiving, records management and e-discovery technologies are coming together to help organizations get a better handle on internal data for litigation readiness and compliance purposes.

The webinar is free and open to anyone, so please feel free to join if you’re interested in this topic.

During the webinar, I’ll outline some of the trends we uncovered while doing our research for this report, look at the vendor landscape and M&A activity in this area, and briefly discuss some of the technologies that we think will be important in this sector moving forward.

Here’s the info and registration link:

The Rise of Information Governance webinar

Thursday, September 24, 2009

12:00 – 1:00 PM EDT

Register here

Recorded versions of our webcasts are available on our site a short while after the events are over.

Autonomy pops up to pronounce an RDBMS revolution is afoot

In one of those Autonomy announcements that seemingly appear out of nowhere, the company has declared its intention to “transform” the relational database market by applying its text analysis technology to content stored within database. The tool is called IDOL Structured Probabilistic Engine (SPE), as it uses the same Bayesian-based probabilistic inferencing technology that IDOL uses on unstructured information.

The quote from CEO Mike Lynch grandly proclaims this to be Autonomy’s “second fundamental technology” – IDOL itself being the first. That’s quite a claim and we’re endeavoring to find out more and will report back as to exactly how it works and what it can do.

Overall though this is part of a push by companies like Autonomy, but also Attivio, Endeca, Exalead and some others into the search-based application market. The underlying premise of that market is database offloading; the idea of using a search engine rather than a relational database to sort and query information. It holds great promise, partly because it is the bridge between enterprise search and business intelligence but also because of the prospect of cost savings for customers as they can either freeze their investments in relational database licenses, reduce them, or even eliminate them.

Of course if the enterprise search licenses then get so expensive as to nullify the cost benefit, then customers will reject the idea, which is something of which search vendors need to be wary.

Users can apply to joint the beta program at a very non-Autonomy looking website.

The future of the database is… plaid?

Oracle has introduced a hybrid column-oriented storage option for Exadata with the release of Oracle Database 11g Release 2.

Ever since Mike Stonebraker and fellow researchers at MIT, Brandeis University, the University of Massachusetts and Brown University presented (PDF) C-Store, a column-oriented database at the 31st VLDB Conference, in 2005, the database industry has debated the relative merits of row- and column-store databases.

While row-based databases dominated the operational database market, column-based database have made in-roads in the analytic database space, with Vertica (based on C-Store) as well as Sybase, Calpont, Infobright, Kickfire, Paraccel and SenSage pushing column-based data warehousing products based on the argument that column-based storage favors the write performance required for query processing.

The debate took a fresh twist recently as former SAP chief executive, Hasso Plattner, recently presented a paper (PDF) calling for the use of in-memory column-based storage databases for both analytical and transaction processing.

As interesting as that is in theory, of more immediate interest is the fact that Oracle – so often the target of column-based database vendors – has introduced a hybrid column-oriented storage option with the release of Oracle Database 11g Release 2.

As Curt Monash recently noted there are a couple of approaches emerging to hybrid row/column stores.

Oracle’s approach, as revealed in a white paper (PDF) has been to add new hybrid columnar compression capabilities in its Exadata Storage servers.

This approach maintains row-based storage in the Oracle Database itself while enabling the use of column-storage to improve compression rates in Exadata, claiming a compression ratio of up to 10 without any loss of query performance and up to 40 for historical data.

As Oracle’s Kevin Closson explains in a blog post: “The technology, available only with Exadata storage, is called Hybrid Columnar Compression. The word hybrid is important. Rows are still used. They are stored in an object called a Compression Unit. Compression Units can span multiple blocks. Like values are stored in the compression unit with metadata that maps back to the rows.”

Vertica took a different hybrid approach with the release of Vertica Database, 3.5, which introduced FlexStore, a new version of the column-store engine, including the ability to group a small number of columns or rows together to reduce input/output bottlenecks. Grouping can be done automatically based on data size (grouped rows can use up to 1MB) to improve query performance of whole rows or specified based on the nature of the column data (for example, bid, ask and date columns for a financial application) to improve query performance.

Likewise, the Ingres VectorWise project (previously mentioned here) will create a new storage engine for the Ingres Database positioned as a platform for data-warehouse and analytic workloads, make use of vectorized execution, which sees multiple instructions processed simultaneously. The Vectorwise architecture makes use of Partition Attributes Across (PAX), which similarly groups multiple rows into blocks to improve processing, while storing the data in columns.

Update – Daniel Abadi has provided an overview at the different approaches to hybrid row-column architectures and suggests something I had suspected, that Oracle is also using the PAX approach, except outside the core database, while Vertica is using what he calls a fine-grained hybrid approach. He also speculates that Microsoft may end up going the third route, fractured mirrors – Update

Perhaps the future of the database may not be row- or column-based, but plaid.