Yet another "Microsoft acquires Powerset" blog

PowerSet and MSFT finally announced this rumored acquisition. MSFT has acquired Powerset for close to $100 million. Powerset is a semantic search engine that extends the XEROX-PARC’s licensed linguistic technology. I have tried Powerset when it was a beta and when it got (almost abruptly) released (with search limited to wikipedia and freebase).

The use of semantic technologies for horizontol search engines look highly improbable. The nature of the web is too diverse to be nailed by a single hammer. A related article on making search relevance more meaningful is in this earlier post.

MSFT plans to integrate the Powerset team with its search relevance team and explore advanced search capabilities while taking the Powerset technology beyond wikipedia. MSFT has been creating noise around enterprise and local search, both of which are vertical in nature. This acquisition can add value in making vertical search offerings from smarter and probably scaring the Mountain View behemoth. Whether MSFT takes Powerset’s saplings and nurtures it into its Redmond forests – this only time will tell.

On a related note, I find results from Hakia more useful. Try a sample search for “is EPS the right measure for stock performance” on both powerset and hakia and you will see the difference. Hakia gives results much more relevant to EPS – at least it displays on the very first page results about the use of EPS measure for measuring stock performance. Powerset returned results that had nothing to do with “EPS ~ Earnings per share”. The results from Powerset were purely keyword matches – it mostly matched Extended Play (EP) vinyl records.

In general, most semantic searches seem to be working better on Hakia than on Powerset. I am surprised why Hakia was not approached by suitors if the motivation has been to ramp up semantic abilities for search engines. Or may be it was!

UIMA and the Semantic Search

IBM’s Unstructured Information Management Architecture (UIMA) was released to the open source community in early 2006 when the entire source code was made available on Sourceforge. After spending more than a year at Sourceforge, UIMA is now a part of the Apache Incubator.

UIMA is pitched to become the first and only open-standard for unstructured information management. In very short, UIMA is a framework for building analytics solutions for the new world of structured-unstructured information sharing. Other frameworks like CALAIS are narrowly focussed on the Semantic Web technologies rather than providing a framework for building rich Text Analytics applications.

UIMA allows developers to build applications around technologies and chain the processes through its framework. Each component in the framework is an annotator. Consider for example an application that identifies person names in a text document. The algorithm can be implemented as an annotator that implements the UIMA interface (jCAS if you are using apache UIMA) for common analysis systems(CAS). A CAS is a general representation schema and can store arbitrary data structures for the analysis of documents. Using CAS, the span of annotation can be represented easily. The data can be passed through several Analysis Engines (AE) so far as each of them comply to the descriptor. Details on using UIMA and how to build Aggregate Analysis Engines are available here.

One of the most exciting engagement will be between UIMA and the Semantic Search. Semantic Search is the next generation of Search Technology using metadata (read information) created through Advanced Text Analytics and enabling ‘contextual’ search. The underlying technologies from NLP, Machine Learning, Statistics have existed for decades and explored to finer details by the research community. With the increasing adoption of enabling frameworks like UIMA, it is now easy to develop scalable solutions using Advanced Research Tools.

Some useful links to learn how UIMA can be used for building advanced text analytics solutions:
1. Background information on UIMA
2. UIMA and Semantic Search

Undercover information: When at IBM, I was part of the gang that developed ProAct – A UIMA based Customer Satisfaction Analysis technology.