Prediction Markets are here…

Well they always were. We are beginning to realize the presence.

Dus Ka Dum (power of 10) is a reality show hosted by popular Indian actor, Salman Khan. The show progresses by asking the player a question, the answer of which tries to predict a percentage. A possible question is “What percentage of votes will the Congress Government win if India has a general election today?”. Answering five correct questions would bring Rs. 10 crores ($2.5M). The correctness of the answer is measured by the fact that it falls in a particular range of the correct answer. The first question allows a 40% window around the correct answer and wins you $250, the second allows a 30% window and wins $2500, the third allows 20% with $25,000 return, the fourth returns $250,000 with a 10% window. The bull’s eye is for the fifth answer with a $2.5M cash prize. Continue reading

Tata Nano features in the Top 10 Tech Cars

The IEEE Spectrum in its April 2008 issue featured the ‘Top 10 Tech Cars’. This included the recent eye candy of India’s common man – Tata Nano. With Tata Motors planning to go full production in later 2008, this small car revolution is obviously creating excitement.

There are concerns about environmental hazards because of easy affordability of the vehicle due to its low cost. I wonder how much of that is a concern because the low cost would also mean easy replacement of the over-polluting Auto-rickshaw in the country and a much better comfort for local city travel. It is interesting to note that a petrol-driven three wheeler costs around Rs. 90,000, which is Rs.10-15K less than the Nano. With double the top-speed than an auto-rickshaw the Nano would also mean faster travel and lesser traffic congestion. The better speed probably over-compensating for the more space it would take on roads.

As an optimist, I look forward to see Nano hit the roads and replacing the more polluting vehicles on the Indian roads. In the process this also creates a better life for the common man.

Personalized Search and Disambiguation – The answers to search engine relevance

There is a flurry of announcements around horizontal and vertical search engines offering ‘better’ and ‘more relevant’ content retrieval. SEOmoz.org has an interesting article on the ability (or inability) of search engines to return relevant results. In this entire discussion of retrieving more relevant there are two seemingly related problems – ‘personalized search’ and ‘disambiguation (of search terms)’.

Personalized search is the ability to convolute the search engine results with the user’s profile in order to return the most relevant search results. The user profile can be created by using the following information:

  • User browsing (search, click-through, time-visited on site, etc) history
  • Desktop data (files, content, most-accessed files)
  • User feedback (asking users to rank order the results based on their preferences)

Continue reading

UIMA and the Semantic Search

IBM’s Unstructured Information Management Architecture (UIMA) was released to the open source community in early 2006 when the entire source code was made available on Sourceforge. After spending more than a year at Sourceforge, UIMA is now a part of the Apache Incubator.

UIMA is pitched to become the first and only open-standard for unstructured information management. In very short, UIMA is a framework for building analytics solutions for the new world of structured-unstructured information sharing. Other frameworks like CALAIS are narrowly focussed on the Semantic Web technologies rather than providing a framework for building rich Text Analytics applications.

UIMA allows developers to build applications around technologies and chain the processes through its framework. Each component in the framework is an annotator. Consider for example an application that identifies person names in a text document. The algorithm can be implemented as an annotator that implements the UIMA interface (jCAS if you are using apache UIMA) for common analysis systems(CAS). A CAS is a general representation schema and can store arbitrary data structures for the analysis of documents. Using CAS, the span of annotation can be represented easily. The data can be passed through several Analysis Engines (AE) so far as each of them comply to the descriptor. Details on using UIMA and how to build Aggregate Analysis Engines are available here.

One of the most exciting engagement will be between UIMA and the Semantic Search. Semantic Search is the next generation of Search Technology using metadata (read information) created through Advanced Text Analytics and enabling ‘contextual’ search. The underlying technologies from NLP, Machine Learning, Statistics have existed for decades and explored to finer details by the research community. With the increasing adoption of enabling frameworks like UIMA, it is now easy to develop scalable solutions using Advanced Research Tools.

Some useful links to learn how UIMA can be used for building advanced text analytics solutions:
1. Background information on UIMA
2. UIMA and Semantic Search

Undercover information: When at IBM, I was part of the gang that developed ProAct – A UIMA based Customer Satisfaction Analysis technology.