Mobile Visual Search

» Research » (Applications) Mobile Visual Search

Overview

Mobile visual search refers to a new class of applications where a picture snapped with a handheld device becomes the search query. Like text-based search, this enables an almost limitless range of applications, from product reviews to comparison shopping, from tourist guides for landmarks and museums to navigation applications, from multimedia extensions of printed text to automatic translation of restaurant menus, and many more. Recently deployed systems, such as Google Goggles, Kooaba, or Amazon's Snaptell upload the image to a server for matching against a visual database. Alas, these systems are much too slow for interactive use and recognition results are mixed.

System Architecture

In our research, we have focused on compact feature descriptors for visual bag-of-words. The response time in mobile visual search critically depends on how much information has to be transferred over the wireless link. For applications where a database of visual words is stored on the phone, the number of database entries similarly depends on the compactness of the feature descriptors. Compressed descriptors also protect privacy, either for a query sent to an untrusted server, or for a database that is shared with others. The smaller the query, the harder it is to infer anything beyond what is already stored in the database.

We initially investigated compression of state-of-the-art feature descriptors, such as SIFT and SURF, and then developed a new descriptor, CHoG, which outperforms SIFT recognition and is 20X smaller. We also developed a very efficient novel technique that encodes feature locations, which is required for geometric consistency check, in the image by ordering a bag-of-words. For even smaller image queries, we can simply use tree histogram, which work well for smaller databases. Speed up of queries of very large databases via inverted index compression is addressed in. For fast search, the inverted index must be kept in memory at the server. Our technique reduces the number of servers required by 6-8X.

Demonstration

We have implemented a real-time recognition system for a database of over 1M media covers (CDs, DVDs, books) as well as packaged products. We are currently integrating our fast recognition engine with personalized recommendation system based on the PrPl-platform for enriched shopping experiences. We have also explored the recognition of videos, which requires an even larger database, based on a single camera-phone snapshot using the same system. For wireless uplink speeds of less than a few hundred kbps, it is faster to extract CHoG features on the handheld and send them to a server. For faster links, one should instead send a JPEG image to the server. As handheld processors become more powerful, this crossover rate increases proportionally. With our implementation, feature extraction can be performed on a typical smartphone in 1 sec. With over 95% recognition rate, our system is not only the fastest but also the most accurate reported in the literature.

Publications