The Science of Ranking Search Results

When a user visits a website and types his search keywords, he expects to find relevant information immediately. We take such easy access to information for granted. Behind the scene, there is large amount of work going to retrieving the most relevant pages, and list them in order of relevance. This is called ranking of search results.

There is huge economic incentive for having good ranking. For a web search or social networking company, this means advertising revenue: The user stays on the website, get interested on related ad posting and click on them. For E-commerce company, this means a successful sale. If the user finds the product he is looking for, he will most likely buy it. In addition, for all these companies it is user experience issue. When the user feels happy after finding the information he needs, he is more likely to come back, more likely to explore the website, and more likely to buy something.

Given the importance of ranking of search results, large amount of research and engineering effort have poured into this field. The problem fits nicely with a computer field called Information Retrieval (IR). The field of information retrieval exists since 1978, when the first SIGIR (Special Interest Group on Information Retrieval) conference was held. In this first SIGIR conference[1], all the important concepts were there: Retrieval from database, from file systems, document retrieval, and relevance feedback from the user. But the booming of IR field only started after the Internet search engine (in 1993[2]) and E-commerce (in 1994[3]) came to existence. Big commercial interest and fast growing of online data fuel the development of this field.

The interface between a user and the website is a search box, where the user enters some keywords. Immediately, the system has to respond with some results. How do we present the most relevant information based on a few keywords? The premise is that there will be thousands of pages (documents) matched with simple string matching. After all, there are limited numbers of English words, but there are millions of products and billions of web pages. It is unavoidable that simple keyword match will give us thousands of equally possible results. But there is only limit space to display results on a web page, and the user is not willing to going through page after page to find the result they want (unless they desperately need the answer). Therefore ranking of search results becomes crucial.

Ranking of web pages is almost a solved problem, thanks to Google’s ingenious PageRank[4] algorithm and its continuous modification. It relies on a social phenomenon that an important page will have more other pages linking to it. Additional improvement to this algorithm is mostly done inside Google or a few other big search engine companies.

Ranking of product pages is still an ongoing task faced by most E-commerce companies. There is no inherent linking among these pages as they are all generated internally. Therefore PageRank cannot be used here. But there is other information we can explore

  1. Syntactic correlation between query and pages retrieved
  2. Search by other users on similar product and their clicks after ranking results were presented (confirmation from those users)
  3. The behavior (such as pages visited, visit sequence) of current user

Item 1 is measured by TF-IDF (Term frequency and Inverse Document frequency), and cosine similarity.

Item 2 is measured by correlation between a query and most frequently clicked pages. This can be built by mining user search logs.

Item 3 is measured by page correlation between visited pages and retrieved product pages. Potentially some classifier can be built from visited pages and visiting sequence mapped to a category (of query type), then such category can be directly applied to rank the resulting product pages. The classifier can be trained on all user log, where certain click stream leads to final product page click or product purchase. Since we are building classifier, other information about the user can be thrown into the model: gender, age group, geographical region, income level and so on. These user specific information are not easy to get and may not increase much prediction power.

The third type of ranking is advertisement display. Assume you give user correct results (optimally ranked) on your page, how would you display advertisement next to these results? This is similar to ranking product pages, in the sense there is no inherent PageRank among advertisement items. Therefore the methods applied to product page ranking can also apply to advertisement ranking, namely get similarity between query and ads page, get historical data on confirmed results, get mapping between user clicks and ads click. Advertisement brings some new dimension such as auction bids (price) they paid for these keywords (or simple stated preference on these keywords), and clickthrough on these ads. The goal of ads display is not highest relevancy, but highest total revenue from all the ads. In general, this assumes advertisers already take care of relevance issue.

Sometimes we have to match ads with non-search pages, for example, news page from user click, email page, social networking pages and so on. This is called “contextual advertisement”, where there are no explicit keywords for advertiser to bid on. The content provider has to decide which ad to display.

Giving the rapidly growing data on the Internet, the field of information retrieval will keep growing. It is an exciting time to be in this field and create new enabling technology.


[1] http://www.informatik.uni-trier.de/~ley/db/conf/sigir/sigir78.html

[2] The first full-scale web search engine W3CCatalog was released in 1993.

[3] Amazon was founded in 1994.

[4] See Wikipedia entry on PageRank: http://en.wikipedia.org/wiki/Page_rank

Leave a comment

1 Comment

  1. Great article from wikipedia. Thank much.

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.