Tuesday, May 5, 2009

Keyword Extraction as a contextual targeting tool

This post is actually about the online ads industry. In the following paragraphs I suggest a different approach to one of the most fundamental problems in contextual advertising. This approach was recently incorporated into ContextIn's targeting technology, and I hope to prove its effect on adequate targeting.

[A reader who is familiar with the online ads market, can skip to the next paragraph]
The use of keywords is very common in the online ads market. In a nutshell, a list of representative keywords1 is extracted from the web page, and an ad is displayed based on one or more of these keywords. A popular example is Google's Adsense service, which, with its twin service, Google Adwords, provides a complete keyword based advertising system. In this example, advertisers buy keywords from Google, coupling their ads with specific keywords. Publishers put Google's tag (a piece of HTML code) in their web pages. When Google's code is called from a specific page, Google analyzes the page, extracts representative keywords, and displays ads that are coupled with these keywords2.

The problem of how to extract keywords and which features to use in order to choose the best keywords from a web page was thoroughly researched. Numerous papers were written on this problem, and a discussion of the different solutions is out of this post's scope. It is safe to say that there are good solutions to extract keywords from content.

However, people who make their living out of contextual advertising (such as myself) tend to claim that relying solely on keywords for targeting is not good enough. One problem is ads appearing in Negative Context, as demonstrated at MikeOnAds. Another problem is that the ad is not related to the semantic context of the page, and determined only by the keywords. For example, a page containing the word 'Barcelona' will usually display ads about travel information to Barcelona, regardless whether the page content is about travel, FC Barcelona soccer team, or the famous song by Queen, which will all render the travel ads less profitable.

One alternative suggested by critics of the keywords approach is to use document classification and NLP analyzing techniques. In this alternative, the web page is classified to a semantic category. This category encapsulates the subject or the main interest of the web page. The ad database is also classified using the same taxonomy, and thus, ads can be matched to web pages. This approach can avoid negative context issues, as it analyzes the web page as a whole, and can "understand" negative context. And of course, it is derived from the semantic meaning of the page.

However, this approach has two main problems. The first is that the world is simply not ready. An ad network usually does not rely only on an exclusive ads database, but uses online ad exchanges and ad feeds as sources for ads. These feeds or exchanges, even if they support choice of ads, can usually receive only keywords. The second problem is that the granularity of the classification is usually too coarse. I believe that albeit the above expressed critique, we can all agree that keyword extraction techniques do work, and a good keyword can sometimes give better ad results then a general category.

The suggested method tries to take the best of both worlds. First, understand the semantic context of the page. Second, use this understanding and create a list of keywords which are strongly connected with the main topic of the content. The advantages of understanding the semantic context remain. Negative context is no longer a problem, and the chosen keywords are based on the semantic meaning of the content. So far, this new approach seem to yield some good results.

Comments and suggestions are most welcome!
Yair

1. When the list includes phrases, it is sometime referred to as "key phrases" or "key terms". In this post, the term "keywords" will be used to encompass all meanings.
2. This is an oversimplified description, and there is some secret sauce by Google that does the coupling, but for the demonstration of the technology this description is enough.