Tuesday, May 5, 2009

Keyword Extraction as a contextual targeting tool

This post is actually about the online ads industry. In the following paragraphs I suggest a different approach to one of the most fundamental problems in contextual advertising. This approach was recently incorporated into ContextIn's targeting technology, and I hope to prove its effect on adequate targeting.

[A reader who is familiar with the online ads market, can skip to the next paragraph]
The use of keywords is very common in the online ads market. In a nutshell, a list of representative keywords1 is extracted from the web page, and an ad is displayed based on one or more of these keywords. A popular example is Google's Adsense service, which, with its twin service, Google Adwords, provides a complete keyword based advertising system. In this example, advertisers buy keywords from Google, coupling their ads with specific keywords. Publishers put Google's tag (a piece of HTML code) in their web pages. When Google's code is called from a specific page, Google analyzes the page, extracts representative keywords, and displays ads that are coupled with these keywords2.

The problem of how to extract keywords and which features to use in order to choose the best keywords from a web page was thoroughly researched. Numerous papers were written on this problem, and a discussion of the different solutions is out of this post's scope. It is safe to say that there are good solutions to extract keywords from content.

However, people who make their living out of contextual advertising (such as myself) tend to claim that relying solely on keywords for targeting is not good enough. One problem is ads appearing in Negative Context, as demonstrated at MikeOnAds. Another problem is that the ad is not related to the semantic context of the page, and determined only by the keywords. For example, a page containing the word 'Barcelona' will usually display ads about travel information to Barcelona, regardless whether the page content is about travel, FC Barcelona soccer team, or the famous song by Queen, which will all render the travel ads less profitable.

One alternative suggested by critics of the keywords approach is to use document classification and NLP analyzing techniques. In this alternative, the web page is classified to a semantic category. This category encapsulates the subject or the main interest of the web page. The ad database is also classified using the same taxonomy, and thus, ads can be matched to web pages. This approach can avoid negative context issues, as it analyzes the web page as a whole, and can "understand" negative context. And of course, it is derived from the semantic meaning of the page.

However, this approach has two main problems. The first is that the world is simply not ready. An ad network usually does not rely only on an exclusive ads database, but uses online ad exchanges and ad feeds as sources for ads. These feeds or exchanges, even if they support choice of ads, can usually receive only keywords. The second problem is that the granularity of the classification is usually too coarse. I believe that albeit the above expressed critique, we can all agree that keyword extraction techniques do work, and a good keyword can sometimes give better ad results then a general category.

The suggested method tries to take the best of both worlds. First, understand the semantic context of the page. Second, use this understanding and create a list of keywords which are strongly connected with the main topic of the content. The advantages of understanding the semantic context remain. Negative context is no longer a problem, and the chosen keywords are based on the semantic meaning of the content. So far, this new approach seem to yield some good results.

Comments and suggestions are most welcome!
Yair

1. When the list includes phrases, it is sometime referred to as "key phrases" or "key terms". In this post, the term "keywords" will be used to encompass all meanings.
2. This is an oversimplified description, and there is some secret sauce by Google that does the coupling, but for the demonstration of the technology this description is enough.

Friday, April 10, 2009

Java Singletons

This post doesn't contain any original research, instead, it describes a problem which I encountered. I believe many programmers have dealt with this issue as well, however, the common solution is wrong.

Singleton is a pretty basic design pattern. Apparently, implementing it is not so basic. The trivial implementation in java is rather straight forward:

public class Singleton {
//DO NOT USE - not thread safe
private static Singleton m_instance = null;
//Making the constructor private enforces the singleton property.
private Singleton() {
//Do something
}
public static Singleton getInstance() {
if(null == m_instance) {
m_instance = new Singleton();
}
return m_instance;
}
}
This is of course not thread safe, as two threads can see null value and then two instances will be created. Again, there is a trivial solution, just make the call to getInstance synchronized:


public class Singleton {
//Works, but not very efficient
private static Singleton m_instance = null;
//Making the constructor private enforces the singleton property.
private Singleton() {
//Do something
}
public static synchronized Singleton getInstance() {
if(null == m_instance) {
m_instance = new Singleton();
}
return m_instance;
}
}

Well, this will work. However, the lock will be acquired whenever getInstance is called, which is bad for performance. This problem is enhanced if many threads are accessing the singleton object, so this is still not good enough. Luckily, there is a famous technique to solve this issue, called the "double checked lock". This technique puts the synchronization itself after checking for null, and then checks again for null in the synchronized block itself, so the lock is not acquired most of the time.


public class Singleton {
//DO NOT USE - Does not always work!
private static Singleton m_instance = null;
//Making the constructor private enforces the singleton property.
private Singleton() {
//Do something
}
public static Singleton getInstance() {
if(null == m_instance) {
createInstance();
}
return m_instance;
}
public static synchronized void createInstance() {
//Double check - in case two threads are calling this function.
if(null == m_instance) {
m_instance = new Singleton();
}
}
}

Until recently, I wrongly believed that this is a good technique. However, it does not work! Sometimes, a thread can see a non null value, but the object itself is not yet constructed. This is when Google comes in handy, and apparently the issue is well explained here. To sum it up, until JDK 5 there wasn't really a good solution for this issue in java (It could be solved in languages that gave the programmer the ability to explicitly create memory barriers, and a solution for Java that is based on using ThreadLocal is also suggested there). From JDK 5, there is a simple solution, which is cited from Jeremy Manson's blog. It could simply be solved by setting the instance's variable to be volatile, which ensures that it will remain null, until after the object is constructed. So this is the final solution, which works from JDK 5 and up:


public class Singleton {
//Works from JDK 5 and up
private static volatile Singleton m_instance = null;
//Making the constructor private enforces the singleton property.
private Singleton() {
//Do something
}
public static Singleton getInstance() {
if(null == m_instance) {
createInstance();
}
return m_instance;
}
public static synchronized void createInstance() {
//Double check - in case two threads are calling this function.
if(null == m_instance) {
m_instance = new Singleton();
}
}
}

Until next time...

Thursday, April 2, 2009

He who could not be named

Until recently, I didn't know that Voldemort existed anywhere outside of Harry Potter books. Well, looking for MySQL alternatives to boost performance can take you to weird places.

A few weeks ago, we encountered high user load on our production machines (which is essentially a good thing). We had to raise more and more machines in order to support more and more requests per second. While raising more machines using the Amazon EC2 infrastructure (to be discussed in a future post) is easy, scaling up MySQL while keeping the data in sync became harder. We figured that we basically need a big hashtable, which is very efficient and easily scalable. At that point, one of my colleagues, Yaron, pointed out this wonderful post by Richard Jones.

After reading the review, I decided to give project-voldemort a try. This is basically an open source, java based, efficient and scalable hash table. After 5 minutes of installation, and a few days of integrating it into the code, I think I'm in love. After much tuning and playing with mysql parameters, the best I could squeeze out of it per request was still not fast enough. With Voldemort, without almost any tuning, the request handling time dropped to less than 1/10th of the time it took with mysql, and I am sure it can be pushed further down after some parameters tuning. That means more users served with less servers, not to mention how easy it is to scale up in case of high load that even Voldemort can't handle.

Unequivocally, using this platform has it's downsides. It's not a relational database, so it's much harder to debug your code, as there is no interface to the data itself. Also, it limits the operations that can be performed in the DB level. It's really nothing more-nothing less than a huge and efficient hash table, so some of the operations must be moved to the code. I also did not find a way to iterate over all of the data, and I'm not sure whether it's possible. However, I think that the amazing performance boost compensates for the shortcomings.

I didn't check the alternatives, but I'll be happy to hear about them in comments.