Last friday I presented the work «Simple but Effective Porn Query Recognition» by Fu et al. to the rest of the group. I chose this paper because it adress an interesting investigation line. The porn. Yeah, it is a paper with «porn» in its title, so it is necessary to print, read and explain to others.
The authors claims that it is important for a search engine to filter porn to users, to avoid bad influence in children (nevertheless, queries with war and violence, and press control it is good to grow as a good human being, it seems). In China every query is
associated to a formal ID of the user, but this is another history, and must be told in another occasion.
This paper shows a way to measure similarity between queries. For example «gay movie» is porn and «bambi movie» is not porn, although they have a term in common. Another example is «Army comrade» and «gay image», because in Chinese the term «comrade» has different meanings. Which you can imagine.
So, the autors propose to use all documents retrieved by the query to build a term vector of each document and obtain a compose vector for the query (the paper includes nice formulae). And this is how the semantical similarity measure works.
Once the semantical comparison is stablished, the authors use a k-NN algorith to classify the queries into two classes: Porn or Not-Porn (uhm… interesting…). So, after a conducted training the algorithm sets the class of a new query comparing with the k more simillar, using the Euclidean distance of the feature vector.
Respecting the empirical studies, the authors uses their query repository (they work in a search engine company, Roboo Inc), so they have a real data to analyze. It is interesting to note that they chose 2000 random querys to test and the ratio of porn queries is higher than non-porn ones. So, as can be seen, I think it is not a good idea to filter to their potential clients that enormous and important information.
The authors also include some charts with data and numbers and stuff, and they adress the problem that sometimes (just a little) the algorithm category a non-porn query as porn (so P=NP!!!).
To finalize this post, I have to say that is a easy to read paper and, for a profane in text recognition like me, you can learn some basis of this research line. And also, it includes funny examples. Like «pretty nurse». Yes. In China it is a porn query.
You can read the paper in Google Docs. Thanks to my colleague José Urquiza, that was the discoverer of the paper ;)