Search raisama.net:

Eduardo Habkost raisama.net

diary / Google is Smart

Qui 01 Fev 2007
12h05min
permalink

Google is Smart

This surprised me:

'mandrake' highlighted when searching for 'mandriva' on Google search

How Google knows that “mandriva” was formerly called “mandrake”, to be smart enough to highlight “mandrake” on the results as if I had also searched for “mandrake”?

I don’t think they have added this to a “synonyms table” manually. I believe this was somehow detected automatically. My question is: how the Google software could have detected this automatically?

Update (Feb 2nd 2006): I think I’ve found how this was done. I’ve described my theory in the comments.

8 comentários

Por madtux em Qui 01 Fev 2007 15:10:11.

Interesting feature. I never really saw that before. The how it is been done is a mistery to me as well.. but i can tell you.. google folks never stop surprising me.

Por Ricardo em Sex 02 Fev 2007 00:35:17.

I really really want to know how it works. According to Google[1], “…your query terms are bolded. If we expanded the range of your search using stemming technology, the variations of your search terms that we searched for will also be bolded”.

But, afaik Stemming[2] don’t work with not-similar related terms like “mandriva” and “mandrake”. Maybe it’s a mix of stemming with synonym (”~”) operator[3] with Google’s dark magic…

Note that you will not find “mandrake” if search only for “mandriva”. I read that it does not work on single word searches or within quotes. Why? How? Discover everything and tell me please.

[1] http://www.google.com/help/interpret.html#J

[2] http://www.google.com/help/basics.html#stemming

[3] http://www.google.com/help/refinesearch.html#tilde

Por Scott Lamb em Sex 02 Fev 2007 00:44:37.

Strange. It doesn’t do that for me at either http://www.google.com/ or http://www.google.com.br/. The latter looks a little different than in your screenshot, too. The top bar looks like this:

Pesquisar: (x) a web () páginas escritas em Português () páginas de Portugal

Where are you searching?

Por Scott Lamb em Sex 02 Fev 2007 00:46:14.

Oops, that was http://www.google.com.pt/, which I tried for good measure. I meant to paste this:

Pesquisar: (x) a web (_) páginas em português (_) páginas do Brasil
Por Scott Lamb em Sex 02 Fev 2007 00:51:09.

No, I take that back. I confused myself and searched for “rpm mandrake” and looked for Mandriva pages, but you did the opposite.

http://www.google.com.br/ does show me Mandrake results when searching or “rpm mandriva”, but http://www.google.com/ doesn’t (my preferences are set to English, yours apparently Portuguese). Huh.

Por AdamW em Sex 02 Fev 2007 05:39:05.

I’m guessing they do it manually. I’ve seen several other such correspondences, and I really can’t think of a way it could be done reliably without at least manual oversight of an automated system.

Por Eduardo Habkost em Sex 02 Fev 2007 08:06:25.

After researching a bit, it looks like this is an instance of automatic stemming done by Google.

Ricardo said the terms are not similar, but I disagree: they have a common prefix, “mandr”. The common prefix probably triggers the same magic that makes stemming on verbs and word variations. As described on that blog post, it is triggered on “difficult searches” (probably rpm mandriva is one of them), but not on searches that already have enough results (such as mandriva alone).

It seems that having the Google interface language set to Portuguese also helps. Probably because there are few pages with the terms “rpm” and “mandriva” in portuguese, making the automatic stemming more visible.

If it is really just automatic stemming working, then I guess it is done automatically, probably by noticing that similar words (with a common prefix, for example) often appear together on the same page.

Por Ricardo em Sex 02 Fev 2007 10:56:20.

Yep, you sure, ‘mandrake’ and ‘mandriva’ have same prefix ‘mandr’. I guessed that stemming only work with word variations (verbs, declination, etc), but now I accept your theory: automatic stemming with similar words.

I don’t believe they have a manual table. Why ‘mandriva’ searches for ‘mandrake’ and not for ‘conectiva’? I challenge someone to find a not similar correspondence searching at Google (of course, not acronyms, like ‘NYC’ = ‘New York City’).

Comente

Hosting service by Dreamhost