This surprised me:

'mandrake' highlighted when searching for 'mandriva' on Google search

How Google knows that “mandriva” was formerly called “mandrake”, to be smart enough to highlight “mandrake” on the results as if I had also searched for “mandrake”?

I don’t think they have added this to a “synonyms table” manually. I believe this was somehow detected automatically. My question is: how the Google software could have detected this automatically?

Update (Feb 2nd 2006): I think I’ve found how this was done. I’ve described my theory in the comments.

8 thoughts on "Google is Smart

  1. Interesting feature. I never really saw that before. The how it is been done is a mistery to me as well.. but i can tell you.. google folks never stop surprising me.

  2. I really really want to know how it works. According to Google[1], “…your query terms are bolded. If we expanded the range of your search using stemming technology, the variations of your search terms that we searched for will also be bolded”.

    But, afaik Stemming[2] don’t work with not-similar related terms like “mandriva” and “mandrake”. Maybe it’s a mix of stemming with synonym (“~”) operator[3] with Google’s dark magic…

    Note that you will not find “mandrake” if search only for “mandriva”. I read that it does not work on single word searches or within quotes. Why? How? Discover everything and tell me please.




  3. I’m guessing they do it manually. I’ve seen several other such correspondences, and I really can’t think of a way it could be done reliably without at _least_ manual oversight of an automated system.

  4. After researching a bit, it looks like this is an instance of automatic stemming done by Google.

    Ricardo said the terms are not similar, but I disagree: they have a common prefix, “mandr”. The common prefix probably triggers the same magic that makes stemming on verbs and word variations. As described on that blog post, it is triggered on “difficult searches” (probably rpm mandriva is one of them), but not on searches that already have enough results (such as mandriva alone).

    It seems that having the Google interface language set to Portuguese also helps. Probably because there are few pages with the terms “rpm” and “mandriva” in portuguese, making the automatic stemming more visible.

    If it is really just automatic stemming working, then I guess it is done automatically, probably by noticing that similar words (with a common prefix, for example) often appear together on the same page.

  5. Yep, you sure, ‘mandrake’ and ‘mandriva’ have same prefix ‘mandr’. I guessed that stemming only work with word variations (verbs, declination, etc), but now I accept your theory: automatic stemming with similar words.

    I don’t believe they have a manual table. Why ‘mandriva’ searches for ‘mandrake’ and not for ‘conectiva’? I challenge someone to find a not similar correspondence searching at Google (of course, not acronyms, like ‘NYC’ = ‘New York City’).

