Indexing Asian languages

Published on 09 March 2006

An important feature of Flock is that it indexes your history. You can find any page that you visited just by searching using one word present in that page. I find it very useful, but it’s not working correctly for Japanese of Chinese.

To index a page, Flock needs to separate words. Additionally, different forms of the same verb, of singular/plural of the same noun should be considered as the same word. In Asian languages, one big difficulty is that there is noe spaces. Consequenlty, words can not be identified without a dictionary. There is no other way.

Let’s take an example: Flockisagreatbrowser.

Ok, suppose I know the “is” and “a” words. They are grammatical words, I can at least have this knowledge. So I get: “Flock is a greatbrowser”. Is “greatbrowser” a noun? Or maybe “great” is an adjective and “browser” a noun? Or “gre” an adjective and “atbrowser” a noun?

There is only one way to know it: use a dictionary.

Now, let’s talk about numbers. Chasen and its little brother MeCab, famous grammatical analysers for Japanese language are using a dictionary of 23 Mega-bytes. No need to say that we will never bundle that in the browser.

What can we do? I can think about two solutions. The first one is quite easy to implement but not very satisfying, the second one is more tricky but it would be great if we can do it.

  1. Consider each character as a different word, and search in integral text. 你好 is indexed as 你 好, and if I search for 你好, the system will launch the query “你 好”. Just as you would search for “great browser”, including the quotes. A lot of project use it rather than a real parser.
  2. Any operating system providing Chinese input or Japanese input contains a dictionary for this input. Let’s just use this dictionary. For Windows and Mac it should be feasible: just use the right API. One difficulty would be to deal with the fact that some computers may not have the input method installed (then we should fall back to 1). For Linux it would be very tricky because there are several input method (for Japanese: Anthy, FreeWnn, Canna…)
TAGS: Misc