Which method is used for language detection?


#1

When I filter Twitter for all Dutch tweets (language.tag == “nl”), I get some amount of non-Dutch content. I understand that language detection will not always be perfect and many one or two worded tweets are ambiguous. But I get a noticeable amount of tweets in different scripts, like Asian, Cyrillic or Arabic. From eye balling the preview, I’d say about 2-5%.

Is language detection purely based on tweet contents or do you also use user metadata to boost confidence for certain languages? Also, would it be possible to perhaps introduce a confidence field on the language augmentation, such that it is possible to only use the tweets with high certainty?


#2

 

We currently use tri-grams for our language detection, based on a corpus of text in the specified language. This is  explained in more detail on our Language Augmentation page. 

Confidence is definitely an interesting feature. Currently we rate the body of text against our index of tri-grams to determine which language the interaction was written in. At present we do not use any metadata to boost confidence, but this is also an interesting concept.

One method to help language detection is to introduce geo location. A method I use regularly is:

language.tag == "NL" OR twitter.place.country_code == "NL"

We are always working to improve our services, and are looking to improve language detection for all our supported languages in early 2012.

 


#3

Thanks for the reply Jason!

Isn’t it odd that languages with different scripts (Arabic, Asian) apparently rank high against Dutch n-grams?


#4

It is definitely not what you would expect! We are aware of another issue related to character encoding for languages that need to be encoded with UTF-16 (Chinese, Japanese and Korean in particular) that may well play some part in this.

We will update our API Issues page with each known issue when we have enough information on the problem.


#5

I see. Thanks for the clarification.