When I filter Twitter for all Dutch tweets (language.tag == “nl”), I get some amount of non-Dutch content. I understand that language detection will not always be perfect and many one or two worded tweets are ambiguous. But I get a noticeable amount of tweets in different scripts, like Asian, Cyrillic or Arabic. From eye balling the preview, I’d say about 2-5%.
Is language detection purely based on tweet contents or do you also use user metadata to boost confidence for certain languages? Also, would it be possible to perhaps introduce a confidence field on the language augmentation, such that it is possible to only use the tweets with high certainty?