I am collecting tweets in brazilian portuguese and I need to define the words boundaries in the following way: “\b\b”. However, the border policy doesn’t work properly when the given word is part of a word that contains special characters like “ç” or “ã”. For example, if I search for tweets that contains the portuguese word “liga” (language.tag == “pt” and twitter.text regex_partial “\bliga\b”), there are some tweets returned in which the word “liga” is part from the word “ligação”.
I would like to know if I am using the borders policy correctly or if there are some way to avoid the unwanted tweets like those with the word “ligação”.
Thanks for the avaiability.