Boundaries don't work properly


#1

Hi,

I am collecting tweets in brazilian portuguese and I need to define the words boundaries in the following way: “\b\b”. However, the border policy doesn’t work properly when the given word is part of a word that contains special characters like “ç” or “ã”. For example, if I search for tweets that contains the portuguese word “liga” (language.tag == “pt” and twitter.text regex_partial “\bliga\b”), there are some tweets returned in which the word “liga” is part from the word “ligação”.
I would like to know if I am using the borders policy correctly or if there are some way to avoid the unwanted tweets like those with the word “ligação”.

Thanks for the avaiability.


#2

This seems to be an issue with the RE2 Regex engine itself. Testing this regular expression outside of DataSift gives me the same results as you are seeing here.

If you are filtering for the word "liga", you can simply use the contains or contains_any operators to achieve this. 

 twitter.text contains "liga"

This CSDL will not match the word "ligação".


#3

Hi,

thank you for the comment.

I have used the contains and contains_any operators before. The idea of using the regular expressions is to reduce the variations, or combinations, of a word to only one search term that is given by the regular expression. This variations can be the existence, or not, of spaces between the words or special characters like “ç”, “ã”, …


#4

I know this thread is quite old, but maybe it serves someone. I think the cause of the problem might be that re2 and many other regex implementations have problems dealing with Unicode characters. As you can see in the re2 syntax explanation (https://github.com/google/re2/wiki/Syntax) \b matches an ASCII word boundary. I believe that characters like “ç” or “ã” are not part of ASCII, thus your string might be encoded as UTF-8/unicode and \b does not work sadly.