Interaction.mentions contains "Bob" incorrectly matching other users (ie. @Bob_Jones)


#1

I have written a query that was intended to match only is a specific user was mentioned. Instead, it matches any user with that string in their username.

The full query, have expurgated the sensitive infomations being serached for, is as follows:

tag “A” { interaction.hashtags in “A1, A2” or (interaction.content contains “AWord” and interaction.content contains “A Phrase”) or interaction.mentions contains “USERNAME” } tag “W” { interaction.hashtags in “W1, W2, W3”} return { interaction.hashtags in “A1, A2, W1, W2, W3” or (interaction.content contains “AWord” and interaction.content contains “A Phrase”) or interaction.mentions contains “USERNAME” }

I expect it not to match, for instance:
Everybody loves loves @USERNAME_IMITATION_ACCOUNT, because it is awesome.
#FF @Not_Username @Othernames

But it does… this seems contrary to the way tokenization is explained here: http://dev.datasift.com/csdl/tokenization-and-csdl-engine/mentions-and-links and contrary to how the CONTAINS keyword is explained. What is going on? (It seems to tokenize the usernames, which is bizarre and unexpected behavior. A related issue seems to have been asked here as well: http://dev.datasift.com/discussions/operator-containsany-water-returning-tweets-water )


#2

You should probably look at using the IN operator here, rather than CONTAINS_ANY. Operators are not target specific, so whether you use contains_any against interaction.content, or interaction.mentions, it will still work in exactly the same way - it will tokenize the content it is filtering on.

If you use the IN operator however, the content will not be tokenized, so interaction.mentions in "username" will only ever match the username, "@username" - and nothing that contains "username" as part of a larger string.


#3

The documentation for all of these keywords does not clarify whether or not they rely on tokenization, for either the “in” or the “contains” operators.

It’s kind of critical that the documentation mention this, because this behavior was completely unexpected, and until you clarified that any use of “contains”/“contains any” will use tokenization, I would not have understood that fact.


#4

Thanks for your feedback. I will see what our Documentation Team can do to make this a little clearer.