Operator contains_any with @water returning tweets with "@ water"


We’re using contains_any as our operator and among many other strings, @water is one of the strings that we look for in tweets. We got returned a tweet which had “@ water” (i.e with a space between @ and water) in it. See https://twitter.com/syaninbarirtan/status/292186829556441088

Similarly, there’s another example of “@kpn” and we got returned a tweet with “@ kpn”. See https://twitter.com/di_jan/status/292186745682944002



Similarly, if contains_any has (@mndr) as one of the strings, then the following tweet “did @mndr_luxray block me? where yah been” (See https://twitter.com/MNDR/status/292186939682078720) is being matched.

It shouldn’t match as per http://dev.datasift.com/docs/operators/contains
i.e. interaction.contents contains “apple” will not match: “They served ham and pineapple pizza.”

Another example :
contains_any has (@nidji) and the following tweet was matched by datasift :
"besok jan 9th 3 at lakeside bogor | @megaspiritrp with @nidji@gac_music @thejakasembung http:tcow6vswhj cc@nidjiholicbogor"
See tweet : https://twitter.com/RACHELANGELINAP/status/292186956559941632


I assume you are looking for @water and @kpn as Twitter usernames? If this is the case, you need to look for them using the twitter.mentions target, not in your interaction.content or twitter.text targets.

When filtering, DataSift strips out any "@mentions" from the main body of content, and treats it separately. I recently wrote a blog post on How Best to Filter for Twitter @Mentions.


Ok, so here’s some more context :

We want to look for @water and @kpn in the tweets and about 30k other twitter.usernames. This is how we’ve coded it right now.
twitter.text contains_any “@water, @kpn, @ladygaga, @pepsi, @sunsbaseball”.

Are you saying this is incorrect? Should this be twitter.mentions in “water, kpn, ladygaga, pepsi, sunbaseball” ?

The reason we went with twitter.text contains_any is because we also have plain strings or hash strings that we look for e.g. Connecticut Huskies, Georgetown Hoyas, #funtime
Hence, we chose to have a CSDL like
twitter.text contains_any “Connecticut Huskies, Georgetown Hoyas, @water, @kpn, @ladygaga, @pepsi, #funtime, @sunsbaseball”. And given the way contains_any works, I expected this to return all tweets matching any of the above strings exactly (i.e. exact match). Hence didn’t expect to see a tweet with “@ water”, when the clause has “@water” (w/o space).

Please let us know


For an explanation of why you were receiving Tweets containing "@ water" (with the space), see The CSDL Engine: How it Works. This explains how DataSift treats punctuation in Tweets.

If you need to search for both keywords and @usernames, you will need to distinguish between them in your CSDL like so:

  twitter.mentions in "water, kpn, ladygaga, pepsi, sunbaseball" OR
  twitter.text contains_any "Connecticut Huskies, Georgetown Hoyas, #funtime"

It is probably also worth noting that our operators such as 'in' and 'contains_any' work in different ways. Take a quick look at our Operators documentation for more info.


Bonus CSDL Tip:

When filtering for hashtags, if you include the '#' symbol in your CSDL filter, you will only match Tweets containing that exact hashtag. If you just include the hashtag text, but not the '#' symbol, you will match both the hastag and the text without the '#'. For example:

Tweet 1) "It's funtime"

Tweet 2) "It's #funtime"

This CSDL will match both Tweets:

  twitter.text contains "funtime"

This CSDL will only match Tweet 2:

  twitter.text contains "#funtime"


Thanks Jason, that helps!