Colons and punctuation


#1

I have been browsing http://dev.datasift.com/csdl-engine-how-it-works and would like to confirm that if we have a case insensitive contains_any containing the word “apple” and then see the tweet “#Occupy and Granny groups come together in Redwood to protest Apple: http://dld.bz/b5Tbv”, then we will get a match even though “Apple” does not occur as a word of its own - it only occurs together with a trailing colon. Thanks!


#2

Yes  - a contains_any filter containing the keyword "apple" will match the example Tweet you described. This should be answered in How do accented characters and punctuation work in DataSift

Essentially, "apple:" is tokenized into "apple" and ":", so our filter will be able to detect "apple". 

Looking at this question the other way, if you were to use the following CSDL:

  twitter.text contains_any " ... , apple:, ..."

Note that we are filtering for "apple:" with a colon. The CSDL compiler also tokenizes your CSDL in the same way as filtering, so it will match "apple:" or "apple :" (notice the whitespace).