Case sensitivity for non-ASCII letters


#1

We want to work on contents in Turkish language. However, it seems that Datasift’s case insensitivity option does not work well for languages other than English.

I tested my hypothesis with the following query:

tag "hit" {interaction.content substr "ü"} return { interaction.content cs substr "Ü" AND NOT interaction.content cs substr "ü" }

Nothing is tagged with “hit” in the results, which suggests that Datasift’s query engine is not aware of the fact that “ü” is the lowercase version of “Ü”. You can replace “ü” with “ç”, “ş”, “é”, etc. The result is the same.

Am I missing something? Or is this the case and you expect us to search for all possible spellings of a simple non-English word?

A suggestion: Let us search in the ASCIIfied version of a text. This would be the ultimate solution for non-English languages. For example, the following query might match “cafe”, “café”, “CAFÉ”, etc.:

interaction.content ascii contains "cafe"

#2

After some testing using your CSDL definition, I can not find any interactions which contain both an upper and lowercase ü. Our filtering service can tell the different between the ü character in both upper and lower case. Using the 'cs' (case sensitive) operator means that we will filter for that character in that case only, so your CSDL will look for interactions which contain an upper case Ü, and do not contain a lower case ü. 

The creation of a new 'ascii' operator does sound like an interesting concept. I will raise this with our development team.


#3

Thank you for your answer, Jason.

I see that your filtering service is able to tell the difference between “ü” and “Ü” in case-sensitive mode. What I am concerned is that it is not able to ignore this difference in case-insensitive mode.

What I was trying to achieve in the test was to find interactions with only uppercase “Ü” and to tag them using case-insensitive substr operator. If the filtering service were aware that “ü” is equivalent to “Ü” (in case-insensitive mode), it would tag all of the interactions. However, this is not the case.

In one sentence: Your case-insensitive filter does not seem to work as expected for non-English letters.

(Thank you for considering implementing “ascii” operator. That would be so useful for us.)


#4

This has been rasied as a bug. You can track the issue : DataSift does not currently associate upper and lower case versions of the same accented characters


#5

Thank you very much.