Filtering data by time zone


#1

I need data for Europe only, but as so many tweets don’t have a location specified, I’m reduced to filtering out data I don’t want. I do this by specifying values for twitter.user.time_zone. However, I suspect I’m doing it wrong, as tweets for the excluded time zones still show up in my feed.

The query I’m using is this (slightly reduced to avoid clutter):

interaction.content CONTAINS_ANY "keyword1,keyword2,keyword3" AND NOT interaction.content CONTAINS_ANY "keyword4" AND NOT language.tag IN "ab,aa,sq,am" AND NOT twitter.user.time_zone CONTAINS_ANY "Quito,Caracas,Mexico City,Moscow,Eastern Time (US & Canada)" AND NOT twitter.place.country_code IN "US,JP,TR"

Is this the correct way of doing it? With the irrelevant tweets I’m getting now, the stream will quite quickly exceed my budget.


#2

You may want to include 

interaction.type == "twitter"

if you only want to receive Tweets. A large number of your results are coming from other data sources. Filtering with something like

AND NOT twitter.user.time_zone == "Moscow"

will only affect Tweets, as interactions from other data sources will not have a twitter.user.time_zone field that == "Moscow".

You may also want to consider changing your logic around a bit; You are using your CSDL to exclude languages, time zones, etc. You may find your DPU cost is cheaper if you only include the terms you want to look for.


#3

It’s not just tweets I need, unfortunately. I also need blog and forum posts, which are of course much harder to filter.

And unfortunately I need all the exclusions. I only want content for Europe, but as I can’t positively identify all European items (the majority don’t have geolocation data), I have to exclude the ones I know aren’t in Europe. The same goes for keywords: one of my keywords mostly matches non-relevant data, so I have to exclude quite a lot of other keywords.

Does this make sense? Increasing DPU cost, but saving on unwanted tweets?


#4

I see. It should be easy enough to ensure your Tweets are coming from Europe by using a combination of  twitter.user.time_zone, twitter.user.location, and twitter.place.country_code.

Unfortunately Twitter is one of the few sources which currently supplies any kind of user location data, so determining whether or not something like a blog post is talking about your keywords in a European, or non-European context may be difficult.

I would suggest adding a language filter to include all European languages you are interested in. This will help exclude results from countries that speak languages like Japanese or Afrikkans. It is also unlikely that you will get much content in some European languages from outside Europe, like German or Italian.

It may also be worth trying adding a list of countries or cities you do not wish to receive interactions from:

interaction.content contains_any "myKeyword, anotherKeyword" AND
NOT interaction.content contains_any "Vancouver, Quebec, Toronto, Canada, ..."

#5

Hi, how can i filter tweets with twitter.user.time_zone is null?


#6

You could use:

  interaction.type == "twitter" AND
  NOT twitter.user.time_zone exists AND
  twitter.text exists

Note the use of interaction.type == "twitter", and twitter.text exists here. If I were to just run NOT twitter.user.time_zone exists, I would receive interactions from data sources other than Twitter, because for example in a Facebook interaction, the field twitter.user.time_zone does not exist, so this is a successful match. The interaction.type == "twitter" line ensures that this interaction is a Twitter interaction. The twitter.text exists line, ensures that this interaction is not a retweet, as retweets will not contain a twitter.user.* object.

Take a look at our available CSDL Operators, and Logical Operators for more details.

Warning - do not run the above CSDL as it is. Ensure you run it as an "AND" statement along with some other CSDL, as this example CSDL will return well over 100 interactions per second, potentially costing you a lot in license fees.