Not getting complete Twitter Firehose!


#1

I am using twitter target ““twitter.text”” but the results i am getting are not satisfactory. Are you providing the full firehose for twitter? Our current API provide us around 37 tweets per minute but with the same sort of query your tool is giving me just 4-8 tweets per hour. I thought my present API vendor was not providing me the complete firehose and thus i was thinking of doing it inhouse. Do you not provide the complete firehose for trial version?

The query i am using is:

twitter.text contains_any "“looking, find, searching, need, suggest, recom, recommendation, idea, advice, help, anyone, require, anybody, scout, recs, hunt”"
AND twitter.text contains_any "“conference, meeting, event, seminar, workshop, symposium, exhibit”"
AND twitter.text contains_any "“venue, room”"
AND NOT twitter.text contains_any ““wedding, free, birthday, music, gigg, play, baby, concert””

And your pricing structure says that you charge $0.1 per 1000, what does that mean? $0.1 per 1000 tweets, then what about DPU! Like the above query is of 0.4 DPU, then the charges should be according to its DPU per hour. Can you please explain your pricing structure along with pointing any mistake i am commiting in my query.


#2

We do provide the full Twitter Firehose. Any public Tweet will be available through DataSift.

twitter.text contains_any "...keywords..." returns any Tweets containing any of your comma separated keywords. It will not return any retweets. To return both Tweets and retweets, I would recommend using interaction.content.

You should not be wrapping your keywords with a double, double quote as you are doing above. The following will do:

 

twitter.text contains_any "looking, find, searching, need, suggest, recom, recommendation, idea, advice, help, anyone, require, anybody, scout, recs, hunt"
AND twitter.text contains_any "conference, meeting, .......
 
Regarding the pricing, you are charged $0.10 per 1000 Tweets as the Twitter license cost. Some augmentations also carry license costs - details of all license costs are available at datasift.com/source.
The minimum cost of running a stream is 1 DPU ($0.20 per hour), so a stream of 0.4 DPUs will be rounded up to 1 DPU. Full billing details can be found in our documentation at Understanding Billing.

#3

the double quotes are mistakenly added in the message, but the query i ran was in single quote only, otherwise the system would have shown me some error. So Jason i think quotes are not the issue! But as you can see the query, did you find any other error in that?

As talk about our company, we were using normal RSS process to extract feeds from twitter and other social media networking sites, and now to wider our scope we have hired a vendor to provide as the full firehose, but after giving an ample of time to him, we think he is not providing us the full firehose with any satisfactory results. And after going through your documentation i think we can do it in-house with your tool, but my main concern is, why the query is not giving the full/satisfactory results. I have tried it so many times, but still the results are not encouraging.


#4

Could you give me the exact CSDL query you are running, and some examples of Tweets you are missing from your stream?


#5

This is the CSDL i am using:

twitter.text contains_any "looking, find, searching, need, suggest, recom, recommendation, idea, advice, help, anyone, require, anybody, scout, recs, hunt"
AND twitter.text contains_any "conference, meeting, event, seminar, workshop, symposium, exhibit"
AND twitter.text contains_any "venue"
AND NOT twitter.text contains_any “wedding, free, birthday, music, gigg, play, baby, concert”

i ran this query on 8th June till 10th June, and got a total of 39 feeds, which is very less, and these feeds has consumed a total of $10 which would be very expensive if we talk about the quantity of data which i am expecting to extract.

As you asked me, following are some examples of feeds which are missing from the Datasift exported file:

  1. If you need a venue for future events no matter how big or need help hosting/planning an event hit up @firstclassentva #VA #NC #DMV (Twitter Link: http://twitter.com/ALoverzDaze/statuses/211115668442062849)

  2. We are looking for a venue for a BBQ for 150 pax at the end of June; needs to be suitable for children too. Any suggestions? #eventprofsUK (Twitter Link: http://twitter.com/PurpleGrapeTeam/statuses/211102526345445376)

  3. @DishoomLondon Sadly, it is so! But I’m meeting a friend tomorrow, and yours was the suggested venue- huzzah! (Twitter Link: http://twitter.com/ireenaribena/statuses/211074291326910465)

Jason, still my main concern is the feed flow.


#6

I will look into why Tweet 1) was not received, but Tweets 2) & 3) were not received because they do not meet your CSDL criteria:

They do not match the following part of your filter:

AND twitter.text contains_any "conference, meeting, event, seminar, workshop, symposium, exhibit"

If you want to try to increase the throughput of your stream, you may want to look into making the filter a little more open by adding more keywords to the "contains_any" statements, or by changing some of the "AND" operators to "OR". If you compare the results to Twitter's Streaming API, you will find there are no more Tweets there than what you are receiving through DataSift.


#7

well if its like
contains_any “event” AND contains_any "suggest"
could not catch “#eventprofs” and “suggested”

then can i use
twitter.text contains_substr “conference, meeting, event, seminar” AND twitter.text contains_substr “looking, find, suggest, need”

will this CSDL will catch #eventprofs, events and suggesting, suggestions, suggested…?


#8

contains_any looks for any of the full comma separated words of phrases - not substrings. 

So the CSDL:

  twitter.text contains_any "event" AND twitter.text contains_any "suggest"

Will not match the Tweet:

 "Any suggestions? #eventprofsUK"

If you were to substitute the contains_any operators in the above filter for substr operators, then the Tweet would be matched. 


#9

Regarding Tweet 1) which was missed, it appears we did not receive it. We had some very minor network disruption around this time, so this Tweet may have been missed at this time. Alternatively, there is a chance the account was protected/suspended when the Tweet was sent, so we would not have received it.


#10

thank you for your prompt response. i’ll try with new query and will get back to you soon.


#11

Hi Jason,

will the following query solve the purpose, if in addition to the above query, i also dont want the tweets containing:
“are you look, r you look, r u look, are u look, you are look, if you need, if u need, you in need, u in need”

CSDL:

twitter.text contains_substr "looking, find, searching, need, suggest, recom, idea, advice, help, anyone, require, anybody, scout, recs, hunt"
AND twitter.text contains_substr "conf, meeting, event, seminar, workshop, symposium, exhibit, expo, corporate"
AND twitter.text contains_substr "venue"
AND NOT twitter.text contains_any "wedding, free, birthday, music, gigg, play, baby, concert, play, family, reception"
AND twitter.text regex_partial “(are|r) (you|u) look, you are look, if (you|u) need, (you|u) in need”

??


#12

Neither regex_partial or contains_substr work with commas. They both match everything within the double quotes. The only operator you can use in this case which will match any of your comma separated keywords is the contains_any operator.

The contains_any operator is a very cheap operator - the price of using it only increases by 0.1 DPUs for every 10 comma separated keywords you add. You might be better off in this case to simply list all varients of the words you are looking for:

  twitter.text contains_any "look, looking, recom, recommend, ....."

Also, you will want to add a "NOT" before the regex_partial expression at the bottom. You can group statements together for clarity:

  twitter.text contains "keyword1"
  AND NOT (
    twitter.text contains "junk" OR
    links.title contains "rubbish"
  )


#13

I have worked on query formation with you. I think I can start with query optimization now.
One more thing I would like to get cleared before we really start using your services. We don’t just want twitter feeds. We would like to use your sophisticated tools to the fullest for extracting value from complete Social Data. As you claim, DataSift offers tools for collecting, filtering and analyzing this data. You have mentioned on your site that, if we are buying the data streams directly from you, we will be eligible to use your API key on your partners’ sites to access their services. To what extent we can use their services? For example, one of your partners i.e., Hstreaming provides a scalable real-time continuous data analytics platform powered by Hadoop. Can we entrust them for our data handling needs along with its analysis? If yes, what infrastructure do we need on our side?
Similarly, is there any alternative for this data handling and analysis need?

And Can we use SQL kind of thing alternatively, to get the Datasift exported file linked directly to our system on our side, to make the daily process more simpler?


#14

For your optimization, I would recommend taking a look at our CSDL Optimization Techniques blog post. If you would like your CSDL reviewed, either post it here, or let our Support Team take a look.

To use our other data sources (Facebook, Blogs, Reddit, etc), take a look at our Data Sources page. Starting to use these additional sources is as simple as clicking to turn them on. All sources can be filtered on using interaction.*. All data sources also have their own unique set of targets.

To work with our partners such as HStreaming, you will need to contact them directly.

We have found the simplest way to use DataSift is to set up a live stream straight to your database, and process the data as it comes in, rather than running exports.