Why does the Python consume-stream.py not return data -- but preview does?


#1

I am able to successfully run my stream in the free preview, and via curl.

But, when I try to run it with the Python example consume-stream.py, I get no data.

Similar problem with the twitter-track.py example. If I run it with a high-volume keyword, I get data. But, if I run with a low-volume (but not zero) keyword, I get no data. This same low-volume keyword will work if I create a test stream and write in on the website free preview.

All help is appreciated.


#2

The free preview returns results from ALL data sources. Running streams through the stream editor, or through a client library using your username and API key will only return interactions from data sources which you have turned on.


#3

I don’t think that’s the issue. I’m focused primarily on the twitter data source – and have that activated in my account. Just rechecked, it’s on.

The free preview shows tweets that match my simple keyword query – but the Python sample never shows any data.

Thanks.


#4

Are you running the examples in the correct way?

To run consume-stream.py you must pass the stream hash into the call to the script:

    $ python consume-stream.py <stream_hash>

And to consume twitter-track.py, you need to pass keywords you wish to search for into the script:

    $ python twitter-track.py "keyword1" "keyword2"

 

Have you tried running an example like football.py? This example is probably the simplest to run.

Also, are you trying to connect through some sort of corporate firewall or proxy? This can often cause problems with HTTP/WebSocket streaming.


#5

I am having the same problem. I run the sample on the Datasift website, cURL command and consume-stream.py side by side. The website is getting about 1 hit/3 seconds on average, the cURL command shows all of the ones on the website, and consume-stream.py shows 0 (but still shows up as streaming time on my bill). football.py works and all the tests passed successfully.


#6

Same question to you godot_labs - are you trying to run these examples on a connection through some kind of corporate firewall or proxy server? If so, take a look at my post - Can I use the DataSift Streaming API through my proxy server?


#7

I can replicate this behaviour. It is not firewall or proxy related as a) curl behaves fine b) data is returned under some circumstances.

My testing has shown that the problems appear as the filter becomes more selective. The ‘football’ example mentioned is OK because there’re so many hits. Just add more criteria and the problems will appear. I’ll can do some more tests to show specific examples etc, but first a couple of questions:

  • the API doc says that the service send’s keep-alive messages but it’s not clear to me if this is a) all the time or b) just when there’s data coming but not yet processed. I suspect it’s the latter and that the client should drop the connection, wait (according to rules) then connect again. If it’s the former then I’m not seeing this happening.

  • When there’s no data matching the criteria over a say minute or so, I’m typically seeing socket.timeout errors. Is this ‘normal’ and is this a case where the client should be dropping the connection, waiting n+1 seconds and then retrying? The current implementation is simply ignoring these errors. Usually the server will eventually reset the connection in this case anyway.

  • as godot_labs have pointed out this error consumes streaming time (+ my testing time) and my account balance is running low. CAn you help me out by topping up my “chrisbc” account balance please? :slight_smile:

BTW I’m seeing some more reliable behaviour from the client using the .read() method instead of the .readline() method. It looks to me that the stream ‘chunks’ are not necessarily terminating on line-ends - they can be anywhere in the data-stream, and that this is causing some problems. The client code will need to be modified if that is expected output from the server.

Thanks Chris


#8

The keep-alive messages, or 'ticks', are sent over an HTTP streaming connection (not over a WebSocket connection), if the connection has received no messages (interactions, or ticks) for 30 seconds. These ticks are currently 'swallowed' by your client library - it is expected behaviour that you will not see any ticks output by the client library. Running an ngrep should allow you to see these ticks being received. We plan to update how ticks are handled in an upcoming version of the client library.

Timeout errors should not be received. The HTTP keep-alive ticks are sent specifically to prevent this. Reconnections after timeouts or network errors are handled towards the end of the streamconsumer_http.py script.

We are looking into this issue, and should have a fix with our next release of the Python client library. We hope to have this fixed in the next few weeks.


#9

Hi Jason, thanks for your reply.

I had made some minor mods to the http streaming client so I could see what was happening - including the ticks. This is what let me to ask about the expected behaviour. From your reply I should see either data or a tick in every 30s period. I don’t believe this was the case, so I may attempt to confirm my findings with a network sniffer as you’ve suggested.

I did observe different tick messages, including a ‘stream established’ message (can’t rememeber the exact wording) but oddly, I didn’t see this arrive until some data was received - which could be a long time after the connection is accepted. It looked to me like the ticks were only fired when there was data arriving on the channel.


#10

We have resolved the issues with the Python client library. Please download the updated version (v.0.4.0) fro our github repo : github.com/datasift/datasift-python

The httplib and urllib2 modules seemed to be buffering interactions until enough were received to post to the screen. The switch to raw sockets seems to have solved this. If anyone has similar issues going forward, please feel free to raise another issue.