Is it Possible to /pull?id=xx from multiple clients without loosing data?


#1

I am running two separate clients from different machines to pull data from DataSift to pass on to different processors.

Method I used is:

  1. I use stream/hash to create new ID for respective stream using API /push/create
  2. Then I use this ID to make pull requests /pull?id=

When I run single client then everything is good.

But when I start second client with same DataSift username & API Key; the second client sometimes get data but most of the times 0 interactions (no data); How can I fix this problem?

“I don’t want to loose the data and want to pull data from multiple clients”


#2

You will need to use the cursor parameter. When you make your first /pull call, DataSift returns a cursor to allow you to make that same pull again, and a cursor to let you know what the next payload is. As long as you can keep these cursors synchronized between your two clients, you should be able to ensure that you pull the same data across both clients. Without using the cursor, if client 1 makes a /pull request, then the other client makes a request, it will be given a different payload.
Alternatively, it is possible to consume the same filter hash multiple times for no additional cost. This means you can create the same Push subscription twice; once for each client, and consume each subscription independently.


#3

Thanks for inputs.

Is there any limitation to this? like I can create 4 or 5 Push subscription to consume independently or it’s unlimited as far as respective filter hash exists?


#4

There is a 200 active Push subscription limit overall on your account.


#5

Hey, I have created same PUSH subscription (i.e. using same hash/stream) to consume data independently; But after some time other PUSH subscriptions go aways and only one PUSH subscription stays per unique hash/stream stays;

For ex. I have “hash1”; using this I created:
hash1 -> id1 -> /pull?id=id1
hash1 -> id2 -> /pull?id=id2
hash1 -> id3 -> /pull?id=id3
When I checked PUSH subscription using /push/get API I got 3 which is good and status was "active"
When I checked after couple of hours using /push/get API; Only one PUSH subscription was left and rest of 2 PUSH subscriptions were gone.

Would you please let me know why this is happening? Do I need to do anything extra?


#6

You do need to ensure that you pull from Pull subscriptions frequently; A Push subscription will only remain active if you continue to consume data from it. You only need to consume data once every hour at the absolute minimum (we recommend pulling as often as you can), but if you go more than an hour without making a successful /pull request, the subscription will come to an end.
It is also worth confirming that you are not making any calls to /push/stop for these subscriptions; this will obviously stop the subscriptions.


#7
  • I did re-verified the frequent data consumption from respective subscription_id; Yes, data is getting consumed many times in an hour and is continuous process.
  • No call to /push/stop for respective subscriptions.

I have question:
If my subscription filter hash/stream hasn’t got any data in an hour; so subscription_id created using same hash will not have anything to deliver for an hour;
Will this scenario force all subscription_ids which are using same hash/stream to finish & die except one (that’s why always only one ‘id’ remains per hash/stream?


#10

If your subscription does not match any interactions in an hour, this should not have any effect on whether the subscription stays alive or not; you simply need to make /pull requests to ensure that the subscription stays alive, regardless of whether your subscription has any data to deliver or not.
I’ve been running some tests with multiple /pull subscriptions, and so far have not seen either of my subscriptions drop off; they have both been running for a couple of hours now. I set up my Pull subscriptions via a curl command similar to:

curl 'https://api.datasift.com/v1/push/create?username=USERNAME&api_key=API_KEY&hash=HASH&name=PULL_TEST&output_type=pull'

I then ensure I pull data every 30 seconds by wrapping a curl command a watch command:

watch -n30 "curl 'https://api.datasift.com/v1/pull?username=USERNAME&api_key=API_KEY&id=SUB_ID' >> rr1.json"

If a Push subscription were to stop due to some error, this error should be logged in the /push/log. We are also looking to add more metadata about the status of your subscriptions to the /push/log, such as events like “A stop request was received” to make it a little clearer why Push subscriptions may be in the state they are in.

It may be worth checking your application to ensure it does not make any calls to /push/stop that you may not be aware of.


#11

Mentioned issue observed, where I have created three subscription_ids for single hash/stream and pulling data from them independently+continuously.

  • I was debugging from source code with debug point inside DataSift library class “com.datasift.client.push.DataSiftPush” methods: “public FutureData stop(String id)”, “public FutureData delete(String id)”; NEVER got caught in here - hence client code never called stop api endpoint.
  • I have only one stop call in my codebase which I use when I want to remove subscription; I put debug point to it and log statements are already in place; NEVER got caught in here - hence this piece not got called.

Observation:

  • If I have three subscription_ids created from single hash; oldest subscription_id vanishes first then second oldest and only one stays (code base is similar which is pulling from all three subscription_ids)
  • I noticed 70 interactions pull count just before it started showing 0/ZERO interactions; subscription_id was vanished from datasift, client code continued sending pull requests to it (mentioning this to indicate that I am consuming data many times in an hour time and few minute before id vanished).

When I went to the API log:
“request_time”: 1429202169,“message”: “The delivery has completed”
“request_time”: 1429202169,“message”: “The status has changed to: finished”
“request_time”: 1429201807, “message”: “The status has changed to: finishing”

Status “finishing” means “The Historics query has finished or live stream has been stopped”, client did not stop the subscription (in fact client is unaware of subscription_id is gone; client is still sending pull requests; I saw the log traces and continuous log output)


#12

In addition to above I ran curl tests and they FAILED:

I did create three subscription_ids using CURL for same hash; for which I got three successful responses with new subscription_ids and I was able to Pull the data till one of them automatically vanished (ran with curl - not my code now):
Actual Response from DataSift:
{
“id”: “xxxx”,
“output_type”: “pull”,
“name”: “”,
“created_at”: 1429213823,
“user_id”: ,
“hash”: “”,
“hash_type”: “stream”,
“output_params”: {
“format”: “json_new_line”,
“acl”: “private”
},
“status”: “active”,
“last_request”: null,
“last_success”: null,
“remaining_bytes”: null,
“lost_data”: false,
“start”: 1429213823,
“end”: 0
}

I pulled data in same fashion as you mentioned in your earlier reply:
watch -n30 “curl ‘https://api.datasift.com/v1/pull?username=USERNAME&api_key=API_KEY&id=SUB_ID’ >> rr1.json”

Note: I did use actual valid username, api_key and subscription_id

Now why Pull subscription_id vanished?


#13

Thanks for the additional information. I’ve alerted our engineering team about this. the fact that this is happening in both the Java client, and via curl requests is worrying. Are you able to share your DataSift username with me? if you’d prefer to submit it in a private support ticket, you are more than welcome to raise one at support.datasift.com


#14

Thanks Jason.
I have created the request on support.datasift.com, request id: 8605 (for your reference)