How many terms can be in an in clause, and how does the DPU scale with the number of terms?


#1

Is there a limit to the number of terms that can be in an IN clause? And how does the DPU usage scale with the number of terms?

Is the DPU for an IN clause proportional to the number of terms? Is it proportional to the number of characters? Is there some other formula?


#2

There is no numerical limit to the number of terms in an IN / contains_any clause. The only limit currently imposed on CSDL is that the maximum size of a CSDL query must not exceed 1MB. We do plan to remove this limit in the near future. If your CSDL filter does exceed 1MB, you can link streams together using the stream keyword.

How the DPU cost scales is all explained in the "Cost of Operators" section on our Understanding Billing page.


#3

Thanks Jason. Here are a couple of follow-up questions:

  1. If I combine streams using the stream keyword, will my total DPU cost be the sum of each stream’s cost? (Understanding Billing says 10 million terms in an IN clause is 32 DPUs, but if I have to spread that across 100 streams with 100,000 terms in the IN clause to stay under the 1MB, it looks like the sum of those DPUs will be between 800 DPUs, or over $100k/month.)
  2. Is there a mailing list I can get on to be notified when the 1MB CSDL filter limit is lifted?

#4

1. If you need to write a stream this big, contact our support team who will be able to help you out until the 1MB limit is lifted.

2. Follow @DataSiftDev - all major platform updates are announced here, along with other tips, blog posts and examples.


#5

One more question: does the 1 MB limit for CSDL mean 1,000,000 bytes or 1,048,576 bytes (MB or MiB)?


#6

I believe it is 1,048,576 bytes, though I have not tested this.