Tokenization of links


#1

if i have the CSDL: links contains “my string”.

Is the URL toeknized for matching? If so, what characters are used for separation/tokenization. Are there examples anywhere. I didn’t see any in the documentation.

Thanks


#2

Each of the available targets to filter on are available on our Links Augmentation page. As far as filtering on the URL itself goes, you can filter on links.domain or links.url. The URL you filter on is the fully resolved URL.


#3

Ok, let me be just a touch more explicit.

I would like to know how exactly the contains operator acts vs the substr operator in regards to a links.url target. Are they functionally the exact same in this instance? if a url ends with:
/a/series/of/words
will contains match against ser? serie? series? I know that substr will match in any of those cases, but in some cases I may want an “if and only if” style match.

I hope this makes my question slightly more clear.


#4

Let's use the following URL as our example:

    http://www.nytimes.com/2012/03/22/us/early-spring-brings-flowers-but-also-pollen-and-pests.html

In this example, links.domain == "nytimes.com"

links.url substr ....... is used to match substrings, which include any punctuation, so links.url substr "times.com/2012" and links.url substr "rings-flowers-b" would both match this URL.

When using the contains operator, links are tokenized by punctuation. You can search for "phrases" in the URL such as "spring-brings-pollen", but you must include any punctuation thart separates the words in the URL. You can do this with or without spaces between the keywords and the punctuation. For example, both of the following filters will return the same results; links.url contains "pollen - and - pests" and links.url contains "pollen-and-pests"

To look for recent rews stories, you may want to search for something like links.url contains "2012/03/22" to bring back any news stories from today.

 


#5

That answers my question much better (probably cause I did a better job of asking it). Thanks so much for your time.