Why links augmentation does not contain original t.co short url?


#1

Hi,

In order to normalize the tweet text we are looking in to replacing the short urls (http://t.co/… and https) with the links augmentation’s normalized_urls. The augmentation does not contain the http://t.co/… form of the url though, and so we must revert to something like:

  1. find urls with regexp from tweet content
  2. perform replace looking up normalized_urls from links augmentation

The problem here is: since we have to use regexp, we now have 2 problems. What I mean is that though I can be pretty sure that the regexp will catch all short urls from the content, I would like to be sure that any changes to their short url format will not implode the system.

What would remedy this quite simply would be that the links augmentation had the original short url included. Thus the process would then be:

  1. Iterate over links augmentation
    1. Find short url from content
    2. Replace with normalized_url

If I’m missing something and this is already there and possible, I’m sorry for taking your time.


#2

Hi,

It looks like you are after the links.hops target. This will contain every URL that it has been directed through in order to get to the links.url.

The links.hops will look something like this through the stream preview:


 

So your CSDL might look something like 'links.hops substr "http://t.co"' or you can use the contains operator if you are looking for a specific t.co URL.

Hope that helps.


#3

In a somewhat large set of data I could not find an interaction instance that would contain the t.co (the original form of the url in the content) in the hops. The sample hops given probably has been formed from a tweet to a facebook share to a tweet to…


#4

To make life even more interesting it seems that there are twitter interactions for which there are less normalized urls in the links aggregation than there are links in the content.


#5

Warming up an old question, but we actually found the time to look in to the missing normalized urls and a solution too. It turns out that links augmentation does not contain “media objects”. Thus our process for normalizing content by replacing t.co urls is:

for all t.co urls in content: if url in twitter.media.url or twitter.retweet.media.url: replace with twitter.media.media_url or twitter.retweet.media.media_url else: replace with next links.normalized_url

This has worked for pretty much every interaction we have. Just adding this here in case someone else has the same problems.