In order to normalize the tweet text we are looking in to replacing the short urls (http://t.co/… and https) with the links augmentation’s normalized_urls. The augmentation does not contain the http://t.co/… form of the url though, and so we must revert to something like:
- find urls with regexp from tweet content
- perform replace looking up normalized_urls from links augmentation
The problem here is: since we have to use regexp, we now have 2 problems. What I mean is that though I can be pretty sure that the regexp will catch all short urls from the content, I would like to be sure that any changes to their short url format will not implode the system.
What would remedy this quite simply would be that the links augmentation had the original short url included. Thus the process would then be:
- Iterate over links augmentation
- Find short url from content
- Replace with normalized_url
If I’m missing something and this is already there and possible, I’m sorry for taking your time.