How to map normalized links to content?


#1

Is there any way to have the links augmentation output include a mapping or indication of the substring in the content that lead to the particular link? I know that the links.url property, for example, is an array and contains links in the order they were found in the content. However, I would love to be able to tell what actual string in the content (the bitly or whatever) lead to that link in the links augmentation! I am hoping I don’t have to rewrite & guess at your parsing logic in order to figure that out, but I don’t see any way to get it from the output I’m seeing. Thanks!


#2

Take a look at the links.hops target - this should be exactly what you are looking for.


#3

That is what I expected, but I’m not seeing everything I’d need. Consider the interaction below.
The twitter.text contains http://t.co/1q4RVZBg, which does not appear in links.hops.
I also find it odd that interaction.content is truncated (ends with …).

Am I missing something? What I’d like to be able to do is replace the http://t.co/1q4RVZBg in the content with the final, resolved URL per links.url.

Thank you!

{"demographic": { "gender": "mostly_male" }, "interaction": { "schema": { "version": 3 }, "source": "web", "author": { "username": "rafeekalika", "name": "RAFEEK ALI", "id": 151420710, "avatar": "http://a0.twimg.com/sticky/default_profile_images/default_profile_0_normal.png", "link": "http://twitter.com/rafeekalika" }, "type": "twitter", "created_at": "Sat, 08 Dec 2012 03:53:39 +0000", "content": "RT @cushy_cms: Merry Xmas Cushy Lovers! Our gift to you: CushyCMS Pro FREE for 1 month w/ Coupon Code XMAS2012 (new subscribers only)! h ...", "id": "1e240ead97e6a380e07421ec836d7a00", "link": "http://twitter.com/rafeekalika/statuses/277259621163470848" }, "klout": { "score": 13 }, "language": { "tag": "en", "confidence": 100 }, "links": { "code": [200], "created_at": ["Sat, 08 Dec 2012 02:43:28 +0000"], "hops": [["http://www.cushycms.com/static/pro?utm_source=twitter&utm_medium=tweet&utm_campaign=CushyXmas12"]], "normalized_url": ["http://cushycms.com/en/static/pro"], "retweet_count": [0], "title": ["Pro Account \u00BB Free and simple CMS \u00BB CushyCMS"], "url": ["http://www.cushycms.com/en/static/pro"] }, "salience": { "content": { "sentiment": 8 } }, "trends": { "type": ["South Africa", "Canada", "daily", "weekly"], "content": ["xmas"], "source": ["twitter"] }, "twitter": { "id": "277259621163470848", "retweet": { "text": "Merry Xmas Cushy Lovers! Our gift to you: CushyCMS Pro FREE for 1 month w/ Coupon Code XMAS2012 (new subscribers only)! http://t.co/1q4RVZBg", "id": "277259621163470848", "user": { "name": "RAFEEK ALI", "statuses_count": 206, "followers_count": 37, "friends_count": 2001, "screen_name": "rafeekalika", "lang": "en", "id": 151420710, "id_str": "151420710", "created_at": "Thu, 03 Jun 2010 11:05:01 +0000" }, "source": "web", "count": 3213, "created_at": "Sat, 08 Dec 2012 03:53:39 +0000", "links": ["http://www.cushycms.com/static/pro?utm_source=twitter&utm_medium=tweet&utm_campaign=CushyXmas12"], "domains": ["www.cushycms.com"] }, "retweeted": { "id": "276250455053582339", "user": { "name": "Cushy CMS", "url": "http://www.cushycms.com", "description": "A free and truly simple CMS.", "statuses_count": 75, "followers_count": 3134, "friends_count": 90, "screen_name": "cushy_cms", "lang": "en", "time_zone": "Hawaii", "utc_offset": -36000, "listed_count": 21, "id": 47867909, "id_str": "47867909", "created_at": "Wed, 17 Jun 2009 05:32:02 +0000" }, "source": "web", "created_at": "Wed, 05 Dec 2012 09:03:35 +0000" } } }

#4

I will raise a feature request internally to see if we can get the t.co URL added to the links.hops output. However, when we process links, we do so in the order in which they appear in the Tweet. For example, the following Tweet:

Tweeting two links j.mp/UvPvhm and j.mp/VAbWCP

— Jason Dugdale (@dugtest) December 11, 2012

will give us the following JSON output (cut down for brevity):

 

{
    "links": {
        "code": [
            200,
            200
        ],
        "created_at": [
            "Tue, 11 Dec 2012 09:46:57 +0000",
            "Tue, 11 Dec 2012 08:01:34 +0000"
        ],
        "hops": [
            [
                "http://j.mp/UvPvhm",
                "http://goo.gl/WZSEv"
            ],
            [
                "http://j.mp/VAbWCP"
            ]
        ],
        "normalized_url": [
            "http://bbc.co.uk/news/uk-20677515",
            "http://bbc.co.uk/news/world-us-canada-20673941"
        ],
        "title": [
            "BBC News - 2011 Census: England and Wales population rises 7%",
            "BBC News - Ikea monkey heads to Canada primate sanctuary"
        ],
        "url": [
            "http://www.bbc.co.uk/news/uk-20677515",
            "http://www.bbc.co.uk/news/world-us-canada-20673941"
        ]
    },
    "twitter": {
        "domains": [
            "j.mp"
        ],
        "id": "278440700306153473",
        "links": [
            "http://j.mp/UvPvhm",
            "http://j.mp/VAbWCP"
        ],
        "text": "Tweeting two links http://t.co/IEduEPJL and http://t.co/hFg6pTsD",
    }
}
 
The first link in the Tweet maps to the first set of hops, and the second link to the second set of hops. 
 
The end of interaction.content being truncated is a known issue, and should be patched soon.

#5

Thanks Jason; just pointing out that my original question was raised because I don’t know the logic used to identify a link in the content and would rather not have to guess / implement it.


#6

Regarding adding the t.co links to the links.hops array, unfortunately this feature request has been rejected. The reasons why this feature was rejected are as follows:

- Twitter passes us the links in the Tweet, resolved from the t.co link to the first hop as the twitter.links fields. If we were to begin resolving the links from the t.co link, this would add an extra, unnecessary hop to each link resolution attempt. Considering we resolve several million links every day, this would add a large processing overhead, and increase the lag on our filtering service. 

- The t.co URL shortener is not an open platform like services such as bitly, and every t.co link is unique, making it difficult to provide click or share stats.