I’ve noticed that some tweets contain ASCII control characters. E.g. the backspace character. This has come to our attention because strings containing these characters cannot be serialised as XML. Would DataSift consider removing these non-printable characters before the interaction is sent to consumers?
Could you share an example interaction containing these characters?
I can’t paste the tweet as the character won’t be visible. You have to view the interaction in a decent text editor such as Notepad++, which has symbols for these characters. Some examples: interaction id 1e1953bab74ea780e07492fa76aab592 contains a vertical tab. 1e1aa5446d27ad00e074dc2623e6793a and 1e1b286ea0e6a800e074c396f7b82d3c contain ‘End of Medium’. 1e1b2d15c0b9af00e07470cf33cae628 contains ‘End of Trans. Block’. I found http://ostermiller.org/calc/ascii.html useful in identifying the characters.
Do you have rough timestamps for when these interactions were received?
1e1953bab74ea780e07492fa76aab592 was received on 2012-05-03 17:18:52, 1e1aa5446d27ad00e074dc2623e6793a on 2012-05-30 13:37:55,
1e1b286ea0e6a800e074c396f7b82d3c on 2012-06-10 00:00:42,
1e1b2d15c0b9af00e07470cf33cae628 on 2012-06-10 08:53:29
Thanks. We are looking into the cause of this now.