Incomplete Normalization of Reddit Interactions


#1

The Interaction.Content field is incompletely normalized for Reddit records. It contains content-free HTML tags that should be stripped out by the normalization process. Here’s an example:

interaction.id = "1e2b6fa812faaa00e061f254fa1d10f0"
interaction.content = "<div class=“usertext-body”><div class=“md”><p>Way 3edgy5me</p> </div> </div>"
reddit.content = “<div class=“usertext-body”><div class=“md”><p>Way 3edgy5me</p> </div> </div>”

Clearly there is tag pollution on the native reddit object as well, so perhaps the problem may be further upstream.


#2

Could you please attach the full JSON interaction?