HTMLifying user input

I've added a comment system to my new Kansas blog. Since the target audience for that site is friends and family rather than fellow web developers, I've taken a very different approach to processing the input from comments. While this blog insists upon valid XHTML and gives very little help to comment posters aside from highlighting validation problems, my new site's comment system takes the more traditional root of disallowing HTML while automatically converting line breaks and links.

The standard way of doing this with PHP is to use the nl2br function. I've never been a big fan of this method as I prefer blocks of text to be surrounded by paragraph tags. Luckily, adding paragraph tags to blocks of text is a relatively easy task. Here's the pseudo-code, mocked up in Python because it's quicker to experiment with than PHP:

>>> text = '''... lengthy text block here ...'''
>>> paras = text.split('\n\n')
>>> paras = ['<p>%s</p>' % para.strip() for para in paras]
>>> print '\n\n'.join(paras)

The above code splits the text block on any occurrence of a double newline, then wraps each of the resulting blocks in a paragraph tag (after stripping off any remaining whitespace) before joining the blocks back together with a pair of newlines between each one - because I like to keep my HTML nicely formatted. What it doesn't do is handle any necessary <br> tags. The trick now is to replace any single line breaks with <br> without interfering with the paragraph tags. The easiest way to do this is to put the replacement inside the loop, so that only line breaks that occur within a paragraph are replaced. Here's the updated list comprehension:


>>> paras = ['<p>%s</p>' % p.strip().replace('\n', '<br>\n') for p in paras]

The final job is to convert the above in to PHP:

$paras = explode("\n\n", $text);
for ($i = 0, $j = count($paras); $i < $j; $i++) {
    $paras[$i] = '<p>'.
        str_replace("\n", "<br>\n", trim($paras[$i])).
        '</p>';
}
$text = implode("\n\n", $paras);

That's the line conversions handled, but there are a few other important steps. Any HTML tags entered by the user need to be either stripped out or disabled by converting them to entities. Converting them to entities carries the risk of ugly failed attempts at HTML appearing on the comments page, but stripping tags carries an equal risk of innocent parts of a legitimate comment (such as a <wink>) being discarded. I chose to go the entity conversion route but force commenters to preview their comments before posting them, a trick I picked up from Adrian's blog. The final step is to automatically convert links in to <a href=""> tags. I achieve this using a pair of naive regular expressions in the hope that the preview screen would avoid them mangling comments in a way not intended by the author.

Here's the finished PHP function:

function untrustedTextToHTML($text) {
    $text = htmlentities($text);
    $paras = explode("\n\n", $text);
    for ($i = 0, $j = count($paras); $i < $j; $i++) {
        $paras[$i] = '<p>'.
            str_replace("\n", "<br>\n", trim($paras[$i])).
            '</p>';
    }
    $text = implode("\n\n", $paras);
    // Convert http:// links
    $text = preg_replace('|\\b(http://[^\s)<]+)|', 
        '<a href="$1">$1</a>', $text);
    // Convert www. links
    $text = preg_replace('|\\b(www.[^\s)<]+)|', 
        '<a href="http://$1">$1</a>', $text);
    return $text;
}

I have no doubt it could be improved, but my tests so far have shown it to be good enough for the job at hand.

Related Entries:

Comments:

Does this not do the nl2paras part in less code? function newlinestoparas( $thedata ) { $data = preg_replace( "/\r\n\r\n/s", "</p>\n<p>", $thedata ); $data = preg_replace( "/\r\n/s", "<br />\n", $data ); return "\n<p>$data</p>\n"; }

zlog - 19th October 2003 10:33 - #

Had you considered using a formatting language such as textile? The worse case if its not written correctly is that you get the formatting characters displayed. I use textile when posting to my blog, although I have to admit I haven't used it for posting comments yet...

I couldn't find an open source implementation of it, although you might want to see if you can use the code from Brad Choate's excellent MT Textile plugin. That said converting it to PHP might be more trouble than its worth...

Sam Newman - 19th October 2003 13:02 - #

That's pretty funny, since Textile was originally written in PHP. But I would not recommend Textile standalone unless you're willing to do some additional hard-ass sanitizing, since Textile allows HTML through unmodified.

Mark - 19th October 2003 15:24 - #

Textile's a pretty cool tool, but I see it as a shorthand for HTML more than anything else. It's really an extended Wiki style language which speeds up HTML entry without preventing you from doing anything.

I wanted the comment system on my new blog to be the simplest thing that could possibly work, and I also wanted it to do the most obvious thing possible with regards to the intentions of the comment author. I tend to prefer to avoid regular expressions if there's a clearer way of doing the same thing. In fact, I've got a nice trick for converting links that doesn't use regular expressions which I hope to post a bit later today.

Simon Willison - 19th October 2003 16:30 - #

The reason I recommended Textile was simply because if the Textile formatting wasn't properly enteredt, the worse you get is one or two funny looking chatacters knocking around. If you allow HTML you always run the risk of the resulting comment completely screwing the page up.

As for Textile letting HTML through, not an issue on my blog as MT sanitised HTML tags.

Sam Newman - 19th October 2003 19:24 - #

The code needs refactoring, but you might find autop useful: http://www.google.com/url?sa=D&q=http:%2F%2Fphotomatt.net%2Fscripts%2Fautop.

Matt - 19th October 2003 19:59 - #

New lines are not converted to breaks; use paragraph tags instead. XHTML must be well formed. The following tags are allowed: a, p, blockquote, ul, ol, li, dl, dt, dd, em, strong, dfn, code, q, samp, kbd, var, cite, abbr, acronym, sub, sup, br, pre

(required)

Note to spammers: Links from this site will have no effect on your Google PageRank, and any inappropriate comments will be deleted and blacklisted

Links from other sites:

...st over a month ago, and I've now added it to my Kansas blog simplified comments system as <a href="http://simon.incutio.com/archive/2003/10/19/htmlifying">mentioned earlier</a>. The problem is the age-old challenge of automatically converting URLs embedded in a piece of text in to links. The standard ...

Simon Willison: Converting links without regular expressions - 19th October 2003 23:25