PottyMouth
© 2007-2008 Matt Chisholm
matt dash pottymouth at mosuki dot com
What does it do?
PottyMouth transforms completely unstructured and untrusted text to valid, nice-looking, completely safe XHTML.
PottyMouth is designed to handle input text from non-technical, potentially careless or malicious users. It produces HTML that is completely safe, programmatically and visually, to include on any web page. And you don’t need to make your users read any instructions before they start typing. They don’t even need to know that PottyMouth is being used.
What is it for?
PottyMouth is ideal for displaying blog comments, text email bodies in a web mail application or mailing list web archive, or any text fields on any site with user input text, such as a social networking, dating, or community site. In short, any input which is displayed in HTML and is input as text by a non-technical and/or untrusted user. It has been in use on mosuki.com since January 2007, and on spydentify.com since January 2008.
What is it not for?
PottyMouth is not intended for HTML page generation, such as writing blog entries, where the author is an authorized and trusted user who may want to exert more control over the content of his or her post. Markdown and SmartyPants, or Textism are good solutions for trusted HTML authoring.
PottyMouth is also not intended for wikis, where the text is more heavily structured and where poorly formatted or malicious input can be quickly corrected by another user. There are many good wiki packages out there; this is not one of them.
Why should I care about…?
…unstructured text input?
The average, non-technical user doesn’t care about formatting syntax and won’t take the time to learn it. PottyMouth lets your website display any user input without having to make your users learn anything. The only “syntax” that PottyMouth uses are conventions that are ubiquitous on-line. If your site displays text input from external programs, third-party sites, or other sources like email, you can’t rely on your users to know about your site’s text formatting conventions.
…layout-safe HTML?
You want to allow your users the freedom to put whatever they want on your site. But you don’t want badly formatted text to make that text look ugly, or to screw up the layout of other elements on the page.
…untrusted text input?
If it’s possible for an untrusted or anonymous user to input text that gets inserted in HTML on your site, you need to process that text to make sure it cannot cause problems for other visitors. If your site displays text input from external programs, third-party sites, or other sources like email, you can’t control or check that text until you are displaying it.
…secure HTML?
Allowing anyone to insert raw or even limited HTML into your site is dangerous. If an attacker can insert JavaScript, media, or malicious links into your site, he or she can cause a user or their browser to perform malicious actions or send spam, on your site or third party sites, or they can insert DHTML id attributes or JavaScript to break your DHTML/JavaScript application. If an attacker can insert CSS into your site, they can hide or override advertisements, warnings, or instructions with their own content.
What does it prevent?
PottyMouth prevents against a wide range of potential problems:
- no JavaScript or HTML insertion via
<iframe>
tags - no JavaScript insertion via:
<script>
tags - no JavaScript insertion via: event handler attributes on tags
- no JavaScript insertion via
javascript:
hyperlinks - no JavaScript insertion via CSS
expression()
- no overriding of site CSS via
<style>
tags - no attacks via malicious
href
attributes in<a>
orsrc
attributes in<img>
,<embed>
or other media tags - no damage to site layout via inserted CSS or
width
,height
, or other HTML attributes - no ability to break or compromise JavaScript applications by generating HTML tags with identifiers that collide with existing DOM identifiers.
Although the problems above could be solved by simply allowing a short white-list of HTML tags and no HTML attributes whatsoever, inserting raw HTML tags is a feature that non-technical users don’t need. And PottyMouth automatically detects most of the instances where the average user would want HTML tags.
PottyMouth syntax
Although PottyMouth has no syntax that users must learn, it does parse input text to transform it to HTML. It relies on some ubiquitous text formatting conventions to do the best formatting job possible.
Paragraphs, newlines, and ad-hoc lists
PottyMouth intelligently identifies paragraph breaks, newlines, and ad-hoc lists. A sequence of more than one blank line is turned into a paragraph break. Within a single paragraph, PottyMouth distinguishes between “short” and “long” lines and treats them differently.
- A sequence of long lines is treated as a single, unbroken line, without newlines.
- A single short line between two long lines is also treated as part of the single, long line, and does not insert a newline either. This ensures that text that has been hard wrapped more than once at decreasing line lengths is repaired, and rendered as a single unbroken paragraph.
- Two or more consecutive short lines are treated as an ad-hoc list, and a line break is inserted between them. Thus a list-like sequences of short lines are preserved.
Extensive testing has shown that fifty characters is a good threshold between short and long lines (this threshold is configurable if your data differs, however). Here’s an example:
This is some text that has been through a bunch of broken email programs and got hard line wrapped really badly at some point because a programmer was lazy. here's some more text where someone decided to list their favorite things: raspberries pink hair devil ducks science fiction And that list goes right in the middle of a paragraph, but does it screw up potty mouth? Nope.
becomes:
This is some text that has been through a bunch of broken email programs and got hard line wrapped really badly at some point because a programmer was lazy.
here’s some more text where someone decided to list their favorite things:
raspberries
pink hair
devil ducks
science fiction
And that list goes right in the middle of a paragraph, but does it screw up
potty mouth? Nope.
Detecting very long sequences of non-breaking characters and inserting a soft hyphen (­) character that will cause natural text wrapping is planned for a future release.
Block quotes
PottyMouth identifies sequences of lines beginning with one or more >
and groups them into nested sequences of <blockquote>
and <p>
tags. In other words, input like this:
A reply to a reply More of the reply to the reply > A reply to a message > More of the reply to the message >> The original message >> More of the original message > Last line of the reply Last line of the reply to the reply
is rendered like this:
A reply to a reply
More of the reply to the reply
Last line of the reply to the replyA reply to a message
More of the reply to the messageLast line of the replyThe original message
More of the original message
You may turn off block quote detection by initializing PottyMouth with blockquote=False
.
Hyperlinks
PottyMouth identifies hyperlinks beginning with these protocols: http
, https
, webcal
, feed
, ftp
, news
, and nntp
, and ending in a valid URL. Adding new protocols is trivial. It also identifies URLs beginning with www.
and prepends http://
.
When using PottyMouth to generate content for a web application, it expects you to provide it with a list of one or more domain names for the site. Unless you explicitly leave this name blank, PottyMouth will only hyperlink links that point to other sites.
If you want PottyMouth to hyperlink site-internal links, you must also provide it with a white-list of regular expressions that match allowed site-internal links. This allows you to denote the site-internal URLs that users can include that will become hyperlinked, and other URLs will remain as text.
For example, if you were using PottyMouth on http://www.mysite.com
, and chose to allow links to posts, you would use a whitelist like https?://(www\.)?mysite\.com/viewpost\?id=\d+
. Then, these URLs would get hyperlinked:
http://google.com/
http://someothersite.com/some/page.html
http://www.mysite.com/viewpost?id=1234
http://mysite.com/viewpost?id=5678
But these URLs, which might just be mis-typed, mis-encoded, or might be malicious URLs, would not get hyperlinked:
http://www.mysite.com/viweopst?id=1234
(Whoops, typo)http://www.mysite.com/viewpost=3Fid=3D1234
(Whoops, encoding problem)http://static.mysite.com/randomimage.gif
(Whoops, disallowed host name on the same domain)http://www.mysite.com/postcomment?content=Here%20is%20some%20spam!
(Malicious)http://www.mysite.com/delete-my-account?confirm=yes
(Malicious)
While PottyMouth should not be considered a substitute for correctly protecting against the latter two types of malicious links in your software, preventing them from being automatically hyperlinked on your site raises the bar significantly for these types of attacks.
You may turn off all hyperlinking by initializing PottyMouth with all_links=False
, and you may turn off just email address hyperlinking with email=False
.
Embedded media
PottyMouth optionally allows embedded media. URLs ending in .JPG, .JPEG, .GIF, and .PNG are considered to be embedded images, and are included as <img>
tags. It also detects links to YouTube videos and embeds them using YouTube’s standard embedding syntax.
The embedded media feature is disabled by default, because it does somewhat compromise the safety of the generated HTML. Embedded media could be used to launch cross-site scripting attacks on another site, if an attacker can generate a malicious URL to the remote site that ends in JPG, GIF, or PNG. However, protecting against cross-site scripting attacks is really the responsibility of the target site, not you.
Embedded media could also be used as web bugs by a third party to collect IP addresses of visitors to your site. This could only be mitigated by running a cache which served the third party content to your site visitors, and appending the target URL onto the cache service. Adding configuration options for this is planned for a future release.
Embedded media is still relatively safe, for the following reasons:
- The URL white-listing of hyperlinks is applied before identifying hyperlinks to media, so linking to malicious site-internal URLs, or random images on the site is still not possible.
- The major browsers do not execute CSS, JavaScript or HTML if it is loaded as the
src
attribute of an<img>
tag, so linking to malicious content is still not possible. - By correctly setting the CSS
overflow
and size properties of the HTML element containing PottyMouth generated HTML, large embedded images will not interfere with page layout. - By only allowing embedded Flash widgets from a set of sites known to produce (relatively) trustworthy Flash, the possibility of including malicious Flash is low.
You may turn off image tag creation and YouTube embedding by initializing PottyMouth with image=False
, and/or youtube=False
. Image and YouTube URLs are then treated as ordinary URLs (see above).
Bold and italic
PottyMouth identifies balanced sets of *
and _
and turns them into bold (<b>
) and italic (<i>
) tags. This support was added because this shorthand is extremely common in text input, even from non-technical users. Bold and italic can be nested; however, they cannot be overlapped and neither can be nested, at any depth, inside itself. Un-balanced *
and _
are rendered literally.
this is *bold _and italic_* or _italic *and bold*_ or just *one* or the _other_ but *I dunno _what* this_ is *supposed to* be.
produces:
this is bold and italic or italic and bold or just one or the other but I dunno _ what this _ is supposed to be.
Support for other shorthands, such as -
for strikeout and =
for monospaced text is a possibility, but unlikely as it requires some user knowledge, is much rarer than *
and _
, and would likely interfere with the normal use of those characters.
You may turn off bold and italic creation by initializing PottyMouth with bold=False
, and/or italic=False
.
Special characters
PottyMouth renders single and double quotes, backticks, ellipsis and double-dashes into the appropriate HTML entities:
'foo'
⇒ ‘foo’"foo"
⇒ “foo”`foo'
⇒ ‘foo’``foo''
⇒ “foo”foo's ball
⇒ foo’s ballfoo...
⇒ foo… (ellipsis)foo--bar
⇒ foo—bar (emdash)
Single dashes are not converted into dash, hyphen, minus or emdash, as it is not possible to reliably detect what is the correct character to use from context. See The Trouble With EM ’n EN for more information. Because PottyMouth is intended for non-technical, novice users, there is no syntax for distinguishing these characters.
All characters that are not valid HTML, including <
, >
, and &
, are escaped in the output.
Support for smilies and additional special characters is a future possibility.
You may turn off smart quotes, ellipsis, and emdash detection by initializing PottyMouth with smart_quotes=False
, ellipsis=False
, and/or emdash=False
.
Literal list syntax
PottyMouth identifies literal lists denoted by lines beginning with #
, 1.
, or any number of digits followed by a period, for ordered lists, and *
, -
, or •
(bullet, •), for unordered lists. The first item in the list determines whether the entire list is ordered or unordered. This:
# science fiction # devil ducks # raspberries * raspberries - pink hair # devil ducks • point-set topology 1. the unit 1. binary 413. ternary
becomes:
- science fiction
- devil ducks
- raspberries
- raspberries
- pink hair
- devil ducks
-
point-set
topology
- the unit
- binary
- ternary
Nested lists are not supported. Nested lists are an important feature in documents with heavily structured content deliberately created by careful editors who want to take the time to learn syntax and structure their content appropriately. PottyMouth is for displaying ad-hoc text input quickly by non-technical users, where flat literal lists are occasional and nested lists are vanishingly rare.
You may turn off all list support by initializing PottyMouth with all_lists=False
. You may turn off just ordered lists, or just unordered lists, by initializing PottyMouth with ordered_list=False
, and/or unordered_list=False=False
. And you may turn off just numbered lists (list items beginning with a sequence of digits and a period) with numbered_list=False
Usage
PottyMouth is written as a Python module. (There is also an experimental port to Ruby 1.9.) To use it, first instantiate a parser and tell it what domain it’s going to be used on:
from PottyMouth import PottyMouth pm = PottyMouth(url_check_domains=('www.mysite.com', 'mysite.com'), url_white_lists=('https?://www\.mysite\.com/allowed/url\?id=\d+',), )
The parse()
method returns a PottyMouth.Node
object representing a <div>
node, and containing <p>
nodes.
div_node = pm.parse(string_to_parse)
You can then stringify them with str()
or just print
them:
print div_node
PottyMouth.Node
objects inherit from native Python list
s, so you may also iterate over their contents and convert them to whatever native XHTML objects that your application requires.
The Ruby version uses an identical interface.
Configuration
You may disable specific components of PottyMouth's syntax by passing in any combination of the following key-word arguments when initializing a new PottyMouth instance:
all_links=False
- disables all URL hyperlinking
image=False
- disables <img> tags for image URLs
youtube=False
- disables YouTube embedding
email=False
- disables mailto:email@site.com URLs
all_lists=False
- disables all lists (<ol> and <ul>)
unordered_list=False
- disables all unordered lists (<ul>)
ordered_list=False
- disables all ordered lists (<ol>)
numbered_list=False
- disables '\d+\.' list elements
blockquote=False
- disables '>' <blockquote>s
bold=False
- disables *bold*
italic=False
- disables _italics_
emdash=False
- disables -- emdash
ellipsis=False
- disables ... ellipsis
smart_quotes=False
- disables smart quotes
All of these options are enabled by default. You only need to pass foo=False
if you wish to disable one.
Download
PottyMouth is licensed under the BSD License. It requires Python, version 2.4 or newer, and is available in source or as .deb or .rpm packages. An experimental port to Ruby 1.9.0 is also available.
- PottyMouth 1.1.1 for Python >= 2.4, released 22 September 2008
- PottyMouth 1.0.2 for Ruby 1.9
- Older versions
If you have any suggestions or problems with PottyMouth, please feel free to email me at matt dash pottymouth at mosuki dot com.
Demonstration
Potty input: (1000 characters maximum)
Potty output: