Escaping and Encoding in HTML and RSS
Thursday, March 13th, 2008Escaping and Encoding, two things that most web developers need to do quite often. Unfortunately, most people are never taught when (or even how) to do so. Depending on what you’re doing, it can be a security risk and a bad user experience.
Encoding and escaping are both similar in concept, and what they ultimately try to do is represent values (in our context, characters) that either have special meanings, or are otherwise not representable in the underlying format.
For example, the “<” character has special meaning in HTML/XML, so if you want it to actually show up, you have to escape it. To do that, you use “<” in its place.
Another example is in URLs, “&” has special meaning (to separate query parameters). It is replaced by “%26″. “%” itself also has to be encoded (as “%25″).
If you wanted to link to “http://virtualinfinity.net/dictionary?word=%nfinity&fun=true” in HTML, you’d first want to URL encode “%nfinity” to “%25nfinity”, and then you’d want to HTML encode the full URL to “http://virtualinfinity.net/dictionary?word=%25nfinity&fun=true”.
You’re final output would look something like <a href=”http://virtualinfinity.net/dictionary?word=%25nfinity&fun=true”>Words ending in nfinity & having fun</a>. Notice the “&” in the href. Most web browsers are tolerant of such mistakes, but they can cause you problems down the road.
A good way to know what to encode, and which method to use, is to think of each encoding as a layer. You want to put the string “%nfinity” in the URL query parameter layer, so you need to encode it with the URL encoding. You want to put the URL into an HTML document, so you need to HTML escape it. And so on and so forth.
Things can get even more interesting with RSS feeds. The <description> elements’ text values can contain HTML within them. A naive first attempt might be something like:
<description> <a href=”http://virtualinfinity.net/dictionary?word=%25nfinity&fun=true”>Words ending in nfinity & having fun</a>. </description>
Unfortunately, this doesn’t work quite as expected. in XML, this actually creates an “a” element with-in the “description” element. This is *not* valid RSS. So you need to escape the contents of the description element.
<description> <a href=”http://virtualinfinity.net/dictionary?word=%25nfinity&amp;fun=true”>Words ending in nfinity &amp; having fun</a>. </description>
This may look funny, but it’s actually correct. It is not a typo to have “&amp;”. The first “&” will be converted back into “&” by the XML parser, so “&amp;” becomes simply “&” After that, the HTML parser gets a hold of it, and converts it to the expected “&”.
So, there you have it. A brief explanation of when and what to encode. How to encode is left as an exercise of the reader. (Hint: google is your friend)
