I zoned out over the development process part, to work on some Upcoming stuff. Sorry about that.
Now we’re on Unicode 101. Short discussion of charset vs. encoding, a few examples of ASCII, UCS2/UTF-16, and UTF-8. “charset” in HTTP headers and meta’s is a misnomer, should be “encoding”. Rationale discussion using substr… my preferred example is in truncation of db strings. Truncating a string that’s in a different language will completely bust a non-unicode implementation. Some of the probs with MySQL are on older (3.0 and 4.0 versions), so those aren’t huge issues for us. Javascript mostly handles UTF-8 successfully, except for the escape() function, so we have to implement our own UTF-8 escape() function.
UTF-8 and email. Content type headers only apply to content blocks. Headers need inline encoding, ex: “=?utf-8?Q?…” Defined in RFC 1342. “Q” is similar to quoted-printable, B is base 664. This allows non-ASCII in the subject line.
Any junk that you receive, you assume is Latin-1 that is misidentified.
Filtering done only at the outside, possibly except for “signed data.” I don’t really understand his example here, so i’ll ask later. You never want chars below 0×20. Apart from normalized carriage returns. Carriage returns mess up XML attributes, though.
He has a Filtering PCRE sample, but there’s a mistake, and he recommends using iconv to convert from UTF-8 to UTF-8.
Discusses HTML and javascript filtering. Some common XSS hacks. Promotes lib_filter.
Dealing with email – receiving email is useful, very handy to support mobile blogging, support tracking. Discusses uses of pipes from /etc/aliases.
Mime in a nutshell – defines some content types, multipart mail can contain sub parts. For mail with attachments, main part is in text, rest is in binary.
Mail::mimeDecode – what Flickr mail parser is based on. Not too broken for their use, heh. application/ms-tnef – that’s MS’s winmail.dat. Only used by Outlook (transport neutral encapsulation format). A packed list of files and metadata. The spec was buried on the MS site, and is fairly easy to unpack if you know the spec. Some code somewhere may document or handle it.
Incoming email isn’t necessarily latin-1 or utf-8. Forcing a character set is kind of lame. Can find out the intended charset from email’s content-type header. Spec states they must state the content-type unless they’re latin1. Fortunately, iconv does the heavy lifting.
Wireless messages suck, because they do special casing but then also append crap at the end often. Attachments as images but text/plain mime type. Wireless carriers also append additional images which are their logo and extra spacer gifs. The worst offenders send links to images instead of actual images. Sometimes slashes are doubled up, so they break automated grabs. Try to capture weird emails and add them to a test suite, along with expected results.
The desired system is a closed test system with easily repeatable regression tests.
That completes the pre-lunch session, so we’re off to lunch.