Cliff Hacks Things.

Saturday, April 01, 2006

Resuming exceptions in Mongoose

Since the initial rev of Mongoose's Signal framework (the basis for its exception handling), exceptions have supported resumption. You can send the message #resume to an exception and, if it supports resumption, execution will continue as though the exception had not been signaled.

Generally, of course, this is a bad thing, and most exceptions don't support resumption. There are a few that do, however; one of these is EncodingException.

EncodingException is signaled by a CharacterDecoder object when it encounters invalid data in its input. For example, the UTF8Decoder will signal an EncodingException if it encounters truncated, invalid, or overlong sequences in the input.

CharacterDecoders agree (in their interface) that if the EncodingException is resumed, they will insert the Unicode replacement character (U+FFFD) in their output and attempt to continue decoding.

So, let's look at two ways of handling a malformed UTF-8 byte sequence, taken from the UTF8Decoder test suite:


| fragment |
(-- Malformed sequence; the initial 0xC2 is missing a byte, and 0xA2 is spurious --)
fragment := #( 0xC2, 0xC2, 0xA1, 0xA2, 0x48, 0x65, 0x6C, 0x6C, 0x6F, 0x21 ).

UTF8Charset decode: fragment.


This code will die with an EncodingException (with a message reading "Truncated form at 1"). Now, we want our decoder to be more tolerant, so let's have it gloss over these issues.


| fragment |
(-- Same byte sequence as before --)
fragment := #( 0xC2, 0xC2, 0xA1, 0xA2, 0x48, 0x65, 0x6C, 0x6C, 0x6F, 0x21 ).

| characters |
[
characters := UTF8Charset decode: fragment.
] on: EncodingException do: [ :ex |
(-- Replace invalid sequence with U+FFFD and attempt to continue. --)
ex resume.
].

characters do: [ :c |
Console writeCharacter: c.
].
Console writeCharacter: Character newline.


So, rather than dying, this fragment prints:

�¡�Hello!


In an upcoming revision, you'll be able to provide a character or character sequence to use in place of invalid input, and have some control over overlong encodings specifically.

(Readers might note some subtle syntax changes in the code fragments above. The Mongoose syntax is evolving as we build out the standard libraries; stay tuned.)

Update: god, I love this language. The enhancements are in place, in under a dozen lines of code.

If you catch an EncodingException ex, your options are as follows:

  • ex resume will resume decoding. If the exception was signaled due to an invalid byte sequence, it becomes the Unicode REPLACEMENT CHARACTER U+FFFD in the output. If the exception was signaled due to an overlong encoding, it is decoded as if it were valid. (This replicates the behavior of most (broken) UTF-8 decoders.)

  • ex resumeAndSkip will resume decoding; whatever input caused the exception will simply be ignored, and decoding will resume after it.

  • ex resumeAndSubstitute: someCharacter does exactly what it sounds like: substitutes a character of your choosing for the invalid input. So, to be like most Unix libraries, you can substitute '?'.