<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div class="">> However, looking back and mulling it over, I think I may now have thought of a way to get cleanup to work by incrementally overwriting invalid bytes with "?" and then revalidating. That would mean you'd get more than one "?" for a multi-byte bad character, but that is not necessarily a problem (it is invalid data, so how many characters it "really" represents is undefined).<br>
</div></blockquote><div>UTF-8 clearly specifies how codepoints are to be encoded:</div><div><a href="http://en.wikipedia.org/wiki/UTF-8">http://en.wikipedia.org/wiki/UTF-8</a><br></div><div>As such, a workable solution could be:</div>
<div>- replace the "red" bytes C0/C1/F5..FF by a single quotation mark</div><div>- replace invalid codepoints by a single quotation mark</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
Following a recent discussion on the SQLite mailing list, perhaps we should replace invalid codepoints with random bytes instead of "?", in order to make corpus admins more aware of the fact that bad and unpredictable things happen if you work on invalid data!<br>
</blockquote><div>To me, this seems straight out of the "poke the user in the eye" school of usability.</div><div>Even something blatant like "[INVALID BYTE REMOVED]" and "[INVALID CODEPOINT REMOVED]" would</div>
<div>make these cases easy to detect. I think that making unpredictable and bad things happen will not make corpus admins</div><div>any more likely to have valid data in the first place (especially when that data is pulled from random webpages), but give</div>
<div>them yet another unpredictable and bad problem that pops up randomly, especially if, as Roland pointed out, unicode libraries</div><div>differ in their definition of "valid codepoint".</div><div><br></div><div>
Best wishes,</div><div>Yannick</div><div><br></div><div><br></div></div></div></div>