[CWB] [cwb:bugs] #70 cwb-encode breaks UTF-8 when truncating long tokens

Andrew Hardie andrewhardie at users.sourceforge.net
Wed Dec 4 06:59:48 CET 2019


- **status**: open --> closed-fixed
- **Comment**:

fixed in commit 1373



---

** [bugs:#70] cwb-encode breaks UTF-8 when truncating long tokens**

**Status:** closed-fixed
**Group:** TODO-3.5
**Created:** Tue Dec 03, 2019 02:03 PM UTC by Stefan Evert
**Last Updated:** Tue Dec 03, 2019 02:03 PM UTC
**Owner:** Andrew Hardie


cwb-encode truncates tokens exceeding the length limit CL_MAX_LINE_LENGTH, but does so by cutting of at the last possible byte position and adding a "$" marker (around line #1520 in the source code). This can break UTF-8 codepoints, leading to invalid tokens. Note that UTF-8 validation is carried out for the entire input line and does not detect broken UTF-8 introduced later.

Proposed solution: Truncation should find the last complete UTF-8 codepoint within the length limit and truncate there.


---

Sent from sourceforge.net because cwb at sslmit.unibo.it is subscribed to https://sourceforge.net/p/cwb/bugs/

To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/cwb/admin/bugs/options.  Or, if this is a mailing list, you can unsubscribe from the mailing list.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20191204/88925161/attachment.html>


More information about the CWB mailing list