[CWB] CQPweb error creating frequency tables (Invalid utf8 character string)

Hardie, Andrew a.hardie at lancaster.ac.uk
Thu Nov 25 10:30:21 CET 2021


The switch from MySQL’s broken “utf8” character encoding to the not-broken “utf8mb4” was one of the main things involved in the massive 3.2 to 3.3 version move.

As a result of the database changes that all that led to, v 3.3 was at first pretty buggy. However, it’s now reasonably stable. I would say that if you are a non-technically-naïve user then you are now safe to go to v 3.3.11 … and have your non-BMP characters behave themselves.

One of the things that’s now non-buggy (or, at least, less  buggy) is the user corpus system. Documentation still lacking alas. If you want to see what it’s like, it’s going to be rolled out on the Lancaster server incrementally over the next three weeks (ish) (https://cqpweb.lancs.ac.uk) ; feel free to have a play and let me know if you see issues.

best

Andrew.


From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On Behalf Of Thilo Wiertz
Sent: 18 November 2021 20:13
To: CWBdev Mailing List <cwb at sslmit.unibo.it>
Subject: Re: [CWB] CQPweb error creating frequency tables (Invalid utf8 character string)

Thanks Jörg, this saved my week!
Thanks Stephanie, while not needed for this corpus, that patch is likely to become very handy soon!

If anyone is interested in the solution, a simple regex does the trick:

re_pattern = re.compile(u'[^\u0000-\uFFFF]', re.UNICODE)
return re_pattern.sub(u'\uFFFD’, str_var_with_wicked_chars)

(Replaces all unicode code points outside the range 0-FFFF by �)

Best,
Thilo


Am 18.11.2021 um 20:04 schrieb Stefan Evert <stefanML at collocations.de<mailto:stefanML at collocations.de>>:

The CQPweb v3.2 server running on my laptop has the following patch around line #347 of file lib/sql-lib.php, in function do_sql_infile_query():


$sql = "{$Config->mysql_LOAD_DATA_INFILE_command} '$filepath' INTO TABLE `$table`";

$sql .= " CHARACTER SET utf8mb4"; /* PATCH to handle characters outside BMP */
if ($no_escapes)
$sql .= ' FIELDS ESCAPED BY \'\'';

return do_sql_query($sql);

This has helped me get Twitter corpora into CQPweb, but I don't know if it is sufficient for your data.

Best,
Stephanie



On 18 Nov 2021, at 17:47, Jörg Knappen <j.knappen at mx.uni-saarland.de<mailto:j.knappen at mx.uni-saarland.de>> wrote:

This is a known shortcoming of the Database (mysql/MariaDB). It can only handle characters in the Basic Monolingual Plane (BMP). Since emojis (I suspect these because of the provenance of your data, there are also additional Chinese characters, mathematical and alchemy symbols, and many historic scripts in that area) are outside that plane with codepoint >= 0x10000 they cause this error.

You can write a script searching for all characters with such codes, and you can try to replace them with some replacement strings to carry on.



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20211125/70c86adc/attachment-0001.html>


More information about the CWB mailing list