[CWB] offline-freqlists.php: Invalid utf8 character string: ''
Hardie, Andrew
a.hardie at lancaster.ac.uk
Thu Jun 25 20:47:27 CEST 2020
OK, if the originally reported error (which is MySQL error # 1300 as per the backtrace) still happens with 3.4.42, there are 3 possibilities.
1. There is a UTF-8 char from outside the range U+0000-U+ffff somewhere in the corpus, e.g. an Emoji
UTF-8 corpora cannot contain chars >= U+10000 in CQPweb 3.2. (due to a historic limitation in mysql, getting past which was the main goal of v3.3)
OR
2. There is a character somewhere in the corpus above U+0080 in Latin-1 encoding, which becomes a stray continuation byte (not allowed) in UTF-8.
Depending on which version of core CWB you are using, it is likely to check encoding input lines for utf 8 validity (versions that don't are now pretty old). If you have a CWB version that does check utf8, then it's not this, it's (1) or (3).
OR
3. some other disallowed byte or character somewhere in the corpus.
So, in short, the solution is to go through the original vertical files, and identify / replace any instances of invalid UTF-8 or characters above U+FFFF (e.g. with U+FFFD, the "replacement" char). Then start over from scratch.
best
Andrew.
-----Original Message-----
From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On Behalf Of Jörg Knappen
Sent: 25 June 2020 16:27
To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it>
Subject: Re: [CWB] offline-freqlists.php: Invalid utf8 character string: ''
Hallo Stefan,
I just upgraded to 3.2.42, can you try it again? I can confirm that with
3.2.40 the problem was still extant.
Jörg Knappen
Am 2019-11-29 20:06, schrieb Stefan Fischer:
> Dear all,
>
> We are trying to import a CWB-encoded corpus into CQPweb. The source
> texts are in UTF-8 and queries for non-ASCII words ([word="dieſ"])
> work both in the CWB and the CQPweb version. Unfortunately, we cannot
> complete the corpus setup as offline-freqlists.php crashes with the
> PHP backtrace below.
>
> I would be grateful for any advice.
>
> Thanks in advance,
> Stefan
>
> ----
>
> PHP debugging backtrace
> =======================
> array(4) {
> [1]=>
> array(4) {
> ["file"]=>
> string(40) "/var/www/html/cqpweb/lib/library.inc.php"
> ["line"]=>
> int(299)
> ["function"]=>
> string(20) "exiterror_mysqlquery"
> ["args"]=>
> array(3) {
> [0]=>
> int(1300)
> [1]=>
> string(33) "Invalid utf8 character string: ''"
> [2]=>
> string(227) "LOAD DATA LOCAL INFILE
> '/data2/cqpweb/cache/______tempfreq_dta_17_09_web.tbl' INTO TABLE
> `__tempfreq_dta_17_09_web` FIELDS ESCAPED BY ''
> /* from User: cqpwebAdmin | Function: corpus_make_freqtables() |
> 2019-Nov-26 04:34:59 */"
> }
> }
> [2]=>
> array(4) {
> ["file"]=>
> string(40) "/var/www/html/cqpweb/lib/library.inc.php"
> ["line"]=>
> int(423)
> ["function"]=>
> string(14) "do_mysql_query"
> ["args"]=>
> array(1) {
> [0]=>
> &string(227) "LOAD DATA LOCAL INFILE
> '/data2/cqpweb/cache/______tempfreq_dta_17_09_web.tbl' INTO TABLE
> `__tempfreq_dta_17_09_web` FIELDS ESCAPED BY ''
> /* from User: cqpwebAdmin | Function: corpus_make_freqtables() |
> 2019-Nov-26 04:34:59 */"
> }
> }
> [3]=>
> array(4) {
> ["file"]=>
> string(42) "/var/www/html/cqpweb/lib/freqtable.inc.php"
> ["line"]=>
> int(124)
> ["function"]=>
> string(21) "do_mysql_infile_query"
> ["args"]=>
> array(3) {
> [0]=>
> string(24) "__tempfreq_dta_17_09_web"
> [1]=>
> string(52) "/data2/cqpweb/cache/______tempfreq_dta_17_09_web.tbl"
> [2]=>
> bool(true)
> }
> }
> [4]=>
> array(4) {
> ["file"]=>
> string(46) "/var/www/html/cqpweb/bin/offline-freqlists.php"
> ["line"]=>
> int(133)
> ["function"]=>
> string(22) "corpus_make_freqtables"
> ["args"]=>
> array(1) {
> [0]=>
> string(13) "dta_17_09_web"
> }
> }
> }
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fliste.sslmit.unibo.it%2Fmailman%2Flistinfo%2Fcwb&data=02%7C01%7Ca.hardie%40lancaster.ac.uk%7C00c5b5a0f80948c9215c08d8191c3e74%7C9c9bcd11977a4e9ca9a0bc734090164a%7C0%7C1%7C637286956395104991&sdata=Yi%2F4%2Flc6YMCcE4pgoRqVLIxKc%2BALYpjKmUmd1qonQJI%3D&reserved=0
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fliste.sslmit.unibo.it%2Fmailman%2Flistinfo%2Fcwb&data=02%7C01%7Ca.hardie%40lancaster.ac.uk%7C00c5b5a0f80948c9215c08d8191c3e74%7C9c9bcd11977a4e9ca9a0bc734090164a%7C0%7C1%7C637286956395104991&sdata=Yi%2F4%2Flc6YMCcE4pgoRqVLIxKc%2BALYpjKmUmd1qonQJI%3D&reserved=0
More information about the CWB
mailing list