[CWB] CQPweb error creating frequency tables (Invalid utf8 character string)
Jörg Knappen
j.knappen at mx.uni-saarland.de
Thu Nov 18 17:47:42 CET 2021
This is a known shortcoming of the Database (mysql/MariaDB). It can only
handle characters in the Basic Monolingual Plane (BMP). Since emojis (I
suspect these because of the provenance of your data, there are also
additional Chinese characters, mathematical and alchemy symbols, and
many historic scripts in that area) are outside that plane with
codepoint >= 0x10000 they cause this error.
You can write a script searching for all characters with such codes, and
you can try to replace them with some replacement strings to carry on.
--Jörg Knappen
Am 2021-11-18 16:36, schrieb Thilo Wiertz:
> Dear all,
>
> I fear I might be lost in encoding hell: I am trying to install a
> corpus on CQPweb, but get the following error message when creating
> word and annotation frequency tables (the last step of generating
> frequency lists):
>
>> An SQL query did not run successfully!
>>
>> Original query: LOAD DATA LOCAL INFILE
>> '/var/cqpdata/temp/______tempfreq_topagrar_v4.tbl' INTO TABLE
>> `__tempfreq_topagrar_v4` FIELDS ESCAPED BY '' /* from User: thilo |
>> Function: corpus_make_freqtables() | 2021-Nov-18 16:05 */
>>
>> Error # 1300: Invalid utf8 character string: ''
> The corpus contains texts parsed from a web blog. I write an xml-file
> using python lxml and run the result through treetagger before
> installing it on cqpweb. It sounds like an encoding problem, although I
> am doing my best to remove anything potentially broken in python (e.g.
> running all strings through bytes(string, 'utf-8').decode('utf-8',
> 'ignore')).
>
> Checking for invalid UTF-8 characters in the input xml-file using grep
> (grep -axv '.*' file.txt) yields no results. Converting the file with
> iconv -f utf-8 -t utf-8 -c file.xml > newfile.xml makes no difference.
>
> Any suggestion how to solve or narrow down the problem (e.g. finding
> the line or text id causing the issue)?
>
> Thanks a lot!
> Thilo
>
> Server Setup:
> OS: Ubuntu 18.04
> DB: MariaDB 10.1
> CQPweb v3.2.43
> PHP: 7.2
>
> PHP debugging backtrace:
>
>> array(6) {
>>
>> [1]=>
>>
>> array(4) {
>>
>> ["file"]=>
>>
>> string(43) "/var/www/html/diskurs/lib/exiterror-lib.php"
>>
>> ["line"]=>
>>
>> int(367)
>>
>> ["function"]=>
>>
>> string(9) "exiterror"
>>
>> ["args"]=>
>>
>> array(3) {
>>
>> [0]=>
>>
>> array(3) {
>>
>> [0]=>
>>
>> string(38) "An SQL query did not run successfully!"
>>
>> [1]=>
>>
>> string(232) "Original query:
>>
>> LOAD DATA LOCAL INFILE
>> '/var/cqpdata/temp/______tempfreq_topagrar_v4.tbl' INTO TABLE
>> `__tempfreq_topagrar_v4` FIELDS ESCAPED BY ''
>>
>> /* from User: thilo | Function: corpus_make_freqtables() | 2021-Nov-18
>> 16:05 */
>>
>> "
>>
>> [2]=>
>>
>> string(48) "Error # 1300: Invalid utf8 character string: '' "
>>
>> }
>>
>> [1]=>
>>
>> NULL
>>
>> [2]=>
>>
>> NULL
>>
>> }
>>
>> }
>>
>> [2]=>
>>
>> array(4) {
>>
>> ["file"]=>
>>
>> string(37) "/var/www/html/diskurs/lib/sql-lib.php"
>>
>> ["line"]=>
>>
>> int(216)
>>
>> ["function"]=>
>>
>> string(18) "exiterror_sqlquery"
>>
>> ["args"]=>
>>
>> array(3) {
>>
>> [0]=>
>>
>> int(1300)
>>
>> [1]=>
>>
>> string(33) "Invalid utf8 character string: ''"
>>
>> [2]=>
>>
>> string(212) "LOAD DATA LOCAL INFILE
>> '/var/cqpdata/temp/______tempfreq_topagrar_v4.tbl' INTO TABLE
>> `__tempfreq_topagrar_v4` FIELDS ESCAPED BY ''
>>
>> /* from User: thilo | Function: corpus_make_freqtables() | 2021-Nov-18
>> 16:05 */"
>>
>> }
>>
>> }
>>
>> [3]=>
>>
>> array(4) {
>>
>> ["file"]=>
>>
>> string(37) "/var/www/html/diskurs/lib/sql-lib.php"
>>
>> ["line"]=>
>>
>> int(350)
>>
>> ["function"]=>
>>
>> string(12) "do_sql_query"
>>
>> ["args"]=>
>>
>> array(1) {
>>
>> [0]=>
>>
>> string(212) "LOAD DATA LOCAL INFILE
>> '/var/cqpdata/temp/______tempfreq_topagrar_v4.tbl' INTO TABLE
>> `__tempfreq_topagrar_v4` FIELDS ESCAPED BY ''
>>
>> /* from User: thilo | Function: corpus_make_freqtables() | 2021-Nov-18
>> 16:05 */"
>>
>> }
>>
>> }
>>
>> [4]=>
>>
>> array(4) {
>>
>> ["file"]=>
>>
>> string(43) "/var/www/html/diskurs/lib/freqtable-lib.php"
>>
>> ["line"]=>
>>
>> int(127)
>>
>> ["function"]=>
>>
>> string(19) "do_sql_infile_query"
>>
>> ["args"]=>
>>
>> array(3) {
>>
>> [0]=>
>>
>> string(22) "__tempfreq_topagrar_v4"
>>
>> [1]=>
>>
>> string(48) "/var/cqpdata/temp/______tempfreq_topagrar_v4.tbl"
>>
>> [2]=>
>>
>> bool(true)
>>
>> }
>>
>> }
>>
>> [5]=>
>>
>> array(4) {
>>
>> ["file"]=>
>>
>> string(37) "/var/www/html/diskurs/lib/execute.php"
>>
>> ["line"]=>
>>
>> int(196)
>>
>> ["function"]=>
>>
>> string(22) "corpus_make_freqtables"
>>
>> ["args"]=>
>>
>> array(1) {
>>
>> [0]=>
>>
>> string(11) "topagrar_v4"
>>
>> }
>>
>> }
>>
>> [6]=>
>>
>> array(4) {
>>
>> ["file"]=>
>>
>> string(37) "/var/www/html/diskurs/exe/execute.php"
>>
>> ["line"]=>
>>
>> int(1)
>>
>> ["args"]=>
>>
>> array(1) {
>>
>> [0]=>
>>
>> string(37) "/var/www/html/diskurs/lib/execute.php"
>>
>> }
>>
>> ["function"]=>
>>
>> string(7) "require"
>>
>> }
>>
>> }
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20211118/da1e8b60/attachment-0001.html>
More information about the CWB
mailing list