[CWB] CQPweb error creating frequency tables (Invalid utf8 character string)
Thilo Wiertz
thilo.wiertz at geographie.uni-freiburg.de
Thu Nov 18 16:36:01 CET 2021
Dear all,
I fear I might be lost in encoding hell: I am trying to install a corpus on CQPweb, but get the following error message when creating word and annotation frequency tables (the last step of generating frequency lists):
An SQL query did not run successfully!
Original query: LOAD DATA LOCAL INFILE '/var/cqpdata/temp/______tempfreq_topagrar_v4.tbl' INTO TABLE `__tempfreq_topagrar_v4` FIELDS ESCAPED BY '' /* from User: thilo | Function: corpus_make_freqtables() | 2021-Nov-18 16:05 */
Error # 1300: Invalid utf8 character string: ''
The corpus contains texts parsed from a web blog. I write an xml-file using python lxml and run the result through treetagger before installing it on cqpweb. It sounds like an encoding problem, although I am doing my best to remove anything potentially broken in python (e.g. running all strings through bytes(string, 'utf-8').decode('utf-8', 'ignore‘)).
Checking for invalid UTF-8 characters in the input xml-file using grep (grep -axv '.*‘ file.txt) yields no results. Converting the file with iconv -f utf-8 -t utf-8 -c file.xml > newfile.xml makes no difference.
Any suggestion how to solve or narrow down the problem (e.g. finding the line or text id causing the issue)?
Thanks a lot!
Thilo
Server Setup:
OS: Ubuntu 18.04
DB: MariaDB 10.1
CQPweb v3.2.43
PHP: 7.2
PHP debugging backtrace:
array(6) {
[1]=>
array(4) {
["file"]=>
string(43) "/var/www/html/diskurs/lib/exiterror-lib.php"
["line"]=>
int(367)
["function"]=>
string(9) "exiterror"
["args"]=>
array(3) {
[0]=>
array(3) {
[0]=>
string(38) "An SQL query did not run successfully!"
[1]=>
string(232) "Original query:
LOAD DATA LOCAL INFILE '/var/cqpdata/temp/______tempfreq_topagrar_v4.tbl' INTO TABLE `__tempfreq_topagrar_v4` FIELDS ESCAPED BY ''
/* from User: thilo | Function: corpus_make_freqtables() | 2021-Nov-18 16:05 */
"
[2]=>
string(48) "Error # 1300: Invalid utf8 character string: '' "
}
[1]=>
NULL
[2]=>
NULL
}
}
[2]=>
array(4) {
["file"]=>
string(37) "/var/www/html/diskurs/lib/sql-lib.php"
["line"]=>
int(216)
["function"]=>
string(18) "exiterror_sqlquery"
["args"]=>
array(3) {
[0]=>
int(1300)
[1]=>
string(33) "Invalid utf8 character string: ''"
[2]=>
string(212) "LOAD DATA LOCAL INFILE '/var/cqpdata/temp/______tempfreq_topagrar_v4.tbl' INTO TABLE `__tempfreq_topagrar_v4` FIELDS ESCAPED BY ''
/* from User: thilo | Function: corpus_make_freqtables() | 2021-Nov-18 16:05 */"
}
}
[3]=>
array(4) {
["file"]=>
string(37) "/var/www/html/diskurs/lib/sql-lib.php"
["line"]=>
int(350)
["function"]=>
string(12) "do_sql_query"
["args"]=>
array(1) {
[0]=>
string(212) "LOAD DATA LOCAL INFILE '/var/cqpdata/temp/______tempfreq_topagrar_v4.tbl' INTO TABLE `__tempfreq_topagrar_v4` FIELDS ESCAPED BY ''
/* from User: thilo | Function: corpus_make_freqtables() | 2021-Nov-18 16:05 */"
}
}
[4]=>
array(4) {
["file"]=>
string(43) "/var/www/html/diskurs/lib/freqtable-lib.php"
["line"]=>
int(127)
["function"]=>
string(19) "do_sql_infile_query"
["args"]=>
array(3) {
[0]=>
string(22) "__tempfreq_topagrar_v4"
[1]=>
string(48) "/var/cqpdata/temp/______tempfreq_topagrar_v4.tbl"
[2]=>
bool(true)
}
}
[5]=>
array(4) {
["file"]=>
string(37) "/var/www/html/diskurs/lib/execute.php"
["line"]=>
int(196)
["function"]=>
string(22) "corpus_make_freqtables"
["args"]=>
array(1) {
[0]=>
string(11) "topagrar_v4"
}
}
[6]=>
array(4) {
["file"]=>
string(37) "/var/www/html/diskurs/exe/execute.php"
["line"]=>
int(1)
["args"]=>
array(1) {
[0]=>
string(37) "/var/www/html/diskurs/lib/execute.php"
}
["function"]=>
string(7) "require"
}
}
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20211118/104a8af5/attachment.html>
More information about the CWB
mailing list