[CWB] CQPweb error creating frequency tables (Invalid utf8 character string)

Thilo Wiertz thilo.wiertz at geographie.uni-freiburg.de
Thu Nov 18 16:36:01 CET 2021


Dear all,

I fear I might be lost in encoding hell: I am trying to install a corpus on CQPweb, but get the following error message when creating word and annotation frequency tables (the last step of generating frequency lists):

An SQL query did not run successfully!

Original query: LOAD DATA LOCAL INFILE '/var/cqpdata/temp/______tempfreq_topagrar_v4.tbl' INTO TABLE `__tempfreq_topagrar_v4` FIELDS ESCAPED BY '' /* from User: thilo | Function: corpus_make_freqtables() | 2021-Nov-18 16:05 */

Error # 1300: Invalid utf8 character string: ''

The corpus contains texts parsed from a web blog. I write an xml-file using python lxml and run the result through treetagger before installing it on cqpweb. It sounds like an encoding problem, although I am doing my best to remove anything potentially broken in python (e.g. running all strings through bytes(string, 'utf-8').decode('utf-8', 'ignore‘)). 

Checking for invalid UTF-8 characters in the input xml-file using grep (grep -axv '.*‘ file.txt) yields no results. Converting the file with iconv -f utf-8 -t utf-8 -c file.xml > newfile.xml makes no difference.

Any suggestion how to solve or narrow down the problem (e.g. finding the line or text id causing the issue)?

Thanks a lot!
Thilo

Server Setup:
OS: Ubuntu 18.04
DB: MariaDB 10.1
CQPweb v3.2.43
PHP: 7.2

PHP debugging backtrace:
array(6) {
  [1]=>
  array(4) {
    ["file"]=>
    string(43) "/var/www/html/diskurs/lib/exiterror-lib.php"
    ["line"]=>
    int(367)
    ["function"]=>
    string(9) "exiterror"
    ["args"]=>
    array(3) {
      [0]=>
      array(3) {
        [0]=>
        string(38) "An SQL query did not run successfully!"
        [1]=>
        string(232) "Original query: 

LOAD DATA LOCAL INFILE '/var/cqpdata/temp/______tempfreq_topagrar_v4.tbl' INTO TABLE `__tempfreq_topagrar_v4` FIELDS ESCAPED BY '' 
	/* from User: thilo | Function: corpus_make_freqtables() | 2021-Nov-18 16:05 */

"
        [2]=>
        string(48) "Error # 1300: Invalid utf8 character string: '' "
      }
      [1]=>
      NULL
      [2]=>
      NULL
    }
  }
  [2]=>
  array(4) {
    ["file"]=>
    string(37) "/var/www/html/diskurs/lib/sql-lib.php"
    ["line"]=>
    int(216)
    ["function"]=>
    string(18) "exiterror_sqlquery"
    ["args"]=>
    array(3) {
      [0]=>
      int(1300)
      [1]=>
      string(33) "Invalid utf8 character string: ''"
      [2]=>
      string(212) "LOAD DATA LOCAL INFILE '/var/cqpdata/temp/______tempfreq_topagrar_v4.tbl' INTO TABLE `__tempfreq_topagrar_v4` FIELDS ESCAPED BY '' 
	/* from User: thilo | Function: corpus_make_freqtables() | 2021-Nov-18 16:05 */"
    }
  }
  [3]=>
  array(4) {
    ["file"]=>
    string(37) "/var/www/html/diskurs/lib/sql-lib.php"
    ["line"]=>
    int(350)
    ["function"]=>
    string(12) "do_sql_query"
    ["args"]=>
    array(1) {
      [0]=>
      string(212) "LOAD DATA LOCAL INFILE '/var/cqpdata/temp/______tempfreq_topagrar_v4.tbl' INTO TABLE `__tempfreq_topagrar_v4` FIELDS ESCAPED BY '' 
	/* from User: thilo | Function: corpus_make_freqtables() | 2021-Nov-18 16:05 */"
    }
  }
  [4]=>
  array(4) {
    ["file"]=>
    string(43) "/var/www/html/diskurs/lib/freqtable-lib.php"
    ["line"]=>
    int(127)
    ["function"]=>
    string(19) "do_sql_infile_query"
    ["args"]=>
    array(3) {
      [0]=>
      string(22) "__tempfreq_topagrar_v4"
      [1]=>
      string(48) "/var/cqpdata/temp/______tempfreq_topagrar_v4.tbl"
      [2]=>
      bool(true)
    }
  }
  [5]=>
  array(4) {
    ["file"]=>
    string(37) "/var/www/html/diskurs/lib/execute.php"
    ["line"]=>
    int(196)
    ["function"]=>
    string(22) "corpus_make_freqtables"
    ["args"]=>
    array(1) {
      [0]=>
      string(11) "topagrar_v4"
    }
  }
  [6]=>
  array(4) {
    ["file"]=>
    string(37) "/var/www/html/diskurs/exe/execute.php"
    ["line"]=>
    int(1)
    ["args"]=>
    array(1) {
      [0]=>
      string(37) "/var/www/html/diskurs/lib/execute.php"
    }
    ["function"]=>
    string(7) "require"
  }
}





 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20211118/104a8af5/attachment.html>


More information about the CWB mailing list