[CWB] CQPweb error creating frequency tables (Invalid utf8 character string)

Stefan Evert stefanML at collocations.de
Thu Nov 18 20:04:32 CET 2021


The CQPweb v3.2 server running on my laptop has the following patch around line #347 of file lib/sql-lib.php, in function do_sql_infile_query():

		
		$sql = "{$Config->mysql_LOAD_DATA_INFILE_command} '$filepath' INTO TABLE `$table`";

		$sql .= " CHARACTER SET utf8mb4"; /* PATCH to handle characters outside BMP */
		if ($no_escapes)
			$sql .= ' FIELDS ESCAPED BY \'\'';
		
		return do_sql_query($sql);

This has helped me get Twitter corpora into CQPweb, but I don't know if it is sufficient for your data.

Best,
Stephanie


> On 18 Nov 2021, at 17:47, Jörg Knappen <j.knappen at mx.uni-saarland.de> wrote:
> 
> This is a known shortcoming of the Database (mysql/MariaDB). It can only handle characters in the Basic Monolingual Plane (BMP). Since emojis (I suspect these because of the provenance of your data, there are also additional Chinese characters, mathematical and alchemy symbols, and many historic scripts in that area) are outside that plane with codepoint >= 0x10000 they cause this error.
> 
> You can write a script searching for all characters with such codes, and you can try to replace them with some replacement strings to carry on.
> 
> 



More information about the CWB mailing list