[CWB] CQPweb error creating frequency tables (Invalid utf8 character string)

Thilo Wiertz thilo.wiertz at geographie.uni-freiburg.de
Thu Nov 18 21:13:03 CET 2021


Thanks Jörg, this saved my week! 
Thanks Stephanie, while not needed for this corpus, that patch is likely to become very handy soon!

If anyone is interested in the solution, a simple regex does the trick:

re_pattern = re.compile(u'[^\u0000-\uFFFF]', re.UNICODE)
return re_pattern.sub(u'\uFFFD’, str_var_with_wicked_chars)

(Replaces all unicode code points outside the range 0-FFFF by �)

Best,
Thilo

> Am 18.11.2021 um 20:04 schrieb Stefan Evert <stefanML at collocations.de>:
> 
> The CQPweb v3.2 server running on my laptop has the following patch around line #347 of file lib/sql-lib.php, in function do_sql_infile_query():
> 
> 		
> 		$sql = "{$Config->mysql_LOAD_DATA_INFILE_command} '$filepath' INTO TABLE `$table`";
> 
> 		$sql .= " CHARACTER SET utf8mb4"; /* PATCH to handle characters outside BMP */
> 		if ($no_escapes)
> 			$sql .= ' FIELDS ESCAPED BY \'\'';
> 		
> 		return do_sql_query($sql);
> 
> This has helped me get Twitter corpora into CQPweb, but I don't know if it is sufficient for your data.
> 
> Best,
> Stephanie
> 
> 
>> On 18 Nov 2021, at 17:47, Jörg Knappen <j.knappen at mx.uni-saarland.de> wrote:
>> 
>> This is a known shortcoming of the Database (mysql/MariaDB). It can only handle characters in the Basic Monolingual Plane (BMP). Since emojis (I suspect these because of the provenance of your data, there are also additional Chinese characters, mathematical and alchemy symbols, and many historic scripts in that area) are outside that plane with codepoint >= 0x10000 they cause this error.
>> 
>> You can write a script searching for all characters with such codes, and you can try to replace them with some replacement strings to carry on.
>> 
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20211118/3539df86/attachment.html>


More information about the CWB mailing list