[CWB] unicode problems with Greek and OCS

Ruprecht von Waldenfels ruprecht.waldenfels at gmx.net
Tue Mar 10 13:02:08 CET 2015


A, this was supposed to be off list. Apologies!
Ruprecht

Am 10.03.2015 um 13:00 schrieb Ruprecht von Waldenfels:
> Thanks, Andrew -
> this one?
> Ruprecht
>
> Am 10.03.2015 um 12:55 schrieb Hardie, Andrew:
>>
>> Hi Ruprecht,
>>
>> This is peculiar, the call to cl_string_canonical() which is 
>> resulting in the error is using a string it has /taken from the 
>> lexicon/. But it ought not to be possible for the lexicon to contain 
>> bad UTF-8 – cwb-encode ought to disallow it.
>>
>> Could you email me, off-list, your lexicon file? The one containing 
>> the strings, separated by nulls, not any of the other binary or 
>> compressed files.
>>
>> best
>>
>> Andrew.
>>
>> *From:*cwb-bounces at sslmit.unibo.it 
>> [mailto:cwb-bounces at sslmit.unibo.it] *On Behalf Of *Ruprecht von 
>> Waldenfels
>> *Sent:* 10 March 2015 11:31
>> *To:* cwb at sslmit.unibo.it
>> *Subject:* Re: [CWB] unicode problems with Greek and OCS
>>
>> Dear List,
>> so my second problem, this time with Ancient Greek. I cannot easily 
>> reproduce this with a 2-line corpus, because I don't know where the 
>> culprit is. I am posting the CWB Output instead, maybe this is 
>> already enough.
>>
>> What I am trying to do: I am trying to align three documents, one 
>> Greek and two Slavic texts, using the aligVerse structural element. 
>> The two Slavic ones align fine, the Greek gives me the following error:
>> rvw at rvw-Latitude-E6410:/data/PROIEL$ /opt/CWBUTF8/cwb/utils/cwb-align 
>> -r /data/PROIEL/Registry -S aligVerse -o out.align NTESTAMENT_GR 
>> NTESTAMENT_MN aligVerse
>> OPENING NTESTAMENT_GR [147613 tokens, 7497 <aligVerse> regions]
>> OPENING NTESTAMENT_MN [71935 tokens, 7497 <aligVerse> regions]
>> OPENING prealignment [NTESTAMENT_GR.aligVerse: 7497 regions, 
>> NTESTAMENT_MN.aligVerse: 7497 regions]
>> LEXICON SIZE: 18085 / 10132
>> FEATURE: character count, weight=1 ... [1]
>> FEATURE: Shared words, threshold=40.0%, weight=50 ... [0]
>> FEATURE: 3-grams, weight=3 ... CL: major error, invalid UTF8 string 
>> passed to cl_string_canonical...
>> CL: major error, invalid UTF8 string passed to cl_string_canonical...
>> CL: major error, invalid UTF8 string passed to cl_string_canonical...
>> [21952]
>> FEATURE: 4-grams, weight=4 ... CL: major error, invalid UTF8 string 
>> passed to cl_string_canonical...
>> CL: major error, invalid UTF8 string passed to cl_string_canonical...
>> CL: major error, invalid UTF8 string passed to cl_string_canonical...
>> CL: major error, invalid UTF8 string passed to cl_string_canonical...
>> [614656]
>> [636609 features allocated]
>> [520402 entries in source text feature map]
>> [246622 entries in target text feature map]
>> PASS 2: Setting character count weight.
>> PASS 2: Processing shared words (th=40.0%).
>> PASS 2: Processing 3-grams.
>> CL: major error, invalid UTF8 string passed to cl_string_canonical...
>> CL: major error, invalid UTF8 string passed to cl_string_canonical...
>> PASS 2: Processing 4-grams.
>> CL: major error, invalid UTF8 string passed to cl_string_canonical...
>> CL: major error, invalid UTF8 string passed to cl_string_canonical...
>> PASS 2: Creating character counts.
>> [checking pointers]
>> ERROR: fcount1[1387]=24 r->w2f1[1388]-r->w2f1[1387]=22 w=``ἥξουσιν''
>> ERROR: fcount1[1388]=50 r->w2f1[1389]-r->w2f1[1388]=52 
>> w=``ἀνακλιθήσονται''
>> ERROR: fcount1[1783]=24 r->w2f1[1784]-r->w2f1[1783]=22 w=``θάνατον''
>> ERROR: fcount1[1784]=50 r->w2f1[1785]-r->w2f1[1784]=52 
>> w=``ἐπαναστήσονται''
>> ERROR: fcount1[3037]=20 r->w2f1[3038]-r->w2f1[3037]=16 w=``δυνατά''
>> ERROR: fcount1[3039]=48 r->w2f1[3040]-r->w2f1[3039]=52 
>> w=``ἀκολουθήσαντές''
>> ERROR: fcount1[3784]=20 r->w2f1[3785]-r->w2f1[3784]=18 w=``ἤλθατε''
>> ERROR: fcount1[3785]=50 r->w2f1[3786]-r->w2f1[3785]=52 
>> w=``ἀποκριθήσονται''
>> ERROR: fcount1[4459]=32 r->w2f1[4460]-r->w2f1[4459]=30 w=``ἐπιθυμίαι''
>> ERROR: fcount1[4460]=50 r->w2f1[4461]-r->w2f1[4460]=52 
>> w=``εἰσπορευόμεναι''
>> ERROR: fcount1[4998]=20 r->w2f1[4999]-r->w2f1[4998]=18 w=``Ἤρξατο''
>> ERROR: fcount1[4999]=46 r->w2f1[5000]-r->w2f1[4999]=48 
>> w=``ἠκολουθήκαμέν''
>> ERROR: fcount1[5038]=36 r->w2f1[5039]-r->w2f1[5038]=34 w=``ἐγγίζουσιν''
>> ERROR: fcount1[5039]=50 r->w2f1[5040]-r->w2f1[5039]=52 
>> w=``εἰσπορευόμενοι''
>> ERROR: fcount1[7009]=32 r->w2f1[7010]-r->w2f1[7009]=30 w=``πλουσίους''
>> ERROR: fcount1[7010]=46 r->w2f1[7011]-r->w2f1[7010]=48 
>> w=``ἀντικαλέσωσίν''
>> ERROR: fcount1[8582]=20 r->w2f1[8583]-r->w2f1[8582]=18 w=``ἐξάγει''
>> ERROR: fcount1[8583]=50 r->w2f1[8584]-r->w2f1[8583]=52 
>> w=``ἀκολουθήσουσιν''
>> ERROR: fcount1[9942]=20 r->w2f1[9943]-r->w2f1[9942]=24 w=``ἅρματι''
>> ERROR: fcount1[9943]=56 r->w2f1[9944]-r->w2f1[9943]=52 
>> w=``ἀναγινώσκοντος''
>> ERROR: fcount1[10119]=48 r->w2f1[10120]-r->w2f1[10119]=44 
>> w=``μεταπέμψασθαί''
>> ERROR: fcount1[10120]=48 r->w2f1[10121]-r->w2f1[10120]=52 
>> w=``εἰσκαλεσάμενος''
>> ERROR: fcount1[10553]=28 r->w2f1[10554]-r->w2f1[10553]=24 w=``ἐτάραξαν''
>> ERROR: fcount1[10554]=48 r->w2f1[10555]-r->w2f1[10554]=52 
>> w=``ἀνασκευάζοντες''
>> ERROR: fcount1[10622]=24 r->w2f1[10623]-r->w2f1[10622]=20 w=``Τρῳάδος''
>> ERROR: fcount1[10623]=48 r->w2f1[10624]-r->w2f1[10623]=52 
>> w=``εὐθυδρομήσαμεν''
>> ERROR: fcount1[11159]=48 r->w2f1[11160]-r->w2f1[11159]=44 
>> w=``ἀποσπασθέντας''
>> ERROR: fcount1[11160]=52 r->w2f1[11161]-r->w2f1[11160]=56 
>> w=``εὐθυδρομήσαντες''
>> ERROR: fcount1[12054]=20 r->w2f1[12055]-r->w2f1[12054]=18 w=``πλάνης''
>> ERROR: fcount1[12055]=50 r->w2f1[12056]-r->w2f1[12055]=52 
>> w=``ἀπολαμβάνοντες''
>> ERROR: fcount1[12422]=12 r->w2f1[12423]-r->w2f1[12422]=10 w=``νοός''
>> ERROR: fcount1[12423]=50 r->w2f1[12424]-r->w2f1[12423]=52 
>> w=``αἰχμαλωτίζοντά''
>> ERROR: fcount1[14334]=40 r->w2f1[14335]-r->w2f1[14334]=38 
>> w=``ἐπαιρόμενον''
>> ERROR: fcount1[14335]=54 r->w2f1[14336]-r->w2f1[14335]=56 
>> w=``αἰχμαλωτίζοντες''
>> ERROR: fcount1[14641]=40 r->w2f1[14642]-r->w2f1[14641]=38 
>> w=``κεκυρωμένην''
>> ERROR: fcount1[14642]=50 r->w2f1[14643]-r->w2f1[14642]=52 
>> w=``ἐπιδιατάσσεται''
>> ERROR: fcount1[14878]=32 r->w2f1[14879]-r->w2f1[14878]=34 w=``προέγραψα''
>> ERROR: fcount1[14879]=54 r->w2f1[14880]-r->w2f1[14879]=52 
>> w=``ἀναγινώσκοντες''
>> ERROR: fcount1[15698]=36 r->w2f1[15699]-r->w2f1[15698]=34 
>> w=``ἐπιστεύθην''
>> ERROR: fcount1[15699]=46 r->w2f1[15700]-r->w2f1[15699]=48 
>> w=``ἐνδυναμώσαντί''
>> ERROR: fcount1[16170]=32 r->w2f1[16171]-r->w2f1[16170]=30 w=``ἀνέξονται''
>> ERROR: fcount1[16171]=50 r->w2f1[16172]-r->w2f1[16171]=52 
>> w=``ἐπισωρεύσουσιν''
>> ERROR: fcount1[16815]=32 r->w2f1[16816]-r->w2f1[16815]=30 w=``ἐνυβρίσας''
>> ERROR: fcount1[16816]=50 r->w2f1[16817]-r->w2f1[16816]=52 
>> w=``Ἀναμιμνῄσκεσθε''
>> ERROR: fcount1[17621]=40 r->w2f1[17622]-r->w2f1[17621]=42 
>> w=``ἀπεσταλμένα''
>> ERROR: fcount1[17622]=56 r->w2f1[17623]-r->w2f1[17622]=54 w=``εἴκοσι 
>> τέσσαρες''
>> ERROR: fcount1[17793]=28 r->w2f1[17794]-r->w2f1[17793]=29 w=``μάρτυσίν''
>> ERROR: fcount1[17794]=93 r->w2f1[17795]-r->w2f1[17794]=92 w=``χιλίας 
>> διακοσίας ἑξήκοντα''
>> ERROR: fcount1[17937]=24 r->w2f1[17938]-r->w2f1[17937]=26 w=``χαλινῶν''
>> ERROR: fcount1[17938]=60 r->w2f1[17939]-r->w2f1[17938]=58 w=``χιλίων 
>> ἑξακοσίων''
>> ERROR: fcount1[17967]=36 r->w2f1[17968]-r->w2f1[17967]=34 
>> w=``καυματίσαι''
>> ERROR: fcount1[17968]=50 r->w2f1[17969]-r->w2f1[17968]=52 
>> w=``ἐκαυματίσθησαν''
>>
>>
>> Again, I would be very thankful for help.
>>
>> Best!
>> Ruprecht
>>
>>
>>
>>
>>
>> Am 10.03.2015 um 12:07 schrieb Ruprecht von Waldenfels:
>>
>>     Hi Andrew,
>>     YES! This does solve the problem. I was thinking this setting
>>     would only concern tokens, not the lemma attribute, but now I
>>     understand that this was a wrong assumption. Thank you!
>>     I will now look at the other problem - because that, as it turns
>>     out, is unrelated.
>>     Thanks A LOT!
>>     Ruprecht
>>     Am 10.03.2015 um 12:02 schrieb Hardie, Andrew:
>>
>>         Is the context size measured in characters? If so, that would
>>         explain the problem, since “characters” = bytes still.
>>
>>         If changing the context width to a given number of words
>>         fixes the issue, then that is the solution.
>>
>>         I have been working on a patch to fix this, but have not
>>         completed it yet.
>>
>>         Andrew.
>>
>>         *From:*cwb-bounces at sslmit.unibo.it
>>         <mailto:cwb-bounces at sslmit.unibo.it>
>>         [mailto:cwb-bounces at sslmit.unibo.it] *On Behalf Of *Ruprecht
>>         von Waldenfels
>>         *Sent:* 10 March 2015 09:54
>>         *To:* cwb at sslmit.unibo.it <mailto:cwb at sslmit.unibo.it>
>>         *Subject:* [CWB] unicode problems with Greek and OCS
>>
>>         Dear List,
>>
>>         I am using CWB 3.4.8 on 64 bit Ubuntu 14.10.
>>         After encoding a text in Old Church Slavonic, I get invalid
>>         UTF-8 character errors; I seem to get them only in sgml mode
>>         (I also get them during alignment with the Ancient Greek
>>         translation source, which might be a related problem, but I
>>         am not sure.)
>>
>>         In order to pinpoint the problem with the Old Church Slavonic
>>         text, I have reduced the text in question to two bible
>>         verses. The text can be found here:
>>         www.parasolcorpus.org/test.txt
>>         <http://www.parasolcorpus.org/test.txt>
>>
>>         I encode the corpus with the following commands:
>>         /opt/CWBUTF8/cwb/utils/cwb-encode -d Data/ntestament_tt -f
>>         test.txt -R /data/PROIEL/Registry/ntestament_tt -c utf8 -xsB
>>         -P lemma -P id -P alig -P pos -P tag -S aligVerse:0
>>         /opt/CWBUTF8/cwb/utils/cwb-makeall -r /data/PROIEL/Registry
>>         NTESTAMENT_TT
>>
>>         There is no problem in text mode:
>>
>>
>>
>>         However, in sgml mode, some lemmas get truncated and do not
>>         contain valid utf8 anymore. For example, the lemma of
>>         "с҃вщаѩи" is such a token. This problem does NOT appear if I
>>         search for this token itself, it ONLY and consistently
>>         appears if I search for a different token and the problematic
>>         token is in the result set:
>>
>>
>>         To sum up: I get the problem only if I search for a
>>         neighboring token in sgml mode. I don't get it if I search
>>         for the token itself, and I don't get it in text mode. I have
>>         reduced the problem to w 50-token text, and the problem persists.
>>
>>         Any help would be greatly appreciated!
>>         Best,
>>         Ruprecht
>>
>>
>>
>>
>>
>>
>>
>>         _______________________________________________
>>
>>         CWB mailing list
>>
>>         CWB at sslmit.unibo.it  <mailto:CWB at sslmit.unibo.it>
>>
>>         http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>>
>>
>>
>>
>>
>>     _______________________________________________
>>
>>     CWB mailing list
>>
>>     CWB at sslmit.unibo.it  <mailto:CWB at sslmit.unibo.it>
>>
>>     http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>>
>>
>>
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20150310/c09f27f1/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 77569 bytes
Desc: not available
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20150310/c09f27f1/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 85145 bytes
Desc: not available
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20150310/c09f27f1/attachment-0003.png>


More information about the CWB mailing list