[CWB] unicode problems with Greek and OCS

Gabriele Brandolini gabriele.brandolini at gmail.com
Wed Mar 11 05:58:49 CET 2015


Dear Ruprecht, Andrew and Stefan

I followed your issue about encoding Old Greek texts.

I also would like to cwb encode texts in this language expecially old texts
of the Fathers of the Church. But I've not yet got a PoS tagger for such a
language. We just planned to work on it to train TreeTagger. But as I know
it isn't ready yet.

Do you, Ruprecht, know if there is any available?

About your list of greek words in your email of 14 31 I noticed that they
are mostly uncorrect. As the initial letter (alfa or eta or epsilon) were
dropped out with its accent and spirit.
I don't know if this has something to do with the encoding error messages
you get.
Just I wanted to point out it, maybe it can be of any help.

Good work and good luck!

Gabriele
Il 10/mar/2015 14:31 "Ruprecht von Waldenfels" <ruprecht.waldenfels a gmx.net>
ha scritto:

>  Dear List,
> so my second problem, this time with Ancient Greek. I cannot easily
> reproduce this with a 2-line corpus, because I don't know where the culprit
> is. I am posting the CWB Output instead, maybe this is already enough.
>
> What I am trying to do: I am trying to align three documents, one Greek
> and two Slavic texts, using the aligVerse structural element. The two
> Slavic ones align fine, the Greek gives me the following error:
> rvw a rvw-Latitude-E6410:/data/PROIEL$ /opt/CWBUTF8/cwb/utils/cwb-align -r
> /data/PROIEL/Registry -S aligVerse -o out.align NTESTAMENT_GR NTESTAMENT_MN
> aligVerse
> OPENING NTESTAMENT_GR [147613 tokens, 7497 <aligVerse> regions]
> OPENING NTESTAMENT_MN [71935 tokens, 7497 <aligVerse> regions]
> OPENING prealignment [NTESTAMENT_GR.aligVerse: 7497 regions,
> NTESTAMENT_MN.aligVerse: 7497 regions]
> LEXICON SIZE: 18085 / 10132
> FEATURE: character count, weight=1 ... [1]
> FEATURE: Shared words, threshold=40.0%, weight=50 ... [0]
> FEATURE: 3-grams, weight=3 ... CL: major error, invalid UTF8 string passed
> to cl_string_canonical...
> CL: major error, invalid UTF8 string passed to cl_string_canonical...
> CL: major error, invalid UTF8 string passed to cl_string_canonical...
> [21952]
> FEATURE: 4-grams, weight=4 ... CL: major error, invalid UTF8 string passed
> to cl_string_canonical...
> CL: major error, invalid UTF8 string passed to cl_string_canonical...
> CL: major error, invalid UTF8 string passed to cl_string_canonical...
> CL: major error, invalid UTF8 string passed to cl_string_canonical...
> [614656]
> [636609 features allocated]
> [520402 entries in source text feature map]
> [246622 entries in target text feature map]
> PASS 2: Setting character count weight.
> PASS 2: Processing shared words (th=40.0%).
> PASS 2: Processing 3-grams.
> CL: major error, invalid UTF8 string passed to cl_string_canonical...
> CL: major error, invalid UTF8 string passed to cl_string_canonical...
> PASS 2: Processing 4-grams.
> CL: major error, invalid UTF8 string passed to cl_string_canonical...
> CL: major error, invalid UTF8 string passed to cl_string_canonical...
> PASS 2: Creating character counts.
> [checking pointers]
> ERROR: fcount1[1387]=24 r->w2f1[1388]-r->w2f1[1387]=22 w=``ἥξουσιν''
> ERROR: fcount1[1388]=50 r->w2f1[1389]-r->w2f1[1388]=52 w=``ἀνακλιθήσονται''
> ERROR: fcount1[1783]=24 r->w2f1[1784]-r->w2f1[1783]=22 w=``θάνατον''
> ERROR: fcount1[1784]=50 r->w2f1[1785]-r->w2f1[1784]=52 w=``ἐπαναστήσονται''
> ERROR: fcount1[3037]=20 r->w2f1[3038]-r->w2f1[3037]=16 w=``δυνατά''
> ERROR: fcount1[3039]=48 r->w2f1[3040]-r->w2f1[3039]=52 w=``ἀκολουθήσαντές''
> ERROR: fcount1[3784]=20 r->w2f1[3785]-r->w2f1[3784]=18 w=``ἤλθατε''
> ERROR: fcount1[3785]=50 r->w2f1[3786]-r->w2f1[3785]=52 w=``ἀποκριθήσονται''
> ERROR: fcount1[4459]=32 r->w2f1[4460]-r->w2f1[4459]=30 w=``ἐπιθυμίαι''
> ERROR: fcount1[4460]=50 r->w2f1[4461]-r->w2f1[4460]=52 w=``εἰσπορευόμεναι''
> ERROR: fcount1[4998]=20 r->w2f1[4999]-r->w2f1[4998]=18 w=``Ἤρξατο''
> ERROR: fcount1[4999]=46 r->w2f1[5000]-r->w2f1[4999]=48 w=``ἠκολουθήκαμέν''
> ERROR: fcount1[5038]=36 r->w2f1[5039]-r->w2f1[5038]=34 w=``ἐγγίζουσιν''
> ERROR: fcount1[5039]=50 r->w2f1[5040]-r->w2f1[5039]=52 w=``εἰσπορευόμενοι''
> ERROR: fcount1[7009]=32 r->w2f1[7010]-r->w2f1[7009]=30 w=``πλουσίους''
> ERROR: fcount1[7010]=46 r->w2f1[7011]-r->w2f1[7010]=48 w=``ἀντικαλέσωσίν''
> ERROR: fcount1[8582]=20 r->w2f1[8583]-r->w2f1[8582]=18 w=``ἐξάγει''
> ERROR: fcount1[8583]=50 r->w2f1[8584]-r->w2f1[8583]=52 w=``ἀκολουθήσουσιν''
> ERROR: fcount1[9942]=20 r->w2f1[9943]-r->w2f1[9942]=24 w=``ἅρματι''
> ERROR: fcount1[9943]=56 r->w2f1[9944]-r->w2f1[9943]=52 w=``ἀναγινώσκοντος''
> ERROR: fcount1[10119]=48 r->w2f1[10120]-r->w2f1[10119]=44
> w=``μεταπέμψασθαί''
> ERROR: fcount1[10120]=48 r->w2f1[10121]-r->w2f1[10120]=52
> w=``εἰσκαλεσάμενος''
> ERROR: fcount1[10553]=28 r->w2f1[10554]-r->w2f1[10553]=24 w=``ἐτάραξαν''
> ERROR: fcount1[10554]=48 r->w2f1[10555]-r->w2f1[10554]=52
> w=``ἀνασκευάζοντες''
> ERROR: fcount1[10622]=24 r->w2f1[10623]-r->w2f1[10622]=20 w=``Τρῳάδος''
> ERROR: fcount1[10623]=48 r->w2f1[10624]-r->w2f1[10623]=52
> w=``εὐθυδρομήσαμεν''
> ERROR: fcount1[11159]=48 r->w2f1[11160]-r->w2f1[11159]=44
> w=``ἀποσπασθέντας''
> ERROR: fcount1[11160]=52 r->w2f1[11161]-r->w2f1[11160]=56
> w=``εὐθυδρομήσαντες''
> ERROR: fcount1[12054]=20 r->w2f1[12055]-r->w2f1[12054]=18 w=``πλάνης''
> ERROR: fcount1[12055]=50 r->w2f1[12056]-r->w2f1[12055]=52
> w=``ἀπολαμβάνοντες''
> ERROR: fcount1[12422]=12 r->w2f1[12423]-r->w2f1[12422]=10 w=``νοός''
> ERROR: fcount1[12423]=50 r->w2f1[12424]-r->w2f1[12423]=52
> w=``αἰχμαλωτίζοντά''
> ERROR: fcount1[14334]=40 r->w2f1[14335]-r->w2f1[14334]=38 w=``ἐπαιρόμενον''
> ERROR: fcount1[14335]=54 r->w2f1[14336]-r->w2f1[14335]=56
> w=``αἰχμαλωτίζοντες''
> ERROR: fcount1[14641]=40 r->w2f1[14642]-r->w2f1[14641]=38 w=``κεκυρωμένην''
> ERROR: fcount1[14642]=50 r->w2f1[14643]-r->w2f1[14642]=52
> w=``ἐπιδιατάσσεται''
> ERROR: fcount1[14878]=32 r->w2f1[14879]-r->w2f1[14878]=34 w=``προέγραψα''
> ERROR: fcount1[14879]=54 r->w2f1[14880]-r->w2f1[14879]=52
> w=``ἀναγινώσκοντες''
> ERROR: fcount1[15698]=36 r->w2f1[15699]-r->w2f1[15698]=34 w=``ἐπιστεύθην''
> ERROR: fcount1[15699]=46 r->w2f1[15700]-r->w2f1[15699]=48
> w=``ἐνδυναμώσαντί''
> ERROR: fcount1[16170]=32 r->w2f1[16171]-r->w2f1[16170]=30 w=``ἀνέξονται''
> ERROR: fcount1[16171]=50 r->w2f1[16172]-r->w2f1[16171]=52
> w=``ἐπισωρεύσουσιν''
> ERROR: fcount1[16815]=32 r->w2f1[16816]-r->w2f1[16815]=30 w=``ἐνυβρίσας''
> ERROR: fcount1[16816]=50 r->w2f1[16817]-r->w2f1[16816]=52
> w=``Ἀναμιμνῄσκεσθε''
> ERROR: fcount1[17621]=40 r->w2f1[17622]-r->w2f1[17621]=42 w=``ἀπεσταλμένα''
> ERROR: fcount1[17622]=56 r->w2f1[17623]-r->w2f1[17622]=54 w=``εἴκοσι
> τέσσαρες''
> ERROR: fcount1[17793]=28 r->w2f1[17794]-r->w2f1[17793]=29 w=``μάρτυσίν''
> ERROR: fcount1[17794]=93 r->w2f1[17795]-r->w2f1[17794]=92 w=``χιλίας
> διακοσίας ἑξήκοντα''
> ERROR: fcount1[17937]=24 r->w2f1[17938]-r->w2f1[17937]=26 w=``χαλινῶν''
> ERROR: fcount1[17938]=60 r->w2f1[17939]-r->w2f1[17938]=58 w=``χιλίων
> ἑξακοσίων''
> ERROR: fcount1[17967]=36 r->w2f1[17968]-r->w2f1[17967]=34 w=``καυματίσαι''
> ERROR: fcount1[17968]=50 r->w2f1[17969]-r->w2f1[17968]=52
> w=``ἐκαυματίσθησαν''
>
>
> Again, I would be very thankful for help.
>
> Best!
> Ruprecht
>
>
>
>
>
> Am 10.03.2015 um 12:07 schrieb Ruprecht von Waldenfels:
>
> Hi Andrew,
> YES! This does solve the problem. I was thinking this setting would only
> concern tokens, not the lemma attribute, but now I understand that this was
> a wrong assumption. Thank you!
> I will now look at the other problem - because that, as it turns out, is
> unrelated.
> Thanks A LOT!
> Ruprecht
> Am 10.03.2015 um 12:02 schrieb Hardie, Andrew:
>
>  Is the context size measured in characters? If so, that would explain
> the problem, since “characters” = bytes still.
>
>
>
> If changing the context width to a given number of words fixes the issue,
> then that is the solution.
>
>
>
> I have been working on a patch to fix this, but have not completed it yet.
>
>
>
> Andrew.
>
>
>
> *From:* cwb-bounces a sslmit.unibo.it [mailto:cwb-bounces a sslmit.unibo.it
> <cwb-bounces a sslmit.unibo.it>] *On Behalf Of *Ruprecht von Waldenfels
> *Sent:* 10 March 2015 09:54
> *To:* cwb a sslmit.unibo.it
> *Subject:* [CWB] unicode problems with Greek and OCS
>
>
>
> Dear List,
>
> I am using CWB 3.4.8 on 64 bit Ubuntu 14.10.
> After encoding a text in Old Church Slavonic, I get invalid UTF-8
> character errors; I seem to get them only in sgml mode (I also get them
> during alignment with the Ancient Greek translation source, which might be
> a related problem, but I am not sure.)
>
> In order to pinpoint the problem with the Old Church Slavonic text, I have
> reduced the text in question to two bible verses. The text can be found
> here: www.parasolcorpus.org/test.txt
>
> I encode the corpus with the following commands:
> /opt/CWBUTF8/cwb/utils/cwb-encode -d Data/ntestament_tt -f test.txt -R
> /data/PROIEL/Registry/ntestament_tt -c utf8 -xsB -P lemma -P id -P alig -P
> pos -P tag -S aligVerse:0
> /opt/CWBUTF8/cwb/utils/cwb-makeall -r /data/PROIEL/Registry NTESTAMENT_TT
>
> There is no problem in text mode:
>
>
>
> However, in sgml mode, some lemmas get truncated and do not contain valid
> utf8 anymore. For example, the lemma of "с҃вщаѩи" is such a token. This
> problem does NOT appear if I search for this token itself, it ONLY and
> consistently appears if I search for a different token and the problematic
> token is in the result set:
>
>
> To sum up: I get the problem only if I search for a neighboring token in
> sgml mode. I don't get it if I search for the token itself, and I don't get
> it in text mode. I have reduced the problem to w 50-token text, and the
> problem persists.
>
> Any help would be greatly appreciated!
> Best,
> Ruprecht
>
>
>
>
>
> _______________________________________________
> CWB mailing listCWB a sslmit.unibo.ithttp://devel.sslmit.unibo.it/mailman/listinfo/cwb
>
>
>
>
> _______________________________________________
> CWB mailing listCWB a sslmit.unibo.ithttp://devel.sslmit.unibo.it/mailman/listinfo/cwb
>
>
>
> _______________________________________________
> CWB mailing list
> CWB a sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>
>
-------------- parte successiva --------------
Un allegato HTML � stato rimosso...
URL: <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20150311/f0c840c1/attachment-0001.html>
-------------- parte successiva --------------
Un allegato non testuale � stato rimosso....
Nome:        non disponibile
Tipo:        image/png
Dimensione:  85145 bytes
Descrizione: non disponibile
URL:         <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20150311/f0c840c1/attachment-0002.png>
-------------- parte successiva --------------
Un allegato non testuale � stato rimosso....
Nome:        non disponibile
Tipo:        image/png
Dimensione:  77569 bytes
Descrizione: non disponibile
URL:         <http://devel.sslmit.unibo.it/pipermail/cwb/attachments/20150311/f0c840c1/attachment-0003.png>


More information about the CWB mailing list