<html>

  <head>

    <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">Dear Gabriele, <br>

      <br>

      there is a web service that will do morphological analysis and

      lemmatization for Greek: <br>

      <a class="moz-txt-link-freetext" href="http://archimedes.mpiwg-berlin.mpg.de/arch/doc/xml-rpc.html">http://archimedes.mpiwg-berlin.mpg.de/arch/doc/xml-rpc.html</a><br>

      <br>

      However, it does not disambiguate homonyms. One way is to encode

      all possibilities in the corpus, that's what I did (for a

      different project). That's the only resource I know of. <br>

      <br>

      I think the problem you saw with the accented characters might be

      part of the rendering on your system - here, things looked fine,

      and Stefan and Andrew checked that. But thanks for pointing that

      out!<br>

      <br>

      Best, <br>

      Ruprecht<br>

      <br>

      <br>

      <br>

      <br>

      <br>

      <br>

      <br>

      Am 11.03.2015 um 05:58 schrieb Gabriele Brandolini:<br>

    </div>

    <blockquote

cite="mid:CALHN0UPZSTne5Kkk4EXe7RdnVW672XBE-cejB5oeuSfUZCj9AA@mail.gmail.com"

      type="cite">

      <p dir="ltr">Dear Ruprecht, Andrew and Stefan</p>

      <p dir="ltr">I followed your issue about encoding Old Greek texts.</p>

      <p dir="ltr">I also would like to cwb encode texts in this

        language expecially old texts of the Fathers of the Church. But

        I've not yet got a PoS tagger for such a language. We just

        planned to work on it to train TreeTagger. But as I know it

        isn't ready yet.</p>

      <p dir="ltr">Do you, Ruprecht, know if there is any available?</p>

      <p dir="ltr">About your list of greek words in your email of 14 31

        I noticed that they are mostly uncorrect. As the initial letter

        (alfa or eta or epsilon) were dropped out with its accent and

        spirit.<br>

        I don't know if this has something to do with the encoding error

        messages you get.<br>

        Just I wanted to point out it, maybe it can be of any help.</p>

      <p dir="ltr">Good work and good luck!</p>

      <p dir="ltr">Gabriele</p>

      <div class="gmail_quote">Il 10/mar/2015 14:31 "Ruprecht von

        Waldenfels" &lt;<a moz-do-not-send="true"

          href="mailto:ruprecht.waldenfels@gmx.net">ruprecht.waldenfels@gmx.net</a>&gt;

        ha scritto:<br type="attribution">

        <blockquote class="gmail_quote" style="margin:0 0 0

          .8ex;border-left:1px #ccc solid;padding-left:1ex">

          <div text="#000000" bgcolor="#FFFFFF">

            <div>Dear List, <br>

              so my second problem, this time with Ancient Greek. I

              cannot easily reproduce this with a 2-line corpus, because

              I don't know where the culprit is. I am posting the CWB

              Output instead, maybe this is already enough. <br>

              <br>

              What I am trying to do: I am trying to align three

              documents, one Greek and two Slavic texts, using the

              aligVerse structural element. The two Slavic ones align

              fine, the Greek gives me the following error: <br>

              rvw@rvw-Latitude-E6410:/data/PROIEL$

              /opt/CWBUTF8/cwb/utils/cwb-align -r /data/PROIEL/Registry

              -S aligVerse -o out.align NTESTAMENT_GR NTESTAMENT_MN

              aligVerse<br>

              OPENING NTESTAMENT_GR [147613 tokens, 7497

              &lt;aligVerse&gt; regions]<br>

              OPENING NTESTAMENT_MN [71935 tokens, 7497

              &lt;aligVerse&gt; regions]<br>

              OPENING prealignment [NTESTAMENT_GR.aligVerse: 7497

              regions, NTESTAMENT_MN.aligVerse: 7497 regions]<br>

              LEXICON SIZE: 18085 / 10132<br>

              FEATURE: character count, weight=1 ... [1]<br>

              FEATURE: Shared words, threshold=40.0%, weight=50 ... [0]<br>

              FEATURE: 3-grams, weight=3 ... CL: major error, invalid

              UTF8 string passed to cl_string_canonical...<br>

              CL: major error, invalid UTF8 string passed to

              cl_string_canonical...<br>

              CL: major error, invalid UTF8 string passed to

              cl_string_canonical...<br>

              [21952]<br>

              FEATURE: 4-grams, weight=4 ... CL: major error, invalid

              UTF8 string passed to cl_string_canonical...<br>

              CL: major error, invalid UTF8 string passed to

              cl_string_canonical...<br>

              CL: major error, invalid UTF8 string passed to

              cl_string_canonical...<br>

              CL: major error, invalid UTF8 string passed to

              cl_string_canonical...<br>

              [614656]<br>

              [636609 features allocated]<br>

              [520402 entries in source text feature map]<br>

              [246622 entries in target text feature map]<br>

              PASS 2: Setting character count weight.<br>

              PASS 2: Processing shared words (th=40.0%).<br>

              PASS 2: Processing 3-grams.<br>

              CL: major error, invalid UTF8 string passed to

              cl_string_canonical...<br>

              CL: major error, invalid UTF8 string passed to

              cl_string_canonical...<br>

              PASS 2: Processing 4-grams.<br>

              CL: major error, invalid UTF8 string passed to

              cl_string_canonical...<br>

              CL: major error, invalid UTF8 string passed to

              cl_string_canonical...<br>

              PASS 2: Creating character counts.<br>

              [checking pointers]<br>

              ERROR: fcount1[1387]=24

              r-&gt;w2f1[1388]-r-&gt;w2f1[1387]=22 w=``ἥξουσιν''<br>

              ERROR: fcount1[1388]=50

              r-&gt;w2f1[1389]-r-&gt;w2f1[1388]=52 w=``ἀνακλιθήσονται''<br>

              ERROR: fcount1[1783]=24

              r-&gt;w2f1[1784]-r-&gt;w2f1[1783]=22 w=``θάνατον''<br>

              ERROR: fcount1[1784]=50

              r-&gt;w2f1[1785]-r-&gt;w2f1[1784]=52 w=``ἐπαναστήσονται''<br>

              ERROR: fcount1[3037]=20

              r-&gt;w2f1[3038]-r-&gt;w2f1[3037]=16 w=``δυνατά''<br>

              ERROR: fcount1[3039]=48

              r-&gt;w2f1[3040]-r-&gt;w2f1[3039]=52 w=``ἀκολουθήσαντές''<br>

              ERROR: fcount1[3784]=20

              r-&gt;w2f1[3785]-r-&gt;w2f1[3784]=18 w=``ἤλθατε''<br>

              ERROR: fcount1[3785]=50

              r-&gt;w2f1[3786]-r-&gt;w2f1[3785]=52 w=``ἀποκριθήσονται''<br>

              ERROR: fcount1[4459]=32

              r-&gt;w2f1[4460]-r-&gt;w2f1[4459]=30 w=``ἐπιθυμίαι''<br>

              ERROR: fcount1[4460]=50

              r-&gt;w2f1[4461]-r-&gt;w2f1[4460]=52 w=``εἰσπορευόμεναι''<br>

              ERROR: fcount1[4998]=20

              r-&gt;w2f1[4999]-r-&gt;w2f1[4998]=18 w=``Ἤρξατο''<br>

              ERROR: fcount1[4999]=46

              r-&gt;w2f1[5000]-r-&gt;w2f1[4999]=48 w=``ἠκολουθήκαμέν''<br>

              ERROR: fcount1[5038]=36

              r-&gt;w2f1[5039]-r-&gt;w2f1[5038]=34 w=``ἐγγίζουσιν''<br>

              ERROR: fcount1[5039]=50

              r-&gt;w2f1[5040]-r-&gt;w2f1[5039]=52 w=``εἰσπορευόμενοι''<br>

              ERROR: fcount1[7009]=32

              r-&gt;w2f1[7010]-r-&gt;w2f1[7009]=30 w=``πλουσίους''<br>

              ERROR: fcount1[7010]=46

              r-&gt;w2f1[7011]-r-&gt;w2f1[7010]=48 w=``ἀντικαλέσωσίν''<br>

              ERROR: fcount1[8582]=20

              r-&gt;w2f1[8583]-r-&gt;w2f1[8582]=18 w=``ἐξάγει''<br>

              ERROR: fcount1[8583]=50

              r-&gt;w2f1[8584]-r-&gt;w2f1[8583]=52 w=``ἀκολουθήσουσιν''<br>

              ERROR: fcount1[9942]=20

              r-&gt;w2f1[9943]-r-&gt;w2f1[9942]=24 w=``ἅρματι''<br>

              ERROR: fcount1[9943]=56

              r-&gt;w2f1[9944]-r-&gt;w2f1[9943]=52 w=``ἀναγινώσκοντος''<br>

              ERROR: fcount1[10119]=48

              r-&gt;w2f1[10120]-r-&gt;w2f1[10119]=44 w=``μεταπέμψασθαί''<br>

              ERROR: fcount1[10120]=48

              r-&gt;w2f1[10121]-r-&gt;w2f1[10120]=52

              w=``εἰσκαλεσάμενος''<br>

              ERROR: fcount1[10553]=28

              r-&gt;w2f1[10554]-r-&gt;w2f1[10553]=24 w=``ἐτάραξαν''<br>

              ERROR: fcount1[10554]=48

              r-&gt;w2f1[10555]-r-&gt;w2f1[10554]=52

              w=``ἀνασκευάζοντες''<br>

              ERROR: fcount1[10622]=24

              r-&gt;w2f1[10623]-r-&gt;w2f1[10622]=20 w=``Τρῳάδος''<br>

              ERROR: fcount1[10623]=48

              r-&gt;w2f1[10624]-r-&gt;w2f1[10623]=52

              w=``εὐθυδρομήσαμεν''<br>

              ERROR: fcount1[11159]=48

              r-&gt;w2f1[11160]-r-&gt;w2f1[11159]=44 w=``ἀποσπασθέντας''<br>

              ERROR: fcount1[11160]=52

              r-&gt;w2f1[11161]-r-&gt;w2f1[11160]=56

              w=``εὐθυδρομήσαντες''<br>

              ERROR: fcount1[12054]=20

              r-&gt;w2f1[12055]-r-&gt;w2f1[12054]=18 w=``πλάνης''<br>

              ERROR: fcount1[12055]=50

              r-&gt;w2f1[12056]-r-&gt;w2f1[12055]=52

              w=``ἀπολαμβάνοντες''<br>

              ERROR: fcount1[12422]=12

              r-&gt;w2f1[12423]-r-&gt;w2f1[12422]=10 w=``νοός''<br>

              ERROR: fcount1[12423]=50

              r-&gt;w2f1[12424]-r-&gt;w2f1[12423]=52

              w=``αἰχμαλωτίζοντά''<br>

              ERROR: fcount1[14334]=40

              r-&gt;w2f1[14335]-r-&gt;w2f1[14334]=38 w=``ἐπαιρόμενον''<br>

              ERROR: fcount1[14335]=54

              r-&gt;w2f1[14336]-r-&gt;w2f1[14335]=56

              w=``αἰχμαλωτίζοντες''<br>

              ERROR: fcount1[14641]=40

              r-&gt;w2f1[14642]-r-&gt;w2f1[14641]=38 w=``κεκυρωμένην''<br>

              ERROR: fcount1[14642]=50

              r-&gt;w2f1[14643]-r-&gt;w2f1[14642]=52

              w=``ἐπιδιατάσσεται''<br>

              ERROR: fcount1[14878]=32

              r-&gt;w2f1[14879]-r-&gt;w2f1[14878]=34 w=``προέγραψα''<br>

              ERROR: fcount1[14879]=54

              r-&gt;w2f1[14880]-r-&gt;w2f1[14879]=52

              w=``ἀναγινώσκοντες''<br>

              ERROR: fcount1[15698]=36

              r-&gt;w2f1[15699]-r-&gt;w2f1[15698]=34 w=``ἐπιστεύθην''<br>

              ERROR: fcount1[15699]=46

              r-&gt;w2f1[15700]-r-&gt;w2f1[15699]=48 w=``ἐνδυναμώσαντί''<br>

              ERROR: fcount1[16170]=32

              r-&gt;w2f1[16171]-r-&gt;w2f1[16170]=30 w=``ἀνέξονται''<br>

              ERROR: fcount1[16171]=50

              r-&gt;w2f1[16172]-r-&gt;w2f1[16171]=52

              w=``ἐπισωρεύσουσιν''<br>

              ERROR: fcount1[16815]=32

              r-&gt;w2f1[16816]-r-&gt;w2f1[16815]=30 w=``ἐνυβρίσας''<br>

              ERROR: fcount1[16816]=50

              r-&gt;w2f1[16817]-r-&gt;w2f1[16816]=52

              w=``Ἀναμιμνῄσκεσθε''<br>

              ERROR: fcount1[17621]=40

              r-&gt;w2f1[17622]-r-&gt;w2f1[17621]=42 w=``ἀπεσταλμένα''<br>

              ERROR: fcount1[17622]=56

              r-&gt;w2f1[17623]-r-&gt;w2f1[17622]=54 w=``εἴκοσι

              τέσσαρες''<br>

              ERROR: fcount1[17793]=28

              r-&gt;w2f1[17794]-r-&gt;w2f1[17793]=29 w=``μάρτυσίν''<br>

              ERROR: fcount1[17794]=93

              r-&gt;w2f1[17795]-r-&gt;w2f1[17794]=92 w=``χιλίας

              διακοσίας ἑξήκοντα''<br>

              ERROR: fcount1[17937]=24

              r-&gt;w2f1[17938]-r-&gt;w2f1[17937]=26 w=``χαλινῶν''<br>

              ERROR: fcount1[17938]=60

              r-&gt;w2f1[17939]-r-&gt;w2f1[17938]=58 w=``χιλίων

              ἑξακοσίων''<br>

              ERROR: fcount1[17967]=36

              r-&gt;w2f1[17968]-r-&gt;w2f1[17967]=34 w=``καυματίσαι''<br>

              ERROR: fcount1[17968]=50

              r-&gt;w2f1[17969]-r-&gt;w2f1[17968]=52

              w=``ἐκαυματίσθησαν''<br>

              <br>

              <br>

              Again, I would be very thankful for help. <br>

              <br>

              Best!<br>

              Ruprecht<br>

              <br>

              <br>

              <br>

              <br>

              <br>

              Am 10.03.2015 um 12:07 schrieb Ruprecht von Waldenfels:<br>

            </div>

            <blockquote type="cite">

              <div>Hi Andrew,<br>

                YES! This does solve the problem. I was thinking this

                setting would only concern tokens, not the lemma

                attribute, but now I understand that this was a wrong

                assumption. Thank you!<br>

                I will now look at the other problem - because that, as

                it turns out, is unrelated. <br>

                Thanks A LOT!<br>

                Ruprecht<br>

                Am 10.03.2015 um 12:02 schrieb Hardie, Andrew:<br>

              </div>

              <blockquote type="cite">

                <div>

                  <p class="MsoNormal"><span

style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1f497d">Is


                      the context size measured in characters? If so,

                      that would explain the problem, since “characters”

                      = bytes still.</span></p>

                  <p class="MsoNormal"><span

style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1f497d"> </span></p>

                  <p class="MsoNormal"><span

style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1f497d">If


                      changing the context width to a given number of

                      words fixes the issue, then that is the solution.</span></p>

                  <p class="MsoNormal"><span

style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1f497d"> </span></p>

                  <p class="MsoNormal"><span

style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1f497d">I

                      have been working on a patch to fix this, but have

                      not completed it yet.</span></p>

                  <p class="MsoNormal"><span

style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1f497d"> </span></p>

                  <p class="MsoNormal"><span

style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1f497d">Andrew.</span></p>

                  <p class="MsoNormal"><span

style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1f497d"> </span></p>

                  <div>

                    <div style="border:none;border-top:solid #b5c4df

                      1.0pt;padding:3.0pt 0cm 0cm 0cm">

                      <p class="MsoNormal"><b><span

style="font-size:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;;color:windowtext"

                            lang="EN-US">From:</span></b><span

style="font-size:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;;color:windowtext"

                          lang="EN-US"> <a moz-do-not-send="true"

                            href="mailto:cwb-bounces@sslmit.unibo.it"

                            target="_blank">cwb-bounces@sslmit.unibo.it</a>

                          [<a moz-do-not-send="true"

                            href="mailto:cwb-bounces@sslmit.unibo.it"

                            target="_blank">mailto:cwb-bounces@sslmit.unibo.it</a>]

                          <b>On Behalf Of </b>Ruprecht von Waldenfels<br>

                          <b>Sent:</b> 10 March 2015 09:54<br>

                          <b>To:</b> <a moz-do-not-send="true"

                            href="mailto:cwb@sslmit.unibo.it"

                            target="_blank">cwb@sslmit.unibo.it</a><br>

                          <b>Subject:</b> [CWB] unicode problems with

                          Greek and OCS</span></p>

                    </div>

                  </div>

                  <p class="MsoNormal"> </p>

                  <p class="MsoNormal" style="margin-bottom:12.0pt">Dear

                    List,<br>

                    <br>

                    I am using CWB 3.4.8 on 64 bit Ubuntu 14.10.<br>

                    After encoding a text in Old Church Slavonic, I get

                    invalid UTF-8 character errors; I seem to get them

                    only in sgml mode (I also get them during alignment

                    with the Ancient Greek translation source, which

                    might be a related problem, but I am not sure.)<br>

                    <br>

                    In order to pinpoint the problem with the Old Church

                    Slavonic text, I have reduced the text in question

                    to two bible verses. The text can be found here: <a

                      moz-do-not-send="true"

                      href="http://www.parasolcorpus.org/test.txt"

                      target="_blank">www.parasolcorpus.org/test.txt</a><br>

                    <br>

                    I encode the corpus with the following commands:<br>

                    /opt/CWBUTF8/cwb/utils/cwb-encode -d

                    Data/ntestament_tt -f test.txt -R

                    /data/PROIEL/Registry/ntestament_tt -c utf8 -xsB -P

                    lemma -P id -P alig -P pos -P tag -S aligVerse:0<br>

                    /opt/CWBUTF8/cwb/utils/cwb-makeall -r

                    /data/PROIEL/Registry NTESTAMENT_TT<br>

                    <br>

                    There is no problem in text mode:<br>

                    <br>

                    <img src="cid:part6.01060805.09070608@gmx.net"

                      height="302" width="653" border="0"><br>

                    <br>

                    However, in sgml mode, some lemmas get truncated and

                    do not contain valid utf8 anymore. For example, the

                    lemma of "с҃вщаѩи" is such a token. This problem

                    does NOT appear if I search for this token itself,

                    it ONLY and consistently appears if I search for a

                    different token and the problematic token is in the

                    result set:<br>

                    <img src="cid:part7.09050408.03090109@gmx.net"

                      height="397" width="654" border="0"><br>

                    <br>

                    To sum up: I get the problem only if I search for a

                    neighboring token in sgml mode. I don't get it if I

                    search for the token itself, and I don't get it in

                    text mode. I have reduced the problem to w 50-token

                    text, and the problem persists.<br>

                    <br>

                    Any help would be greatly appreciated!<br>

                    Best, <br>

                    Ruprecht<br>

                    <br>

                    <br>

                    <br>

                  </p>

                </div>

                <br>

                <fieldset></fieldset>

                <br>

                <pre>_______________________________________________

CWB mailing list

<a moz-do-not-send="true" href="mailto:CWB@sslmit.unibo.it" target="_blank">CWB@sslmit.unibo.it</a>

<a moz-do-not-send="true" href="http://devel.sslmit.unibo.it/mailman/listinfo/cwb" target="_blank">http://devel.sslmit.unibo.it/mailman/listinfo/cwb</a>

</pre>

              </blockquote>

              <br>

              <br>

              <fieldset></fieldset>

              <br>

              <pre>_______________________________________________

CWB mailing list

<a moz-do-not-send="true" href="mailto:CWB@sslmit.unibo.it" target="_blank">CWB@sslmit.unibo.it</a>

<a moz-do-not-send="true" href="http://devel.sslmit.unibo.it/mailman/listinfo/cwb" target="_blank">http://devel.sslmit.unibo.it/mailman/listinfo/cwb</a>

</pre>

            </blockquote>

            <br>

          </div>

          <br>

          _______________________________________________<br>

          CWB mailing list<br>

          <a moz-do-not-send="true" href="mailto:CWB@sslmit.unibo.it">CWB@sslmit.unibo.it</a><br>

          <a moz-do-not-send="true"

            href="http://devel.sslmit.unibo.it/mailman/listinfo/cwb"

            target="_blank">http://devel.sslmit.unibo.it/mailman/listinfo/cwb</a><br>

          <br>

        </blockquote>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

CWB mailing list

<a class="moz-txt-link-abbreviated" href="mailto:CWB@sslmit.unibo.it">CWB@sslmit.unibo.it</a>

<a class="moz-txt-link-freetext" href="http://devel.sslmit.unibo.it/mailman/listinfo/cwb">http://devel.sslmit.unibo.it/mailman/listinfo/cwb</a>

</pre>

    </blockquote>

    <br>

  </body>

</html>