<html>
  <head>
    <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <div class="moz-cite-prefix">Hi Andrew,<br>
      YES! This does solve the problem. I was thinking this setting
      would only concern tokens, not the lemma attribute, but now I
      understand that this was a wrong assumption. Thank you!<br>
      I will now look at the other problem - because that, as it turns
      out, is unrelated. <br>
      Thanks A LOT!<br>
      Ruprecht<br>
      Am 10.03.2015 um 12:02 schrieb Hardie, Andrew:<br>
    </div>
    <blockquote
      cite="mid:28078EC3FBF1B940A3EF3D0D19BE351D329FC54F@EX-0-MB1.lancs.local"
      type="cite">
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      <meta name="Generator" content="Microsoft Word 14 (filtered
        medium)">
      <!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]-->
      <style><!--
/* Font Definitions */
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
        {font-family:Tahoma;
        panose-1:2 11 6 4 3 5 4 4 2 4;}
@font-face
        {font-family:Verdana;
        panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0cm;
        margin-bottom:.0001pt;
        font-size:12.0pt;
        font-family:"Times New Roman","serif";
        color:black;}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:purple;
        text-decoration:underline;}
span.EmailStyle17
        {mso-style-type:personal-reply;
        font-family:"Verdana","sans-serif";
        color:#1F497D;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-size:10.0pt;}
@page WordSection1
        {size:612.0pt 792.0pt;
        margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
        {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
      <div class="WordSection1">
        <p class="MsoNormal"><span
style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D">Is
            the context size measured in characters? If so, that would
            explain the problem, since “characters” = bytes still.<o:p></o:p></span></p>
        <p class="MsoNormal"><span
style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p> </o:p></span></p>
        <p class="MsoNormal"><span
style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D">If
            changing the context width to a given number of words fixes
            the issue, then that is the solution.<o:p></o:p></span></p>
        <p class="MsoNormal"><span
style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p> </o:p></span></p>
        <p class="MsoNormal"><span
style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D">I
            have been working on a patch to fix this, but have not
            completed it yet.<o:p></o:p></span></p>
        <p class="MsoNormal"><span
style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p> </o:p></span></p>
        <p class="MsoNormal"><span
style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D">Andrew.<o:p></o:p></span></p>
        <p class="MsoNormal"><span
style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p> </o:p></span></p>
        <div>
          <div style="border:none;border-top:solid #B5C4DF
            1.0pt;padding:3.0pt 0cm 0cm 0cm">
            <p class="MsoNormal"><b><span
style="font-size:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;;color:windowtext"
                  lang="EN-US">From:</span></b><span
style="font-size:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;;color:windowtext"
                lang="EN-US"> <a class="moz-txt-link-abbreviated" href="mailto:cwb-bounces@sslmit.unibo.it">cwb-bounces@sslmit.unibo.it</a>
                [<a class="moz-txt-link-freetext" href="mailto:cwb-bounces@sslmit.unibo.it">mailto:cwb-bounces@sslmit.unibo.it</a>] <b>On Behalf Of </b>Ruprecht
                von Waldenfels<br>
                <b>Sent:</b> 10 March 2015 09:54<br>
                <b>To:</b> <a class="moz-txt-link-abbreviated" href="mailto:cwb@sslmit.unibo.it">cwb@sslmit.unibo.it</a><br>
                <b>Subject:</b> [CWB] unicode problems with Greek and
                OCS<o:p></o:p></span></p>
          </div>
        </div>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal" style="margin-bottom:12.0pt">Dear List,<br>
          <br>
          I am using CWB 3.4.8 on 64 bit Ubuntu 14.10.<br>
          After encoding a text in Old Church Slavonic, I get invalid
          UTF-8 character errors; I seem to get them only in sgml mode
          (I also get them during alignment with the Ancient Greek
          translation source, which might be a related problem, but I am
          not sure.)<br>
          <br>
          In order to pinpoint the problem with the Old Church Slavonic
          text, I have reduced the text in question to two bible verses.
          The text can be found here:
          <a moz-do-not-send="true"
            href="http://www.parasolcorpus.org/test.txt">www.parasolcorpus.org/test.txt</a><br>
          <br>
          I encode the corpus with the following commands:<br>
          /opt/CWBUTF8/cwb/utils/cwb-encode -d Data/ntestament_tt -f
          test.txt -R /data/PROIEL/Registry/ntestament_tt -c utf8 -xsB
          -P lemma -P id -P alig -P pos -P tag -S aligVerse:0<br>
          /opt/CWBUTF8/cwb/utils/cwb-makeall -r /data/PROIEL/Registry
          NTESTAMENT_TT<br>
          <br>
          There is no problem in text mode:<br>
          <br>
          <img id="_x0000_i1025"
            src="cid:part2.04040102.05030605@gmx.net" height="302"
            width="653" border="0"><br>
          <br>
          However, in sgml mode, some lemmas get truncated and do not
          contain valid utf8 anymore. For example, the lemma of
          "с҃вщаѩи" is such a token. This problem does NOT appear if I
          search for this token itself, it ONLY and consistently appears
          if I search for a different token and the problematic token is
          in the result set:<br>
          <img id="_x0000_i1026"
            src="cid:part3.00090807.03010201@gmx.net" height="397"
            width="654" border="0"><br>
          <br>
          To sum up: I get the problem only if I search for a
          neighboring token in sgml mode. I don't get it if I search for
          the token itself, and I don't get it in text mode. I have
          reduced the problem to w 50-token text, and the problem
          persists.<br>
          <br>
          Any help would be greatly appreciated!<br>
          Best, <br>
          Ruprecht<br>
          <br>
          <br>
          <br>
          <o:p></o:p></p>
      </div>
      <br>
      <fieldset class="mimeAttachmentHeader"></fieldset>
      <br>
      <pre wrap="">_______________________________________________
CWB mailing list
<a class="moz-txt-link-abbreviated" href="mailto:CWB@sslmit.unibo.it">CWB@sslmit.unibo.it</a>
<a class="moz-txt-link-freetext" href="http://devel.sslmit.unibo.it/mailman/listinfo/cwb">http://devel.sslmit.unibo.it/mailman/listinfo/cwb</a>
</pre>
    </blockquote>
    <br>
  </body>
</html>