<div dir="ltr">Thanks Graham, and thanks again, Stefan. I just finished tagging a test corpus with FreeLing and encoding it with CWB, and everything worked splendidly.<div><br></div><div>FreeLing produces UTF-8 output, but as far as I can tell CWB 3.4.9 deals with it just fine using the <font face="monospace, monospace">-c utf8 option</font>. Are there any gotchas I should know about with this encoding?</div><div><br></div><div>Finally, I&#39;m encoding a fair number to S attributes that describe the source of the texts, the genre and so on, with the idea of making one big corpus in which ad hoc sub-corpora can easily be queried (say, all newspapers, or all forum posts, or a certain magazine, or whatever). I&#39;ve found a bit of info on page 26 of the CQP Query Language Tutorial that you pointed me to. Is there anything else out there that might be of use for this particular purpose?</div><div><br></div><div>Cheers,</div><div>Scott</div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Jul 18, 2016 at 2:41 AM, Stefan Evert <span dir="ltr">&lt;<a href="mailto:stefanML@collocations.de" target="_blank">stefanML@collocations.de</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class=""><br>

&gt; The one thing I can&#39;t get rid of are the empty lines between each line of verticalized text. I&#39;d banged my head against this issue before, until I finally realized that all the tools involved (sed and such) work on a line-by-line basis, and so you apparently can&#39;t process \n\n in order to convert it to just \n like this -- I&#39;d have to write the file and then read it all into memory, perform that operation and then write it again. Big performance hit there!<br>

<br>

</span>As Graham suggested, for line-by-line processing you can recognize empty lines with the regexp /^$/  (in Perl, I often use /^\s*$/ so I don&#39;t stumble over a few stray blanks) and then simply skip printing them in the output.<br>

<span class=""><br>

&gt; Below is what my script is currently outputting. Is it valid CWB input text?<br>

<br>

</span>Looks good to me. Just make sure to pass the -s and -B flags to cwb-encode to make it skip blank lines.<br>

<span class=""><br>

&gt; By the way, while I&#39;m here, what&#39;s the best and most up to date info (tutorials, manuals, etc.) on encoding with CWB?<br>

<br>

</span>The official manuals are the &quot;tutorials&quot; you can find at<br>

<br>

        <a href="http://cwb.sourceforge.net/documentation.php" rel="noreferrer" target="_blank">http://cwb.sourceforge.net/documentation.php</a><br>

<br>

They are slightly out of date, but we haven&#39;t added that much in the meantime.  You can also download PDFs of the latest versions directly from the SVN repository:<br>

<br>

        <a href="https://sourceforge.net/p/cwb/code/HEAD/tree/doc/tutorials/CQP_Tutorial.pdf?format=raw" rel="noreferrer" target="_blank">https://sourceforge.net/p/cwb/code/HEAD/tree/doc/tutorials/CQP_Tutorial.pdf?format=raw</a><br>

<br>

        <a href="https://sourceforge.net/p/cwb/code/HEAD/tree/doc/tutorials/CWB_Encoding_Tutorial.pdf?format=raw" rel="noreferrer" target="_blank">https://sourceforge.net/p/cwb/code/HEAD/tree/doc/tutorials/CWB_Encoding_Tutorial.pdf?format=raw</a><br>

<br>

Best,<br>

Stefan</blockquote></div><br><br>

</div></div>