<div dir="ltr">Thanks, Stefan and Vladimír!<div><br></div><div>I'm dealing with about 1.3 million files, so efficiency is an issue here. I've managed to write a bash script that both invokes the FreeLing analyzer and edits the heck out of its output before writing it to disk, giving me what i <i>think</i> is acceptable input for CWB. </div><div><br></div><div>The one thing I can't get rid of are the empty lines between each line of verticalized text. I'd banged my head against this issue before, until I finally realized that all the tools involved (sed and such) work on a line-by-line basis, and so you apparently can't process \n\n in order to convert it to just \n like this -- I'd have to write the file and then read it all into memory, perform that operation and then write it again. Big performance hit there!</div><div><br></div><div>Below is what my script is currently outputting. Is it valid CWB input text?</div><div><br></div><div>By the way, while I'm here, what's the best and most up to date info (tutorials, manuals, etc.) on encoding with CWB?</div><div><br></div><div>Thanks!</div><div>Scott</div><div><br></div><div><div><font face="monospace, monospace" size="1"><text corpus="test" label="PROF-ACAD-CCSS" mode="professional" genre="academic" field="social sciences" source="misc"></font></div><div><font face="monospace, monospace" size="1"><s></font></div><div><font face="monospace, monospace" size="1">La<span class="" style="white-space:pre">        </span>el<span class="" style="white-space:pre">        </span>DA0FS0<span class="" style="white-space:pre">        </span>DA<span class="" style="white-space:pre">        </span>determiner<span class="" style="white-space:pre">        </span>article</font></div><div><font face="monospace, monospace" size="1"><br></font></div><div><font face="monospace, monospace" size="1">abogada<span class="" style="white-space:pre">        </span>abogado<span class="" style="white-space:pre">        </span>NCFS000<span class="" style="white-space:pre">        </span>NC<span class="" style="white-space:pre">        </span>noun<span class="" style="white-space:pre">        </span>common</font></div><div><font face="monospace, monospace" size="1"><br></font></div><div><font face="monospace, monospace" size="1">y<span class="" style="white-space:pre">        </span>y<span class="" style="white-space:pre">        </span>CC<span class="" style="white-space:pre">        </span>CC<span class="" style="white-space:pre">        </span>conjunction<span class="" style="white-space:pre">        </span>coordinating</font></div><div><font face="monospace, monospace" size="1"><br></font></div><div><font face="monospace, monospace" size="1">ex<span class="" style="white-space:pre">        </span>ex<span class="" style="white-space:pre">        </span>AQ0CN00<span class="" style="white-space:pre">        </span>AQ<span class="" style="white-space:pre">        </span>adjective<span class="" style="white-space:pre">        </span>qualificative</font></div><div><font face="monospace, monospace" size="1"><br></font></div><div><font face="monospace, monospace" size="1">fiscal<span class="" style="white-space:pre">        </span>fiscal<span class="" style="white-space:pre">        </span>NCCS000<span class="" style="white-space:pre">        </span>NC<span class="" style="white-space:pre">        </span>noun<span class="" style="white-space:pre">        </span>common</font></div><div><font face="monospace, monospace" size="1"><br></font></div><div><font face="monospace, monospace" size="1"><br></font></div><div><font face="monospace, monospace" size="1"></s></font></div><div><font face="monospace, monospace" size="1"><s></font></div><div><font face="monospace, monospace" size="1">La<span class="" style="white-space:pre">        </span>el<span class="" style="white-space:pre">        </span>DA0FS0<span class="" style="white-space:pre">        </span>DA<span class="" style="white-space:pre">        </span>determiner<span class="" style="white-space:pre">        </span>article</font></div><div><font face="monospace, monospace" size="1"><br></font></div><div><font face="monospace, monospace" size="1">secretaria<span class="" style="white-space:pre">        </span>secretario<span class="" style="white-space:pre">        </span>NCFS000<span class="" style="white-space:pre">        </span>NC<span class="" style="white-space:pre">        </span>noun<span class="" style="white-space:pre">        </span>common</font></div><div><font face="monospace, monospace" size="1"><br></font></div><div><font face="monospace, monospace" size="1">de<span class="" style="white-space:pre">        </span>de<span class="" style="white-space:pre">        </span>SP<span class="" style="white-space:pre">        </span>SP<span class="" style="white-space:pre">        </span>adposition<span class="" style="white-space:pre">        </span>preposition</font></div><div><font face="monospace, monospace" size="1"><br></font></div><div><font face="monospace, monospace" size="1">Estado<span class="" style="white-space:pre">        </span>estado<span class="" style="white-space:pre">        </span>NCMS000<span class="" style="white-space:pre">        </span>NC<span class="" style="white-space:pre">        </span>noun<span class="" style="white-space:pre">        </span>common</font></div><div><font face="monospace, monospace" size="1"><br></font></div><div><font face="monospace, monospace" size="1"></s></font></div><div><font face="monospace, monospace" size="1"><br></font></div><div><font face="monospace, monospace" size="1"></text></font></div><div><br></div><div><br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Sun, Jul 17, 2016 at 6:17 AM, Stefan Evert <span dir="ltr"><<a href="mailto:stefanML@collocations.de" target="_blank">stefanML@collocations.de</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">The answer to both questions is: not directly, but it's easy to write a small pre-processing script. I'm sure that many CWB users have written similar scripts over the years and someone may be willing to share a script that works with the FreeLing output format.<br>
<span class=""><br>
<br>
> 1. FreeLing's plain text vertical output separates sentences with a blank line, rather than enclosing them in any sort of tag (e.g. <s>...</s>). Can CWB be configured to recognize this type of sentence encoding?<br>
<br>
</span>Simply write a script that inserts a start tag <s> at the beginning of the corpus and then replaces every blank line with<br>
<br>
</s><br>
<s><br>
<br>
(plus the final close tag </s> at the end of the text).<br>
<span class=""><br>
> 2. FreeLing's XML output looks a lot more complex than what I see in tutorials. It has more attributes, which shouldn't be a problem, but it also encodes each line in XML, as seen below. Can CWB be used with this?<br>
<br>
</span>This is an XML format, not one-token-per-line with XML tags as in CWB's input format. The best strategy is perhaps to write a simple Perl or Python script that parses <token> lines with a regular expression and prints the relevant information in TAB-delimited format. CWB should then be able to handle the structural XML tags as s-attributes.<br>
<br>
Best,<br>
Stefan<br>
<br>
</blockquote></div><br>
</div></div>