<br><br><div class="gmail_quote">2012/10/11 Stefan Evert <span dir="ltr">&lt;<a href="mailto:stefanML@collocations.de" target="_blank">stefanML@collocations.de</a>&gt;</span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hi Eva,<br>

<br>

Can you tell us exactly what operating system and version you&#39;re using, and how you have obtained and installed CWB?  If you&#39;re using a pre-compiled binary, please tell us which version you&#39;ve downloaded.<br>


<br></blockquote><div> </div><div>I&#39;m using a server: Debian GNU/Linux 6.0<br>We downloaded the beta version (cwb-3.4.1 )<br>I have created several corpus in several languages, and I never got this problem.<br> <br> </div>

<blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

We also need to know exactly which commands you entered to index and compress the corpus, plus the output from each of these commands.  Perhaps this will allow us to make a guess at the source of the error.<br>

<br></blockquote><div><br>the command I use is:<br> cat $SOURCEFILE | /usr/local/cwb-3.4.1/bin/cwb-encode -c utf8 -d $DATADIR -R $REGDIR/$CORPUSNAME -xsB -P lema -P pos -V s  -S doc:0+type+title -S not:0+text<br><br>This is the output (after correcting the errors you mentioned):<br>

<br>=== Makeall: processing corpus LATIN ===<br>Registry directory: /B_NFS_P/resources/corpora/written/registry/<br>ATTRIBUTE word<br> - lexicon      OK<br> - frequencies  OK<br> - token stream OK (COMPRESSED)<br> - index        OK (COMPRESSED)<br>

ATTRIBUTE lema<br> - lexicon      OK<br> - frequencies  OK<br> - token stream OK (COMPRESSED)<br> - index        OK (COMPRESSED)<br>ATTRIBUTE pos<br> - lexicon      OK<br> - frequencies  OK<br> - token stream OK (COMPRESSED)<br>

 - index        OK (COMPRESSED)<br>========================================<br>COMPRESSING TOKEN STREAM of LATIN.word<br>- writing code descriptor block to /B_NFS_P/resources/corpora/written/data/latin/word.hcd<br>- writing compressed item sequence to /B_NFS_P/resources/corpora/written/data/latin/word.huf<br>

- writing sync (every 128 tokens) to /B_NFS_P/resources/corpora/written/data/latin/word.huf.syn<br>VALIDATING LATIN.word<br>- reading code descriptor block from /B_NFS_P/resources/corpora/written/data/latin/word.hcd<br>- reading compressed item sequence from /B_NFS_P/resources/corpora/written/data/latin/word.huf<br>

- reading sync (mod 128) from /B_NFS_P/resources/corpora/written/data/latin/word.huf.syn<br>!! You can delete the file &lt;/B_NFS_P/resources/corpora/written/data/latin/word.corpus&gt; now.<br>COMPRESSING TOKEN STREAM of LATIN.lema<br>

Error: Huffman codes too long (33 bits, current maximum is 31 bits).<br>       Please contact the CWB development team for assistance.<br>COMPRESSING INDEX of LATIN.word<br>- writing compressed index to /B_NFS_P/resources/corpora/written/data/latin/word.crc<br>

- writing compressed index offsets to /B_NFS_P/resources/corpora/written/data/latin/word.crx<br>VALIDATING LATIN.word<br>- reading compressed index from /B_NFS_P/resources/corpora/written/data/latin/word.crc<br>- reading compressed index offsets from /B_NFS_P/resources/corpora/written/data/latin/word.crx<br>

!! You can delete the file &lt;/B_NFS_P/resources/corpora/written/data/latin/word.corpus.rev&gt; now.<br>!! You can delete the file &lt;/B_NFS_P/resources/corpora/written/data/latin/word.corpus.rdx&gt; now.<br>COMPRESSING INDEX of LATIN.lema<br>

- writing compressed index to /B_NFS_P/resources/corpora/written/data/latin/lema.crc<br>- writing compressed index offsets to /B_NFS_P/resources/corpora/written/data/latin/lema.crx<br>CL: index is out of range: (aborting) token frequency == 0<br>

<br>=== Makeall: processing corpus LATIN ===<br>Registry directory: /B_NFS_P/resources/corpora/written/registry/<br>ATTRIBUTE word<br> - lexicon      OK<br> - frequencies  OK<br> - token stream OK (COMPRESSED)<br> - index        OK (COMPRESSED)<br>

ATTRIBUTE lema<br> - lexicon      OK<br> - frequencies  OK<br> - token stream OK (COMPRESSED)<br> - index        OK (COMPRESSED)<br>ATTRIBUTE pos<br> - lexicon      OK<br> - frequencies  OK<br> - token stream OK (COMPRESSED)<br>

 - index        OK (COMPRESSED)<br>========================================<br><br>Thanks <br><br>Eva<br><br></div></div>