Hi Stefan,<br>I got confused about the version between the server and my computer.<br>
I was using an old Scipt that called cwb-makeall, cwb-huffcode,
cwb-compress-rdx (and the last cwb-makeall that I think it was there to
check that everything was ok.)<br>
I changed to cwb-make and now it works. So the error must have been related to old files as you pointed out.<br><br>Many thanks for your help<br><br>Eva Bofias<br><br><div class="gmail_quote">2012/10/12 Stefan Evert <span dir="ltr"><<a href="mailto:stefanML@collocations.de" target="_blank">stefanML@collocations.de</a>></span><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im"><br>
On 11 Oct 2012, at 17:43, BOFÍAS ALBERCH, EVA wrote:<br>
<br>
> I'm using a server: Debian GNU/Linux 6.0<br>
> We downloaded the beta version (cwb-3.4.1 )<br>
<br>
</div>In a previous mail you stated that you're using CWB 3.0.2 -- is it possible that you've mixed up two different versions? However, file formats should be fully compatible between 3.0.x and 3.4.x, so this is unlikely to be the cause of your problems.<br>
<div class="im"><br></div></blockquote><div><br><br> </div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="im">
> We also need to know exactly which commands you entered to index and compress the corpus, plus the output from each of these commands. Perhaps this will allow us to make a guess at the source of the error.<br>
><br>
> the command I use is:<br>
> cat $SOURCEFILE | /usr/local/cwb-3.4.1/bin/cwb-encode -c utf8 -d $DATADIR -R $REGDIR/$CORPUSNAME -xsB -P lema -P pos -V s -S doc:0+type+title -S not:0+text<br>
<br>
</div>That can't be all you're doing.<br>
<br>
For one thing, you need to define the shell variables SOURCEFILE, DATADIR, etc. for this command to do anything sensible.<br>
<br>
More importantly, this command only runs cwb-encode, which is the first step of the indexing process. You still need to run cwb-makeall (to build the actual index structures) and cwb-huffcode and cwb-compress-rdx (to compress the index files, which is where your error occurs).<br>
<br>
The output you sent us (as shown below) stems from these programs, so you must be running those additional commands in some way!<br>
<br>
There are two strange things about the output:<br>
<br>
1) You seem to run cwb-makeall twice, once before compressing and once after. There's no need to run cwb-makeall a second time -- why do you do that?<br>
<br>
2) The output from the first cwb-makeall run indicates that the index structures have already been created _and_ compressed (it just says "OK" rather than "creating ..."). Those might be stale, damaged files from a previous encoding run. Did you forget to clean the data directory /B_NFS_P/resources/corpora/written/data/latin/ before re-running cwb-encode? It's quite possible that your error is due to damaged index files still lying around ...<br>
<br>
By the way, this is a good reason why you should use cwb-make from the CWB/Perl modules rather than calling cwb-makeall etc. directly. cwb-make would recognise that they index files are out of date and automatically delete and rebuild them.<br>
<br>
Best,<br>
Stefan<br>
<div class="HOEnZb"><div class="h5"><br>
<br>
><br>
> This is the output (after correcting the errors you mentioned):<br>
><br>
> === Makeall: processing corpus LATIN ===<br>
> Registry directory: /B_NFS_P/resources/corpora/written/registry/<br>
> ATTRIBUTE word<br>
> - lexicon OK<br>
> - frequencies OK<br>
> - token stream OK (COMPRESSED)<br>
> - index OK (COMPRESSED)<br>
> ATTRIBUTE lema<br>
> - lexicon OK<br>
> - frequencies OK<br>
> - token stream OK (COMPRESSED)<br>
> - index OK (COMPRESSED)<br>
> ATTRIBUTE pos<br>
> - lexicon OK<br>
> - frequencies OK<br>
> - token stream OK (COMPRESSED)<br>
> - index OK (COMPRESSED)<br>
> ========================================<br>
> COMPRESSING TOKEN STREAM of LATIN.word<br>
> - writing code descriptor block to /B_NFS_P/resources/corpora/written/data/latin/word.hcd<br>
> - writing compressed item sequence to /B_NFS_P/resources/corpora/written/data/latin/word.huf<br>
> - writing sync (every 128 tokens) to /B_NFS_P/resources/corpora/written/data/latin/word.huf.syn<br>
> VALIDATING LATIN.word<br>
> - reading code descriptor block from /B_NFS_P/resources/corpora/written/data/latin/word.hcd<br>
> - reading compressed item sequence from /B_NFS_P/resources/corpora/written/data/latin/word.huf<br>
> - reading sync (mod 128) from /B_NFS_P/resources/corpora/written/data/latin/word.huf.syn<br>
> !! You can delete the file </B_NFS_P/resources/corpora/written/data/latin/word.corpus> now.<br>
> COMPRESSING TOKEN STREAM of LATIN.lema<br>
> Error: Huffman codes too long (33 bits, current maximum is 31 bits).<br>
> Please contact the CWB development team for assistance.<br>
> COMPRESSING INDEX of LATIN.word<br>
> - writing compressed index to /B_NFS_P/resources/corpora/written/data/latin/word.crc<br>
> - writing compressed index offsets to /B_NFS_P/resources/corpora/written/data/latin/word.crx<br>
> VALIDATING LATIN.word<br>
> - reading compressed index from /B_NFS_P/resources/corpora/written/data/latin/word.crc<br>
> - reading compressed index offsets from /B_NFS_P/resources/corpora/written/data/latin/word.crx<br>
> !! You can delete the file </B_NFS_P/resources/corpora/written/data/latin/word.corpus.rev> now.<br>
> !! You can delete the file </B_NFS_P/resources/corpora/written/data/latin/word.corpus.rdx> now.<br>
> COMPRESSING INDEX of LATIN.lema<br>
> - writing compressed index to /B_NFS_P/resources/corpora/written/data/latin/lema.crc<br>
> - writing compressed index offsets to /B_NFS_P/resources/corpora/written/data/latin/lema.crx<br>
> CL: index is out of range: (aborting) token frequency == 0<br>
><br>
> === Makeall: processing corpus LATIN ===<br>
> Registry directory: /B_NFS_P/resources/corpora/written/registry/<br>
> ATTRIBUTE word<br>
> - lexicon OK<br>
> - frequencies OK<br>
> - token stream OK (COMPRESSED)<br>
> - index OK (COMPRESSED)<br>
> ATTRIBUTE lema<br>
> - lexicon OK<br>
> - frequencies OK<br>
> - token stream OK (COMPRESSED)<br>
> - index OK (COMPRESSED)<br>
> ATTRIBUTE pos<br>
> - lexicon OK<br>
> - frequencies OK<br>
> - token stream OK (COMPRESSED)<br>
> - index OK (COMPRESSED)<br>
> ========================================<br>
><br>
> Thanks<br>
><br>
> Eva<br>
><br>
<br>
</div></div></blockquote></div><br>