On Wed, Jul 24, 2013 at 2:43 AM, Stefan Evert <span dir="ltr"><<a href="mailto:stefanML@collocations.de" target="_blank">stefanML@collocations.de</a>></span> wrote:<div><br></div><div>Dear Stefan,</div><div><br></div>
<div>Thanks so much for your help. The following seems to have fixed the problem:</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
If you have "cwb-make" from the CWB/Perl modules, you can simply trash the ".crc" and ".crx" files (which contain the actual lookup index that appears to be damaged) and rebuild them with<br>
cwb-make [...] PERS-DIVER-USENET</blockquote><div><br></div><div>More testing will be needed to be sure, of course.</div><div><br></div><div>Best wishes,</div><div>Scott</div><div><br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im"><br>
On 24 Jul 2013, at 04:30, Scott Sadowsky <<a href="mailto:ssadowsky@gmail.com">ssadowsky@gmail.com</a>> wrote:<br>
<br>
> Something very strange is going on. I've replaced my index for this corpus with a third backup copy, and the following happened:<br>
><br>
> PERS-DIVER-USENET> "jai"<br>
> 0 matches.<br>
> PERS-DIVER-USENET> ".+ai"<br>
> Segmentation fault (core dumped)<br>
> Here the search for "jai", which previously caused a segfault, worked. So all seemed good. But the search returned 0 hits, instead of the 1 which is returned by the command cwb-lexdecode -f -p '.ai' PERS-DIVER-USENET. So something isn't adding up here.<br>
<br>
</div>If this is indeed a buffer overflow or so triggered by a faulty index file, it is not surprising that there's somewhat erratic behaviour.<br>
<div class="im"><br>
> I suspect the next step is to rebuild the index from scratch, but that involves decompressing a ZIP file with 1.2 million files inside it, which I'd rather avoid if at all possible.<br>
<br>
</div><br>
<br>
Of course, make sure you have a backup copy of the corpus beforehand.<br>
<br>
You should also be able to rebuild the index files manually with "cwb-makeall" and "cwb-compress-rdx", but those tools sometimes get confused about which files need to be rebuilt in which order.<br>
<br>
<br>
If you need to try re-encoding from scratch, an easier solution is<br>
<br>
cwb-decode -Cx PERS-DIVER-USENET -ALL | cwb-encode -x [...] <appropriate declarations><br>
<br>
Note that the attribute declarations in the cwb-encode command will be different from the ones you used for the original encoding, because attributes on XML regions are not decoded in proper XML notation.<br>
<br>
<br>
Hope that one of these steps helps!<br>
<span class="HOEnZb"><font color="#888888">Stefan<br>
<br>
</font></span></blockquote></div><br><br></div>