[CWB] CWB: problems with indexing a corpus
Stephanie Evert
stefanML at collocations.de
Mon Feb 13 21:39:54 CET 2023
> I am trying to process a vrt file with cwb-encode.
> The file has pos tagging and I used the examples from CWB-manuals as a template.
>
> I run the command
> cwb-encode -f /path_to_file.vrt -d /path/datafiles -R /path/registry/corpus_name -9 -c utf8 -P pos -P lemma -S text -S p -S s
>
> and I am getting the following warnings:
> > Annotations of s-attribute <text> not stored (file /xxx.vrt, line #1, warning issued only once).
> > Annotations of s-attribute <p> not stored (file /xx.vrt, line #3, warning issued only once).
> > Annotations of s-attribute <s> not stored (file /xx.vrt, line #4, warning issued only once).
That means the start tags of these XML elements contain attribute-value pairs, which you're ignoring – cwb-encode simply warns you about this fact.
> And the programme terminates without producing any result.
That sounds like an error, though, and completely unrelated to the warnings. After successful completion of the command, your data directory /path/datafiles should be populated with index files.
Are you sure there isn't any error message?
A first step would be to re-run cwb-encode with the -v option added (at the start, not after the attribute flags). This should print how many tokens have been read and encoded from the vrt file.
Best,
Stephanie
More information about the CWB
mailing list