[CWB] cwb-scan-corpus

Stefan Evert stefanML at collocations.de
Fri Nov 1 11:01:45 CET 2019


Hi Simon,

first: you should probably enclose the POS constraints in single quotes, so your shell doesn't get confused by them.  I also find the -f option more convenient than redirecting the output of cwb-scan-corpus (and I often save to a .gz file because I'm chronically short on disk space).

> On 1 Nov 2019, at 09:14, Simon Meier-Vieracker <simon.meier-vieracker at tu-dresden.de> wrote:
> 
> I succeed in filtering out >all< pos-tags starting with '$' like this:
> 
> cwb-scan-corpus CORPUS lemma+0 lemma+1 lemma+2 ?pos+0=/[^\$].+/ ?pos+1=/[^\$].+/ ?pos+2=/[^\$].+/ > trigrams.txt
> 
> But still this is not exactly what I want, because I only want to filter out '$.'

Make sure you have a sufficiently recent version of CWB installed (v3.4.11 or newer should suffice) and use negated constraints:

	cwb-scan-corpus -f trigrams.txt CORPUS lemma+0 lemma+1 lemma+2 '?pos+0!=/\$\./' '?pos+1!=/\$\./' '?pos+2!=/\$\./'

Best,
Stefan


More information about the CWB mailing list