[CWB] size of a subcorpus - how?
Stefan Evert
stefanML at collocations.de
Tue Apr 16 10:10:12 CEST 2013
On 15 Apr 2013, at 23:07, Pavel Vondřička <Pavel.Vondricka at ff.cuni.cz> wrote:
> yes, that is a very dirty way, I think. At the moment I prefer the one I
> tried using rcqp: compute it from the subcorpus dump in R. I am still
> mostly puzzled that CWB is missing such a basic functionality at all.
CWB is intended more as a query engine than as a complete concordancer or workbench (at least nowadays, and despite its name), so it provides mostly core functionality -- accessing a corpus, index lookup, corpus queries. Convenience functions, e.g. for subcorpus sizes or collocation analysis, are usually much easier to implement in a high-level language (Perl, Python, PHP, R, ...) than in plain C. With the R or Perl API, it's easy to tally up all regions of a subcorpus and compute its total size.
That said, on a Linux platform you can get these counts fairly easily from within CQP, e.g.
Subcorpus = <speaker_language = "DE"> [] expand to speaker_language;
dump Subcorpus > "| awk '{size += $2 - $1 + 1} END {print size}'";
If you need this often, define a macro in your startup file, e.g.
define macro TotalSize (1) 'dump $0 > "| awk ''{size += $ 2 - $ 1 + 1} END {print size}''";'
/TotalSize[Subcorpus];
(A bit ugly due to quirks of the macro syntax, but at least on my Mac the example above works.)
Cheers,
Stefan
More information about the CWB
mailing list