[CWB] size of a subcorpus - how?

Tue Apr 16 10:10:12 CEST 2013

On 15 Apr 2013, at 23:07, Pavel Vondřička <Pavel.Vondricka at ff.cuni.cz> wrote:

> yes, that is a very dirty way, I think. At the moment I prefer the one I
> tried using rcqp: compute it from the subcorpus dump in R. I am still
> mostly puzzled that CWB is missing such a basic functionality at all.

CWB is intended more as a query engine than as a complete concordancer or workbench (at least nowadays, and despite its name), so it provides mostly core functionality -- accessing a corpus, index lookup, corpus queries.  Convenience functions, e.g. for subcorpus sizes or collocation analysis, are usually much easier to implement in a high-level language (Perl, Python, PHP, R, ...) than in plain C.  With the R or Perl API, it's easy to tally up all regions of a subcorpus and compute its total size.

That said, on a Linux platform you can get these counts fairly easily from within CQP, e.g.

	Subcorpus = <speaker_language = "DE"> [] expand to speaker_language;
	dump Subcorpus > "| awk '{size += $2 - $1 + 1} END {print size}'";

If you need this often, define a macro in your startup file, e.g.

	define macro TotalSize (1) 'dump $0 > "| awk ''{size += $ 2 - $ 1 + 1} END {print size}''";'
	/TotalSize[Subcorpus];

(A bit ugly due to quirks of the macro syntax, but at least on my Mac the example above works.)

Cheers,
Stefan