[CWB] Counting tokens with a given s-attribute

Tue Mar 22 10:43:22 CET 2022

> I have a corpus which is divided into de facto subcorpora using an s-attribute, and I need to count the number of tokens in each subcorpus. Are there any issues with doing this by searching for [word=".+"] while selecting each of the s-attribute values and using the number of matches returned as the token count? Is there a better way to do this (ideally, one which would return all the match counts at once)?

Let's assume that the s-attribute in question is <div_cat>, for the sake of exposition.  There are three ways of obtaining the subcorpus sizes:

1) The only efficient solution is to use cwb-s-decode together with a Perl, Python or R script for aggregating counts (or use available packages in one of those programming languages for direct corpus access).

2) The lazy solution – if you don't care about wasting time and memory – works in CQP:

	Tokens = [];
	group Tokens match div_cat;

(and you'll probably want to set PrettyPrint off; and redirect the frequency table to a TSV file).

3) As a compromise, you can use cwb-scan-corpus on the command-line. It is still relatively inefficient, but considerably faster than solution 2 and very memory-efficient.

	cwb-scan-corpus -o subcorpus_sizes.tsv CORPUS div_cat+0

Best,
Stephanie