[CWB] Counting tokens with a given s-attribute
Stephanie Evert
stefanML at collocations.de
Tue Mar 22 10:43:22 CET 2022
> I have a corpus which is divided into de facto subcorpora using an s-attribute, and I need to count the number of tokens in each subcorpus. Are there any issues with doing this by searching for [word=".+"] while selecting each of the s-attribute values and using the number of matches returned as the token count? Is there a better way to do this (ideally, one which would return all the match counts at once)?
Let's assume that the s-attribute in question is <div_cat>, for the sake of exposition. There are three ways of obtaining the subcorpus sizes:
1) The only efficient solution is to use cwb-s-decode together with a Perl, Python or R script for aggregating counts (or use available packages in one of those programming languages for direct corpus access).
2) The lazy solution – if you don't care about wasting time and memory – works in CQP:
Tokens = [];
group Tokens match div_cat;
(and you'll probably want to set PrettyPrint off; and redirect the frequency table to a TSV file).
3) As a compromise, you can use cwb-scan-corpus on the command-line. It is still relatively inefficient, but considerably faster than solution 2 and very memory-efficient.
cwb-scan-corpus -o subcorpus_sizes.tsv CORPUS div_cat+0
Best,
Stephanie
More information about the CWB
mailing list