[CWB] Counting tokens with a given s-attribute
Scott Sadowsky
ssadowsky at gmail.com
Tue Mar 22 15:47:27 CET 2022
Thanks so much, Stephanie! It's great to have multiple solutions.
All the best,
Scott
On Tue, Mar 22, 2022 at 6:43 AM Stephanie Evert <stefanML at collocations.de>
wrote:
> > I have a corpus which is divided into de facto subcorpora using an
> s-attribute, and I need to count the number of tokens in each subcorpus.
> Are there any issues with doing this by searching for [word=".+"] while
> selecting each of the s-attribute values and using the number of matches
> returned as the token count? Is there a better way to do this (ideally, one
> which would return all the match counts at once)?
>
> Let's assume that the s-attribute in question is <div_cat>, for the sake
> of exposition. There are three ways of obtaining the subcorpus sizes:
>
> 1) The only efficient solution is to use cwb-s-decode together with a
> Perl, Python or R script for aggregating counts (or use available packages
> in one of those programming languages for direct corpus access).
>
> 2) The lazy solution – if you don't care about wasting time and memory –
> works in CQP:
>
> Tokens = [];
> group Tokens match div_cat;
>
> (and you'll probably want to set PrettyPrint off; and redirect the
> frequency table to a TSV file).
>
> 3) As a compromise, you can use cwb-scan-corpus on the command-line. It is
> still relatively inefficient, but considerably faster than solution 2 and
> very memory-efficient.
>
> cwb-scan-corpus -o subcorpus_sizes.tsv CORPUS div_cat+0
>
> Best,
> Stephanie
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20220322/5889af58/attachment.html>
More information about the CWB
mailing list