<div dir="ltr"><div><div>Dear Yannick and Stefan,<br><br></div>thank you both for your feedback. I have implemented Stefan&#39;s algorithm (many thanks!) in R to calculate statistics from CWB generated frequency tables and they work just great. I am working on a 100-million token corpus, so dealing with raw data (corpus read into an R data frame) has been always extremely tedious which made any calculation hardly feasible. I have already used &quot;cwb-scan-corpus&quot;, as well as the &quot;tabulate&quot; solution, however, astonished by how fast cwb-scan-corpus is, I thought that there might exist some undocumented way to do this in one pass. Your clarification helped me a lot!<br>

<br>Thank you once again for your help<br></div><div></div>Chris<br><div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">2014-03-23 14:52 GMT+01:00 Stefan Evert <span dir="ltr">&lt;<a href="mailto:stefanML@collocations.de" target="_blank">stefanML@collocations.de</a>&gt;</span>:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class=""><br>

&gt; I was wondering if I had missed something when reading CWB documentation or there does not exist any trivial way to generate per text corpus statistics (eg. text_id, text_author, word_count, types_count etc.). I have already tried both external  (cwb-scan-corpus) and internal (query = []; then tabulate) approach, but without major success. I have also started to analyse CQPWeb php scripts in order to see how it populates mysql tables with frequency data, but it is not precisely what I was looking for (I am still digging, though).<br>


<br>

</div>You seem to be asking for two different things here:<br>

<br>

(a) A metadata table that associates each text (ID) with text-level metadata such as &quot;author&quot;, &quot;genre&quot;, etc.<br>

<br>

(b) Various kinds of type-token statistics for each text.<br>

<br>

<br>

Concerning (a), if the metadata are encoded in the CWB index, you can easily &quot;tabulate&quot; them within CQP:<br>

<br>

&gt; Texts = &lt;text&gt; [];<br>

&gt; tabulate Texts match text_id, match text_author, ... ;<br>

<br>

<br>

Concerning (b), the CWB doesn&#39;t keep track of per-text corpus statistics (and it doesn&#39;t have a notion of &quot;text&quot; in the first place, anyway).  CQPweb keeps full word frequency counts for each text in its internal database, from which most type-token-statistics can be derived.  I&#39;m not sure if there&#39;s a way to access them directly, though.<br>


<br>

To generate the necessary counts, CQPweb runs through the full corpus and collects the frequency counts in hash variables.  As Yannick suggested, it is fairly easy to do this from Perl, Python or R using the low-level corpus access APIs.  You can also do this quite efficiently from the command line with cwb-scan-corpus:<br>


<br>

        cwb-scan-corpus -o text_word_counts.gz CORPUS text_id+0 word+0<br>

<br>

will produce frequency counts for every combination of text_id and word form that occurs in the corpus.  They are saved in unsorted order, so you&#39;ll probably want to sort by the second column (the text ID):<br>

<br>

        cwb-scan-corpus CORPUS text_id+0 word+0 | sort -k2,2 -k1,1nr  | gzip &gt; text_word_counts.gz<br>

<br>

For type counts, check how often each text ID occurs in the table:<br>

<br>

        gzip -cd  text_word_counts.gz | cut -f2 | uniq -c<br>

<br>

For token counts, add up the word frequencies for each text ID, or get them directly from cwb-s-decode:<br>

<br>

        cwb-s-decode CORPUS -S text_id | awk &#39;{print $2 - $1 + 1, $3}&#39;<br>

<br>

<br>

Hope this helps,<br>

Stefan<br>

<br>

<br>

<br>

<br>

<br>

_______________________________________________<br>

CWB mailing list<br>

<a href="mailto:CWB@sslmit.unibo.it">CWB@sslmit.unibo.it</a><br>

<a href="http://devel.sslmit.unibo.it/mailman/listinfo/cwb" target="_blank">http://devel.sslmit.unibo.it/mailman/listinfo/cwb</a><br>

</blockquote></div><br></div>