[CWB] Limit for text elements in CQPweb?

Fri Jun 19 20:29:05 CEST 2020

> The corpus is a collection of almost 5 million Tweets/texts with around
> 140 million tokens. I could index the whole corpus without any problems
> in CWB (version 3.4.18), 

CQPweb should be able to handle this in theory.  As far as I know – Andrew will have to chime in if I'm wrong – there is no formal limit on the number of individual texts.

HOWEVER, you're not going to be happy with this corpus because CQPweb will become unusably slow, due to the way that many operations on text metadata are implemented and because of performance issues in MySQL.  (The recent CQPweb 3.3 might perform better because a lot of the internal architecture was changed.)

What we do with Twitter corpora is to pre-group all tweets with the same CQPweb-relevant metadata values (i.e. classifications that can be used for subcorpora and frequency distributions) into collections that are then encoded as "texts".  Accessing screen names, tweet IDs, URLs etc. needs a little bit of magic with XML visualisations and JavaScript, but should be doable.

Best,
Stefan