[CWB] Best practices to manage big corpora in CQPweb

Sat Nov 14 11:38:25 CET 2020

Hi José,

building frequency lists is the most time-consuming step of corpus installation in CQPweb and can be tedious, but your corpora are still in a reasonable size range (both wrt. token count and number of texts).

I definitely wouldn't expect a 140M corpus to take 10 hours.  One possibility is the fact that you're indexing 20 p-attributes, even though CQPweb won't be able to work with them anyway (except to do a keyword or collocation analysis).  IIRC, CQPweb indexes unique combinations across all p-attributes, so this is going to be a huge and very expensive database.

If you only need them for CQP queries, a work-around could be to remove them from the registry file while installing the corpus in CQPweb (so CQPweb won't know about them) and then put them back in later (so they're available for CQP queries).

There are two bottlenecks in building frequency lists:

a) Creating per-text frequency lists is done in PHP and uses only a single thread.  This is something you can't get around.

b) Indexing frequency tables in MySQL can take a very long time (I always feel that MySQL could do better there …). If this is your key bottleneck, you should try to optimise the configuration of your MySQL server, e.g. making it use more threads. Are you sure that the MySQL data store is on a fast hard disk? 

Can you watch "top" during the indexing and check which programs are taking up so much time?

Best,
Stefan

> On 11 Nov 2020, at 10:06, José Manuel Martínez Martínez <chozelinek at gmail.com> wrote:
> 
> I'm currently working with several corpora in CQPweb which are fairly big (they will be below the 2,1 billion limit though). The corpora will contain between 7000 to 30000 texts, and the typical size in tokens will range from 500M to 1500M tokens.
> 
> My server (4 cores, 16GB RAM) is only serving CQPweb (no users for now), indexing a corpus from the command line and running a python script.
> 
> I've seen that the process of creating the frequency lists with offline-freqlists.php is my current bottleneck. I think the process uses at max. 2 cores? With a test corpus made up of 2300 texts and 140M tokens, it took my server around 10 hours. My next will be on a corpus of around 8000 texts 500M tokens. Could this take up to 40 hours to be ready to be used in CQPweb?
> 
> How can I optimize the process? How do you usually do it?
> Any tips and tricks on how to handle this very big corpora will be very appreciated.
> 
> I think that the part that took longer was when it started generating the frequency lists for every positional attribute. If this assumption is right, I could skip some of the positional attributes (I have twenty of them, eleven of them are booleans True, False values only, the interesting ones are word, lemma, norm, pos, lower, shape, tag, dep, ent_type...).
>