[CWB] Best practices to manage big corpora in CQPweb

Wed Nov 11 10:06:35 CET 2020

Hi there,

I'm currently working with several corpora in CQPweb which are fairly big
(they will be below the 2,1 billion limit though). The corpora will contain
between 7000 to 30000 texts, and the typical size in tokens will range from
500M to 1500M tokens.

My server (4 cores, 16GB RAM) is only serving CQPweb (no users for now),
indexing a corpus from the command line and running a python script.

I've seen that the process of creating the frequency lists
with offline-freqlists.php is my current bottleneck. I think the process
uses at max. 2 cores? With a test corpus made up of 2300 texts and 140M
tokens, it took my server around 10 hours. My next will be on a corpus of
around 8000 texts 500M tokens. Could this take up to 40 hours to be ready
to be used in CQPweb?

How can I optimize the process? How do you usually do it?
Any tips and tricks on how to handle this very big corpora will be very
appreciated.

I think that the part that took longer was when it started generating the
frequency lists for every positional attribute. If this assumption is
right, I could skip some of the positional attributes (I have twenty of
them, eleven of them are booleans True, False values only, the interesting
ones are word, lemma, norm, pos, lower, shape, tag, dep, ent_type...).

Cheers,
--
José Manuel Martínez Martínez
https://chozelinek.github.io
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20201111/d5130dee/attachment.html>