[CWB] Best practices to manage big corpora in CQPweb

Mon Nov 16 03:40:28 CET 2020

Just a couple of additions to Stefan's answers. 

I've added the capacity to specify that a particular annotation (p-attribute) should not have freq tables built, but only in 3.3 (trunk). (I guess you are on the 3.2 branch, José) In  3.3, under "Manage annotation" there is a control called "Needs FT", set to "Y" by default. The only effect of switching to N is that the attribute is absent from "Frequency lists", "keywords" and "collocations".  

As Stefan points out, the bottleneck is in MySQL, but specifically it's an issue of disk *access*.  Creating the freq table requires the creation of large temporary tables to store the results of intermediate "select..." queries. For large corpora, these tables are too big to be held in RAM, so they are stored to temporary disk space. If your MySQL daemon uses the same physical disk for temp space and storage of actual tables, then its read-accesses and write-accesses will be constantly interrupting one another to read one table and write to another. This can cause MAJOR slowdown.

Possible remedies - not tested by me, sorry, but theoretically useful!

- ensure that MySQL is using a location for temporary files which is on a *separate physical disk* to the location where actual tables for the CQPweb database is.

- or, get a faster disk (RAID?) for the single location

- or, get enough RAM to do it all without writing temp tables to disk

- or, block all use of the server by other users during freq table setup  (again, to give the MySQL server connection doing the freq table all available disk read/write bandwidth) 

- ALSO: creating freq tables is faster for annotations that are set to case-sensitive/accent-sensitive. So, consider setting annotations to CS/AS if you don't need C/A insensitivity. Again this is not available in 3.2.

The creation of the per-text frequency data which Stefan mentions is actually normally pretty quick, compared to the freq table building, because unlike the SQL freq tables, no intermediate data is involved: it's just a filter on pipeline from cwb-decode to cwb-encode.     

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On Behalf Of Stefan Evert
Sent: 14 November 2020 10:38
To: CWBdev Mailing List <cwb at sslmit.unibo.it>
Subject: Re: [CWB] Best practices to manage big corpora in CQPweb

Hi José,

building frequency lists is the most time-consuming step of corpus installation in CQPweb and can be tedious, but your corpora are still in a reasonable size range (both wrt. token count and number of texts).

I definitely wouldn't expect a 140M corpus to take 10 hours.  One possibility is the fact that you're indexing 20 p-attributes, even though CQPweb won't be able to work with them anyway (except to do a keyword or collocation analysis).  IIRC, CQPweb indexes unique combinations across all p-attributes, so this is going to be a huge and very expensive database.

If you only need them for CQP queries, a work-around could be to remove them from the registry file while installing the corpus in CQPweb (so CQPweb won't know about them) and then put them back in later (so they're available for CQP queries).

There are two bottlenecks in building frequency lists:

a) Creating per-text frequency lists is done in PHP and uses only a single thread.  This is something you can't get around.

b) Indexing frequency tables in MySQL can take a very long time (I always feel that MySQL could do better there …). If this is your key bottleneck, you should try to optimise the configuration of your MySQL server, e.g. making it use more threads. Are you sure that the MySQL data store is on a fast hard disk?

Can you watch "top" during the indexing and check which programs are taking up so much time?

Best,
Stefan

> On 11 Nov 2020, at 10:06, José Manuel Martínez Martínez <chozelinek at gmail.com> wrote:
>
> I'm currently working with several corpora in CQPweb which are fairly big (they will be below the 2,1 billion limit though). The corpora will contain between 7000 to 30000 texts, and the typical size in tokens will range from 500M to 1500M tokens.
>
> My server (4 cores, 16GB RAM) is only serving CQPweb (no users for now), indexing a corpus from the command line and running a python script.
>
> I've seen that the process of creating the frequency lists with offline-freqlists.php is my current bottleneck. I think the process uses at max. 2 cores? With a test corpus made up of 2300 texts and 140M tokens, it took my server around 10 hours. My next will be on a corpus of around 8000 texts 500M tokens. Could this take up to 40 hours to be ready to be used in CQPweb?
>
> How can I optimize the process? How do you usually do it?
> Any tips and tricks on how to handle this very big corpora will be very appreciated.
>
> I think that the part that took longer was when it started generating the frequency lists for every positional attribute. If this assumption is right, I could skip some of the positional attributes (I have twenty of them, eleven of them are booleans True, False values only, the interesting ones are word, lemma, norm, pos, lower, shape, tag, dep, ent_type...).
>

_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fliste.sslmit.unibo.it%2Fmailman%2Flistinfo%2Fcwb&amp;data=04%7C01%7Ca.hardie%40lancaster.ac.uk%7C682839573a054fae43aa08d8888e72e6%7C9c9bcd11977a4e9ca9a0bc734090164a%7C0%7C0%7C637409492692609346%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=rxebEKThc30I9EEdCcu5u7IVWEtuRDHJRlQonO4XUZE%3D&amp;reserved=0