[CWB] Performance limit of CQPweb?

Hardie, Andrew a.hardie at lancaster.ac.uk
Wed Mar 11 15:38:48 CET 2020


Hi Matthias,

There is not a straightforward answer to that question. There are no hard limits except the CWB maximum corpus size of 2.1 billion tokens. Rather, if the system is stretched, things will still work, but more and more slowly as the load goes up.  

Generally, the number of simultaneous users is constrained by the number of CPU cores you have. But that really does mean SIMULTANEOUS as in, pressing the "run query" button (or whatever) at the exact same second! Of course that doesn't usually happen as most people working on CQPweb only load a new page once every couple of minutes or so. So in practice many more users than that can use the system at one time.

Performance does not necessarily match up directly to corpus size. There are some aspects for which the N of types, or the N of texts in the corpus, are the things that determine how fast it will run. A 500 MW corpus with 2 million types is a very different thing to a 500 MW corpus with 20 million types. 

The major bottlenecks for large corpora are:

(a) the creation of temporary frequency lists for sub-parts of the corpus. You will spot if this is a major problem because collocation of restricted queries will start to run slowly.  Creation of frequency lists requires simultaneous heavy use of disk read and disk write by the SQL daemon, so if the daemon has to share the drive bandwidth with other processes, it can get slow. One solution if that happens is to put the SQL daemon's temporary table directory on a different physical disk (not a different partition). 

(b) query restrictions and sub-corpora based on sub-text XML regions. Calculating the extent of complex restrictions can require a lot of RAM and CPU power. If you start getting crashes caused by running out of memory, or overrunning the execution time limit, you probably need to increase the amount of RAM and length of time each PHP process can use. That in turn will reduce the amount of headroom you have for multiple users to access the system simultaneously (in the narrow sense of "at the same moment").

Finally if there are any other applications/services running on the same machine as CQPweb, their resource needs must be considered too, which is non-straightforward. (This is why for the Lancaster CQPweb server I have a terminal permanently open and running "top"!)

There's notes on all this in the admin manual. If you find places where the manual is unclear or does not explain in sufficient detail please feel free to open a bug or a feature request about that.

Hope this helps

best

Andrew.


-----Original Message-----
From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On Behalf Of Fluor Matthias (fluo)
Sent: 11 March 2020 11:26
To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it>
Subject: [CWB] Performance limit of CQPweb?


Hi,

I wanted to ask if there are some experience values for performance limits - as in - how many people can access it simultaneously, and how large the corpora can be to be reasonably performant?

As second question maybe - where are known bottlenecks in a "standard" CQPweb setup with large Corpora (300-900+Mio Tokens)?

Best,
Matthias
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it


More information about the CWB mailing list