[CWB] Best practices to manage big corpora in CQPweb

Wed Nov 25 22:17:39 CET 2020

Dear Andrew,

I have a question regarding your comment

> - ensure that MySQL is using a location for temporary files which is on a
*separate physical disk* to the location where actual tables for the CQPweb
database is.

Do you mean the location indicated with the config file variable
$cqpweb_tempdir ? Or would it be the value that could be given to other
mysql configuration variables like tmpdir?

Best,
--
José Manuel Martínez Martínez
https://chozelinek.github.io

On Wed, Nov 25, 2020 at 1:47 PM José Manuel Martínez Martínez <
chozelinek at gmail.com> wrote:

> Hi Stefan and Andrew,
>
> Thank you for the quick feedback. Now, I understand much better where I
> can improve the performance of these processes.
> Regarding the issues MySQL writing and reading from disk, I'm using
> Amazon's cloud solutions, and in particular EFS. This is like a virtual
> local network disk that can be mounted on any virtual instance. It is in
> general pretty fast, but the bandwidth is bound to a certain amount of
> time, so I think that after some processing on the same disk (in the end a
> lot of information is being transferred from and to the disk), it becomes
> quite slow. It is convenient because different computers can access the
> same indices, so I avoid data redundancy. But it has a performance cost.
>
> Yes, I'm using CQPweb 3.2.6 because I wanted to work in a very stable
> version. Happy to test and jump to a more recent one if it is not broken. I
> need CQPweb to be in production.
>
> I will try to optimize the process. At some point, I'll share my
> experience with the community if time permits.
>
> --
> José Manuel Martínez Martínez
> https://chozelinek.github.io
>
>
> On Mon, Nov 16, 2020 at 4:20 AM Hardie, Andrew <a.hardie at lancaster.ac.uk>
> wrote:
>
>> Just a couple of additions to Stefan's answers.
>>
>> I've added the capacity to specify that a particular annotation
>> (p-attribute) should not have freq tables built, but only in 3.3 (trunk).
>> (I guess you are on the 3.2 branch, José) In  3.3, under "Manage
>> annotation" there is a control called "Needs FT", set to "Y" by default.
>> The only effect of switching to N is that the attribute is absent from
>> "Frequency lists", "keywords" and "collocations".
>>
>> As Stefan points out, the bottleneck is in MySQL, but specifically it's
>> an issue of disk *access*.  Creating the freq table requires the creation
>> of large temporary tables to store the results of intermediate "select..."
>> queries. For large corpora, these tables are too big to be held in RAM, so
>> they are stored to temporary disk space. If your MySQL daemon uses the same
>> physical disk for temp space and storage of actual tables, then its
>> read-accesses and write-accesses will be constantly interrupting one
>> another to read one table and write to another. This can cause MAJOR
>> slowdown.
>>
>> Possible remedies - not tested by me, sorry, but theoretically useful!
>>
>> - ensure that MySQL is using a location for temporary files which is on a
>> *separate physical disk* to the location where actual tables for the CQPweb
>> database is.
>>
>> - or, get a faster disk (RAID?) for the single location
>>
>> - or, get enough RAM to do it all without writing temp tables to disk
>>
>> - or, block all use of the server by other users during freq table setup
>> (again, to give the MySQL server connection doing the freq table all
>> available disk read/write bandwidth)
>>
>> - ALSO: creating freq tables is faster for annotations that are set to
>> case-sensitive/accent-sensitive. So, consider setting annotations to CS/AS
>> if you don't need C/A insensitivity. Again this is not available in 3.2.
>>
>> The creation of the per-text frequency data which Stefan mentions is
>> actually normally pretty quick, compared to the freq table building,
>> because unlike the SQL freq tables, no intermediate data is involved: it's
>> just a filter on pipeline from cwb-decode to cwb-encode.
>>
>> best
>>
>> Andrew.
>>
>>
>> -----Original Message-----
>> From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On
>> Behalf Of Stefan Evert
>> Sent: 14 November 2020 10:38
>> To: CWBdev Mailing List <cwb at sslmit.unibo.it>
>> Subject: Re: [CWB] Best practices to manage big corpora in CQPweb
>>
>>
>> Hi José,
>>
>> building frequency lists is the most time-consuming step of corpus
>> installation in CQPweb and can be tedious, but your corpora are still in a
>> reasonable size range (both wrt. token count and number of texts).
>>
>> I definitely wouldn't expect a 140M corpus to take 10 hours.  One
>> possibility is the fact that you're indexing 20 p-attributes, even though
>> CQPweb won't be able to work with them anyway (except to do a keyword or
>> collocation analysis).  IIRC, CQPweb indexes unique combinations across all
>> p-attributes, so this is going to be a huge and very expensive database.
>>
>> If you only need them for CQP queries, a work-around could be to remove
>> them from the registry file while installing the corpus in CQPweb (so
>> CQPweb won't know about them) and then put them back in later (so they're
>> available for CQP queries).
>>
>> There are two bottlenecks in building frequency lists:
>>
>> a) Creating per-text frequency lists is done in PHP and uses only a
>> single thread.  This is something you can't get around.
>>
>> b) Indexing frequency tables in MySQL can take a very long time (I always
>> feel that MySQL could do better there …). If this is your key bottleneck,
>> you should try to optimise the configuration of your MySQL server, e.g.
>> making it use more threads. Are you sure that the MySQL data store is on a
>> fast hard disk?
>>
>> Can you watch "top" during the indexing and check which programs are
>> taking up so much time?
>>
>> Best,
>> Stefan
>>
>>
>> > On 11 Nov 2020, at 10:06, José Manuel Martínez Martínez <
>> chozelinek at gmail.com> wrote:
>> >
>> > I'm currently working with several corpora in CQPweb which are fairly
>> big (they will be below the 2,1 billion limit though). The corpora will
>> contain between 7000 to 30000 texts, and the typical size in tokens will
>> range from 500M to 1500M tokens.
>> >
>> > My server (4 cores, 16GB RAM) is only serving CQPweb (no users for
>> now), indexing a corpus from the command line and running a python script.
>> >
>> > I've seen that the process of creating the frequency lists with
>> offline-freqlists.php is my current bottleneck. I think the process uses at
>> max. 2 cores? With a test corpus made up of 2300 texts and 140M tokens, it
>> took my server around 10 hours. My next will be on a corpus of around 8000
>> texts 500M tokens. Could this take up to 40 hours to be ready to be used in
>> CQPweb?
>> >
>> > How can I optimize the process? How do you usually do it?
>> > Any tips and tricks on how to handle this very big corpora will be very
>> appreciated.
>> >
>> > I think that the part that took longer was when it started generating
>> the frequency lists for every positional attribute. If this assumption is
>> right, I could skip some of the positional attributes (I have twenty of
>> them, eleven of them are booleans True, False values only, the interesting
>> ones are word, lemma, norm, pos, lower, shape, tag, dep, ent_type...).
>> >
>>
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>>
>> https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fliste.sslmit.unibo.it%2Fmailman%2Flistinfo%2Fcwb&amp;data=04%7C01%7Ca.hardie%40lancaster.ac.uk%7C682839573a054fae43aa08d8888e72e6%7C9c9bcd11977a4e9ca9a0bc734090164a%7C0%7C0%7C637409492692609346%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=rxebEKThc30I9EEdCcu5u7IVWEtuRDHJRlQonO4XUZE%3D&amp;reserved=0
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20201125/03cd3fe2/attachment-0001.html>