[CWB] Best practices to manage big corpora in CQPweb
José Manuel Martínez Martínez
chozelinek at gmail.com
Thu Nov 26 14:28:32 CET 2020
Thanks for the clarification, Andrew!
Best for now!
--
José Manuel Martínez Martínez
https://chozelinek.github.io
On Thu, Nov 26, 2020 at 2:14 PM Hardie, Andrew <a.hardie at lancaster.ac.uk>
wrote:
> Hi José
>
>
>
> I was referring to TMPDIR specified in MySQL / MariaDB configuration – see
> e.g. here https://dev.mysql.com/doc/refman/8.0/en/temporary-files.html
>
>
>
> $cqpweb_tempdir is for cqp query cache and other misc temp files. the
> mysql daemon’s temporary storage has nothing to do with this.
>
>
>
> best
>
>
>
> Andrew.
>
>
>
>
>
>
>
>
>
> *From:* cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> *On
> Behalf Of *José Manuel Martínez Martínez
> *Sent:* 25 November 2020 21:18
> *To:* Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it
> >
> *Subject:* Re: [CWB] Best practices to manage big corpora in CQPweb
>
>
>
> Dear Andrew,
>
>
>
> I have a question regarding your comment
>
>
>
> > - ensure that MySQL is using a location for temporary files which is on
> a *separate physical disk* to the location where actual tables for the
> CQPweb database is.
>
>
>
> Do you mean the location indicated with the config file variable
> $cqpweb_tempdir ? Or would it be the value that could be given to other
> mysql configuration variables like tmpdir?
>
>
>
> Best,
>
> --
>
> José Manuel Martínez Martínez
>
> https://chozelinek.github.io
> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fchozelinek.github.io%2F&data=04%7C01%7Ca.hardie%40lancaster.ac.uk%7Cb545d0b146ab4d9bff2e08d89187ad0d%7C9c9bcd11977a4e9ca9a0bc734090164a%7C0%7C0%7C637419359210222394%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=xJAzfUKt8IKcxowlJxJ0h0NXFZuBNdQkM42Rq2iPPFk%3D&reserved=0>
>
>
>
>
>
> On Wed, Nov 25, 2020 at 1:47 PM José Manuel Martínez Martínez <
> chozelinek at gmail.com> wrote:
>
> Hi Stefan and Andrew,
>
>
>
> Thank you for the quick feedback. Now, I understand much better where I
> can improve the performance of these processes.
>
> Regarding the issues MySQL writing and reading from disk, I'm using
> Amazon's cloud solutions, and in particular EFS. This is like a virtual
> local network disk that can be mounted on any virtual instance. It is in
> general pretty fast, but the bandwidth is bound to a certain amount of
> time, so I think that after some processing on the same disk (in the end a
> lot of information is being transferred from and to the disk), it becomes
> quite slow. It is convenient because different computers can access the
> same indices, so I avoid data redundancy. But it has a performance cost.
>
>
>
> Yes, I'm using CQPweb 3.2.6 because I wanted to work in a very stable
> version. Happy to test and jump to a more recent one if it is not broken. I
> need CQPweb to be in production.
>
>
>
> I will try to optimize the process. At some point, I'll share my
> experience with the community if time permits.
>
>
> --
>
> José Manuel Martínez Martínez
>
> https://chozelinek.github.io
> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fchozelinek.github.io%2F&data=04%7C01%7Ca.hardie%40lancaster.ac.uk%7Cb545d0b146ab4d9bff2e08d89187ad0d%7C9c9bcd11977a4e9ca9a0bc734090164a%7C0%7C0%7C637419359210232349%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=EBO4klTuk5KWyw0EYco6gxC7iRJP4lFVKLr2MekUQbc%3D&reserved=0>
>
>
>
>
>
> On Mon, Nov 16, 2020 at 4:20 AM Hardie, Andrew <a.hardie at lancaster.ac.uk>
> wrote:
>
> Just a couple of additions to Stefan's answers.
>
> I've added the capacity to specify that a particular annotation
> (p-attribute) should not have freq tables built, but only in 3.3 (trunk).
> (I guess you are on the 3.2 branch, José) In 3.3, under "Manage
> annotation" there is a control called "Needs FT", set to "Y" by default.
> The only effect of switching to N is that the attribute is absent from
> "Frequency lists", "keywords" and "collocations".
>
> As Stefan points out, the bottleneck is in MySQL, but specifically it's an
> issue of disk *access*. Creating the freq table requires the creation of
> large temporary tables to store the results of intermediate "select..."
> queries. For large corpora, these tables are too big to be held in RAM, so
> they are stored to temporary disk space. If your MySQL daemon uses the same
> physical disk for temp space and storage of actual tables, then its
> read-accesses and write-accesses will be constantly interrupting one
> another to read one table and write to another. This can cause MAJOR
> slowdown.
>
> Possible remedies - not tested by me, sorry, but theoretically useful!
>
> - ensure that MySQL is using a location for temporary files which is on a
> *separate physical disk* to the location where actual tables for the CQPweb
> database is.
>
> - or, get a faster disk (RAID?) for the single location
>
> - or, get enough RAM to do it all without writing temp tables to disk
>
> - or, block all use of the server by other users during freq table setup
> (again, to give the MySQL server connection doing the freq table all
> available disk read/write bandwidth)
>
> - ALSO: creating freq tables is faster for annotations that are set to
> case-sensitive/accent-sensitive. So, consider setting annotations to CS/AS
> if you don't need C/A insensitivity. Again this is not available in 3.2.
>
> The creation of the per-text frequency data which Stefan mentions is
> actually normally pretty quick, compared to the freq table building,
> because unlike the SQL freq tables, no intermediate data is involved: it's
> just a filter on pipeline from cwb-decode to cwb-encode.
>
> best
>
> Andrew.
>
>
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On Behalf
> Of Stefan Evert
> Sent: 14 November 2020 10:38
> To: CWBdev Mailing List <cwb at sslmit.unibo.it>
> Subject: Re: [CWB] Best practices to manage big corpora in CQPweb
>
>
> Hi José,
>
> building frequency lists is the most time-consuming step of corpus
> installation in CQPweb and can be tedious, but your corpora are still in a
> reasonable size range (both wrt. token count and number of texts).
>
> I definitely wouldn't expect a 140M corpus to take 10 hours. One
> possibility is the fact that you're indexing 20 p-attributes, even though
> CQPweb won't be able to work with them anyway (except to do a keyword or
> collocation analysis). IIRC, CQPweb indexes unique combinations across all
> p-attributes, so this is going to be a huge and very expensive database.
>
> If you only need them for CQP queries, a work-around could be to remove
> them from the registry file while installing the corpus in CQPweb (so
> CQPweb won't know about them) and then put them back in later (so they're
> available for CQP queries).
>
> There are two bottlenecks in building frequency lists:
>
> a) Creating per-text frequency lists is done in PHP and uses only a single
> thread. This is something you can't get around.
>
> b) Indexing frequency tables in MySQL can take a very long time (I always
> feel that MySQL could do better there …). If this is your key bottleneck,
> you should try to optimise the configuration of your MySQL server, e.g.
> making it use more threads. Are you sure that the MySQL data store is on a
> fast hard disk?
>
> Can you watch "top" during the indexing and check which programs are
> taking up so much time?
>
> Best,
> Stefan
>
>
> > On 11 Nov 2020, at 10:06, José Manuel Martínez Martínez <
> chozelinek at gmail.com> wrote:
> >
> > I'm currently working with several corpora in CQPweb which are fairly
> big (they will be below the 2,1 billion limit though). The corpora will
> contain between 7000 to 30000 texts, and the typical size in tokens will
> range from 500M to 1500M tokens.
> >
> > My server (4 cores, 16GB RAM) is only serving CQPweb (no users for now),
> indexing a corpus from the command line and running a python script.
> >
> > I've seen that the process of creating the frequency lists with
> offline-freqlists.php is my current bottleneck. I think the process uses at
> max. 2 cores? With a test corpus made up of 2300 texts and 140M tokens, it
> took my server around 10 hours. My next will be on a corpus of around 8000
> texts 500M tokens. Could this take up to 40 hours to be ready to be used in
> CQPweb?
> >
> > How can I optimize the process? How do you usually do it?
> > Any tips and tricks on how to handle this very big corpora will be very
> appreciated.
> >
> > I think that the part that took longer was when it started generating
> the frequency lists for every positional attribute. If this assumption is
> right, I could skip some of the positional attributes (I have twenty of
> them, eleven of them are booleans True, False values only, the interesting
> ones are word, lemma, norm, pos, lower, shape, tag, dep, ent_type...).
> >
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
>
> https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fliste.sslmit.unibo.it%2Fmailman%2Flistinfo%2Fcwb&data=04%7C01%7Ca.hardie%40lancaster.ac.uk%7C682839573a054fae43aa08d8888e72e6%7C9c9bcd11977a4e9ca9a0bc734090164a%7C0%7C0%7C637409492692609346%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=rxebEKThc30I9EEdCcu5u7IVWEtuRDHJRlQonO4XUZE%3D&reserved=0
> <https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fliste.sslmit.unibo.it%2Fmailman%2Flistinfo%2Fcwb&data=04%7C01%7Ca.hardie%40lancaster.ac.uk%7Cb545d0b146ab4d9bff2e08d89187ad0d%7C9c9bcd11977a4e9ca9a0bc734090164a%7C0%7C0%7C637419359210232349%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=lf0w%2BaBZypLIj8GkU1iDuIbDSRHQ6URpZ38LVG8gvhQ%3D&reserved=0>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
> <https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fliste.sslmit.unibo.it%2Fmailman%2Flistinfo%2Fcwb&data=04%7C01%7Ca.hardie%40lancaster.ac.uk%7Cb545d0b146ab4d9bff2e08d89187ad0d%7C9c9bcd11977a4e9ca9a0bc734090164a%7C0%7C0%7C637419359210232349%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=lf0w%2BaBZypLIj8GkU1iDuIbDSRHQ6URpZ38LVG8gvhQ%3D&reserved=0>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20201126/c94c3f49/attachment-0001.html>
More information about the CWB
mailing list