[CWB] Best practices to manage big corpora in CQPweb

José Manuel Martínez Martínez chozelinek at gmail.com
Wed Nov 25 13:47:26 CET 2020


Hi Stefan and Andrew,

Thank you for the quick feedback. Now, I understand much better where I can
improve the performance of these processes.
Regarding the issues MySQL writing and reading from disk, I'm using
Amazon's cloud solutions, and in particular EFS. This is like a virtual
local network disk that can be mounted on any virtual instance. It is in
general pretty fast, but the bandwidth is bound to a certain amount of
time, so I think that after some processing on the same disk (in the end a
lot of information is being transferred from and to the disk), it becomes
quite slow. It is convenient because different computers can access the
same indices, so I avoid data redundancy. But it has a performance cost.

Yes, I'm using CQPweb 3.2.6 because I wanted to work in a very stable
version. Happy to test and jump to a more recent one if it is not broken. I
need CQPweb to be in production.

I will try to optimize the process. At some point, I'll share my experience
with the community if time permits.

--
José Manuel Martínez Martínez
https://chozelinek.github.io


On Mon, Nov 16, 2020 at 4:20 AM Hardie, Andrew <a.hardie at lancaster.ac.uk>
wrote:

> Just a couple of additions to Stefan's answers.
>
> I've added the capacity to specify that a particular annotation
> (p-attribute) should not have freq tables built, but only in 3.3 (trunk).
> (I guess you are on the 3.2 branch, José) In  3.3, under "Manage
> annotation" there is a control called "Needs FT", set to "Y" by default.
> The only effect of switching to N is that the attribute is absent from
> "Frequency lists", "keywords" and "collocations".
>
> As Stefan points out, the bottleneck is in MySQL, but specifically it's an
> issue of disk *access*.  Creating the freq table requires the creation of
> large temporary tables to store the results of intermediate "select..."
> queries. For large corpora, these tables are too big to be held in RAM, so
> they are stored to temporary disk space. If your MySQL daemon uses the same
> physical disk for temp space and storage of actual tables, then its
> read-accesses and write-accesses will be constantly interrupting one
> another to read one table and write to another. This can cause MAJOR
> slowdown.
>
> Possible remedies - not tested by me, sorry, but theoretically useful!
>
> - ensure that MySQL is using a location for temporary files which is on a
> *separate physical disk* to the location where actual tables for the CQPweb
> database is.
>
> - or, get a faster disk (RAID?) for the single location
>
> - or, get enough RAM to do it all without writing temp tables to disk
>
> - or, block all use of the server by other users during freq table setup
> (again, to give the MySQL server connection doing the freq table all
> available disk read/write bandwidth)
>
> - ALSO: creating freq tables is faster for annotations that are set to
> case-sensitive/accent-sensitive. So, consider setting annotations to CS/AS
> if you don't need C/A insensitivity. Again this is not available in 3.2.
>
> The creation of the per-text frequency data which Stefan mentions is
> actually normally pretty quick, compared to the freq table building,
> because unlike the SQL freq tables, no intermediate data is involved: it's
> just a filter on pipeline from cwb-decode to cwb-encode.
>
> best
>
> Andrew.
>
>
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On Behalf
> Of Stefan Evert
> Sent: 14 November 2020 10:38
> To: CWBdev Mailing List <cwb at sslmit.unibo.it>
> Subject: Re: [CWB] Best practices to manage big corpora in CQPweb
>
>
> Hi José,
>
> building frequency lists is the most time-consuming step of corpus
> installation in CQPweb and can be tedious, but your corpora are still in a
> reasonable size range (both wrt. token count and number of texts).
>
> I definitely wouldn't expect a 140M corpus to take 10 hours.  One
> possibility is the fact that you're indexing 20 p-attributes, even though
> CQPweb won't be able to work with them anyway (except to do a keyword or
> collocation analysis).  IIRC, CQPweb indexes unique combinations across all
> p-attributes, so this is going to be a huge and very expensive database.
>
> If you only need them for CQP queries, a work-around could be to remove
> them from the registry file while installing the corpus in CQPweb (so
> CQPweb won't know about them) and then put them back in later (so they're
> available for CQP queries).
>
> There are two bottlenecks in building frequency lists:
>
> a) Creating per-text frequency lists is done in PHP and uses only a single
> thread.  This is something you can't get around.
>
> b) Indexing frequency tables in MySQL can take a very long time (I always
> feel that MySQL could do better there …). If this is your key bottleneck,
> you should try to optimise the configuration of your MySQL server, e.g.
> making it use more threads. Are you sure that the MySQL data store is on a
> fast hard disk?
>
> Can you watch "top" during the indexing and check which programs are
> taking up so much time?
>
> Best,
> Stefan
>
>
> > On 11 Nov 2020, at 10:06, José Manuel Martínez Martínez <
> chozelinek at gmail.com> wrote:
> >
> > I'm currently working with several corpora in CQPweb which are fairly
> big (they will be below the 2,1 billion limit though). The corpora will
> contain between 7000 to 30000 texts, and the typical size in tokens will
> range from 500M to 1500M tokens.
> >
> > My server (4 cores, 16GB RAM) is only serving CQPweb (no users for now),
> indexing a corpus from the command line and running a python script.
> >
> > I've seen that the process of creating the frequency lists with
> offline-freqlists.php is my current bottleneck. I think the process uses at
> max. 2 cores? With a test corpus made up of 2300 texts and 140M tokens, it
> took my server around 10 hours. My next will be on a corpus of around 8000
> texts 500M tokens. Could this take up to 40 hours to be ready to be used in
> CQPweb?
> >
> > How can I optimize the process? How do you usually do it?
> > Any tips and tricks on how to handle this very big corpora will be very
> appreciated.
> >
> > I think that the part that took longer was when it started generating
> the frequency lists for every positional attribute. If this assumption is
> right, I could skip some of the positional attributes (I have twenty of
> them, eleven of them are booleans True, False values only, the interesting
> ones are word, lemma, norm, pos, lower, shape, tag, dep, ent_type...).
> >
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
>
> https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fliste.sslmit.unibo.it%2Fmailman%2Flistinfo%2Fcwb&amp;data=04%7C01%7Ca.hardie%40lancaster.ac.uk%7C682839573a054fae43aa08d8888e72e6%7C9c9bcd11977a4e9ca9a0bc734090164a%7C0%7C0%7C637409492692609346%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=rxebEKThc30I9EEdCcu5u7IVWEtuRDHJRlQonO4XUZE%3D&amp;reserved=0
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20201125/95fe22f9/attachment.html>


More information about the CWB mailing list