[CWB] Limit for text elements in CQPweb?

Hardie, Andrew a.hardie at lancaster.ac.uk
Tue Jun 23 12:39:49 CEST 2020


Stefan's right, there's no limit - just steadily degrading performance as the n of texts gets bigger and bigger beyond a certain point. In the past I've found things getting pretty slow around the level of tens of thousands of texts. But of course, much depends on your hardware.

best

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On Behalf Of Stefan Evert
Sent: 19 June 2020 19:29
To: CWBdev Mailing List <cwb at sslmit.unibo.it>
Subject: Re: [CWB] Limit for text elements in CQPweb?


> The corpus is a collection of almost 5 million Tweets/texts with 
> around
> 140 million tokens. I could index the whole corpus without any 
> problems in CWB (version 3.4.18),

CQPweb should be able to handle this in theory.  As far as I know – Andrew will have to chime in if I'm wrong – there is no formal limit on the number of individual texts.

HOWEVER, you're not going to be happy with this corpus because CQPweb will become unusably slow, due to the way that many operations on text metadata are implemented and because of performance issues in MySQL.  (The recent CQPweb 3.3 might perform better because a lot of the internal architecture was changed.)

What we do with Twitter corpora is to pre-group all tweets with the same CQPweb-relevant metadata values (i.e. classifications that can be used for subcorpora and frequency distributions) into collections that are then encoded as "texts".  Accessing screen names, tweet IDs, URLs etc. needs a little bit of magic with XML visualisations and JavaScript, but should be doable.

Best,
Stefan
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fliste.sslmit.unibo.it%2Fmailman%2Flistinfo%2Fcwb&amp;data=02%7C01%7Ca.hardie%40lancaster.ac.uk%7C623bc89ba9b64b2d06ec08d8147eb271%7C9c9bcd11977a4e9ca9a0bc734090164a%7C0%7C1%7C637281881683738760&amp;sdata=ZDdZq4ci6gHaD3wLzZd8F%2BzjuZg5oyn%2BtLkQ6fBmJlE%3D&amp;reserved=0


More information about the CWB mailing list