[CWB] Maximum corpus size

Stephanie Evert stefanML at collocations.de
Mon Feb 6 11:42:44 CET 2023


Dear Austin,

I think you've been misreading the encoding tutorial, which says that

	The maximum corpus size is 2,147,483,647 tokens (the largest value that can be stored as a signed 32-bit integer). In the CWB source code, this is represented by the macro CL_MAX_CORPUS_SIZE.

	https://cwb.sourceforge.io/files/CWB_Encoding_Tutorial/B.html

So the maximum size is a hard upper limit, and there is no indication here that it would be sensible to modify CL_MAX_CORPUS_SIZE in the source code.

Such limitations will be lifted by the new Ziggurat backend, once we finally get round to implementing it.  Things are progressing, though, so I'm inclined to say “stay tuned”.

Best,
Stephanie


> On 6 Feb 2023, at 09:53, Austin Yang <austin.yang.2014 at gmail.com> wrote:
> 
> Dear all,
> I'm trying to encode a corpus size over 2GiB. The CWB encoding tutorial noted that it is possible by changing the CL_MAX_CORPUS_SIZE from CWB source code. I modified the parameter (CL_MAX_CORPUS_SIZE) from the cl.h file (which I'm not sure if it's the CWB source code mentioned in the tutorial) by 10x, but the CQPweb site still show that the maximum token is 2,147,483,647 tokens. Did I miss something from the tutorial? Any comments will be greatly appreciated! 
> 
> CWB version 3.5.0
> 
> 
> Best,
> Austin Yang (楊承洋)
> MS in Cognitive Neuroscience, NCU
> BS in Psychology, CYCU
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb



More information about the CWB mailing list