[CWB] Expanding existing corpora

Maarten Janssen maartenpt at gmail.com
Sat Jun 15 12:43:50 CEST 2019


Updating a CQP corpus (not necessarily the MySQL tables, those I know little of but the raw CQP files) is not really possible - there are various attempts out there to do things in parts, but in the end, due to the set-up of the files, there is no secure way of updating files - files have an index of values, in corpus order, and a list linking corpus positions to the numbers in that index. Theoretically, you could not care about the corpus order and just change a corpus position index number, but unless you know beforehand which to change, it would not save time since you still have to go through the entire corpus; and you would have to be completely certain no extra values have appeared (or values that were possible actually became used). And even if you would manage, there are other files that count the number of values and such, and those would have to be recompiled in any case. So unless you have gigaword corpora, the best way is just to recompile; you might be able to get away with just recompiling the CQP corpus itself (running just cwb-encode), which is relatively fast; in my experience, a 500M corpus takes about half an hour to compile, and that is using my own encoder directly from XML, so the native cwb-encode is likely to be even faster than that….
> I have a situation which is probably not the norm for most users here. I
> have a corpus which I will be putting online gradually, in 20 or 30
> installments over the next two years or so, as texts can be reviewed a
> second time for personally identifying or sensitive information, and such
> things can be redacted (it's a speech corpus).
> 
> When a new batch of texts is ready I process, tag and compile all the files
> that are fit for public consumption into a CQP corpus, upload the new set
> of CQP files to the server (replacing the old ones), and then re-run the
> frequency and STTR calculation scripts on the server. This updates the
> frequencies shown everywhere I've looked (test query results, corpus
> metadata, etc.) -- so far, so good.
> 
> The one thing I haven't been able to get to update, however, are the values
> of the text metadata and word-level annotation variables (as seen in the
> selection boxes of restricted queries and subcorpus creation).
> 
> Thus, if the first version of the corpus only had four of six socioeconomic
> statuses (say 1, 2, 3, 6) and a new version includes one or more speakers
> of SES 4, this new SES doesn't show up anywhere.
> 
> *Is there any way to update a corpus so that it rescans metadata like p-
> and s-attributes and their values?* My goal is to avoid having to recreate
> the corpus from scratch over and over.
> 
> Thanks in advance,
> Scott
> 
> NOTE Unless I've misunderstood something, I'm *not* adding new p- or
> s-attributes, but rather new *values* for existing p-attributes.



More information about the CWB mailing list