[CWB] Expanding existing corpora

Mon Jun 17 00:52:10 CEST 2019

PS, Scott, if you can get your templates working, you might find check-in 1250 of interest.

cd bin
php install-corpus.php --help

From: Hardie, Andrew
Sent: 16 June 2019 22:05
To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it>
Subject: RE: [CWB] Expanding existing corpora

cd bin

>> creating and assigning corpus and frequency list permissions

./cqpweb add_corpus_to_privilege_scope PRIVILEGE-INTEGER-ID CORPUS-HANDLE

./cqpweb remove_corpus_from_privilege_scope PRIVILEGE-INTEGER-ID CORPUS-HANDLE

./cqpweb create_corpus_default_privileges CORPUS-HANDLE

./cqpweb add_new_privilege 1 "" "Permission to use at retricted level (initially has scope over no corpora, they can be added later)"

./cqpweb add_new_privilege 2 "" "Permission to use at normal level "

./cqpweb add_new_privilege 3 "" "Permission to use at full level "

./cqpweb add_new_privilege 4 5000000 "Permission to create freq lists up to 500 K tokens"

./cqpweb grant_privilege_to_user USERNAME PRIVILEGE-INTEGER-ID

./cqpweb grant_privilege_to_group GROUP-NAME  PRIVILEGE-INTEGER-ID

./cqpweb remove_grant_from_user USERNAME PRIVILEGE-INTEGER-ID

./cqpweb remove_grant_from_group GROUP-NAME  PRIVILEGE-INTEGER-ID

>> ; setting things like inter-linear gloss views, corpus title and corpus-level metadata

./cqpweb update_corpus_visualisation_gloss CORPUS-HANDLE 1-OR-0-FOR-SHOw-IN-CONCORDANCE 1-OR-0-FOR-SHOW-IN-CONTEXT P-ATTRIBUTE-HANDLE

./cqpweb update_corpus_visualisation_translate CORPUS-HANDLE 1-OR-0-FOR-SHOw-IN-CONCORDANCE 1-OR-0-FOR-SHOW-IN-CONTEXT S-ATTRIBUTE-HANDLE

./cqpweb add_variable_corpus_metadata CORPUS-HANDLE ATTRIBUTE-DESCRITPION VALUE-CONTENT

./cqpweb update_corpus_title CORPUS-HANDLE "new title goes here"

From: cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it> <cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it>> On Behalf Of Scott Sadowsky
Sent: 16 June 2019 16:21
To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it<mailto:cwb at sslmit.unibo.it>>
Subject: Re: [CWB] Expanding existing corpora

On Sat, Jun 15, 2019 at 6:44 AM Maarten Janssen <maartenpt at gmail.com<mailto:maartenpt at gmail.com>> wrote:

Thanks very much for answering, Maarten. It's not so much the tagging, compiling, uploading and doing frequency counts that I'm trying to not repeat, since I script all that. It's everything afterwards you have to do in CQPweb -- creating and assigning corpus and frequency list permissions; setting things like inter-linear gloss views, corpus title and corpus-level metadata; generating subcorpora; and so on. And it doesn't help that I haven't yet gotten the XML or metadata templates to work.

In theory this, too, could all be scripted, since you can pass MySQL any command you want from Bash, Perl or whatever. But it would take a rather deep understanding of CQPweb, its database structures, etc., which I don't have. Hence my search for other ways to streamline things. But as Andrew made clear, copying one corpus version on top of another and then updating is definitely not the way to go!

Best wishes,
Scott

Updating a CQP corpus (not necessarily the MySQL tables, those I know little of but the raw CQP files) is not really possible - there are various attempts out there to do things in parts, but in the end, due to the set-up of the files, there is no secure way of updating files - files have an index of values, in corpus order, and a list linking corpus positions to the numbers in that index. Theoretically, you could not care about the corpus order and just change a corpus position index number, but unless you know beforehand which to change, it would not save time since you still have to go through the entire corpus; and you would have to be completely certain no extra values have appeared (or values that were possible actually became used). And even if you would manage, there are other files that count the number of values and such, and those would have to be recompiled in any case. So unless you have gigaword corpora, the best way is just to recompile; you might be able to get away with just recompiling the CQP corpus itself (running just cwb-encode), which is relatively fast; in my experience, a 500M corpus takes about half an hour to compile, and that is using my own encoder directly from XML, so the native cwb-encode is likely to be even faster than that….
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20190616/81f54661/attachment-0001.html>