[CWB] Expanding existing corpora
Scott Sadowsky
ssadowsky at gmail.com
Mon Jun 17 16:03:41 CEST 2019
This is fantastic, Andrew!
On Sun, Jun 16, 2019 at 6:52 PM Hardie, Andrew <a.hardie at lancaster.ac.uk>
wrote:
> PS, Scott, if you can get your templates working, you might find check-in
> 1250 of interest.
>
>
>
> cd bin
>
> php install-corpus.php --help
>
>
>
> *From:* Hardie, Andrew
> *Sent:* 16 June 2019 22:05
> *To:* Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it
> >
> *Subject:* RE: [CWB] Expanding existing corpora
>
>
>
> cd bin
>
>
>
> >> creating and assigning corpus and frequency list permissions
>
>
>
> ./cqpweb add_corpus_to_privilege_scope PRIVILEGE-INTEGER-ID CORPUS-HANDLE
>
>
>
> ./cqpweb remove_corpus_from_privilege_scope PRIVILEGE-INTEGER-ID
> CORPUS-HANDLE
>
>
>
> ./cqpweb create_corpus_default_privileges CORPUS-HANDLE
>
>
>
> ./cqpweb add_new_privilege 1 "" "Permission to use at retricted level
> (initially has scope over no corpora, they can be added later)"
>
>
>
> ./cqpweb add_new_privilege 2 "" "Permission to use at normal level "
>
>
>
> ./cqpweb add_new_privilege 3 "" "Permission to use at full level "
>
>
>
> ./cqpweb add_new_privilege 4 5000000 "Permission to create freq lists up
> to 500 K tokens"
>
>
>
> ./cqpweb grant_privilege_to_user USERNAME PRIVILEGE-INTEGER-ID
>
>
>
> ./cqpweb grant_privilege_to_group GROUP-NAME PRIVILEGE-INTEGER-ID
>
>
>
> ./cqpweb remove_grant_from_user USERNAME PRIVILEGE-INTEGER-ID
>
>
>
> ./cqpweb remove_grant_from_group GROUP-NAME PRIVILEGE-INTEGER-ID
>
>
>
>
>
>
>
> >> ; setting things like inter-linear gloss views, corpus title and
> corpus-level metadata
>
>
>
> ./cqpweb update_corpus_visualisation_gloss CORPUS-HANDLE
> 1-OR-0-FOR-SHOw-IN-CONCORDANCE 1-OR-0-FOR-SHOW-IN-CONTEXT P-ATTRIBUTE-HANDLE
>
>
>
> ./cqpweb update_corpus_visualisation_translate CORPUS-HANDLE
> 1-OR-0-FOR-SHOw-IN-CONCORDANCE 1-OR-0-FOR-SHOW-IN-CONTEXT S-ATTRIBUTE-HANDLE
>
>
>
> ./cqpweb add_variable_corpus_metadata CORPUS-HANDLE ATTRIBUTE-DESCRITPION
> VALUE-CONTENT
>
>
>
> ./cqpweb update_corpus_title CORPUS-HANDLE "new title goes here"
>
>
>
>
>
>
>
>
>
> *From:* cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> *On
> Behalf Of *Scott Sadowsky
> *Sent:* 16 June 2019 16:21
> *To:* Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it
> >
> *Subject:* Re: [CWB] Expanding existing corpora
>
>
>
> On Sat, Jun 15, 2019 at 6:44 AM Maarten Janssen <maartenpt at gmail.com>
> wrote:
>
>
>
> Thanks very much for answering, Maarten. It's not so much the tagging,
> compiling, uploading and doing frequency counts that I'm trying to not
> repeat, since I script all that. It's everything afterwards you have to do
> in CQPweb -- creating and assigning corpus and frequency list permissions;
> setting things like inter-linear gloss views, corpus title and corpus-level
> metadata; generating subcorpora; and so on. And it doesn't help that I
> haven't yet gotten the XML or metadata templates to work.
>
>
>
> In theory this, too, could all be scripted, since you can pass MySQL any
> command you want from Bash, Perl or whatever. But it would take a rather
> deep understanding of CQPweb, its database structures, etc., which I don't
> have. Hence my search for other ways to streamline things. But as Andrew
> made clear, copying one corpus version on top of another and then updating
> is definitely not the way to go!
>
>
>
> Best wishes,
>
> Scott
>
>
>
> Updating a CQP corpus (not necessarily the MySQL tables, those I know
> little of but the raw CQP files) is not really possible - there are various
> attempts out there to do things in parts, but in the end, due to the set-up
> of the files, there is no secure way of updating files - files have an
> index of values, in corpus order, and a list linking corpus positions to
> the numbers in that index. Theoretically, you could not care about the
> corpus order and just change a corpus position index number, but unless you
> know beforehand which to change, it would not save time since you still
> have to go through the entire corpus; and you would have to be completely
> certain no extra values have appeared (or values that were possible
> actually became used). And even if you would manage, there are other files
> that count the number of values and such, and those would have to be
> recompiled in any case. So unless you have gigaword corpora, the best way
> is just to recompile; you might be able to get away with just recompiling
> the CQP corpus itself (running just cwb-encode), which is relatively fast;
> in my experience, a 500M corpus takes about half an hour to compile, and
> that is using my own encoder directly from XML, so the native cwb-encode is
> likely to be even faster than that….
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
--
Dr. Scott Sadowsky
Profesor Asistente de Lingüística
Pontificia Universidad Católica de Chile
ssadowsky gmail com
scsadowsky uc cl
http://sadowsky.cl/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20190617/ab378265/attachment-0001.html>
More information about the CWB
mailing list