[CWB] Expanding existing corpora

Sun Jun 16 17:21:25 CEST 2019

On Sat, Jun 15, 2019 at 6:44 AM Maarten Janssen <maartenpt at gmail.com> wrote:

Thanks very much for answering, Maarten. It's not so much the tagging,
compiling, uploading and doing frequency counts that I'm trying to not
repeat, since I script all that. It's everything afterwards you have to do
in CQPweb -- creating and assigning corpus and frequency list permissions;
setting things like inter-linear gloss views, corpus title and corpus-level
metadata; generating subcorpora; and so on. And it doesn't help that I
haven't yet gotten the XML or metadata templates to work.

In theory this, too, could all be scripted, since you can pass MySQL any
command you want from Bash, Perl or whatever. But it would take a rather
deep understanding of CQPweb, its database structures, etc., which I don't
have. Hence my search for other ways to streamline things. But as Andrew
made clear, copying one corpus version on top of another and then updating
is definitely not the way to go!

Best wishes,
Scott

Updating a CQP corpus (not necessarily the MySQL tables, those I know
> little of but the raw CQP files) is not really possible - there are various
> attempts out there to do things in parts, but in the end, due to the set-up
> of the files, there is no secure way of updating files - files have an
> index of values, in corpus order, and a list linking corpus positions to
> the numbers in that index. Theoretically, you could not care about the
> corpus order and just change a corpus position index number, but unless you
> know beforehand which to change, it would not save time since you still
> have to go through the entire corpus; and you would have to be completely
> certain no extra values have appeared (or values that were possible
> actually became used). And even if you would manage, there are other files
> that count the number of values and such, and those would have to be
> recompiled in any case. So unless you have gigaword corpora, the best way
> is just to recompile; you might be able to get away with just recompiling
> the CQP corpus itself (running just cwb-encode), which is relatively fast;
> in my experience, a 500M corpus takes about half an hour to compile, and
> that is using my own encoder directly from XML, so the native cwb-encode is
> likely to be even faster than that….
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20190616/62542c9e/attachment.html>