[CWB] [CQPweb] Expanding existing corpora
Scott Sadowsky
ssadowsky at gmail.com
Sat Jun 15 20:22:30 CEST 2019
On Sat, Jun 15, 2019 at 10:52 AM Hardie, Andrew <a.hardie at lancaster.ac.uk>
wrote:
Hi Andrew,
CQPweb is built on the same assumption as CWB generally:that corpora don't
> change once created.
>
That's what I imagined, and I realize mine is an edge case in the extreme.
I just wanted to be sure.
> Overwriting the indexes with an expanded version, while trying to keep the
> CQPweb extra bits intact , therefore implies undefined behaviour. Ie,
> you're not meant to do it; having done it, all kinds of things might break;
> the low number of things that have actually broken is pure good luck! In
> particular I have no idea what kinds of inconsistencies might arise in
> things like user's saved data. As I say, undefined : anything could happen.
>
Thanks very much for the warning! I didn't realize the situation was that
dangerous, and I certainly won't be repeating it!
> But to answer the question: the safe way to add new metadata category
> values is to nuke the whole metadata table and rebuild from scratch. You
> might find an unsafe way by poking around in metadata lib php and script a
> series of calls... but that would be a pretty precarious way to do it. And
> in future I am going to shift the underlying SQL datatype of category
> metadata from strings to enums for efficiency... at which point the above
> will be super-precarious.
>
All clear!
> *Is there any way to update a corpus so that it rescans metadata like p-
> and s-attributes and their values?* My goal is to avoid having to
> recreate the corpus from scratch over and over.
> -------
> Fair enough, but note what you're already doing:
>
> Offline rebuild of the CWB data
> Rebuild of all the freq tables
> Reinstallation of metadata table (As per above)
>
> .... which is nearly ALL the work of installing a corpus from scratch
> already.
>
Yes, but I have everything from tagging to local corpus creation to
uploading via SSH scripted, so it takes me just a second or two :-) It's
what comes next, which mostly involves MySQL, that I haven't figured out
how to automate, and so it takes me a fair bit of time.
> My recommendation: set up highly specific templates, index the data
> offline, and use "new corpus from existing cwb" (can't remember the exact
> wording, sorry... That saves you specifying p/s atts as they are taken from
> the registry. The rest is what you are doing anyway. This also has the
> benefit of making it possible to preserve the earlier versions -- your
> users might appreciate this if ever they need to do an analysis on the
> exact data as an analysis they did last year, for instance... on my server
> I do this by means of dates or version nos. on the end of corpus handles.
>
That seems like an excellent strategy. I'll be doing just that from now on.
I'm writing at a wedding with no laptop!).
As always, Andrew, you've gone above and beyond the call of duty. Thanks
yet again!
Best,
Scott
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20190615/c6dc51e0/attachment-0001.html>
More information about the CWB
mailing list