[CWB] [CQPweb] Expanding existing corpora

Scott Sadowsky ssadowsky at gmail.com
Sat Jun 15 20:22:30 CEST 2019


On Sat, Jun 15, 2019 at 10:52 AM Hardie, Andrew <a.hardie at lancaster.ac.uk>
wrote:

Hi Andrew,

CQPweb is built on the same assumption as CWB generally:that corpora don't
> change once created.
>

That's what I imagined, and I realize mine is an edge case in the extreme.
I just wanted to be sure.



> Overwriting the indexes with an expanded version, while trying to keep the
> CQPweb extra bits intact ,  therefore implies undefined behaviour. Ie,
> you're not meant to do it; having done it, all kinds of things might break;
> the low number of things that have actually broken is pure good luck! In
> particular  I have no idea what kinds of inconsistencies might arise in
> things like user's  saved data. As I say, undefined : anything could happen.
>

Thanks very much for the warning! I didn't realize the situation was that
dangerous, and I certainly won't be repeating it!



> But to answer the question: the safe way to add new metadata category
> values is to nuke the whole metadata table and rebuild from scratch. You
> might find an unsafe way by poking around in metadata lib php and script a
> series of calls... but that would be a pretty precarious way to do it. And
> in future I am going to shift the underlying SQL datatype of category
> metadata from strings to enums for efficiency... at which point the above
> will be super-precarious.
>

All clear!



> *Is there any way to update a corpus so that it rescans metadata like p-
> and s-attributes and their values?* My goal is to avoid having to
> recreate the corpus from scratch over and over.
> -------
> Fair enough,  but note what you're already doing:
>
> Offline rebuild of the CWB data
> Rebuild of all the freq tables
> Reinstallation of metadata table (As per above)
>
> .... which is nearly ALL the work of installing a corpus from scratch
> already.
>

Yes, but I have everything from tagging to local corpus creation to
uploading via SSH scripted, so it takes me just a second or two :-) It's
what comes next, which mostly involves MySQL, that I haven't figured out
how to automate, and so it takes me a fair bit of time.



> My recommendation: set up highly specific templates, index the data
> offline, and use "new corpus from existing cwb" (can't remember the exact
> wording, sorry... That saves you specifying p/s atts as they are taken from
> the registry.  The rest is what you are doing anyway. This also has the
> benefit of making it possible to preserve the earlier versions -- your
> users might appreciate this if ever they need to do an analysis on the
> exact data as an analysis they did last year, for instance... on my server
> I do this by means of dates or version nos. on the end of corpus handles.
>

That seems like an excellent strategy. I'll be doing just that from now on.

 I'm writing at a wedding with no laptop!).


As always, Andrew, you've gone above and beyond the call of duty. Thanks
yet again!

Best,
Scott
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20190615/c6dc51e0/attachment-0001.html>


More information about the CWB mailing list