[CWB] [CQPweb] Expanding existing corpora

Hardie, Andrew a.hardie at lancaster.ac.uk
Sat Jun 15 16:52:12 CEST 2019


Hi Scott,

CQPweb is built on the same assumption as CWB generally:that corpora don't change once created.

Overwriting the indexes with an expanded version, while trying to keep the CQPweb extra bits intact ,  therefore implies undefined behaviour.

Ie, you're not meant to do it; having done it, all kinds of things might break; the low number of things that have actually broken is pure good luck!

In particular  I have no idea what kinds of inconsistencies might arise in things like user's  saved data. As I say, undefined : anything could happen.

But to answer the question: the safe way to add new metadata category values is to nuke the whole metadata table and rebuild from scratch. You might find an unsafe way by poking around in metadata lib php and script a series of calls... but that would be a pretty precarious way to do it.

And in future I am going to shift the underlying SQL datatype of category metadata from strings to enums for efficiency... at which point the above will be super-precarious.

-------
Is there any way to update a corpus so that it rescans metadata like p- and s-attributes and their values? My goal is to avoid having to recreate the corpus from scratch over and over.
-------
Fair enough,  but note what you're already doing:

Offline rebuild of the CWB data
Rebuild of all the freq tables
Reinstallation of metadata table (As per above)

.... which is nearly ALL the work of installing a corpus from scratch already.

My recommendation: set up highly specific templates, index the data offline, and use "new corpus from existing cwb" (can't remember the exact wording, sorry; I'm writing at a wedding with no laptop!). That saves you specifying p/s atts as they are taken from the registry.  The rest is what you are doing anyway.

This also has the benefit of making it possible to preserve the earlier versions -- your users might appreciate this if ever they need to do an analysis on the exact data as an analysis they did last year, for instance... on my server I do this by means of dates or version nos. on the end of corpus handles.

best

Andrew.


From: Scott Sadowsky
Sent: Saturday 15 June, 10:20 am
Subject: [CWB] [CQPweb] Expanding existing corpora
To: Open source development of the Corpus WorkBench


I have a situation which is probably not the norm for most users here. I have a corpus which I will be putting online gradually, in 20 or 30 installments over the next two years or so, as texts can be reviewed a second time for personally identifying or sensitive information, and such things can be redacted (it's a speech corpus).

When a new batch of texts is ready I process, tag and compile all the files that are fit for public consumption into a CQP corpus, upload the new set of CQP files to the server (replacing the old ones), and then re-run the frequency and STTR calculation scripts on the server. This updates the frequencies shown everywhere I've looked (test query results, corpus metadata, etc.) -- so far, so good.

The one thing I haven't been able to get to update, however, are the values of the text metadata and word-level annotation variables (as seen in the selection boxes of restricted queries and subcorpus creation).

Thus, if the first version of the corpus only had four of six socioeconomic statuses (say 1, 2, 3, 6) and a new version includes one or more speakers of SES 4, this new SES doesn't show up anywhere.

Is there any way to update a corpus so that it rescans metadata like p- and s-attributes and their values? My goal is to avoid having to recreate the corpus from scratch over and over.

Thanks in advance,
Scott

NOTE Unless I've misunderstood something, I'm not adding new p- or s-attributes, but rather new values for existing p-attributes.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20190615/e1626779/attachment-0001.html>


More information about the CWB mailing list