[CWB] How to add new data to a corpus without re-indexing it
Hardie, Andrew
a.hardie at lancaster.ac.uk
Thu Jul 8 12:04:10 CEST 2021
For monitor corpora the best approach is to create a new installed corpus at each update point.
Adding more text to an existing corpus is one of those things that sounds good until you think about how it would fit with all the rest of the system. For instance, if the system administrator were to append more text to a corpus, it would cause all the saved data by users (saved queries, categorised queries, subcorpora) to suddenly no longer match the corpus they relate to. Not to mention that when running the same query on the same corpus produces different results on Tuesday than it did on Monday, replicability of analyses becomes a serious headache.
So it will never be possible to append additional text to an existing indexed corpus.
What would be possible, and I’ll add it to my list for the long term, would be a function to say “create a new corpus by taking the full content of this existing corpus and adding these new files to it”. IE, without modifying the original corpus data in any way.
But in fact, that’s already possible via command line: using cwb-decode to save the existing corpus to a text file and then adding in the extra files and running cwb-encode on the whole lot. There’s just no web interface for that use case at present.
best
Andrew.
From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On Behalf Of wu liangping
Sent: 08 July 2021 10:15
To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it>
Subject: Re: [CWB] How to add new data to a corpus without re-indexing it
Hi Andrew,
Thanks for the clarification, then all things seem to make sense.
As monitor/dynamic corpus is becoming more visible, it would be great to find a way to be able to periodically update the data behind CQPweb.
Best,
WU Liangping
At 2021-07-08 16:51:20, "Hardie, Andrew" <a.hardie at lancaster.ac.uk<mailto:a.hardie at lancaster.ac.uk>> wrote:
Hi Liangping,
It’s not possible to append text to an existing corpus. The “add data” function allows you to add new attributes (annotation/xml) or new metadata to the existing corpus. IT doesn’t allow you to extend the corpus.
best
Andrew.
From: cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it> <cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it>> On Behalf Of wu liangping
Sent: 08 July 2021 09:16
To: cwb at sslmit.unibo.it<mailto:cwb at sslmit.unibo.it>
Subject: [CWB] How to add new data to a corpus without re-indexing it
Dear all,
Has anyon managed to add new data to a corpus without re-indexing it?
In the "Latest news" of a recent 3.2 branch CQPweb installation, it reads that CQPweb has "[c]ompleted the feature that adds new data to a corpus without re-indexing it (this can now be done for p-attributes as well as s-attributes and corpus metadata)" since version 3.2.31. However, a previous discussion back in 2012 in the thread titled "Appending text to an existing corpus" clearly says that we "need to re-index from scratch" if we want to append text to an existing corpus.
Has anyone tried the new feature with success? Or better still, is there any documentation for this new feature?
Thanks for any hints before we decide to dive into the actual code.
Best,
WU Liangping
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20210708/191bda16/attachment-0001.html>
More information about the CWB
mailing list