[CWB] export corpus
Hardie, Andrew
a.hardie at lancaster.ac.uk
Sun Apr 23 20:31:15 CEST 2023
Re >> The corpus for which we need to export the tokens plus the POS and lemma tags more urgently, though, is not in an XML format.
Then you can just use cwb-decode without specifying any s-attributes. And the issue Stephanie mentions is not relevant to you.
IE
cwb-decode -r /path/to/your/registry -C CORPUS_HANDLE -P word -P pos -P lemma
... replacing the placeholders in the command above as necessary (and also considering whether the output format you need is -C or -Cx or something else, for info on which see "man cwb-decode").
Re >> I have tried to find .vrt files but I find none for that corpus. If those files only exist for corpora with XML tags, then this is perhaps not surprising. Is there any other way to export the text with the tags that doesn't involve extracting that information from a .vrt file?
I think there is a bit of a misunderstanding at the root of your questions here. The .vrt files **are not retained** in a CWB corpus index. Their content is chopped up, binary encoded, and compressed. The index thus contains all the original information but not in any text readable format. The only way to get the unencoded corpus text OUT of the index is with cwb-decode. The original vertical format text simply **does not exist** anywhere in the CWB data folder. (Your original .vrt files that you encoded will still exist, unless you delete them of course, but CWB has no knowledge of them once encoding is complete.)
by the way, the CQPweb export-corpus function is, in fact, no more than a pretty minimal web wrapper round cwb-decode.
I hope that clears things up a bit.
best
Andrew.
-----Original Message-----
From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On Behalf Of Josep M. Fontana
Sent: Sunday, April 23, 2023 6:24 PM
To: cwb at sslmit.unibo.it
Subject: Re: [CWB] export corpus
Thanks Stephanie. I realize that I might have introduced some confusion when I mixed different things in my first message.
The specific corpus that I mentioned when I talked about the export problems we were having was our installation of the XML BNC corpus. The corpus for which we need to export the tokens plus the POS and lemma tags more urgently, though, is not in an XML format. It is actually the corpus Cristina developed in her thesis; the first corpus we ever installed using CWB. This was before we started using the CQPWeb interface.
So, we didn't use the XML format for tags in that corpus. I have tried to find .vrt files but I find none for that corpus. If those files only exist for corpora with XML tags, then this is perhaps not surprising.
Is there any other way to export the text with the tags that doesn't involve extracting that information from a .vrt file?
JM
> If you do it on the command-line rather than via CQPweb, make sure you have CWB v3.5 and read Sec. 8 of the Corpus Encoding Manual carefully to see how you can reconstruct nested XML tags and attribute-value pairs in the start tags (if they have been split up by cwb-encode).
>
> Best,
> Stephanie
>
>> On 23 Apr 2023, at 01:26, Josep M. Fontana <josepm.fontana at upf.edu> wrote:
>>
>> Thanks. We'll try that.
>>
>> JM
>>
>> On 22/4/23 23:48, Hardie, Andrew wrote:
>>> With cwb-decode.
>>>
>>> best
>>>
>>> Andrew
>>>
>>> From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On
>>> Behalf Of Andrés Chandía
>>> Sent: Thursday, April 20, 2023 6:23 PM
>>> To: Open source development of the Corpus WorkBench
>>> <cwb at sslmit.unibo.it>
>>> Subject: [CWB] export corpus
>>>
>>> How do I export big corpus not compromising the machine resources?
>>> No data available in manuals...
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste/
> .sslmit.unibo.it%2Fmailman%2Flistinfo%2Fcwb&data=05%7C01%7Chardiea%40l
> ive.lancs.ac.uk%7C9c84a90062eb48d6795208db441f98a1%7C9c9bcd11977a4e9ca
> 9a0bc734090164a%7C0%7C0%7C638178674722252028%7CUnknown%7CTWFpbGZsb3d8e
> yJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C30
> 00%7C%7C%7C&sdata=O41EtYB%2BwRJYiGL7SMr8Kibbc%2BEywFizQvT4gyQ%2BAbo%3D
> &reserved=0
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://liste.sslmit.unibo.it/mailman/listinfo/cwb
More information about the CWB
mailing list