[CWB] International Corpus of English
Stella Neumann
stella.neumann at ifaar.rwth-aachen.de
Sun Jan 3 11:39:57 CET 2021
Hi Stefan,
I /am/ on the mailing list and actually suggested to Florian to check
whether someone other than myself had previously tried to prepare ICE
components for CQP. In fact, I am not so sure very many people did. The
original inconsistent version works ok in concordance tools such as
AntConc. Another solution I heard of is to simply remove those parts
from the corpus that don't work properly (rather than fix the mark-up).
I think that these two options are the ones people use.
Since my approach to fixing the problems is not only very time-consuming
but also only works for three components (I ended up fixing individual
issues locally), Florian and I were wondering whether someone else had
approached the task in a more principled, replicable way while still
maintaining as much mark up as possible.
Happy New Year!
Stella
Am 31.12.2020 um 15:31 schrieb cwb-request at sslmit.unibo.it:
> Send CWB mailing list submissions to
> cwb at sslmit.unibo.it
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
> or, via email, send a message with subject or body 'help' to
> cwb-request at sslmit.unibo.it
>
> You can reach the person managing the list at
> cwb-owner at sslmit.unibo.it
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of CWB digest..."
>
>
> Today's Topics:
>
> 1. Re: International Corpus of English (Stefan Evert)
> 2. Big Sur? (Simon Meier-Vieracker)
> 3. Re: Big Sur? (Stefan Evert)
> 4. Re: Big Sur? (Simon Meier-Vieracker)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 31 Dec 2020 15:04:17 +0100
> From: Stefan Evert <stefanML at collocations.de>
> To: CWBdev Mailing List <cwb at sslmit.unibo.it>
> Cc: Stella Neumann <stella.neumann at ifaar.rwth-aachen.de>
> Subject: Re: [CWB] International Corpus of English
> Message-ID: <518B039E-F3B9-4C43-BDFF-CC80C9D9738D at collocations.de>
> Content-Type: text/plain; charset=utf-8
>
> Dear Florian,
>
> some of the ICE components have badly ill-formed XML markup indeed, and there are also various inconsistencies in the annotation and metadata.
>
> I'm sure several people have already put ICE components in CQPweb or a similar concordancing software. I know Stella Neumann (CC:ed because she's not on this mailing list) has some ICE components indexed with CWB, but that involved quite a lot of scripting and manual correction. Perhaps she can give you some pointers ? in any case, you will need different solutions for different ICE components because they're not marked up to the same standard.
>
> ICLE has not relation to the International Corpus of English.
>
> Best,
> Stefan
>
>
>> On 30 Dec 2020, at 12:50, Frenken, Florian <florian.frenken at ifaar.rwth-aachen.de> wrote:
>>
>> I realise this question may not be a perfect fit for this mailing list, but I'm not sure who or where else to ask, so here goes: Have any of you ever worked with components from the International Corpus of English? The xml-like annotations in the original files seem to be broken in many ways (e.g., inconsistent, unclosed and open tags, invalid overlaps, reserved characters in content), so preparing them for CQP turned out to be quite challenging (at least for me). It's not really that I got caught on a specific problem; I'm rather curious whether you have some general advice for correcting such ill-formed texts, perhaps from experience.
>
>
> ------------------------------
>
> Message: 2
> Date: Thu, 31 Dec 2020 14:10:00 +0000
> From: Simon Meier-Vieracker <simon.meier-vieracker at tu-dresden.de>
> To: Open source development of the Corpus WorkBench
> <cwb at sslmit.unibo.it>
> Subject: [CWB] Big Sur?
> Message-ID: <C0E980E5-9AAE-452B-87EE-0839EAD5B980 at tu-dresden.de>
> Content-Type: text/plain; charset="utf-8"
>
> Hi,
>
> does anyone have experience with CWB, TreeTagger and other software on the newest Mac OS system ?Big Sur??
> Updating to Catalina already caused a lot of problems and although I could resolve them in the end, I?d like to avoid to spend hours and hours to work around the security preferences?
>
> Best, Simon
>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: smime.p7s
> Type: application/pkcs7-signature
> Size: 5616 bytes
> Desc: not available
> URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20201231/ba7058ae/attachment-0001.p7s>
>
> ------------------------------
>
> Message: 3
> Date: Thu, 31 Dec 2020 15:23:33 +0100
> From: Stefan Evert <stefanML at collocations.de>
> To: CWBdev Mailing List <cwb at sslmit.unibo.it>
> Subject: Re: [CWB] Big Sur?
> Message-ID: <225F298A-7ACA-4AE1-8B8D-C63259E5D4D5 at collocations.de>
> Content-Type: text/plain; charset=utf-8
>
>> does anyone have experience with CWB, TreeTagger and other software on the newest Mac OS system ?Big Sur??
> Unfortunately not, but I intend to get a M1 mac soon in order to test CWB and related software thoroughly.
>
> I can't update my main laptop (or rather: don't dare to) because I'm relying on various pieces of software that might not be compatible.
>
> I don't think there should be a major problem wrt. CWB. As far as I know, homebrew works fine in BigSur to provide the required libraries.
>
>> Updating to Catalina already caused a lot of problems and although I could resolve them in the end, I?d like to avoid to spend hours and hours to work around the security preferences?
> I don't think I had any major problems at all with Catalina (at least now with command-line software). What exactly did you have to work around?
>
> Best,
> Stefan
>
> ------------------------------
>
> Message: 4
> Date: Thu, 31 Dec 2020 14:31:26 +0000
> From: Simon Meier-Vieracker <simon.meier-vieracker at tu-dresden.de>
> To: Open source development of the Corpus WorkBench
> <cwb at sslmit.unibo.it>
> Subject: Re: [CWB] Big Sur?
> Message-ID: <A48C6B02-1A52-4A56-A0A3-E2B9ED035065 at tu-dresden.de>
> Content-Type: text/plain; charset="utf-8"
>
>
>> I don't think I had any major problems at all with Catalina (at least now with command-line software). What exactly did you have to work around?
>
> Catalina started to block command line software as well. There was a thread on this topic in this mailing list in June 2020 (subject "CWB on OS Catalina?). May be that it was due to not compiling CWB by myself, that?s at least what I can reconstruct reading the thread.
>
> Best, Simon
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20201231/5a53edbb/attachment.html>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: smime.p7s
> Type: application/pkcs7-signature
> Size: 5616 bytes
> Desc: not available
> URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20201231/5a53edbb/attachment.p7s>
>
> ------------------------------
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
>
> End of CWB Digest, Vol 166, Issue 29
> ************************************
--
Prof. Dr. Stella Neumann
Anglistische Sprachwissenschaft
RWTH Aachen University
Institut für Anglistik, Amerikanistik und Romanistik
Kármánstr. 17/19
D-52062 Aachen
Tel. +49 (0)241 80-96105
More information about the CWB
mailing list