[CWB] International Corpus of English

Stefan Evert stefanML at collocations.de
Thu Dec 31 15:04:17 CET 2020


Dear Florian,

some of the ICE components have badly ill-formed XML markup indeed, and there are also various inconsistencies in the annotation and metadata.

I'm sure several people have already put ICE components in CQPweb or a similar concordancing software.  I know Stella Neumann (CC:ed because she's not on this mailing list) has some ICE components indexed with CWB, but that involved quite a lot of scripting and manual correction.  Perhaps she can give you some pointers – in any case, you will need different solutions for different ICE components because they're not marked up to the same standard.

ICLE has not relation to the International Corpus of English.

Best,
Stefan


> On 30 Dec 2020, at 12:50, Frenken, Florian <florian.frenken at ifaar.rwth-aachen.de> wrote:
> 
> I realise this question may not be a perfect fit for this mailing list, but I'm not sure who or where else to ask, so here goes: Have any of you ever worked with components from the International Corpus of English? The xml-like annotations in the original files seem to be broken in many ways (e.g., inconsistent, unclosed and open tags, invalid overlaps, reserved characters in content), so preparing them for CQP turned out to be quite challenging (at least for me). It's not really that I got caught on a specific problem; I'm rather curious whether you have some general advice for correcting such ill-formed texts, perhaps from experience. 



More information about the CWB mailing list