[CWB] International Corpus of English
Frenken, Florian
florian.frenken at ifaar.rwth-aachen.de
Wed Dec 30 12:50:36 CET 2020
Dear All,
I realise this question may not be a perfect fit for this mailing list, but I'm not sure who or where else to ask, so here goes: Have any of you ever worked with components from the International Corpus of English<http://ice-corpora.net/ice/index.html>? The xml-like annotations in the original files seem to be broken in many ways (e.g., inconsistent, unclosed and open tags, invalid overlaps, reserved characters in content), so preparing them for CQP turned out to be quite challenging (at least for me). It's not really that I got caught on a specific problem; I'm rather curious whether you have some general advice for correcting such ill-formed texts, perhaps from experience. I feel like regular expressions can only go so far (though I may very well just not be sufficiently knowledgable). There is an International Corpus of Learner English on the Lancaster CQPweb page. Is that similar by any chance?
Best,
Florian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20201230/b6497629/attachment.html>
More information about the CWB
mailing list