[CWB] Abortion on corpus creation

Hardie, Andrew a.hardie at lancaster.ac.uk
Thu Apr 27 21:33:52 CEST 2023


You might want to check that your CWB installation is up to date (i.e. 3.5), as the most recent versions of cwb-encode will simply pass over a line-initial U+feff character wherever it appears in the input when in UTF-8 mode.

best

Andrew

From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On Behalf Of Graham Ranger -- UAPV
Sent: Thursday, April 27, 2023 5:45 PM
To: cwb at sslmit.unibo.it
Subject: Re: [CWB] Abortion on corpus creation

Thanks again, Stephanie. The problem was indeed the U+FEFF or BOM character lurking somewhere in the file. I thought I'd removed it with the command I usually use:

sed -i '1s/^\xEF\xBB\xBF//' myfile.txt
until I realised that this is only any good before concatenating the files (as it only targets the first line). So I tried

sed -i 's/^\xEF\xBB\xBF//g' myfile.txt
and things worked from that point on!
Best,
Graham.
Le 26/04/2023 à 21:14, Stephanie Evert a écrit :





On 26 Apr 2023, at 16:28, Graham Ranger -- UAPV <graham.ranger at univ-avignon.fr><mailto:graham.ranger at univ-avignon.fr> wrote:



Many thanks for your help. Unfortunately, that didn't work... I've just checked: my XML tags are on different lines (though I would hope that would not make a difference) and the only spaces in the file are in the XML tag between "text" and "id".



If you can access the CWB-indexed corpus (or index it yourself on the command-line with cwb-encode and cwb-make), then you could find the location of the problem with a CQP query



  [ ! text_id ];



Best,

Stephanie

_______________________________________________

CWB mailing list

CWB at sslmit.unibo.it<mailto:CWB at sslmit.unibo.it>

http://liste.sslmit.unibo.it/mailman/listinfo/cwb

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20230427/ab32b811/attachment.html>


More information about the CWB mailing list