[CWB] Abortion on corpus creation

Graham Ranger -- UAPV graham.ranger at univ-avignon.fr
Thu Apr 27 18:44:36 CEST 2023


Thanks again, Stephanie. The problem was indeed the U+FEFF or BOM 
character lurking somewhere in the file. I thought I'd removed it with 
the command I usually use:

sed -i '1s/^\xEF\xBB\xBF//' myfile.txt

until I realised that this is only any good before concatenating the 
files (as it only targets the first line). So I tried

sed -i 's/^\xEF\xBB\xBF//g' myfile.txt

and things worked from that point on!
Best,
Graham.

Le 26/04/2023 à 21:14, Stephanie Evert a écrit :
>
>> On 26 Apr 2023, at 16:28, Graham Ranger -- UAPV<graham.ranger at univ-avignon.fr>  wrote:
>>
>> Many thanks for your help. Unfortunately, that didn't work... I've just checked: my XML tags are on different lines (though I would hope that would not make a difference) and the only spaces in the file are in the XML tag between "text" and "id".
> If you can access the CWB-indexed corpus (or index it yourself on the command-line with cwb-encode and cwb-make), then you could find the location of the problem with a CQP query
>
> 	[ ! text_id ];
>
> Best,
> Stephanie	
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20230427/50b35c88/attachment.html>


More information about the CWB mailing list