[CWB] Abortion on corpus creation
Graham Ranger -- UAPV
graham.ranger at univ-avignon.fr
Thu Apr 27 18:44:36 CEST 2023
Thanks again, Stephanie. The problem was indeed the U+FEFF or BOM
character lurking somewhere in the file. I thought I'd removed it with
the command I usually use:
sed -i '1s/^\xEF\xBB\xBF//' myfile.txt
until I realised that this is only any good before concatenating the
files (as it only targets the first line). So I tried
sed -i 's/^\xEF\xBB\xBF//g' myfile.txt
and things worked from that point on!
Best,
Graham.
Le 26/04/2023 à 21:14, Stephanie Evert a écrit :
>
>> On 26 Apr 2023, at 16:28, Graham Ranger -- UAPV<graham.ranger at univ-avignon.fr> wrote:
>>
>> Many thanks for your help. Unfortunately, that didn't work... I've just checked: my XML tags are on different lines (though I would hope that would not make a difference) and the only spaces in the file are in the XML tag between "text" and "id".
> If you can access the CWB-indexed corpus (or index it yourself on the command-line with cwb-encode and cwb-make), then you could find the location of the problem with a CQP query
>
> [ ! text_id ];
>
> Best,
> Stephanie
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20230427/50b35c88/attachment.html>
More information about the CWB
mailing list