[CWB] Abortion on corpus creation

Graham Ranger -- UAPV graham.ranger at univ-avignon.fr
Thu Apr 27 07:55:28 CEST 2023


Many thanks for this, Stephanie. Your query brought to light a "Zero 
Width No-Break Space" character (U+FEFF) (what is the use of that?). I 
think that is what has thrown whole parts of the corpus out of kilter. 
I'm going to try and remove that with sed, and will report back accordingly!
Best,
Graham.




Le 26/04/2023 à 21:14, Stephanie Evert a écrit :
>
>> On 26 Apr 2023, at 16:28, Graham Ranger -- UAPV<graham.ranger at univ-avignon.fr>  wrote:
>>
>> Many thanks for your help. Unfortunately, that didn't work... I've just checked: my XML tags are on different lines (though I would hope that would not make a difference) and the only spaces in the file are in the XML tag between "text" and "id".
> If you can access the CWB-indexed corpus (or index it yourself on the command-line with cwb-encode and cwb-make), then you could find the location of the problem with a CQP query
>
> 	[ ! text_id ];
>
> Best,
> Stephanie	
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20230427/37836e56/attachment.html>


More information about the CWB mailing list