[CWB] Character encoding revisited

Wed Jun 25 21:17:13 CEST 2014

What particular output?

e.g., concordance with context width defined in characters, concordance with context width defined in words, tabulation, group, ... ?

Depending on which it is, the cause could be rather different.

Also, where in the lines do the broken UTF-8 characters occur? At the beginning, at the end, in the middle, or a combination?

Lastly, what version are you running?

best

Andrew.
________________________________________
From: cwb-bounces at sslmit.unibo.it [cwb-bounces at sslmit.unibo.it] on behalf of Josep M. Fontana [josepm.fontana at upf.edu]
Sent: 25 June 2014 17:41
To: cwb at sslmit.unibo.it
Subject: [CWB] Character encoding revisited

Hi,

Our corpus is encoded in UTF-8 but when I create a text file with the
output of some search I get the typical odd characters one gets when the
conversion has gone wrong. I used the 'file' command and I saw that the
text files are sometimes encoded as ISO-8859 and some other times as
ASCII. Is there anyway to configure things so that the UTF-8 character
set is maintained? Thanks.

Josep M.
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb