[CWB] Unicode and convenient interface
Ruprecht von Waldenfels
ruprecht.waldenfels at sprachlit.uni-regensburg.de
Mon Jul 7 11:39:02 CEST 2008
Dear all,
as far as I can see, there are two problems with CWB that seriously impair
its usefulness: first, its relatively clumsy interface (it's a pain to always
have to retype everything the moment I discover a spelling mistake in the
first characters; also, inspecting corpus results is not all too ideal).
The second, more serious problem, are its limited unicode capabilities,
mainly because regular expressions are not interpreted as they should be
(i.e., .. may relate to one or two characters in utf-8 because of variable
character-encoding; [abc] does not work in a predictably way for multi-byte
characters; etc, etc.).
Now I think one could easily solve both problems by (a) encoding corpora in a
fixed-length unicode encoding, thereby making the length of characters
predictable and (b) an intelligent wrapper around CWB that transforms
regular expressions (e.g., [abc] -> (a|b|c); with multibyte a b c) as well
as provides a more convenient command line interface.
Has anybody done something like that? Is there some more convenient interface
to CWB, perhaps half-graphical, that preserves all its capabilities? Ideally,
one would like to have something that is more accessible especially to the
regular linguist that has no programming skills, has never used linux or any
command-line tool? Has anybody developed scripts with work-arounds for the
unicode issues? Ideally, I think these two issues could be fairly easily
solved together.
Would anybody be interested in having something like that?
Thanks, and a nice day,
Ruprecht
Ruprecht v. Waldenfels
Institut für Slavistik, Universität Regensburg
Universitätsstr. 31, 93051 Regensburg
ruprecht dot waldenfels at sprachlit.uni-regensburg.de
skype: rvwaldenfels
Tel. +49 (0) 941 943 5319
Fax. +49 (0) 941 943 1991
More information about the CWB
mailing list