[CWB] Unicode and convenient interface

Mon Jul 7 11:39:02 CEST 2008

Dear all, 

as far as I can see, there are two problems with CWB that seriously impair 
its usefulness: first, its relatively clumsy interface (it's a pain to always 
have to retype everything the moment I discover a spelling mistake in the 
first characters; also, inspecting corpus results is not all too ideal). 

The second, more serious problem, are its limited unicode capabilities, 
mainly because regular expressions are not interpreted as they should be 
(i.e., .. may relate to one or two characters in utf-8 because of variable 
character-encoding; [abc] does not work in a predictably way for multi-byte 
characters; etc, etc.).

Now I think one could easily solve both problems by (a) encoding corpora in a 
fixed-length unicode encoding, thereby making the length of characters 
predictable  and (b) an intelligent wrapper around CWB that transforms 
regular expressions  (e.g., [abc] -> (a|b|c); with multibyte a b c) as well 
as provides a more convenient command line interface. 

Has anybody done something like that? Is there some more convenient interface 
to CWB, perhaps half-graphical, that preserves all its capabilities? Ideally, 
one would like to have something that is more accessible especially to the 
regular linguist that has no programming skills, has never used linux or any 
command-line tool? Has anybody developed scripts with work-arounds for the 
unicode issues? Ideally, I think these two issues could be fairly easily 
solved together.

Would anybody be interested in having something like that?

Thanks, and a nice day, 
Ruprecht

Ruprecht v. Waldenfels
Institut für Slavistik, Universität Regensburg
Universitätsstr. 31, 93051 Regensburg
ruprecht dot waldenfels at sprachlit.uni-regensburg.de
skype: rvwaldenfels
Tel. +49 (0) 941 943 5319
Fax. +49 (0) 941 943 1991