[CWB] web-interface with aligned corpora and WebCqp::Persistent

Wed Feb 21 23:10:06 CET 2007

Hi again!

> thanks stefan and lars for your answers. I managed to load my parallel
> corpus and to add the alignments to the CQPdemo interface (just dirty
> hacking - to test if it would work ...). very nice! I will try to  
> use it
> for the OPUS corpora as well. I can let you know when I have it on- 
> line.

That's good to hear – so the CQPdemo code is more hackable than I  
thought after all. :o)

> about crossing alignments: I didn't know that this is supported by  
> CWB. I
> usually used cwb-align-encode to built the alignment attributes and  
> as far
> as I remember, crossing links are not allowed when using that tool.  
> Am I
> correct?

Do you generate the input files for cwb-align-encode yourself?  You  
have to make sure that the regions in the _source language_ are  
ordered (gaps are allowed), but there are no restrictions for the  
corresponding target regions.  Make sure you _don't_ pass the "-C"  
flag for compatibility mode.  Recent version use the "extended"  
alignment file format that allows crossing links and gaps (all beta  
versions published within the last 5 years or so should support the  
extended alignment format).

Or, as the CWB source code puts it:

     if (compatibility) {
       /* source and target regions of .alg file must be contiguous;  
store start points only; */
       /* hence we must collapse crossing alignments into one larger  
region (I know that's bullshit) */

:o)

> but with crossing links - it's no problem to represent word
> alignment in CWB as well, isn't it? cool! did anyone try this already?
> (I can imagine that indeces get qquite big then ...)

In principle you should be able to do that, but it would be rather  
impractical (huge index files, as you guessed, and processing would  
be very inefficient) and not very useful (since all you can do within  
CQP is run aligned queries and display aligned regions for the  
_entire_ query match, i.e. the source language match has to fit into  
a single alignment bead).

If anyone feels like refactoring the CWB to a more general query  
model and adding true word alignment attributes so that we can run  
queries across multiple languages, I'm game – I'm sure this would be  
fun!

> one more thing: does the charset option in registry files do  
> anything? or
> is it just for information purposes?

At the moment, it's just for informational purposes – e.g., a Web  
interface could use it to translate between the corpus encoding and  
an "external" Unicode representation automatically (since the  
information is represented in a standardised way) – but the intention  
behind the charset "property" was to add support for non-Latin1  
characters sets in the future (so that %c and %d queries work  
properly).  CQP will probably never offer to convert automatically  
between a corpus charset and e.g. Unicode for input/output.

I'm hoping to provide support at least for Latin2 in time for the  
official 3.0 release, using a mapping table contributed by Tomaz  
Erjavec.  The short-term strategy will be to deactivate the %c and %d  
flags for all other charsets, so at least they won't mess up UTF-8  
byte sequences any more.

> I guess that CWB is still byte based
> and cannot really handle unicode encodings, can it?

Exactly.  The CWB can easily be extended to all ISO-8859 character  
sets (all we have to do is provide suitable mapping tables for the %c  
and %d flags), but there is no proper Unicode support for the simple  
reason that this would require us to compile against huge Unicode  
libraries with potential licensing problems.  There's also a certain  
performance penalty: regular expressions and case/diacritic- 
insensitive searching are more efficient for byte encodings than for  
Unicode (UTF-8) data.

A lot of people have requested Unicode support, of course, so this is  
on the top of the list of features to be added in the future open- 
source development of the CWB.

> I'm looking forward seeing the next (first) version on sourceforge ...

Thanks! I'm trying to get it out quickly ... I'd just need a few  
quiet days to go through the source code and clean up a little ...  
like throwing out the tests, which don't work at all and have already  
confused a number of people trying to compile the CWB. :o)

Best to all of you,
Stefan