[CWB] web-interface with aligned corpora and WebCqp::Persistent
Stefan Evert
stefan.evert at uos.de
Wed Feb 21 23:10:06 CET 2007
Hi again!
> thanks stefan and lars for your answers. I managed to load my parallel
> corpus and to add the alignments to the CQPdemo interface (just dirty
> hacking - to test if it would work ...). very nice! I will try to
> use it
> for the OPUS corpora as well. I can let you know when I have it on-
> line.
That's good to hear – so the CQPdemo code is more hackable than I
thought after all. :o)
> about crossing alignments: I didn't know that this is supported by
> CWB. I
> usually used cwb-align-encode to built the alignment attributes and
> as far
> as I remember, crossing links are not allowed when using that tool.
> Am I
> correct?
Do you generate the input files for cwb-align-encode yourself? You
have to make sure that the regions in the _source language_ are
ordered (gaps are allowed), but there are no restrictions for the
corresponding target regions. Make sure you _don't_ pass the "-C"
flag for compatibility mode. Recent version use the "extended"
alignment file format that allows crossing links and gaps (all beta
versions published within the last 5 years or so should support the
extended alignment format).
Or, as the CWB source code puts it:
if (compatibility) {
/* source and target regions of .alg file must be contiguous;
store start points only; */
/* hence we must collapse crossing alignments into one larger
region (I know that's bullshit) */
:o)
> but with crossing links - it's no problem to represent word
> alignment in CWB as well, isn't it? cool! did anyone try this already?
> (I can imagine that indeces get qquite big then ...)
In principle you should be able to do that, but it would be rather
impractical (huge index files, as you guessed, and processing would
be very inefficient) and not very useful (since all you can do within
CQP is run aligned queries and display aligned regions for the
_entire_ query match, i.e. the source language match has to fit into
a single alignment bead).
If anyone feels like refactoring the CWB to a more general query
model and adding true word alignment attributes so that we can run
queries across multiple languages, I'm game – I'm sure this would be
fun!
> one more thing: does the charset option in registry files do
> anything? or
> is it just for information purposes?
At the moment, it's just for informational purposes – e.g., a Web
interface could use it to translate between the corpus encoding and
an "external" Unicode representation automatically (since the
information is represented in a standardised way) – but the intention
behind the charset "property" was to add support for non-Latin1
characters sets in the future (so that %c and %d queries work
properly). CQP will probably never offer to convert automatically
between a corpus charset and e.g. Unicode for input/output.
I'm hoping to provide support at least for Latin2 in time for the
official 3.0 release, using a mapping table contributed by Tomaz
Erjavec. The short-term strategy will be to deactivate the %c and %d
flags for all other charsets, so at least they won't mess up UTF-8
byte sequences any more.
> I guess that CWB is still byte based
> and cannot really handle unicode encodings, can it?
Exactly. The CWB can easily be extended to all ISO-8859 character
sets (all we have to do is provide suitable mapping tables for the %c
and %d flags), but there is no proper Unicode support for the simple
reason that this would require us to compile against huge Unicode
libraries with potential licensing problems. There's also a certain
performance penalty: regular expressions and case/diacritic-
insensitive searching are more efficient for byte encodings than for
Unicode (UTF-8) data.
A lot of people have requested Unicode support, of course, so this is
on the top of the list of features to be added in the future open-
source development of the CWB.
> I'm looking forward seeing the next (first) version on sourceforge ...
Thanks! I'm trying to get it out quickly ... I'd just need a few
quiet days to go through the source code and clean up a little ...
like throwing out the tests, which don't work at all and have already
confused a number of people trying to compile the CWB. :o)
Best to all of you,
Stefan
More information about the CWB
mailing list