<html>
<head>
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">Hi, <br>
I don't know whether this helps, but I use positional attributes
to encode word alignment, and then transform the output to reflect
this function. Essentially, you can put anything into these
positional attributes, also ranges, be they continuous or not. The
challenge just moves on to transforming the output. <br>
<br>
My solution is to have CWB output, including word alignment in the
positional attributes, as XML, and transform that using XSLT. Have
a look here: <a class="moz-txt-link-freetext" href="http://www.parasolcorpus.org/KrakowMW/">http://www.parasolcorpus.org/KrakowMW/</a><br>
<br>
The interface is open source
(<a class="moz-txt-link-freetext" href="https://bitbucket.org/rvwfels/paravoz2">https://bitbucket.org/rvwfels/paravoz2</a> ) , but we just found a
bug which isn't repaired yet, so write to me for details if you
want to try it out (essentially, you need to follow a certain
naming convention when encoding the corpus). <br>
<br>
Best!<br>
Ruprecht<br>
<br>
<br>
Am 23.06.2015 um 18:24 schrieb Jose Manuel Martinez Martinez:<br>
</div>
<blockquote cite="mid:55898822.40408@gmail.com" type="cite">
<meta http-equiv="content-type" content="text/html;
charset=windows-1252">
<font size="+1">Dear all,<br>
<br>
I've managed to import the alignment of two corpora at sentence
level. I don't mind to document the process somehow for the
encoding tutorial.<br>
<br>
However, I had came across with an error when trying to align
structural attributes </font><font size="+1"><font size="+1">in
a different corpus</font>.<br>
<br>
> sh add_difficulties_align_test.sh <br>
Generating keys for grid regions:<br>
- TDC-AD-TEST ..... ok<br>
- TDC-TT-TEST ..... ok<br>
Processing .Error: alignment bead #4 is non-contiguous in
TDC-TT-TEST<br>
(keys: ep1_tr10_dif_3 ep1_tr10_dif_4)<br>
<br>
You can find attached a test data set to reproduce the issue. My
question is, is there a way to overcome this error?<br>
<br>
This alignment is basically some kind of "word alignment",
however I am not aligning all words, but only those words on the
source text contained within a structural attribute, and I align
them only with the structural attribute(s) containing the
translation. Sometimes, depending on the source text unit, the
translation is a non-contiguous rendering. See the example
below, specially </font><font size="+1"><font size="+1">difficulty
id="ep1_tr10_dif_3" in the source text and its translation </font></font><font
size="+1"><font size="+1"><font size="+1">(difficulty
id="ep1_tr10_dif_3"</font> </font>and </font><font
size="+1"><font size="+1">difficulty id="ep1_tr10_dif_4"</font>).<br>
<br>
#-- source<br>
<br>
the<br>
<difficulty id="ep1_tr10_dif_2" type="unspec"><br>
interbank<br>
market<br>
</difficulty><br>
is<br>
<difficulty id="ep1_tr10_dif_3" type="unspec"><br>
restarted<br>
</difficulty><br>
.<br>
<br>
#-- translation<br>
<br>
el<br>
<difficulty id="ep1_tr10_dif_2" type="unspec"><br>
mercado<br>
interbancario<br>
</difficulty><br>
<difficulty id="ep1_tr10_dif_3" type="unspec"><br>
vuelva<br>
a<br>
poner<br>
</difficulty><br>
se<br>
<difficulty id="ep1_tr10_dif_4" type="unspec"><br>
en<br>
marcha<br>
</difficulty><br>
.<br>
<br>
#-- alignment<br>
<br>
ep1_tr10_dif_2 ep1_tr10_dif_2<br>
ep1_tr10_dif_3 ep1_tr10_dif_3 ep1_tr10_dif_4<br>
<br>
I also tried to wrap each work with an XML element like:<br>
<br>
<token id="ep1_tr10_t_2"><br>
mercado<br>
</token><br>
<token id="ep1_tr10_t_3"><br>
interbancario<br>
</token><br>
<token id="ep1_tr10_t_4"><br>
vuelva<br>
</token><br>
<token id="ep1_tr10_t_5"><br>
a<br>
</token><br>
<token id="ep1_tr10_t_6"><br>
poner<br>
</token><br>
<token id="ep1_tr10_t_54"><br>
se<br>
</token><br>
<token id="ep1_tr10_t_7"><br>
en<br>
</token><br>
<token id="ep1_tr10_t_8"><br>
marcha<br>
</token><br>
<br>
So the tokens involved in the alignment have to be contiguous
(not the structural elements). In the example given, this is
trivial (one token more or less...), but I have other cases
where elements appear much far apart and I don't want to include
all the tokens in between.<br>
<br>
Although my case is a bit special, I don't think this is an
infrequent scenario see Amoia et al. 2011 <a
moz-do-not-send="true" class="moz-txt-link-freetext"
href="http://www.aclweb.org/anthology/W11-4302">http://www.aclweb.org/anthology/W11-4302</a>.<br>
<br>
Any comments, hints, will be much appreciated.<br>
<br>
Cheers,<br>
<br>
jmm<br>
</font> <br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
CWB mailing list
<a class="moz-txt-link-abbreviated" href="mailto:CWB@sslmit.unibo.it">CWB@sslmit.unibo.it</a>
<a class="moz-txt-link-freetext" href="http://devel.sslmit.unibo.it/mailman/listinfo/cwb">http://devel.sslmit.unibo.it/mailman/listinfo/cwb</a>
</pre>
</blockquote>
<br>
</body>
</html>