<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<font size="+1">Dear all,<br>
<br>
I've managed to import the alignment of two corpora at sentence
level. I don't mind to document the process somehow for the
encoding tutorial.<br>
<br>
However, I had came across with an error when trying to align
structural attributes </font><font size="+1"><font size="+1">in a
different corpus</font>.<br>
<br>
> sh add_difficulties_align_test.sh <br>
Generating keys for grid regions:<br>
- TDC-AD-TEST ..... ok<br>
- TDC-TT-TEST ..... ok<br>
Processing .Error: alignment bead #4 is non-contiguous in
TDC-TT-TEST<br>
(keys: ep1_tr10_dif_3 ep1_tr10_dif_4)<br>
<br>
You can find attached a test data set to reproduce the issue. My
question is, is there a way to overcome this error?<br>
<br>
This alignment is basically some kind of "word alignment", however
I am not aligning all words, but only those words on the source
text contained within a structural attribute, and I align them
only with the structural attribute(s) containing the translation.
Sometimes, depending on the source text unit, the translation is a
non-contiguous rendering. See the example below, specially </font><font
size="+1"><font size="+1">difficulty id="ep1_tr10_dif_3" in the
source text and its translation </font></font><font size="+1"><font
size="+1"><font size="+1">(difficulty id="ep1_tr10_dif_3"</font>
</font>and </font><font size="+1"><font size="+1">difficulty
id="ep1_tr10_dif_4"</font>).<br>
<br>
#-- source<br>
<br>
the<br>
<difficulty id="ep1_tr10_dif_2" type="unspec"><br>
interbank<br>
market<br>
</difficulty><br>
is<br>
<difficulty id="ep1_tr10_dif_3" type="unspec"><br>
restarted<br>
</difficulty><br>
.<br>
<br>
#-- translation<br>
<br>
el<br>
<difficulty id="ep1_tr10_dif_2" type="unspec"><br>
mercado<br>
interbancario<br>
</difficulty><br>
<difficulty id="ep1_tr10_dif_3" type="unspec"><br>
vuelva<br>
a<br>
poner<br>
</difficulty><br>
se<br>
<difficulty id="ep1_tr10_dif_4" type="unspec"><br>
en<br>
marcha<br>
</difficulty><br>
.<br>
<br>
#-- alignment<br>
<br>
ep1_tr10_dif_2 ep1_tr10_dif_2<br>
ep1_tr10_dif_3 ep1_tr10_dif_3 ep1_tr10_dif_4<br>
<br>
I also tried to wrap each work with an XML element like:<br>
<br>
<token id="ep1_tr10_t_2"><br>
mercado<br>
</token><br>
<token id="ep1_tr10_t_3"><br>
interbancario<br>
</token><br>
<token id="ep1_tr10_t_4"><br>
vuelva<br>
</token><br>
<token id="ep1_tr10_t_5"><br>
a<br>
</token><br>
<token id="ep1_tr10_t_6"><br>
poner<br>
</token><br>
<token id="ep1_tr10_t_54"><br>
se<br>
</token><br>
<token id="ep1_tr10_t_7"><br>
en<br>
</token><br>
<token id="ep1_tr10_t_8"><br>
marcha<br>
</token><br>
<br>
So the tokens involved in the alignment have to be contiguous (not
the structural elements). In the example given, this is trivial
(one token more or less...), but I have other cases where elements
appear much far apart and I don't want to include all the tokens
in between.<br>
<br>
Although my case is a bit special, I don't think this is an
infrequent scenario see Amoia et al. 2011
<a class="moz-txt-link-freetext" href="http://www.aclweb.org/anthology/W11-4302">http://www.aclweb.org/anthology/W11-4302</a>.<br>
<br>
Any comments, hints, will be much appreciated.<br>
<br>
Cheers,<br>
<br>
jmm<br>
</font>
</body>
</html>