[CWB] A question about the aligning using cwb-encoding

Stefan Evert stefanML at collocations.de
Mon Jan 27 08:22:21 CET 2014


> Some first sentences were aligned as right pairs.
> But the others were not.
> It seems to be related with statistical aligning process.

You're absolutely right.  cwb-align isn't a particularly sophisticated sentence aligner, so it's likely to get some cases wrong.  You may be seeing particularly bad performance if you're using the default parameter settings, which are intended for related languages and are based on sentence length (in characters), character n-gram counts and identical words.

For Korean-English alignment, the best solution might be to get a good bilingual word list and use that as the only feature (dropping even sentence length).

> Actually I made two corpora so, that every pair sentence should have the same sentence id like <s id="100"> or <s id="10000">, in order to avoid the failure of statistical alignment.
> I am working with 60000 sentences. And I manually aligned all sentences and put the information into the xml tag "s_id".
> 
> My question is how I can make useful the manually created xml tag "s_id"?

If these are only 1:1 alignments, you can use a trick to smuggle them past cwb-align:

	cwb-align -V s_id -o alignment.txt CORPUS1 CORPUS2 s -C:1

With "-V s_id", the manually aligned sentence pairs are taken as a pre-alignment, and the statistical aligner is only run within each pair of pre-aligned regions.  Since each of those contains just a single sentence pair, it cannot further break up the bead, so the original pre-aligment is passed through.  Feature specs shouldn't matter here, so you might as well just specify -C:1 to avoid unnecessary overhead.  You can then proceed to cwb-align-encode the generated file alignment.txt as usual.

If you have more complex alignments (n:1 or 1:n, 2:2, ...), you could add new XML regions, e.g.

	<bead id="100"> ... </bead>

and use -V bead_id for the pre-alignment in cwb-align.


If you have a recent version of the CWB/Perl interface, the best strategy is to use the cwb-align-import tool.  You'll have to provide a separate alignment file that lists the sentence IDs in source and target corpus for each alignment bead.  Complex alignments require no special treatment with this tool.  See "perldoc cwb-align-import" for usage and format details.


Best,
Stefan


More information about the CWB mailing list