[CWB] Partial realignment of a parallel corpus

Stefan Evert stefanML at collocations.de
Fri Aug 2 10:34:58 CEST 2019


> I have a question for those with experience with parallel corpora. Say that I've spotted in a parallel corpus a mistake in the alignment of one text. Is it possible to import the right alignment only for that text using cwb-align-import? Or do I have to import the alignment for all the texts in the corpus.

CWB annotation (including alignments) cannot be updated once it has been encoded.  You will have to fix the error in the alignment source and then re-encode the complete alignment attribute.

(An exception to this rule is that cwb-s-encode allows you to update s-attributes, but that simply means it automatically merges the new data and re-encodes the attributes.)

> Is it possible to dump the alignments already encoded somehow?

At a low level, you can use cwb-align-decode to dump the alignment attribute as a sequence of region pairs. Then edit the file manually to adjust the corpus positions of the incorrect alignment and re-encode with cwb-align-encode, overwriting the previous data in the corpus.

Alternatively, use cwb-align-export from the CWB/Perl package to export the alignment in terms of sets of sentence IDs.  Read the manpage (perldoc cwb-align-export) and work out how to construct appropriate sentence IDs that correspond to your original input file. After manually correcting the errors, you should be able to re-encode the alignment with cwb-align-import.

> The thing is that the alignments used to create the file imported by cwb-align-import do not exist anymore. I'd like to avoid realigning the whole corpus, just to fix a few errors.

In that case, I hope that you're going to make a backup copy of the corpus before fiddling with the alignment … 

Best,
Stefan


More information about the CWB mailing list