[CWB] Partial realignment of a parallel corpus

José Manuel Martínez Martínez chozelinek at gmail.com
Sun Aug 4 11:18:54 CEST 2019


Thanks, Stefan!

José Manuel Martínez Martínez
https://chozelinek.github.io

On 02.08.19 10:34, Stefan Evert wrote:
>> I have a question for those with experience with parallel corpora. Say that I've spotted in a parallel corpus a mistake in the alignment of one text. Is it possible to import the right alignment only for that text using cwb-align-import? Or do I have to import the alignment for all the texts in the corpus.
> CWB annotation (including alignments) cannot be updated once it has been encoded.  You will have to fix the error in the alignment source and then re-encode the complete alignment attribute.
>
> (An exception to this rule is that cwb-s-encode allows you to update s-attributes, but that simply means it automatically merges the new data and re-encodes the attributes.)
>
>> Is it possible to dump the alignments already encoded somehow?
> At a low level, you can use cwb-align-decode to dump the alignment attribute as a sequence of region pairs. Then edit the file manually to adjust the corpus positions of the incorrect alignment and re-encode with cwb-align-encode, overwriting the previous data in the corpus.
>
> Alternatively, use cwb-align-export from the CWB/Perl package to export the alignment in terms of sets of sentence IDs.  Read the manpage (perldoc cwb-align-export) and work out how to construct appropriate sentence IDs that correspond to your original input file. After manually correcting the errors, you should be able to re-encode the alignment with cwb-align-import.
>
>> The thing is that the alignments used to create the file imported by cwb-align-import do not exist anymore. I'd like to avoid realigning the whole corpus, just to fix a few errors.
> In that case, I hope that you're going to make a backup copy of the corpus before fiddling with the alignment …
>
> Best,
> Stefan
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20190804/c823faa9/attachment.html>


More information about the CWB mailing list