[CWB] unicode problems with Greek and OCS

Ruprecht von Waldenfels ruprecht.waldenfels at gmx.net
Tue Mar 10 16:31:15 CET 2015


Dear Andrew,
I figured it out. I specified a different positional attribute to be 
used, instead of word I specified id, which is just a number. This seems 
to work.
Best and thanks a lot!
Ruprecht



Am 10.03.2015 um 16:22 schrieb Ruprecht von Waldenfels:
> I have to say, solving this problem would be a very good start!
> However, I don't understand how to NOT specify these parameters. I've 
> tried turning them to 0, but that doesn't help.
> Best!
> Ruprecht
>
>
>
> Am 10.03.2015 um 15:56 schrieb Hardie, Andrew:
>> Well you could always tell the aligner not to use n-grams as a feature!
>>
>>  From the manfile:
>>
>>         -1:<weight>
>>             Specifies that the appearance of shared one-letter 
>> sequences within words in the two possibly-equivalent regions should 
>> be used as features for the similarity measurement, with the 
>> specified weight.
>>
>>             The configuration flags "-1, -2, -3, -4" all specify the 
>> use of letter sequences as features, and they all work in the same 
>> way; the following general comments apply to all four of these flags.
>>
>>             Sub-word letter-sequence matching allows the presence of 
>> similar but not identical words to count as a factor in similarity. 
>> Such words are often orthogrpahic cognates that are likely to be 
>> translation equivalents and
>>             thus evidence that the pair of regions under analysis 
>> really are equivalent.  The longer the letter sequence, the more 
>> impressive the evidence (so you would normally weight "-4" more 
>> heavily than "-3", and so on; the
>>             default configuration (see below) does not include "-1" 
>> and "-2" at all).
>>
>>             When letter saequences are compared, the comparison is 
>> case-insensitive and diacritic-insensitive.
>>
>>             Only the letters "A" to "Z" are counted for the 
>> comparison; numbers, punctuation and any other symbol will be 
>> ignored. This means that the letter-sequence features are of no use 
>> at all, and should not be used, if either
>>             or both of the corpora is in a language that does not use 
>> the Latin alphabet.
>>
>>
>> ...
>>
>>
>>
>>         The default configuration (if no flags are specified) is 
>> "-C:1 -S:50:0.4 -3:3 -4:4".
>>
>> By not specifying a configuration you are, ergo, asking the aligner 
>> to use 3grams and 4grams.
>>
>> However, though that will solve your immediate problem, it doesn't 
>> solve the bug.
>>
>> best
>>
>> Andrew.
>>
>> -----Original Message-----
>> From: cwb-bounces at sslmit.unibo.it 
>> [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Ruprecht von 
>> Waldenfels
>> Sent: 10 March 2015 14:51
>> To: cwb at sslmit.unibo.it
>> Subject: Re: [CWB] unicode problems with Greek and OCS
>>
>> So is the mistake non-fatal, or could one make it non-fatal? As you 
>> pointed out, this is a hopeless task.
>> Ruprech
>> Am 10.03.2015 um 15:46 schrieb Hardie, Andrew:
>>> The n-grams are for spotting corresponding words. As explained in 
>>> the manfile, the program is designed for pairs like French-German 
>>> where the alphabet is the same and there are at least a smattering 
>>> of cognate words which will be similar if not identical.
>>>
>>> For Cyrillic vs Greek the n-grams buy you nothing.
>>>
>>> Andrew.
>>>
>>> -----Original Message-----
>>> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it]
>>> On Behalf Of Ruprecht von Waldenfels
>>> Sent: 10 March 2015 14:43
>>> To: cwb at sslmit.unibo.it
>>> Subject: Re: [CWB] unicode problems with Greek and OCS
>>>
>>> All corpora are encoded as UTF8. This looks really strange. I tried 
>>> different normalizations for unicode, namely NFKD, NFC, NFD, but all 
>>> to no avail.
>>>
>>> What are the ngrams for? There is no word alignment, so it's all 
>>> about the alignment anchors - shouldn't they be independent of the 
>>> character set?
>>> Best,
>>> Ruprecht
>>>
>>>
>>> Am 10.03.2015 um 14:09 schrieb Stefan Evert:
>>>> One case in which this would happen is if the _source_ corpus is 
>>>> UTF-8, but the target corpus has some other encoding.  cwb-align 
>>>> obtains the encoding from the source corpus and doesn't bother to 
>>>> check it against the target corpus.
>>>>
>>>> At first I thought that this might be due to the fact that the 
>>>> character n-gram features are in fact n-grams of bytes (so they cut 
>>>> out invalid UTF-8 sequences), but only the full strings are passed 
>>>> to cl_string_canonical().  See lines 287, 296, 509 and 527 in 
>>>> utils/feature_maps.c.
>>>>
>>>> Cheers,
>>>> Stefan
>>>> _______________________________________________
>>>> CWB mailing list
>>>> CWB at sslmit.unibo.it
>>>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>>> _______________________________________________
>>> CWB mailing list
>>> CWB at sslmit.unibo.it
>>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>>> _______________________________________________
>>> CWB mailing list
>>> CWB at sslmit.unibo.it
>>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb



More information about the CWB mailing list