[CWB] unicode problems with Greek and OCS
Ruprecht von Waldenfels
ruprecht.waldenfels at gmx.net
Tue Mar 10 16:31:15 CET 2015
Dear Andrew,
I figured it out. I specified a different positional attribute to be
used, instead of word I specified id, which is just a number. This seems
to work.
Best and thanks a lot!
Ruprecht
Am 10.03.2015 um 16:22 schrieb Ruprecht von Waldenfels:
> I have to say, solving this problem would be a very good start!
> However, I don't understand how to NOT specify these parameters. I've
> tried turning them to 0, but that doesn't help.
> Best!
> Ruprecht
>
>
>
> Am 10.03.2015 um 15:56 schrieb Hardie, Andrew:
>> Well you could always tell the aligner not to use n-grams as a feature!
>>
>> From the manfile:
>>
>> -1:<weight>
>> Specifies that the appearance of shared one-letter
>> sequences within words in the two possibly-equivalent regions should
>> be used as features for the similarity measurement, with the
>> specified weight.
>>
>> The configuration flags "-1, -2, -3, -4" all specify the
>> use of letter sequences as features, and they all work in the same
>> way; the following general comments apply to all four of these flags.
>>
>> Sub-word letter-sequence matching allows the presence of
>> similar but not identical words to count as a factor in similarity.
>> Such words are often orthogrpahic cognates that are likely to be
>> translation equivalents and
>> thus evidence that the pair of regions under analysis
>> really are equivalent. The longer the letter sequence, the more
>> impressive the evidence (so you would normally weight "-4" more
>> heavily than "-3", and so on; the
>> default configuration (see below) does not include "-1"
>> and "-2" at all).
>>
>> When letter saequences are compared, the comparison is
>> case-insensitive and diacritic-insensitive.
>>
>> Only the letters "A" to "Z" are counted for the
>> comparison; numbers, punctuation and any other symbol will be
>> ignored. This means that the letter-sequence features are of no use
>> at all, and should not be used, if either
>> or both of the corpora is in a language that does not use
>> the Latin alphabet.
>>
>>
>> ...
>>
>>
>>
>> The default configuration (if no flags are specified) is
>> "-C:1 -S:50:0.4 -3:3 -4:4".
>>
>> By not specifying a configuration you are, ergo, asking the aligner
>> to use 3grams and 4grams.
>>
>> However, though that will solve your immediate problem, it doesn't
>> solve the bug.
>>
>> best
>>
>> Andrew.
>>
>> -----Original Message-----
>> From: cwb-bounces at sslmit.unibo.it
>> [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Ruprecht von
>> Waldenfels
>> Sent: 10 March 2015 14:51
>> To: cwb at sslmit.unibo.it
>> Subject: Re: [CWB] unicode problems with Greek and OCS
>>
>> So is the mistake non-fatal, or could one make it non-fatal? As you
>> pointed out, this is a hopeless task.
>> Ruprech
>> Am 10.03.2015 um 15:46 schrieb Hardie, Andrew:
>>> The n-grams are for spotting corresponding words. As explained in
>>> the manfile, the program is designed for pairs like French-German
>>> where the alphabet is the same and there are at least a smattering
>>> of cognate words which will be similar if not identical.
>>>
>>> For Cyrillic vs Greek the n-grams buy you nothing.
>>>
>>> Andrew.
>>>
>>> -----Original Message-----
>>> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it]
>>> On Behalf Of Ruprecht von Waldenfels
>>> Sent: 10 March 2015 14:43
>>> To: cwb at sslmit.unibo.it
>>> Subject: Re: [CWB] unicode problems with Greek and OCS
>>>
>>> All corpora are encoded as UTF8. This looks really strange. I tried
>>> different normalizations for unicode, namely NFKD, NFC, NFD, but all
>>> to no avail.
>>>
>>> What are the ngrams for? There is no word alignment, so it's all
>>> about the alignment anchors - shouldn't they be independent of the
>>> character set?
>>> Best,
>>> Ruprecht
>>>
>>>
>>> Am 10.03.2015 um 14:09 schrieb Stefan Evert:
>>>> One case in which this would happen is if the _source_ corpus is
>>>> UTF-8, but the target corpus has some other encoding. cwb-align
>>>> obtains the encoding from the source corpus and doesn't bother to
>>>> check it against the target corpus.
>>>>
>>>> At first I thought that this might be due to the fact that the
>>>> character n-gram features are in fact n-grams of bytes (so they cut
>>>> out invalid UTF-8 sequences), but only the full strings are passed
>>>> to cl_string_canonical(). See lines 287, 296, 509 and 527 in
>>>> utils/feature_maps.c.
>>>>
>>>> Cheers,
>>>> Stefan
>>>> _______________________________________________
>>>> CWB mailing list
>>>> CWB at sslmit.unibo.it
>>>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>>> _______________________________________________
>>> CWB mailing list
>>> CWB at sslmit.unibo.it
>>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>>> _______________________________________________
>>> CWB mailing list
>>> CWB at sslmit.unibo.it
>>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
More information about the CWB
mailing list