<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="Generator" content="Microsoft Exchange Server">
<!-- converted from text --><style><!-- .EmailQuote { margin-left: 1pt; padding-left: 4pt; border-left: #800000 2px solid; } --></style>
</head>
<body>
<div>
<p dir="ltr">It isn't, because the MySQL data is always in UTF-8, even if the CWB index is in Latin-1...</p>
<p dir="ltr">Andrew.</p>
<div class="x_quote">On 31 Mar 2014 09:16, genereux <genereux@clul.ul.pt> wrote:<br type="attribution">
</div>
</div>
<font size="2"><span style="font-size:10pt;">
<div class="PlainText">I received feedback from the MariaDB technical team on this issue:<br>
<br>
"A case-insensitive, but accent-sensitive collation that is available <br>
in MariaDB is latin1_general_ci,<br>
<a href="http://collation-charts.org/mysql60/mysql604.latin1_general_ci.html">http://collation-charts.org/mysql60/mysql604.latin1_general_ci.html</a>.
<br>
But for unicode characters MariaDB does not have general accent <br>
sensitive collations."<br>
<br>
I've tested the latin1_general_ci collation on MariaDB (which should be <br>
the same on mysql) and it works as advertised.<br>
<br>
For lack of better, this may be a convenient temporary solution for <br>
some corpora.<br>
<br>
Best,<br>
<br>
Michel<br>
<br>
<br>
<br>
On Thu Mar 27 2014 17:03, genereux wrote:<br>
> The obvious explanation I can find why there are many collations<br>
> (german, hungarian, spanish ...) is that accent and case sensitivities<br>
> can be language specific.<br>
> <br>
> Yet, it seems to me that a collation offering ci and as across all<br>
> accented characters should be suitable for some if not many languages,<br>
> so my surprise of not finding one ...<br>
> <br>
> Best regards,<br>
> <br>
> Michel<br>
> <br>
> On Thu Mar 27 2014 16:23, Hardie, Andrew wrote:<br>
>> The MariaDB collations are identical to the MySQL ones, as there have<br>
>> been no relevant changes since the fork.<br>
>> The new Firebird collations are a lot better, which I hadn't known;<br>
>> thanks for pointing it out. However, it is somewhat academic, since I<br>
>> am not about to port the whole thing to a Firebird backend!<br>
>> best<br>
>> Andrew.<br>
>> -----Original Message-----<br>
>> From: cwb-bounces@sslmit.unibo.it<br>
>> [<a href="mailto:cwb-bounces@sslmit.unibo.it">mailto:cwb-bounces@sslmit.unibo.it</a>] On Behalf Of Ciarán Ó Duibhín<br>
>> Sent: 27 March 2014 14:32<br>
>> To: Open source development of the Corpus WorkBench<br>
>> Subject: Re: [CWB] [CQPWeb] diacritics in CQPweb<br>
>> Browsing around, I see that Firebird 2.5 has UTF8 collations called<br>
>> UNICODE, UNICODE_CI and UNICODE_CI_AI (<br>
>> <a href="http://www.firebirdsql.org/file/documentation/reference_manuals/reference_material/html/langrefupd25-collations.html#langrefupd25-collations-unicode">
http://www.firebirdsql.org/file/documentation/reference_manuals/reference_material/html/langrefupd25-collations.html#langrefupd25-collations-unicode</a><br>
>> )<br>
>> For MariaDB, there are many collations containing "ci" in their<br>
>> names, but I can't see whether they are "ai" or "as" (<br>
>> <a href="https://mariadb.com/kb/en/supported-character-sets-and-collations/">https://mariadb.com/kb/en/supported-character-sets-and-collations/</a> )<br>
>> It looks like MySQL may have some catching up to do. I suppose there<br>
>> wouldn't be a repository of user-defined collations for MySQL?<br>
>> Ciarán Ó Duibhín<br>
>> ----- Original Message -----<br>
>> From: "Hardie, Andrew" <a.hardie@lancaster.ac.uk><br>
>> To: "Open source development of the Corpus WorkBench" <br>
>> <cwb@sslmit.unibo.it><br>
>> Sent: Thursday, March 27, 2014 12:11 AM<br>
>> Subject: Re: [CWB] [CQPWeb] diacritics in CQPweb<br>
>> <br>
>> <br>
>>> Not helpful at all, alas, as you missed the critical context that we <br>
>>> are<br>
>>> talking about the Unicode collations available *in MySQL*, on which <br>
>>> CQPweb<br>
>>> depends. These collations include one (utf8_bin) that does "level <br>
>>> 1", and<br>
>>> one (utf8_general_ci) which does "level 4", but nothing that does <br>
>>> "level<br>
>>> 3" or "level 2". That was why I was saying I would have to add one <br>
>>> myself.<br>
>>> See:<br>
>>> <a href="http://collation-charts.org/mysql60/mysql604.utf8_general_ci.european.html">
http://collation-charts.org/mysql60/mysql604.utf8_general_ci.european.html</a><br>
>>> (Note that as rotten as MySQL is on this front, so far as I can tell <br>
>>> other<br>
>>> RDBMSs are even worse, as they seem to link collations to OS <br>
>>> locales,<br>
>>> which is the last thing you want in this context)<br>
>>> best<br>
>>> Andrew.<br>
>>> -----Original Message-----<br>
>>> From: cwb-bounces@sslmit.unibo.it <br>
>>> [<a href="mailto:cwb-bounces@sslmit.unibo.it">mailto:cwb-bounces@sslmit.unibo.it</a>] On<br>
>>> Behalf Of Ciarán Ó Duibhín<br>
>>> Sent: 26 March 2014 18:00<br>
>>> To: Open source development of the Corpus WorkBench<br>
>>> Subject: Re: [CWB] [CQPWeb] diacritics in CQPweb<br>
>>> Apologies if this is not relevant, but I thought that unicode <br>
>>> sorting<br>
>>> recognized four "levels" in comparing two strings:<br>
>>> 1. account is taken of differences in accents, case and specials 2.<br>
>>> account is taken of differences in accents and case, but differences <br>
>>> in<br>
>>> specials are disregarded 3. account is taken of differences in <br>
>>> accents,<br>
>>> but differences in case and specials are disregarded 4. differences <br>
>>> in<br>
>>> accents, case and specials are disregarded<br>
>>> In these terms, what Michel wants is collation at level 3, but is <br>
>>> getting<br>
>>> collation at level 4.<br>
>>> If the CWB developers have access to a "standard" collation <br>
>>> procedure, it<br>
>>> should take care of this requirement automatically, with the <br>
>>> additional<br>
>>> benefit that efficiency considerations can be left to the <br>
>>> implementors of<br>
>>> the standard procedure!<br>
>>> (Specials are non-alphabetic characters, including punctuation, <br>
>>> which may<br>
>>> be present in the strings.)<br>
>>> For more info, see <a href="http://en.wikipedia.org/wiki/ISO_14651">http://en.wikipedia.org/wiki/ISO_14651</a> or<br>
>>> <a href="http://www.unicode.org/reports/tr10/">http://www.unicode.org/reports/tr10/</a><br>
>>> I hope this is helpful,<br>
>>> Ciarán Ó Duibhín.<br>
>>> ----- Original Message -----<br>
>>> From: "Hardie, Andrew" <a.hardie@lancaster.ac.uk><br>
>>> To: "Open source development of the Corpus WorkBench"<br>
>>> <cwb@sslmit.unibo.it><br>
>>> Sent: Wednesday, March 26, 2014 5:16 PM<br>
>>> Subject: Re: [CWB] [CQPWeb] diacritics in CQPweb<br>
>>> <br>
>>> <br>
>>>> Unfortunately, at the moment as you say there is a choice between <br>
>>>> CS/DS<br>
>>>> and CI/DI, while for most linguistic purposes we want CI/DS. One of <br>
>>>> my<br>
>>>> planned developments is to introduce custom collations that can be <br>
>>>> loaded<br>
>>>> into MySQL that will allow CI/DS because I want it too! ( I think I <br>
>>>> would<br>
>>>> have to define one from scratch based on automated mapping from the<br>
>>>> Unicode standard datadase UNIDATA.TXT).<br>
>>>> However, I need to find out first how this will affect performance. <br>
>>>> I<br>
>>>> have<br>
>>>> tried to find out whether using a custom, rather than built-in, <br>
>>>> collation<br>
>>>> affects MySQL performance (and also what effect the complexity of <br>
>>>> the<br>
>>>> custom collation has), but cannot find much online about it. So I <br>
>>>> will<br>
>>>> need to take time to do some empirical experimentation at some <br>
>>>> point.<br>
>>>> So ---- if anyone has any info or experience about MySQL custom<br>
>>>> collations<br>
>>>> that would be very useful.<br>
>>>> best<br>
>>>> Andrew.<br>
>>>> -----Original Message-----<br>
>>>> From: cwb-bounces@sslmit.unibo.it <br>
>>>> [<a href="mailto:cwb-bounces@sslmit.unibo.it">mailto:cwb-bounces@sslmit.unibo.it</a>] On<br>
>>>> Behalf Of genereux<br>
>>>> Sent: 26 March 2014 10:32<br>
>>>> To: Open source development of the Corpus WorkBench<br>
>>>> Subject: [CWB] [CQPWeb] diacritics in CQPweb<br>
>>>> Hi,<br>
>>>> Here's an issue concerning diacritics in CQPweb.<br>
>>>> CQPweb stores frequency lists in mysql. Since there are no<br>
>>>> case-insensitive diacritic-sensitive collations currently available <br>
>>>> in<br>
>>>> mysql, a frequency list merges tokens/characters as follows:<br>
>>>> [e,é,É,Ê,E, ...] [o,ò,ó,Ô,O, ...] ...<br>
>>>> What we want is:<br>
>>>> [e,E] [é,É] [Ê,ê] [o,O] [ò,Ò] [ó,Ó] ...<br>
>>>> We can take care of the case-insensitivity programmatically outside<br>
>>>> CQPweb/mysql by turning to lowercase records before they enter the <br>
>>>> DB<br>
>>>> table. Tables holding frequency lists are then declared as 'collate<br>
>>>> utf8_bin', which takes care of diacritic-sensitivity.<br>
>>>> I am wondering if people involved with corpora for languages other <br>
>>>> than<br>
>>>> English have dealt with this issue in some other (more elegant) <br>
>>>> way?<br>
>>>> Thank you,<br>
>>>> Michel Généreux<br>
>>>> <br>
>>>> _______________________________________________<br>
>>>> CWB mailing list<br>
>>>> CWB@sslmit.unibo.it<br>
>>>> <a href="http://devel.sslmit.unibo.it/mailman/listinfo/cwb">http://devel.sslmit.unibo.it/mailman/listinfo/cwb</a><br>
>>>> _______________________________________________<br>
>>>> CWB mailing list<br>
>>>> CWB@sslmit.unibo.it<br>
>>>> <a href="http://devel.sslmit.unibo.it/mailman/listinfo/cwb">http://devel.sslmit.unibo.it/mailman/listinfo/cwb</a><br>
>>>> <br>
>>> _______________________________________________<br>
>>> CWB mailing list<br>
>>> CWB@sslmit.unibo.it<br>
>>> <a href="http://devel.sslmit.unibo.it/mailman/listinfo/cwb">http://devel.sslmit.unibo.it/mailman/listinfo/cwb</a><br>
>>> _______________________________________________<br>
>>> CWB mailing list<br>
>>> CWB@sslmit.unibo.it<br>
>>> <a href="http://devel.sslmit.unibo.it/mailman/listinfo/cwb">http://devel.sslmit.unibo.it/mailman/listinfo/cwb</a><br>
>>> <br>
>> _______________________________________________<br>
>> CWB mailing list<br>
>> CWB@sslmit.unibo.it<br>
>> <a href="http://devel.sslmit.unibo.it/mailman/listinfo/cwb">http://devel.sslmit.unibo.it/mailman/listinfo/cwb</a><br>
>> _______________________________________________<br>
>> CWB mailing list<br>
>> CWB@sslmit.unibo.it<br>
>> <a href="http://devel.sslmit.unibo.it/mailman/listinfo/cwb">http://devel.sslmit.unibo.it/mailman/listinfo/cwb</a><br>
> _______________________________________________<br>
> CWB mailing list<br>
> CWB@sslmit.unibo.it<br>
> <a href="http://devel.sslmit.unibo.it/mailman/listinfo/cwb">http://devel.sslmit.unibo.it/mailman/listinfo/cwb</a><br>
_______________________________________________<br>
CWB mailing list<br>
CWB@sslmit.unibo.it<br>
<a href="http://devel.sslmit.unibo.it/mailman/listinfo/cwb">http://devel.sslmit.unibo.it/mailman/listinfo/cwb</a><br>
</div>
</span></font>
</body>
</html>