[CWB] Getting [UNREADABLE] in cqpweb

Thu Mar 17 15:26:01 CET 2022

Am 2022-03-16 22:42, schrieb Hardie, Andrew:
> Hi Jörg,
> 
> The [UNREADABLE] happens when the regular expression used to split up
> the word form and the tag in the CQP output fails.
> 
> Given this is associated with NBSPs, it seems likely it is those that
> are causing the regex failure.
> 
> The issue is not NBSP particularly, I think, but the fact of any
> space. (I don't think there's any particular need to put NBSP in the
> vrt file. normal space will do the job just as well.)
> 
> The regular expression in question is defined on l. 276 of 
> environment.php:
> 
> define('CQP_INTERFACE_WORD_REGEX', '|((<\S+?( [^>]*?)?>)*)([^
> <]+)((</\S+?>)*) ?|');
> 
> You might be able to improve matters by hacking about with that, but I
> am not sure how. The central bit which defines a token is this:
> 
> ([^ <]+)
> 
> IE, string of anything except space and <.
> 
> Another possible solution is to hack line 1279 of concordance-lib.php
> 
> 			$word = '[UNREADABLE]';
> 
> Changing it to
> 
> 			$word = escape_html($cqp_source_string);
> 
> might help.
> 
> Sorry about this, it's the result of a series of complexities that
> couldn't be solved without radical changes to concordance rendering,
> which happened between 3.2 and 3.3.
> 
> V3.3 relies on the new CQP settings "AttributeSeparator" and
> "TokenSeparator", rather than the above regex, which helps avoid a lot
> of these problems. (Except for concordance download which still uses
> the old regular expression method - that is on my TODO list of
> course.)
> 
> best
> 
> Andrew.
> 
> 
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On
> Behalf Of Jörg Knappen
> Sent: 16 March 2022 14:09
> To: cwb at sslmit.unibo.it
> Subject: [CWB] Getting [UNREADABLE] in cqpweb
> 
> 
Thanks Andrew,

I applied the hack 2) and it works fine for our Kyrgyz corpora.

Jörg Knappen

> Hi all,
> 
> I am getting a lot of "[UNREADABLE]" in cqpweb in a corpus I have. I
> have traced some instances of it to tokens containing a
> no-break-space.
> 
> The input line from the vrt file looks like
> 
> Кыргыз Республикасынын  Кыргыз Республикасы     np_top_gen
> 
> and it goes well with cqp from the command line, e.g., in a query like
> [word=".*Республикасы"] given results like
> 
>    1367195: л алынып , анын алкагында <Кыргыз Республикасынын> «
> Насыялык маалыма
> 
> However, in cqpweb I get for the query [word=".*Республикасынын"]
> [word="«"] [word="Насыялык"]
> 
> жылдын 22 - июлундагы № 85 Мыйзамы [UNREADABLE] алынып , анын алкагында
>         [UNREADABLE] Республикасынын « Насыялык маалымат алмашуу
> жөнүндө » Мыйзамына да өзгөртүүлөр киргизилген . - Жогоруда
> көрсөтүлгөн
> 
> cqpweb is version CQPweb v3.2.42 © 2008-2020 (still, we update rather
> infrequently).
> 
> What can I do here? Use something different from a no-break-space,
> like an underscore? Or is this a bug in cqpweb?
> 
> Greetings from Saarbrücken,
> 
> Jörg Knappen
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fliste.sslmit.unibo.it%2Fmailman%2Flistinfo%2Fcwb&amp;data=04%7C01%7Chardiea%40live.lancs.ac.uk%7C30474b0ca48e4a176bed08da075785b7%7C9c9bcd11977a4e9ca9a0bc734090164a%7C0%7C0%7C637830370151710348%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=y1y1tFuBFHrtUtgWRERwKZQ7XTpKDVBI2zkHwXJd%2BjQ%3D&amp;reserved=0
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb