[CWB] Getting [UNREADABLE] in cqpweb
Jörg Knappen
j.knappen at mx.uni-saarland.de
Thu Mar 17 15:26:01 CET 2022
Am 2022-03-16 22:42, schrieb Hardie, Andrew:
> Hi Jörg,
>
> The [UNREADABLE] happens when the regular expression used to split up
> the word form and the tag in the CQP output fails.
>
> Given this is associated with NBSPs, it seems likely it is those that
> are causing the regex failure.
>
> The issue is not NBSP particularly, I think, but the fact of any
> space. (I don't think there's any particular need to put NBSP in the
> vrt file. normal space will do the job just as well.)
>
> The regular expression in question is defined on l. 276 of
> environment.php:
>
> define('CQP_INTERFACE_WORD_REGEX', '|((<\S+?( [^>]*?)?>)*)([^
> <]+)((</\S+?>)*) ?|');
>
> You might be able to improve matters by hacking about with that, but I
> am not sure how. The central bit which defines a token is this:
>
> ([^ <]+)
>
> IE, string of anything except space and <.
>
> Another possible solution is to hack line 1279 of concordance-lib.php
>
> $word = '[UNREADABLE]';
>
> Changing it to
>
> $word = escape_html($cqp_source_string);
>
> might help.
>
> Sorry about this, it's the result of a series of complexities that
> couldn't be solved without radical changes to concordance rendering,
> which happened between 3.2 and 3.3.
>
> V3.3 relies on the new CQP settings "AttributeSeparator" and
> "TokenSeparator", rather than the above regex, which helps avoid a lot
> of these problems. (Except for concordance download which still uses
> the old regular expression method - that is on my TODO list of
> course.)
>
> best
>
> Andrew.
>
>
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On
> Behalf Of Jörg Knappen
> Sent: 16 March 2022 14:09
> To: cwb at sslmit.unibo.it
> Subject: [CWB] Getting [UNREADABLE] in cqpweb
>
>
Thanks Andrew,
I applied the hack 2) and it works fine for our Kyrgyz corpora.
Jörg Knappen
> Hi all,
>
> I am getting a lot of "[UNREADABLE]" in cqpweb in a corpus I have. I
> have traced some instances of it to tokens containing a
> no-break-space.
>
> The input line from the vrt file looks like
>
> Кыргыз Республикасынын Кыргыз Республикасы np_top_gen
>
> and it goes well with cqp from the command line, e.g., in a query like
> [word=".*Республикасы"] given results like
>
> 1367195: л алынып , анын алкагында <Кыргыз Республикасынын> «
> Насыялык маалыма
>
> However, in cqpweb I get for the query [word=".*Республикасынын"]
> [word="«"] [word="Насыялык"]
>
> жылдын 22 - июлундагы № 85 Мыйзамы [UNREADABLE] алынып , анын алкагында
> [UNREADABLE] Республикасынын « Насыялык маалымат алмашуу
> жөнүндө » Мыйзамына да өзгөртүүлөр киргизилген . - Жогоруда
> көрсөтүлгөн
>
> cqpweb is version CQPweb v3.2.42 © 2008-2020 (still, we update rather
> infrequently).
>
> What can I do here? Use something different from a no-break-space,
> like an underscore? Or is this a bug in cqpweb?
>
> Greetings from Saarbrücken,
>
> Jörg Knappen
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fliste.sslmit.unibo.it%2Fmailman%2Flistinfo%2Fcwb&data=04%7C01%7Chardiea%40live.lancs.ac.uk%7C30474b0ca48e4a176bed08da075785b7%7C9c9bcd11977a4e9ca9a0bc734090164a%7C0%7C0%7C637830370151710348%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=y1y1tFuBFHrtUtgWRERwKZQ7XTpKDVBI2zkHwXJd%2BjQ%3D&reserved=0
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
More information about the CWB
mailing list