[CWB] Getting [UNREADABLE] in cqpweb
Hardie, Andrew
a.hardie at lancaster.ac.uk
Wed Mar 16 22:42:08 CET 2022
Hi Jörg,
The [UNREADABLE] happens when the regular expression used to split up the word form and the tag in the CQP output fails.
Given this is associated with NBSPs, it seems likely it is those that are causing the regex failure.
The issue is not NBSP particularly, I think, but the fact of any space. (I don't think there's any particular need to put NBSP in the vrt file. normal space will do the job just as well.)
The regular expression in question is defined on l. 276 of environment.php:
define('CQP_INTERFACE_WORD_REGEX', '|((<\S+?( [^>]*?)?>)*)([^ <]+)((</\S+?>)*) ?|');
You might be able to improve matters by hacking about with that, but I am not sure how. The central bit which defines a token is this:
([^ <]+)
IE, string of anything except space and <.
Another possible solution is to hack line 1279 of concordance-lib.php
$word = '[UNREADABLE]';
Changing it to
$word = escape_html($cqp_source_string);
might help.
Sorry about this, it's the result of a series of complexities that couldn't be solved without radical changes to concordance rendering, which happened between 3.2 and 3.3.
V3.3 relies on the new CQP settings "AttributeSeparator" and "TokenSeparator", rather than the above regex, which helps avoid a lot of these problems. (Except for concordance download which still uses the old regular expression method - that is on my TODO list of course.)
best
Andrew.
-----Original Message-----
From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On Behalf Of Jörg Knappen
Sent: 16 March 2022 14:09
To: cwb at sslmit.unibo.it
Subject: [CWB] Getting [UNREADABLE] in cqpweb
Hi all,
I am getting a lot of "[UNREADABLE]" in cqpweb in a corpus I have. I have traced some instances of it to tokens containing a no-break-space.
The input line from the vrt file looks like
Кыргыз Республикасынын Кыргыз Республикасы np_top_gen
and it goes well with cqp from the command line, e.g., in a query like [word=".*Республикасы"] given results like
1367195: л алынып , анын алкагында <Кыргыз Республикасынын> « Насыялык маалыма
However, in cqpweb I get for the query [word=".*Республикасынын"] [word="«"] [word="Насыялык"]
жылдын 22 - июлундагы № 85 Мыйзамы [UNREADABLE] алынып , анын алкагында
[UNREADABLE] Республикасынын « Насыялык маалымат алмашуу жөнүндө » Мыйзамына да өзгөртүүлөр киргизилген . - Жогоруда көрсөтүлгөн
cqpweb is version CQPweb v3.2.42 © 2008-2020 (still, we update rather infrequently).
What can I do here? Use something different from a no-break-space, like an underscore? Or is this a bug in cqpweb?
Greetings from Saarbrücken,
Jörg Knappen
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fliste.sslmit.unibo.it%2Fmailman%2Flistinfo%2Fcwb&data=04%7C01%7Chardiea%40live.lancs.ac.uk%7C30474b0ca48e4a176bed08da075785b7%7C9c9bcd11977a4e9ca9a0bc734090164a%7C0%7C0%7C637830370151710348%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=y1y1tFuBFHrtUtgWRERwKZQ7XTpKDVBI2zkHwXJd%2BjQ%3D&reserved=0
More information about the CWB
mailing list