[CWB] [cwb:bugs] #74 Inconsistency in CQP regexp matching

Andrew Hardie andrewhardie at users.sourceforge.net
Mon Feb 14 09:53:15 CET 2022


1.  My vote is to add $ and ^ to the list, backwards compatibility be damned. 
2.  No, that's the whole set. Other characters with special meaning in PCRE are dependent on one of those (e.g. : is only special after ?)
3.  No, there's another case of the same thing in do_XMLTag(). My inclination would be to define  CHARS_MAKING_REGEX_NONLITERAL as a macro used in both locations. 




---

** [bugs:#74] Inconsistency in CQP regexp matching**

**Status:** open
**Group:** TODO-3.5
**Created:** Sat Feb 12, 2022 09:21 AM UTC by Stefan Evert
**Last Updated:** Sat Feb 12, 2022 09:21 AM UTC
**Owner:** Andrew Hardie


The query `[pos="PP$"]` matches the Penn tag `PP$`, but queries `[pos="PP$"%c]` and `[pos="PP$|PP$"]` match `PP` instead.

The reason, of course, is that the first query is matched as a literal string rather than a regexp, so `$` isn't interpreted as a metacharacter anchoring the regexp at end-of-string. CQP heuristically checks for metacharacters in `do_flagged_string()<cqp/parse_actions.c>`, but the list doesn't include the “useless” metacharacters `$` and `^`.  This raises three questions:
1. Should we change behaviour to ensure consistency between the three queries? This might break existing applications (and users) who have unwittingly relied on the current inconsistent behaviour.
2. If we do, perhaps the current list `"[](){}.*+|?\\"` has further gaps?
3. Is `do_flagged_string()` the only place where this test is run or do we need to patch other functions as well?


---

Sent from sourceforge.net because cwb at sslmit.unibo.it is subscribed to https://sourceforge.net/p/cwb/bugs/

To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/cwb/admin/bugs/options.  Or, if this is a mailing list, you can unsubscribe from the mailing list.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20220214/34030035/attachment.html>


More information about the CWB mailing list