[CWB] Regular expressions with word groups
Hardie, Andrew
a.hardie at lancaster.ac.uk
Wed Jul 29 01:46:39 CEST 2020
PS. The diff approach is not so very impractical.
A = [word="word of interest"];
B = [word="word of interest"] (phrase you don't want);
set B matchend match;
C = diff A B;
and then C should have only instances of the word of interest that are NOT followed by the phrase you don't want.
Quick and easy!
best
Andrew.
-----Original Message-----
From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On Behalf Of Hardie, Andrew
Sent: 29 July 2020 00:22
To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it>
Subject: Re: [CWB] Regular expressions with word groups
>>> Once we allow to use parenthesis to form groups of sequences of words that are treated essentially as a single unit, however, why can't we use the same operators we use with single expressions enclosed within '[ ]'? I don't see why it shouldn't be possible.
It's not possible because they are different levels of syntax. The syntax within [ ] is a logical statement evaluating to true or false, which implies you can do all the normal Boolean arithmetic including AND, NOT OR and brackets for order of calculation. So there, ! functions as you would expect in a Boolean evaluation -
Once you are outside the [ ], you are no longer writing a Boolean expression - you are writing a token-sequence-level regular expression. So you c an only use regex syntax which doesn't include ! as a "not".
PCRE syntax (in character-sequence regexes) does allow negative lookahead with the syntax
(?!pattern)
but the regex engine at token-sequence level in CQP doesn't.
best
Andrew.
-----Original Message-----
From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On Behalf Of Josep M. Fontana
Sent: 28 July 2020 22:42
To: cwb at sslmit.unibo.it
Subject: Re: [CWB] Regular expressions with word groups
Thanks Maarten and to everybody who responded.
Yes. What you say makes total sense. I had assumed that since in essence the complex pattern involving grouped expressions are all within a single parenthesis '( )' and that allows one to treat it as if it were a single word within '[]', the ! operator would work the same way it works when it is associated to any expression inside square brackets.
Andrés seems to suggest to enclose everything within square brackets but that doesn't work. In principle it shouldn't word because the convention is that square brackets enclose a word. So it makes sense that we can't do that. Once we allow to use parenthesis to form groups of sequences of words that are treated essentially as a single unit, however, why can't we use the same operators we use with single expressions enclosed within '[ ]'? I don't see why it shouldn't be possible.
If no one has asked this before it must mean that there are not that many people who would need to do this kind of search and of course I have no idea of how hard this might be to implement. Having said this, however, I certainly think that this would be very useful. I find the idea of doing a diff as Andrew suggests a bit unpractical.
JM
On 28/07/2020 22:41, Maarten Janssen wrote:
> I did not look in detail at the implementation in CWB - but if these
> were normal regular expressions, your query
>
> [(word="f[ei]rid.*")|(word="muert[ao].*")] !(([(pos="S.*")
> &
> (word="d.*")][word=".*el"][word="ca[buv]allo.*"])|[word="entierra"]|([
> word="en"][word="tierra"]))
>
> should match
>
> cayo *muerto* en tierra
>
> Namely - “muerto” for the first part of the query, and nothing for the second - there is no indication of how long the second part should be - add a word requirement after it and it even becomes unwelldefined what you would mean by it; it would be different if you were looking for a specific word after it that cannot be one of several, like [!(word=“en ?tierra” | word=“ca[buv]allo")] - but your second part has a variable word length. What you are looking for is a negative look-ahead, which you cannot do by negating the parts of what you are looking for - and given how query matches work in CWB I would be very surprised if there is a negative look-ahead...
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fliste
> .sslmit.unibo.it%2Fmailman%2Flistinfo%2Fcwb&data=02%7C01%7Ca.hardi
> e%40lancaster.ac.uk%7Cffc2986ac3014f11705c08d83341f2f5%7C9c9bcd11977a4
> e9ca9a0bc734090164a%7C0%7C1%7C637315705642901953&sdata=0MTOFbQf%2F
> DVXxmVLxZgOu6J8EZJ1r5gXfx2Z2BAXWXg%3D&reserved=0
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fliste.sslmit.unibo.it%2Fmailman%2Flistinfo%2Fcwb&data=02%7C01%7Ca.hardie%40lancaster.ac.uk%7Ca028b7fdeadd496d6bee08d8334d0ac3%7C9c9bcd11977a4e9ca9a0bc734090164a%7C0%7C1%7C637315753289944918&sdata=%2B01alUyqPQgbQGBYMggE8TgvLg8Y0e4unrNo4gtPClQ%3D&reserved=0
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fliste.sslmit.unibo.it%2Fmailman%2Flistinfo%2Fcwb&data=02%7C01%7Ca.hardie%40lancaster.ac.uk%7Ca028b7fdeadd496d6bee08d8334d0ac3%7C9c9bcd11977a4e9ca9a0bc734090164a%7C0%7C1%7C637315753289954869&sdata=OdHxnBLAhDZLI2s4meUakD%2FuqHNag%2Bq2MM9A4FFl61w%3D&reserved=0
More information about the CWB
mailing list