[CWB] Regular expressions with word groups

Josep M. Fontana josepm.fontana at upf.edu
Thu Jul 30 00:07:51 CEST 2020


Thanks Andrew,

I had not looked at this part of the tutorial and had thought doing a 
diff was something much more involved. This is indeed not very 
impractical and it is quick and easy. This is very helpful.

Josep M.

On 29/07/2020 01:46, Hardie, Andrew wrote:
> PS. The diff approach is not so very impractical.
>
> A = [word="word of interest"];
> B = [word="word of interest"] (phrase you don't want);
> set B matchend match;
> C = diff A B;
>
> and then C should have only instances of the word of interest that are NOT followed by the phrase you don't want.
>
> Quick and easy!
>
> best
>
> Andrew.
>
>
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On Behalf Of Hardie, Andrew
> Sent: 29 July 2020 00:22
> To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it>
> Subject: Re: [CWB] Regular expressions with word groups
>
>
>>>> Once we allow to use parenthesis to form groups of sequences of words that are treated essentially as a single unit, however, why can't we use the same operators we use with single expressions enclosed within '[ ]'? I don't see why it shouldn't be possible.
> It's not possible because they are different levels of syntax. The syntax within [ ] is  a logical statement evaluating to true or false, which implies you can do all the normal Boolean arithmetic including AND, NOT OR and brackets for order of calculation. So there, ! functions as you would expect in a Boolean evaluation -
>
> Once you are outside the [  ], you are no longer writing a Boolean expression - you are writing a token-sequence-level regular expression. So you c an only use regex syntax which doesn't include ! as a "not".
>
> PCRE syntax (in character-sequence regexes) does allow negative lookahead with the syntax
>
>      (?!pattern)
>
> but the regex engine at token-sequence level in CQP doesn't.
>
> best
>
> Andrew.
>
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On Behalf Of Josep M. Fontana
> Sent: 28 July 2020 22:42
> To: cwb at sslmit.unibo.it
> Subject: Re: [CWB] Regular expressions with word groups
>
>
> Thanks Maarten and to everybody who responded.
>
> Yes. What you say makes total sense. I had assumed that since in essence the complex pattern involving grouped expressions are all within a single parenthesis '( )' and that allows one to treat it as if it were a single word within '[]', the ! operator would work the same way it works when it is associated to any expression inside square brackets.
>
> Andrés seems to suggest to enclose everything within square brackets but that doesn't work. In principle it shouldn't word because the convention is that square brackets enclose a word. So it makes sense that we can't do that. Once we allow to use parenthesis to form groups of sequences of words that are treated essentially as a single unit, however, why can't we use the same operators we use with single expressions enclosed within '[ ]'? I don't see why it shouldn't be possible.
>
> If no one has asked this before it must mean that there are not that many people who would need to do this kind of search and of course I have no idea of how hard this might be to implement. Having said this, however, I certainly think that this would be very useful. I find the idea of doing a diff as Andrew suggests a bit unpractical.
>
> JM
>
> On 28/07/2020 22:41, Maarten Janssen wrote:
>> I did not look in detail at the implementation in CWB - but if these
>> were normal regular expressions, your query
>>
>> [(word="f[ei]rid.*")|(word="muert[ao].*")] !(([(pos="S.*")
>>         &
>> (word="d.*")][word=".*el"][word="ca[buv]allo.*"])|[word="entierra"]|([
>> word="en"][word="tierra"]))
>>
>> should match
>>
>> cayo *muerto* en tierra
>>
>> Namely - “muerto” for the first part of the query, and nothing for the second - there is no indication of how long the second part should be - add a word requirement after it and it even becomes unwelldefined what you would mean by it; it would be different if you were looking for a specific word after it that cannot be one of several, like [!(word=“en ?tierra”  | word=“ca[buv]allo")] - but your second part has a variable word length. What you are looking for is a negative look-ahead, which you cannot do by negating the parts of what you are looking for - and given how query matches work in CWB I would be very surprised if there is a negative look-ahead...
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fliste
>> .sslmit.unibo.it%2Fmailman%2Flistinfo%2Fcwb&amp;data=02%7C01%7Ca.hardi
>> e%40lancaster.ac.uk%7Cffc2986ac3014f11705c08d83341f2f5%7C9c9bcd11977a4
>> e9ca9a0bc734090164a%7C0%7C1%7C637315705642901953&amp;sdata=0MTOFbQf%2F
>> DVXxmVLxZgOu6J8EZJ1r5gXfx2Z2BAXWXg%3D&amp;reserved=0
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fliste.sslmit.unibo.it%2Fmailman%2Flistinfo%2Fcwb&amp;data=02%7C01%7Ca.hardie%40lancaster.ac.uk%7Ca028b7fdeadd496d6bee08d8334d0ac3%7C9c9bcd11977a4e9ca9a0bc734090164a%7C0%7C1%7C637315753289944918&amp;sdata=%2B01alUyqPQgbQGBYMggE8TgvLg8Y0e4unrNo4gtPClQ%3D&amp;reserved=0
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fliste.sslmit.unibo.it%2Fmailman%2Flistinfo%2Fcwb&amp;data=02%7C01%7Ca.hardie%40lancaster.ac.uk%7Ca028b7fdeadd496d6bee08d8334d0ac3%7C9c9bcd11977a4e9ca9a0bc734090164a%7C0%7C1%7C637315753289954869&amp;sdata=OdHxnBLAhDZLI2s4meUakD%2FuqHNag%2Bq2MM9A4FFl61w%3D&amp;reserved=0
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list