[CWB] Regular expressions with word groups

Josep M. Fontana josepm.fontana at upf.edu
Tue Jul 28 23:42:26 CEST 2020


Thanks Maarten and to everybody who responded.

Yes. What you say makes total sense. I had assumed that since in essence 
the complex pattern involving grouped expressions are all within a 
single parenthesis '( )' and that allows one to treat it as if it were a 
single word within '[]', the ! operator would work the same way it works 
when it is associated to any expression inside square brackets.

Andrés seems to suggest to enclose everything within square brackets but 
that doesn't work. In principle it shouldn't word because the convention 
is that square brackets enclose a word. So it makes sense that we can't 
do that. Once we allow to use parenthesis to form groups of sequences of 
words that are treated essentially as a single unit, however, why can't 
we use the same operators we use with single expressions enclosed within 
'[ ]'? I don't see why it shouldn't be possible.

If no one has asked this before it must mean that there are not that 
many people who would need to do this kind of search and of course I 
have no idea of how hard this might be to implement. Having said this, 
however, I certainly think that this would be very useful. I find the 
idea of doing a diff as Andrew suggests a bit unpractical.

JM

On 28/07/2020 22:41, Maarten Janssen wrote:
> I did not look in detail at the implementation in CWB - but if these were normal regular expressions, your query
>
> [(word="f[ei]rid.*")|(word="muert[ao].*")] !(([(pos="S.*")
>        &
> (word="d.*")][word=".*el"][word="ca[buv]allo.*"])|[word="entierra"]|([word="en"][word="tierra"]))
>
> should match
>
> cayo *muerto* en tierra
>
> Namely - “muerto” for the first part of the query, and nothing for the second - there is no indication of how long the second part should be - add a word requirement after it and it even becomes unwelldefined what you would mean by it; it would be different if you were looking for a specific word after it that cannot be one of several, like [!(word=“en ?tierra”  | word=“ca[buv]allo")] - but your second part has a variable word length. What you are looking for is a negative look-ahead, which you cannot do by negating the parts of what you are looking for - and given how query matches work in CWB I would be very surprised if there is a negative look-ahead...
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list