[CWB] Regular expressions with word groups

Josep M. Fontana josepm.fontana at upf.edu
Wed Jul 29 13:50:36 CEST 2020


OK. Thanks. That's what the problem seems to be. The ! operator can have 
scope over "words" (i.e anything enclosed within '[]') even if the | 
operator is used to establish possible alternatives for the word form. 
It cannot be used, however, with "word" groups enclosed within '( )'.

This makes sense at a conceptual level.

Once we have the option of using parenthesis to create what are in 
effect multi-word expressions in CQP searches, though, I think it would 
make sense to also be able to use the ! operator to handle those 
expressions as if they were words. I can see how this could be 
enormously useful for people working with corpora.

JM

On 29/07/2020 11:52, "Andrés Chandía" wrote:
> No, I didn't suggest waht you say, I was just calling your attention 
> to the difference between your RegEx and that from the manual...
>
> Manual
> [(lemma="go") & !(word="went"%c | word="gone"%c)];
>
> Yours
> ([word="en"][word="tierra"])
>
> to match yours to the manual one, regex should be: !(word="en" 
> word="tierra")
>
>
>
> El Mar, 28 de Julio de 2020, 23:42, Josep M. Fontana escribió:
> > Thanks Maarten and to everybody who responded.
> >
> > Yes. What you say makes total sense. I had assumed that since in essence
> > the complex pattern involving grouped expressions are all within a
> > single parenthesis '( )' and that allows one to treat it as if it were a
> > single word within '[]', the ! operator would work the same way it works
> > when it is associated to any expression inside square brackets.
> >
> > Andrés seems to suggest to enclose everything within square brackets but
> > that doesn't work. In principle it shouldn't word because the convention
> > is that square brackets enclose a word. So it makes sense that we can't
> > do that. Once we allow to use parenthesis to form groups of sequences of
> > words that are treated essentially as a single unit, however, why can't
> > we use the same operators we use with single expressions enclosed within
> > '[ ]'? I don't see why it shouldn't be possible.
> >
> > If no one has asked this before it must mean that there are not that
> > many people who would need to do this kind of search and of course I
> > have no idea of how hard this might be to implement. Having said this,
> > however, I certainly think that this would be very useful. I find the
> > idea of doing a diff as Andrew suggests a bit unpractical.
> >
> > JM
> >
> > On 28/07/2020 22:41, Maarten Janssen wrote:
> >> I did not look in detail at the implementation in CWB - but if 
> these were normal regular
> >> expressions, your query
> >>
> >> [(word="f[ei]rid.*")|(word="muert[ao].*")] !(([(pos="S.*")
> >> &
> >> 
> (word="d.*")][word=".*el"][word="ca[buv]allo.*"])|[word="entierra"]|([word="en"][word="tierra"]))
> >>
> >> should match
> >>
> >> cayo *muerto* en tierra
> >>
> >> Namely - “muerto” for the first part of the query, and nothing for 
> the second - there is no
> >> indication of how long the second part should be - add a word 
> requirement after it and it
> >> even becomes unwelldefined what you would mean by it; it would be 
> different if you were
> >> looking for a specific word after it that cannot be one of several, 
> like [!(word=“en
> >> ?tierra” | word=“ca[buv]allo")] - but your second part has a 
> variable word length. What you
> >> are looking for is a negative look-ahead, which you cannot do by 
> negating the parts of what
> >> you are looking for - and given how query matches work in CWB I 
> would be very surprised if
> >> there is a negative look-ahead...
> >> _______________________________________________
> >> CWB mailing list
> >> CWB at sslmit.unibo.it
> >> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
> > _______________________________________________
> > CWB mailing list
> > CWB at sslmit.unibo.it
> > http://liste.sslmit.unibo.it/mailman/listinfo/cwb
> >
>
>
>
> _______________________
>             andrés chandía
> chandia.net <http://www.chandia.net> <https://twitter.com/chandianet>
> Düngupeyem <http://chandia.net/content/dungupeyem> | IECMap 
> <http://chandia.net/content/iecmap> | ISECMap 
> <http://chandia.net/content/isecmap> | NMT 
> <http://chandia.net/content/nmt> | Corlexim <http://corlexim.cl>
>
> Desarrollador de:
> Parles.upf <https://parles.upf.edu> | IWCH <https://iwch.upf.edu> | 
> Amind terapia <http://amindterapia.com> | ONG Mapuche koyaktu 
> <http://koyaktumapuche.net> | Nocando 
> <https://parles.upf.edu/llocs/nocando> | IAC <https://iac.upf.edu> | 
> CddZ <https://iac.upf.edu/cddz> | ISAC <https://iac.upf.edu/isac> | 
> CatCg <http://catcg.upf.edu>
> P No imprima innecesariamente. ¡Cuide el medio ambiente!
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20200729/d004928d/attachment.html>


More information about the CWB mailing list