[CWB] Regular expressions with word groups
Josep M. Fontana
josepm.fontana at upf.edu
Wed Jul 29 13:50:36 CEST 2020
OK. Thanks. That's what the problem seems to be. The ! operator can have
scope over "words" (i.e anything enclosed within '[]') even if the |
operator is used to establish possible alternatives for the word form.
It cannot be used, however, with "word" groups enclosed within '( )'.
This makes sense at a conceptual level.
Once we have the option of using parenthesis to create what are in
effect multi-word expressions in CQP searches, though, I think it would
make sense to also be able to use the ! operator to handle those
expressions as if they were words. I can see how this could be
enormously useful for people working with corpora.
JM
On 29/07/2020 11:52, "Andrés Chandía" wrote:
> No, I didn't suggest waht you say, I was just calling your attention
> to the difference between your RegEx and that from the manual...
>
> Manual
> [(lemma="go") & !(word="went"%c | word="gone"%c)];
>
> Yours
> ([word="en"][word="tierra"])
>
> to match yours to the manual one, regex should be: !(word="en"
> word="tierra")
>
>
>
> El Mar, 28 de Julio de 2020, 23:42, Josep M. Fontana escribió:
> > Thanks Maarten and to everybody who responded.
> >
> > Yes. What you say makes total sense. I had assumed that since in essence
> > the complex pattern involving grouped expressions are all within a
> > single parenthesis '( )' and that allows one to treat it as if it were a
> > single word within '[]', the ! operator would work the same way it works
> > when it is associated to any expression inside square brackets.
> >
> > Andrés seems to suggest to enclose everything within square brackets but
> > that doesn't work. In principle it shouldn't word because the convention
> > is that square brackets enclose a word. So it makes sense that we can't
> > do that. Once we allow to use parenthesis to form groups of sequences of
> > words that are treated essentially as a single unit, however, why can't
> > we use the same operators we use with single expressions enclosed within
> > '[ ]'? I don't see why it shouldn't be possible.
> >
> > If no one has asked this before it must mean that there are not that
> > many people who would need to do this kind of search and of course I
> > have no idea of how hard this might be to implement. Having said this,
> > however, I certainly think that this would be very useful. I find the
> > idea of doing a diff as Andrew suggests a bit unpractical.
> >
> > JM
> >
> > On 28/07/2020 22:41, Maarten Janssen wrote:
> >> I did not look in detail at the implementation in CWB - but if
> these were normal regular
> >> expressions, your query
> >>
> >> [(word="f[ei]rid.*")|(word="muert[ao].*")] !(([(pos="S.*")
> >> &
> >>
> (word="d.*")][word=".*el"][word="ca[buv]allo.*"])|[word="entierra"]|([word="en"][word="tierra"]))
> >>
> >> should match
> >>
> >> cayo *muerto* en tierra
> >>
> >> Namely - “muerto” for the first part of the query, and nothing for
> the second - there is no
> >> indication of how long the second part should be - add a word
> requirement after it and it
> >> even becomes unwelldefined what you would mean by it; it would be
> different if you were
> >> looking for a specific word after it that cannot be one of several,
> like [!(word=“en
> >> ?tierra” | word=“ca[buv]allo")] - but your second part has a
> variable word length. What you
> >> are looking for is a negative look-ahead, which you cannot do by
> negating the parts of what
> >> you are looking for - and given how query matches work in CWB I
> would be very surprised if
> >> there is a negative look-ahead...
> >> _______________________________________________
> >> CWB mailing list
> >> CWB at sslmit.unibo.it
> >> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
> > _______________________________________________
> > CWB mailing list
> > CWB at sslmit.unibo.it
> > http://liste.sslmit.unibo.it/mailman/listinfo/cwb
> >
>
>
>
> _______________________
> andrés chandía
> chandia.net <http://www.chandia.net> <https://twitter.com/chandianet>
> Düngupeyem <http://chandia.net/content/dungupeyem> | IECMap
> <http://chandia.net/content/iecmap> | ISECMap
> <http://chandia.net/content/isecmap> | NMT
> <http://chandia.net/content/nmt> | Corlexim <http://corlexim.cl>
>
> Desarrollador de:
> Parles.upf <https://parles.upf.edu> | IWCH <https://iwch.upf.edu> |
> Amind terapia <http://amindterapia.com> | ONG Mapuche koyaktu
> <http://koyaktumapuche.net> | Nocando
> <https://parles.upf.edu/llocs/nocando> | IAC <https://iac.upf.edu> |
> CddZ <https://iac.upf.edu/cddz> | ISAC <https://iac.upf.edu/isac> |
> CatCg <http://catcg.upf.edu>
> P No imprima innecesariamente. ¡Cuide el medio ambiente!
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20200729/d004928d/attachment.html>
More information about the CWB
mailing list