[CWB] CQP s-attributes constrains: select text_ids for searching

Fri Jun 27 18:30:34 CEST 2025

Thanks again, Stephanie!

El vie, 27 jun 2025 a la(s) 8:46 a.m., Stephanie Evert (
stefanML at collocations.de) escribió:

> Thank you, Stephanie, this is what I was looking for. For the regex, I
> guess I can do something like text_id = "(ID1|ID2|IDn)”
>
>
> You can also read them as a word list and compile the regexp automatically
> with the RE() operator:
>
> define $texts < "text_ids.txt";
> Texts = <text_id = RE($texts)> [] expand to text;
>
> But it has the same limits, namely …
>
> In the other hand, when you said “so this can be tedious (or not work at
> all) if you have a very long list of text IDs”, which thing could not work?
> If I have like, say, 100 docs, could this approach not work?
>
>
> There is a length limit for strings in CWB, which we've been increasing
> over the years. Worse, regexp implementations often have their own limits,
> which *do not* throw and error, but silently ignore the rest of the regexp.
> So you might not be matching all IDs without ever noticing. Perhaps a good
> idea to check with
>
> tabulate Texts match text_id > "test.txt";
>
> and compare test.txt with text_ids.txt.
>
> Best,
> Stephanie
>
> PS: Some people (esp. Python users) try to optimised the regexp by
> combining shared prefixes (there's a Python package for doing this
> automatically). This is even worse, because PCRE1 (which the released
> version of CWB still uses) doesn't support deeply nested parentheses and
> will just silently discard them.
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20250627/09683723/attachment.html>