[CWB] CQP s-attributes constrains: select text_ids for searching

Fri Jun 27 17:39:29 CEST 2025

> Thank you, Stephanie, this is what I was looking for. For the regex, I guess I can do something like text_id = "(ID1|ID2|IDn)”

You can also read them as a word list and compile the regexp automatically with the RE() operator:

	define $texts < "text_ids.txt";
	Texts = <text_id = RE($texts)> [] expand to text;

But it has the same limits, namely …

> In the other hand, when you said “so this can be tedious (or not work at all) if you have a very long list of text IDs”, which thing could not work? If I have like, say, 100 docs, could this approach not work?

There is a length limit for strings in CWB, which we've been increasing over the years. Worse, regexp implementations often have their own limits, which *do not* throw and error, but silently ignore the rest of the regexp. So you might not be matching all IDs without ever noticing. Perhaps a good idea to check with

	tabulate Texts match text_id > "test.txt";

and compare test.txt with text_ids.txt.

Best,
Stephanie

PS: Some people (esp. Python users) try to optimised the regexp by combining shared prefixes (there's a Python package for doing this automatically). This is even worse, because PCRE1 (which the released version of CWB still uses) doesn't support deeply nested parentheses and will just silently discard them.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20250627/e5b62c34/attachment.html>