[CWB] A few miscellaneous questions

Stefan Evert stefanML at collocations.de
Sat Nov 30 10:00:34 CET 2019


> I've got a few miscellaneous questions about searching in CQPweb using CQP syntax.
> 
> 1. I've tagged each utterance with a unique serial number that's stored in the s_utterance s-attribute. It's encoded as free text. I'd like to be able to query specific utterances by number, e.g. s_utterance="1287117", and get a single result just once -- the entirety of the utterance.
> 
> The best attempts I've been able to come up with are kludgy and return one hit per token in the utterance: [word=".+"] :: match.s_utterance="12887"and [] :: match.s_utterance="12887". 
> 
> Is there a better way to do this?


The intuitive solution is

	<s_utterance = "12887"> []* </s_utterance>

A faster and slightly safer alternative (because it also works for very long sentences) is

	<s_utterance = "12887"> [] expand to s_utterance


> 2. Performing case-sensitive queries of words is of limited use to me (and likely others). However, it's the default with CQP syntax queries. This is different from both the simple query syntax and search engine syntax, which makes it very easy to forget to add %c to every single query element.
> 
> Is there any way to set searching to be case-insensitive by default?
> 
> 
> 3. In a similar vein, searching across sentence/utterance boundaries is of limited usefulness, but it is also the default. This can, of course, be dealt with by adding within s to all queries, but that's a lot of typing over time, it's not intuitive to many users, and it's also easy to forget.
> 
> Can queries somehow be set to not cross sentence/utterance boundaries by default? 

Both options aren't supported.  CQP is mainly designed as a backend query processor, not as a frontend interface for end users.  Hence there is only very limited syntactic sugar such as being able to type "the" instead of the correct [word = "the"].

One of the advantages of this approach is that a given CQP query will (almost) always return the same result set and does not depend on external interface settings, which is e.g. essential for caching query results.  (There are two exceptions, the DefaultNonbrackAttr and MatchingStrategy options, but that's unfortunate enough as it is.)

3. could in principle be handled by a Web interface such as CQPweb, which could be configured to auto-append a suitable within clause to every query (but keep in mind that only a single within clause is allowed, so this would clash with an explicit within specified by the user).  CQPweb doesn't expect every corpus to have sentence units, though, so this could not be a global setting!

2. is a much harder problem, for several reasons:

 - You don't necessarily want case-insensitive matching for all attributes, e.g. POS tags might be case-sensitive, and you might want to distinguish between the lemmas "Polish" and "polish".  So you'd have to tell CQP exactly which attributes default to case-insensitive.

 - You'd have to give users a way of turning off case-sensitivity for individual query elements, like the :C modifier in CEQL.

 - Case-insensitive matching is only supported for regular expressions, so if a users enters "?"%l, we'd get an option clash.

So there's little chance of getting support for 2. in CQP.  

Best,
Stefan



More information about the CWB mailing list