[CWB] A few miscellaneous questions

Hardie, Andrew a.hardie at lancaster.ac.uk
Mon Dec 2 13:14:55 CET 2019


Hi Scott,

What you want in terms of manipulable auto-defaults is largely what CEQL is designed to provide. One of the new types of plugin under development is the CEQL Extender which will enable you to override CEQL grammar rules as per what you want. The example plugin of this sort is in fact one that adds a “within s” (or some other xml element) clause to everything. See

https://sourceforge.net/p/cwb/code/HEAD/tree/gui/cqpweb/trunk/lib/plugins/builtin/CeqlExtender/AddWithinRangeOfXml.php

Also, the new implementation of case sensitivity in 3,3 will allow the sensitivity defaults to be set per attribute.

It would be a bad idea, from a design perspective, to make the kinds of changes you suggest at the CQP level, where things like changing the default case sensitivity or adding special treatment for “word” etc. etc. would be radical and very bad for backward-compatibility.

best

Andrew.

From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On Behalf Of Scott Sadowsky
Sent: 30 November 2019 10:36
To: Stefan Evert <stefanML at collocations.de>
Cc: CWBdev Mailing List <cwb at sslmit.unibo.it>
Subject: [External Sender] Re: [CWB] A few miscellaneous questions


This email originated from outside of the University. Do not click links or open attachments unless you recognise the sender and know the content is safe.
On Sat, Nov 30, 2019 at 6:00 AM Stefan Evert <stefanML at collocations.de<mailto:stefanML at collocations.de>> wrote:

Hi Stefan,

1. I've tagged each utterance with a unique serial number that's stored in the s_utterance s-attribute. It's encoded as free text. I'd like to be able to query specific utterances by number, e.g. s_utterance="1287117", and get a single result just once -- the entirety of the utterance.

The intuitive solution is <s_utterance = "12887"> []* </s_utterance>
A faster and slightly safer alternative (because it also works for very long sentences) is <s_utterance = "12887"> [] expand to s_utterance

That's perfect! Many thanks.


2. Performing case-sensitive queries of words is of limited use to me (and likely others). However, it's the default with CQP syntax queries. This is different from both the simple query syntax and search engine syntax, which makes it very easy to forget to add %c to every single query element. Is there any way to set searching to be case-insensitive by default?

3. In a similar vein, searching across sentence/utterance boundaries is of limited usefulness, but it is also the default. This can, of course, be dealt with by adding within s to all queries, but that's a lot of typing over time, it's not intuitive to many users, and it's also easy to forget. Can queries somehow be set to not cross sentence/utterance boundaries by default?

... 3. could in principle be handled by a Web interface such as CQPweb, which could be configured to auto-append a suitable within clause to every query (but keep in mind that only a single within clause is allowed, so this would clash with an explicit within specified by the user).  CQPweb doesn't expect every corpus to have sentence units, though, so this could not be a global setting!

Thanks for the detailed answers and rationales. It's good stuff to know!

Being able to append a user-specified (corpus admin-specified, really) "within" clause automatically would certainly be useful! The tagger I use doesn't output NPs, VPs or similar, so I can't think of any use for "within" except sentence/utterance boundaries. And in corpora in which such structures are indeed tagged, and a user specifies one with a "within" clause, this could simply replace the automatic sentence/utterance clause. That should even produce the same results, since NPs and such don't cross sentence boundaries.


2. is a much harder problem, for several reasons:

 - You don't necessarily want case-insensitive matching for all attributes, e.g. POS tags might be case-sensitive, and you might want to distinguish between the lemmas "Polish" and "polish".  So you'd have to tell CQP exactly which attributes default to case-insensitive.

Right. If nothing else, I'd think the "word" attribute should default to being case-insensitive. There are certainly corner cases, as you point out, but mostly capitalization is linguistically uninteresting -- even with proper nouns, since any tagger worth its salt has NER and tags them as proper. Certainly I find my self adding (or mistakenly forgetting to add) "%c" to 99% of all my word-based queries. And an option such as CEQL's ":C" would allow the 1% of cases to be dealt with appropriately, of course. (Others' percentages will, of course, vary!)


 - You'd have to give users a way of turning off case-sensitivity for individual query elements, like the :C modifier in CEQL.

I have no idea what implementing something like this in CQP would involve, of course, but it almost seems like automatically adding "%c" in CQPweb, except when the user specifies some new flag for case sensitivity such as "%C", would be the easiest way to go about getting the same result.

 Thanks again for your response.

Cheers,
Scott
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20191202/d695dee5/attachment-0001.html>


More information about the CWB mailing list