[CWB] A few miscellaneous questions

Scott Sadowsky ssadowsky at gmail.com
Sat Nov 30 11:36:22 CET 2019


On Sat, Nov 30, 2019 at 6:00 AM Stefan Evert <stefanML at collocations.de>
wrote:

Hi Stefan,

1. I've tagged each utterance with a unique serial number that's stored in
>> the s_utterance s-attribute. It's encoded as free text. I'd like to be able
>> to query specific utterances by number, e.g. s_utterance="1287117", and get
>> a single result just once -- the entirety of the utterance.
>
>
> The intuitive solution is <s_utterance = "12887"> []* </s_utterance>
> A faster and slightly safer alternative (because it also works for very
> long sentences) is <s_utterance = "12887"> [] expand to s_utterance
>

That's perfect! Many thanks.



> 2. Performing case-sensitive queries of words is of limited use to me (and
>> likely others). However, it's the default with CQP syntax queries. This is
>> different from both the simple query syntax and search engine syntax, which
>> makes it very easy to forget to add %c to every single query element. Is
>> there any way to set searching to be case-insensitive by default?
>
>
> 3. In a similar vein, searching across sentence/utterance boundaries is of
>> limited usefulness, but it is also the default. This can, of course, be
>> dealt with by adding within s to all queries, but that's a lot of typing
>> over time, it's not intuitive to many users, and it's also easy to forget.
>> Can queries somehow be set to not cross sentence/utterance boundaries by
>> default?
>
>
> ... 3. could in principle be handled by a Web interface such as CQPweb,
> which could be configured to auto-append a suitable within clause to every
> query (but keep in mind that only a single within clause is allowed, so
> this would clash with an explicit within specified by the user).  CQPweb
> doesn't expect every corpus to have sentence units, though, so this could
> not be a global setting!
>

Thanks for the detailed answers and rationales. It's good stuff to know!

Being able to append a user-specified (corpus admin-specified, really)
"within" clause automatically would certainly be useful! The tagger I use
doesn't output NPs, VPs or similar, so I can't think of any use for
"within" except sentence/utterance boundaries. And in corpora in which such
structures are indeed tagged, and a user specifies one with a "within"
clause, this could simply replace the automatic sentence/utterance clause.
That should even produce the same results, since NPs and such don't cross
sentence boundaries.



> 2. is a much harder problem, for several reasons:
>
>  - You don't necessarily want case-insensitive matching for all
> attributes, e.g. POS tags might be case-sensitive, and you might want to
> distinguish between the lemmas "Polish" and "polish".  So you'd have to
> tell CQP exactly which attributes default to case-insensitive.
>

Right. If nothing else, I'd think the "word" attribute should default to
being case-insensitive. There are certainly corner cases, as you point out,
but mostly capitalization is linguistically uninteresting -- even with
proper nouns, since any tagger worth its salt has NER and tags them as
proper. Certainly I find my self adding (or mistakenly forgetting to add)
"%c" to 99% of all my word-based queries. And an option such as CEQL's ":C"
would allow the 1% of cases to be dealt with appropriately, of course.
(Others' percentages will, of course, vary!)



>  - You'd have to give users a way of turning off case-sensitivity for
> individual query elements, like the :C modifier in CEQL.
>

I have no idea what implementing something like this in CQP would involve,
of course, but it almost seems like automatically adding "%c" in CQPweb,
except when the user specifies some new flag for case sensitivity such as
"%C", would be the easiest way to go about getting the same result.

 Thanks again for your response.

Cheers,
Scott
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20191130/5eac6e3b/attachment-0001.html>


More information about the CWB mailing list