[CWB] Restrictions on lemma annotation
graham.ranger
graham.ranger at univ-avignon.fr
Mon Jun 2 12:07:45 CEST 2025
Hi Andrew, Many thanks for this, which is extremely helpful. I was indeed not escaping the pipe in the query. First step, then, will be for me to generate a list of these alternative lemmata, for users, and to provide indications on how to formulate queries in this specific case.I'll look into option two, but the platform is really addressed to cqpweb users for whom I'd like to keep queries as simple as possible.Best, Graham.Envoyé depuis mon appareil Galaxy
-------- Message d'origine --------De : "Hardie, Andrew" <a.hardie at lancaster.ac.uk> Date : 02/06/2025 10:59 (GMT+01:00) À : Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it> Objet : Re: [CWB] Restrictions on lemma annotation
Hi Graham
This isn’t a restriction on the lemma format. It’s simply that CQP doesn’t, by default, understand things like | as meaning an alternative
in its input data.
Thus, what gets indexed is the string “eau|eaux”
– so that’s what you have to search for.
In CQL
[pos="eau\|eaux"]
Note that the pipe has to be escaped because you are
searching for the pipe, not separating queriable alternatives.
In CEQL
{eau\|eaux}
Escape is for the same reason. Or, more concisely for this specific example:
[pos="eaux?"]
{eau[x,]}
(or else just use a bunch of * at the start and end of every lemma query, though that will probably lose you precision in the query
results)
HOWEVER, there is a way to get the lemma field to behave like I think you expect it to (though you would need to recode to add leading
and trailing pipes to each lemma value), which is to create the p-attribute as a feature set. See encoding manual
Sec 6, and CQP manual Sec 6.6. Note that the special CQP functions for feature sets aren’t accessible via CEQL.
Hope that helps
Best
Andrew.
From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it>
On Behalf Of Graham Ranger -- UAPV
Sent: 31 May 2025 10:43
To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it>
Subject: [CWB] Restrictions on lemma annotation
Hello,
In a corpus I'm setting up, using treetagger with a parameter file for classical French, there are a number of alternative lemmata, i.e. things like:
eau Nc eau|eaux [Nc: common noun]
I'm not entirely sure why, since there is no ambiguity here, but as a result it is impossible to search for the lemma "eau".
Are there any solutions to other than simply opting to remove the pipe and what comes after it from column three of the vrt file to allow querying only for the first choice of lemma?
Many thanks in advance.
Graham.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20250602/42dab067/attachment-0001.html>
More information about the CWB
mailing list