[CWB] Restrictions on lemma annotation

graham.ranger graham.ranger at univ-avignon.fr
Mon Jun 2 12:07:45 CEST 2025


Hi Andrew, Many thanks for this, which is extremely helpful. I was indeed not escaping the pipe in the query. First step, then, will be for me to generate a list of these alternative lemmata, for users, and to provide indications on how to formulate queries in this specific case.I'll look into option two, but the platform is really addressed to cqpweb users for whom I'd like to keep queries as simple as possible.Best, Graham.Envoyé depuis mon appareil Galaxy
-------- Message d'origine --------De : "Hardie, Andrew" <a.hardie at lancaster.ac.uk> Date : 02/06/2025  10:59  (GMT+01:00) À : Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it> Objet : Re: [CWB] Restrictions on lemma annotation 

Hi Graham
 
This isn’t a restriction on the lemma format. It’s simply that CQP doesn’t, by default, understand things like | as meaning an alternative
 in its input data.
 
Thus, what gets indexed is the string “eau|eaux”
 – so that’s what you have to search for.
 
In CQL
 
[pos="eau\|eaux"]
 
Note that the pipe has to be escaped because you are
searching for the pipe, not separating queriable alternatives.
 
In CEQL
 
{eau\|eaux}

         

Escape is for the same reason. Or, more concisely for this specific example:
 
[pos="eaux?"]
 
{eau[x,]}
 
(or else just use a bunch of * at the start and end of every lemma query, though that will probably lose you precision in the query
 results)
 
HOWEVER, there is a way to get the lemma field to behave like I think you expect it to (though you would need to recode to add leading
 and trailing pipes to each lemma value), which is to create the p-attribute as a feature set. See encoding manual
Sec 6, and CQP manual Sec 6.6. Note that the special CQP functions for feature sets aren’t accessible via CEQL.

 
Hope that helps
 
Best
 
Andrew.
 
 


From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it>
On Behalf Of Graham Ranger -- UAPV
Sent: 31 May 2025 10:43
To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it>
Subject: [CWB] Restrictions on lemma annotation


 

Hello,
In a corpus I'm setting up, using treetagger with a parameter file for classical French, there are a number of alternative lemmata, i.e. things like:
eau    Nc    eau|eaux [Nc: common noun]
I'm not entirely sure why, since there is no ambiguity here, but as a result it is impossible to search for the lemma "eau".
Are there any solutions to other than simply opting to remove the pipe and what comes after it from column three of the vrt file to allow querying only for the first choice of lemma?
Many thanks in advance.
Graham.




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20250602/42dab067/attachment-0001.html>


More information about the CWB mailing list