[CWB] Two questions

Tue Dec 10 04:20:24 CET 2019

Ad 2) Do you have a combination lemma_POS p-attribute? (e.g. leave_VERB, be_VERB, soft_ADJ)

If so, Go to Freq list -> select that p attribute -> select "ending with" ->enter the POS, e.g. VERB (if you include the underscore, escape it, as this field expects CEQL) -> click "show freq list". 

If not, it's trickier, but still possible via workarounds. (And, in future, by actual features! as ever I have more ideas than time to implement) 

Method 1. 
Run a query for any verb e.g. >> _V* << in most tagsets. Go to Download ... then click on the button for "tabulation".
In the row of the form for "Col. no / 1", change the attribute column to specify your lemma attribute.
Click "Download query tabulation with above settings" 
Take the resulting text file and build a freq list from it in Your Tool Of Choice.

Method 2.
Run query as above. Save it. 
Go to "Manage annotation" (from the corpus entry page, in the Admin section of the left hand 
Change the CEQL bindings so that the lemma is the primary annotation. (1ary is usually the pos)
Go to saved queries, click to view the saved query. 
Go to Frequency Breakdown. 
Change the dropdown that says "F b of words only" to "F b of annotation only". Then press Go. 
When you're done, change back the 1ary annotation to what it was previously.

=====

Ad 3) This is basically what the "idlink" datatype feature is for. Idlink is not totally documented yet. In brief:

- have an xml tag for utterance (e.g. <u>) with an attribute for an ID handle for the speaker. (e.g. <u who="SPX276">.) 
- remember ID handles need to be alphanumeric + underscore only. 
- when you install the corpus, declare the u XML element and the attribute who, and give the latter the datatype of "ID link"
- after the corpus is installed, go to "manage XML". The u_who attribute will be described as not having a table installed yet.
- using the form there, install speaker metadata for u_who. This is like text metadata, but col 1 contains speaker IDs not text IDs. 
- these are the same IDs you used in the XML 
- you then have (and must declare on the form) as many columns as you like
- which can be free text or classification datatypes. Just like a text metadata table.
- That's all. Has it worked? Go to "restricted query"
- speaker metadata fields that you defined as "classifications" should be usable here now.
- (and in distribution though there are still a few bugs being shaken out there).
- id link tickboxes are below text-based tickboxes. 

Hope this helps!

Andrew.

-----Original Message-----
From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On Behalf Of Graham Ranger
Sent: 09 December 2019 13:44
To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it>
Subject: [External Sender] [CWB] Two questions

This email originated from outside of the University. Do not click links or open attachments unless you recognise the sender and know the content is safe.

Hello to all,
Three questions, one of which is not strictly cqpweb related, but I hope you might be able to help...
1) I've been using treetagger to tag French texts, but it performs fairly unsatisfactorily, with a strong tendency to decide that capitalisation (including the first words of sentences) means proper nouns... Of course, I can switch to lowercase everywhere, but this creates a whole load of alternative problems. I'd be very interested to hear if anyone has found good methods for reliable POS and lemma tagging of French, preferably generating a treetagger-type output, since cqpweb anticipates this format.
2) Is there a simple way in cqpweb of generating lemma / POS frequencies. For exemple, all the verbs / adjectives, etc. in a corpus, with totals grouped together by lemmata (i.e. not "is", "be", "are", etc. as different entries but just "be")? I haven't found a way as yet, but I'm sure there must be something.
3) A last question concerns pointers for encoding indications regarding speakers. I'd like to be able to include information on speaker sex, social category, age, etc. in a corpus of fiction, with a view to studying the stylistic correlations of an author in direct speech representation. Would this best be done as speaker attributes?
Thank you in advance for any answers, suggestions.
Best,
Graham.
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fliste.sslmit.unibo.it%2Fmailman%2Flistinfo%2Fcwb&amp;data=02%7C01%7Ca.hardie%40lancaster.ac.uk%7C17aeb1a5bc984f7f068808d77caf1f83%7C9c9bcd11977a4e9ca9a0bc734090164a%7C1%7C1%7C637114963919634000&amp;sdata=6Q7EUCS3whzYpQpmgaT%2Fi9AvmoCy9Ff2BV19XtsGqsg%3D&amp;reserved=0