[CWB] Two questions

Graham Ranger graham.ranger at univ-avignon.fr
Tue Dec 10 11:59:19 CET 2019


Many thanks for all this help, Andrew. I have begun experimenting and it 
looks very much as if I'll find my answers (direct method or 
workarounds) in your suggestions!
Best,
Graham.

Le 10/12/2019 à 04:20, Hardie, Andrew a écrit :
> Ad 2) Do you have a combination lemma_POS p-attribute? (e.g. leave_VERB, be_VERB, soft_ADJ)
>
> If so, Go to Freq list -> select that p attribute -> select "ending with" ->enter the POS, e.g. VERB (if you include the underscore, escape it, as this field expects CEQL) -> click "show freq list".
>
> If not, it's trickier, but still possible via workarounds. (And, in future, by actual features! as ever I have more ideas than time to implement)
>
> Method 1.
> Run a query for any verb e.g. >> _V* << in most tagsets. Go to Download ... then click on the button for "tabulation".
> In the row of the form for "Col. no / 1", change the attribute column to specify your lemma attribute.
> Click "Download query tabulation with above settings"
> Take the resulting text file and build a freq list from it in Your Tool Of Choice.
>
> Method 2.
> Run query as above. Save it.
> Go to "Manage annotation" (from the corpus entry page, in the Admin section of the left hand
> Change the CEQL bindings so that the lemma is the primary annotation. (1ary is usually the pos)
> Go to saved queries, click to view the saved query.
> Go to Frequency Breakdown.
> Change the dropdown that says "F b of words only" to "F b of annotation only". Then press Go.
> When you're done, change back the 1ary annotation to what it was previously.
>
> =====
>
> Ad 3) This is basically what the "idlink" datatype feature is for. Idlink is not totally documented yet. In brief:
>
> - have an xml tag for utterance (e.g. <u>) with an attribute for an ID handle for the speaker. (e.g. <u who="SPX276">.)
> - remember ID handles need to be alphanumeric + underscore only.
> - when you install the corpus, declare the u XML element and the attribute who, and give the latter the datatype of "ID link"
> - after the corpus is installed, go to "manage XML". The u_who attribute will be described as not having a table installed yet.
> - using the form there, install speaker metadata for u_who. This is like text metadata, but col 1 contains speaker IDs not text IDs.
> - these are the same IDs you used in the XML
> - you then have (and must declare on the form) as many columns as you like
> - which can be free text or classification datatypes. Just like a text metadata table.
> - That's all. Has it worked? Go to "restricted query"
> - speaker metadata fields that you defined as "classifications" should be usable here now.
> - (and in distribution though there are still a few bugs being shaken out there).
> - id link tickboxes are below text-based tickboxes.
>
> Hope this helps!
>
> Andrew.
>
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On Behalf Of Graham Ranger
> Sent: 09 December 2019 13:44
> To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it>
> Subject: [External Sender] [CWB] Two questions
>
> This email originated from outside of the University. Do not click links or open attachments unless you recognise the sender and know the content is safe.
>
> Hello to all,
> Three questions, one of which is not strictly cqpweb related, but I hope you might be able to help...
> 1) I've been using treetagger to tag French texts, but it performs fairly unsatisfactorily, with a strong tendency to decide that capitalisation (including the first words of sentences) means proper nouns... Of course, I can switch to lowercase everywhere, but this creates a whole load of alternative problems. I'd be very interested to hear if anyone has found good methods for reliable POS and lemma tagging of French, preferably generating a treetagger-type output, since cqpweb anticipates this format.
> 2) Is there a simple way in cqpweb of generating lemma / POS frequencies. For exemple, all the verbs / adjectives, etc. in a corpus, with totals grouped together by lemmata (i.e. not "is", "be", "are", etc. as different entries but just "be")? I haven't found a way as yet, but I'm sure there must be something.
> 3) A last question concerns pointers for encoding indications regarding speakers. I'd like to be able to include information on speaker sex, social category, age, etc. in a corpus of fiction, with a view to studying the stylistic correlations of an author in direct speech representation. Would this best be done as speaker attributes?
> Thank you in advance for any answers, suggestions.
> Best,
> Graham.
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fliste.sslmit.unibo.it%2Fmailman%2Flistinfo%2Fcwb&amp;data=02%7C01%7Ca.hardie%40lancaster.ac.uk%7C17aeb1a5bc984f7f068808d77caf1f83%7C9c9bcd11977a4e9ca9a0bc734090164a%7C1%7C1%7C637114963919634000&amp;sdata=6Q7EUCS3whzYpQpmgaT%2Fi9AvmoCy9Ff2BV19XtsGqsg%3D&amp;reserved=0
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb



More information about the CWB mailing list