[CWB] encoder script for AnCora or DEFT Spanish treebanks?
John Hale
jthale at uga.edu
Mon Jul 15 21:16:04 CEST 2019
Hi everyone, on behalf of Donald Dunagan at the University of Georgia (cc)
I’m pleased to contribute CWB encoder scripts for the Spanish treebanks that I asked about previously (see below).
The attached zip archive includes a README file; it is presupposed that the user already has licensed copies of the DEFT spanish treebank from the Linguistic Data Consortium and AnCora-ES 3.0 from the Centre de Llenguatge i Computació, Universitat de Barcelona.
You will need to first adjust pathnames that set at the beginning such as INPUT1, INPUT2, CWBMAKE, DATA_DIRECTORY etc
to fit your site’s filesystem configuration. You may also need to upgrade your version of awk and ensure that you have the “iconv” utility on your system.
But if all goes well, after running them you will be able to issue queries like:
<s_clausetype="relative"> [word="que"] [pos="v"] []* <grup_nom_gen="f"> []* </grup_nom_gen> </s_clausetype> ;
which searches for relative clauses which begin with the word 'que' followed by a verb and end with a feminine noun phrase.
Perhaps this zip file could be added to the CWB web page under Import & export utilities<http://cwb.sourceforge.net/download.php#import>?
all the best,
-john
On Jun 20, 2019, at 8:13 AM, John Hale <jthale at uga.edu<mailto:jthale at uga.edu>> wrote:
Hi, before reinventing the wheel I wanted to ask the CWB list whether anyone has already created an encoder script for the XML annotations used in the CLiC group’s Spanish corpora<http://clic.ub.edu/corpus>? This annotation system is also used in the DEFT Spanish treebank<https://catalog.ldc.upenn.edu/LDC2018T01> and documented fairly exhaustively in this English-language publication:
Soriano, B., O. Borrega, M. Taulé and M.A. Martí (2008) Guidelines,
3LB-WP-02-03, Universitat de Barcelona.
(http://clic.ub.edu/corpus/webfm_send/17)
It’s straightforward enough to thresh out the word (“wd”) attributes and morphology as positional attributes,
but my ambition is to encode the syntactic annotations as s-attributes as well, along the lines suggested in the CWB manual<http://cwb.sourceforge.net/files/CWB_Encoding_Tutorial/node7.html>.
with grateful for any tips you might have,
-john
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20190715/f0c1d5ef/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cwb_encode_spanish_treebanks.zip
Type: application/zip
Size: 5175 bytes
Desc: cwb_encode_spanish_treebanks.zip
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20190715/f0c1d5ef/attachment.zip>
More information about the CWB
mailing list