[CWB] CWB and CoNLL format

Christian Chiarcos christian.chiarcos at web.de
Wed Mar 3 12:59:21 CET 2021


Dear Stefan,

thanks a lot, that's great news. However, just for clarification: Could  
you make explicit what features of the CoNLL format family are supported  
and which are not?

Specific questions/remarks:
- I assume you support CoNLL-X and CoNLL-U
- Does that include the support of CoNLL-U metadata (in "classical" CoNLL,  
this is just skipped as a free-text comment, see  
https://universaldependencies.org/format.html#sentence-boundaries-and-comments  
and https://universaldependencies.org/ext-format.html)
- Is this metadata/comment information preserved in (i.e., writable from)  
CWB?
- Is there any support for CoNLL-U Plus? (i.e., user-defined extensions to  
the CoNLL-U format). Note that this does require to parse CoNLL-U-specific  
metadata (these are not shared with other CoNLL formats, see  
https://universaldependencies.org/ext-format.html).
- Does that support the CoNLL-U encoding of multi-tokens (after lines with  
regular numerical IDs, say 1 and 2, you can add a multi-token line with ID  
1-2 that describes the multi-token, see  
https://universaldependencies.org/format.html#words-tokens-and-empty-nodes)
- I assume that CoNLL formats with SRL annotations aren't supported  
(because they come with a variable number of columns, potentially  
different for every sentence). This does include CoNLL-2004 and CoNLL-2005  
formats (among others), as well as the current PropBank "skel" format  
(which differs from the CoNLL SRL formats by replacing words with  
placeholders).
- Do you support the IOB(ES) formats for writing chunks (or are they just  
interpreted as strings)? These have been part of various CoNLL formats  
since 1999 and are still commonly used for chunking and named entity  
annotation.
- Is there any support for PTB-style bracket formats (or are they just  
interpreted as strings)? They have been used for phrase-structure parsing  
and semantic role labelling in different CoNLL formats (and are, again,  
still part of the current PropBank "skel" format).
- Is there any support for other relations between sentence parts such as  
coreference (CoNLL-2012 format) or between parts of different sentences  
(CoNLL-2015 format)?
- Do you require TAB as column separator (as in most more recent CoNLL  
formats) or do you permit SPACE (as in the original CoNLL format) ? If the  
former, do you permit SPACE in tokens or annotations (traditionally, CoNLL  
formats don't, but with TAB-separated values, that is technically possible  
to occur)?
- Is there a strategy for escaping special characters, e.g., SPACE or TAB?  
Almost all CoNLL formats are TSV (i.e., CSV with TABs as separators), but  
I'm not sure whether any of them uses the standard CSV conventions for  
this purpose -- partially, because they pre-date the CSV specification  
(https://tools.ietf.org/html/rfc4180).

Simply because of inherent limitations of the CWB3 data model, the answer  
to some of these questions is fairly obvious (and how to overcome them  
with CWB4, as well) -- so, apologies for asking explicitly --, but as  
"support for CoNLL format" can mean a lot of different things to potential  
users (depending on what data they're most familiar with), I would ask for  
documenting that as part of appendix B and also refer to that in the  
manual when introducing "CoNLL format" as a term).

Best regards,
Christian

Am .03.2021, 11:45 Uhr, schrieb Stefan Evert <stefanML at collocations.de>:

> Dear all,
>
> as of v3.4.28 (i.e. the most recent code from the SVN repository), CWB  
> has full support for reading and writing CoNLL-style files.  Of course,  
> dependency links are still stored as token numbers in regular  
> p-attributes – this will only change with Ziggurat and CWB 4.
>
> Suitable command-line options for CoNLL format are documented in Sec. 2  
> and 7 of the current draft of the CWB Encoding Tutorial, available from  
> this link:
>
> 	https://sourceforge.net/p/cwb/code/HEAD/tree/doc/tutorials/CWB_Encoding_Tutorial.pdf?format=raw
>
> Testers are highly welcome!
>
> Best,
> Stefan
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb


More information about the CWB mailing list