[CWB] CWB and CoNLL format
Christian Chiarcos
christian.chiarcos at web.de
Wed Mar 3 12:59:21 CET 2021
Dear Stefan,
thanks a lot, that's great news. However, just for clarification: Could
you make explicit what features of the CoNLL format family are supported
and which are not?
Specific questions/remarks:
- I assume you support CoNLL-X and CoNLL-U
- Does that include the support of CoNLL-U metadata (in "classical" CoNLL,
this is just skipped as a free-text comment, see
https://universaldependencies.org/format.html#sentence-boundaries-and-comments
and https://universaldependencies.org/ext-format.html)
- Is this metadata/comment information preserved in (i.e., writable from)
CWB?
- Is there any support for CoNLL-U Plus? (i.e., user-defined extensions to
the CoNLL-U format). Note that this does require to parse CoNLL-U-specific
metadata (these are not shared with other CoNLL formats, see
https://universaldependencies.org/ext-format.html).
- Does that support the CoNLL-U encoding of multi-tokens (after lines with
regular numerical IDs, say 1 and 2, you can add a multi-token line with ID
1-2 that describes the multi-token, see
https://universaldependencies.org/format.html#words-tokens-and-empty-nodes)
- I assume that CoNLL formats with SRL annotations aren't supported
(because they come with a variable number of columns, potentially
different for every sentence). This does include CoNLL-2004 and CoNLL-2005
formats (among others), as well as the current PropBank "skel" format
(which differs from the CoNLL SRL formats by replacing words with
placeholders).
- Do you support the IOB(ES) formats for writing chunks (or are they just
interpreted as strings)? These have been part of various CoNLL formats
since 1999 and are still commonly used for chunking and named entity
annotation.
- Is there any support for PTB-style bracket formats (or are they just
interpreted as strings)? They have been used for phrase-structure parsing
and semantic role labelling in different CoNLL formats (and are, again,
still part of the current PropBank "skel" format).
- Is there any support for other relations between sentence parts such as
coreference (CoNLL-2012 format) or between parts of different sentences
(CoNLL-2015 format)?
- Do you require TAB as column separator (as in most more recent CoNLL
formats) or do you permit SPACE (as in the original CoNLL format) ? If the
former, do you permit SPACE in tokens or annotations (traditionally, CoNLL
formats don't, but with TAB-separated values, that is technically possible
to occur)?
- Is there a strategy for escaping special characters, e.g., SPACE or TAB?
Almost all CoNLL formats are TSV (i.e., CSV with TABs as separators), but
I'm not sure whether any of them uses the standard CSV conventions for
this purpose -- partially, because they pre-date the CSV specification
(https://tools.ietf.org/html/rfc4180).
Simply because of inherent limitations of the CWB3 data model, the answer
to some of these questions is fairly obvious (and how to overcome them
with CWB4, as well) -- so, apologies for asking explicitly --, but as
"support for CoNLL format" can mean a lot of different things to potential
users (depending on what data they're most familiar with), I would ask for
documenting that as part of appendix B and also refer to that in the
manual when introducing "CoNLL format" as a term).
Best regards,
Christian
Am .03.2021, 11:45 Uhr, schrieb Stefan Evert <stefanML at collocations.de>:
> Dear all,
>
> as of v3.4.28 (i.e. the most recent code from the SVN repository), CWB
> has full support for reading and writing CoNLL-style files. Of course,
> dependency links are still stored as token numbers in regular
> p-attributes – this will only change with Ziggurat and CWB 4.
>
> Suitable command-line options for CoNLL format are documented in Sec. 2
> and 7 of the current draft of the CWB Encoding Tutorial, available from
> this link:
>
> https://sourceforge.net/p/cwb/code/HEAD/tree/doc/tutorials/CWB_Encoding_Tutorial.pdf?format=raw
>
> Testers are highly welcome!
>
> Best,
> Stefan
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
More information about the CWB
mailing list