[CWB] CWB and CoNLL format
Christian Chiarcos
christian.chiarcos at web.de
Tue Apr 13 16:43:40 CEST 2021
Hi Luigi,
IMHO, anything so specific such as CoNLL-U conventions for sentence
metadata should be clearly distinguished from a "general" CoNLL
importer/converter. If you just need sentence ids, a simple workaround
would be to adopt the CoNLL-2015 approach and to add sentence id to every
word as a separate column to a CoNLL(-U) file (cf. trial data under
https://www.cs.brandeis.edu/~clp/conll15st/dataset.html, second column).
Other than that, a possible strategy for CoNLL formats in general (and
compliant with formats that encode intersential relations as sentence
offsets) would be to have *implicit* numerical sentence ids (i.e., number
of preceding sentences [as in CoNLL-2015] or number of preceding sentences
+ 1 [~ word IDs]). That could actually be a feature of a generic CoNLL
importer, as this is not specific to a particular CoNLL dialect.
Best,
Christian
Am Di., 13. Apr. 2021 um 15:11 Uhr schrieb Luigi Talamo <
talamo.luigi at gmail.com>:
> Dear all,
> I have recently used the new CoNLL feature and worked like a charm.
> Using the latest version of CWB through a Docker image, I was able to
> encode a conllu file with the following command:
>
> cwb-encode -f greek-conllu-file -d /var/corpora/el_ciep/ -c utf8 -R
> /usr/local/share/cwb/registry/el_ciep -xsB -N id -L s -P lemma -P upos
> -P xpos -P feats -P head -P deprel -P deps -P misc
>
> (It is the Greek treebank of the multilingual and parallel corpus we
> are currently building here at Saarland University)
>
> I already know from Stefan's answers that #lines are ignored, but it
> would be nice to have at least the sentence id encoded - btw, how is
> the sentence boundary recognized, if there are not XML tags in the
> conllu file?
>
> cheers,
> Luigi
>
> On Wed, Mar 3, 2021 at 10:19 PM Stefan Evert <stefanML at collocations.de>
> wrote:
> >
> > Dear Christian and Maarten,
> >
> > thanks for your clarification questions, which made me realise that my
> announcement had obviously been misleading. By CoNLL support I meant that
> CWB is able to read and write the general CoNLL-style format – i.e.
> TAB-separated token-level annotation with numeric IDs in the first column
> and sentences separated by blank lines – not that it directly supports any
> particular CoNLL flavours.
> >
> > CWB has always focused on maximal flexibility and it would go against
> this principle to fix the interpretation of specific columns. It should be
> easy enough to write a small shell script or bash functions with suitable
> presets for different CoNLL formats.
> >
> > Unfortunately I've never been able to find formal documentation for a
> general CoNLL format (and neither for e.g. CoNLL-U), so it's quite possible
> that I've overlooked some features, but would then hope to add them to
> cwb-encode.
> >
> >
> > Regarding your specific questions:
> >
> > > - Does that include the support of CoNLL-U metadata (in "classical"
> CoNLL, this is just skipped as a free-text comment, see
> https://universaldependencies.org/format.html#sentence-boundaries-and-comments
> and https://universaldependencies.org/ext-format.html)
> >
> > These are just comment lines, and cwb-encode will ignore them – cf. the
> top section of https://universaldependencies.org/format.html, which
> clearly says that there are only token lines, blank lines and comments.
> >
> > Further down, the remark "the contents of the comments and metadata is
> basically unrestricted" clarifies that it is impossible to index these
> comments in a meaningful way. :-)
> >
> > > - Is this metadata/comment information preserved in (i.e., writable
> from) CWB?
> >
> > No, in this case pre-processing will be required to turn these lines
> into appropriate XML tags (which CoNLL should have done in the first
> place!).
> >
> > > - Does that support the CoNLL-U encoding of multi-tokens (after lines
> with regular numerical IDs, say 1 and 2, you can add a multi-token line
> with ID 1-2 that describes the multi-token, see
> https://universaldependencies.org/format.html#words-tokens-and-empty-nodes
> )
> >
> > That doesn't fit into the CWB data model. Actually, such input files
> will be rejected by cwb-encode because it requires the first column to be a
> number.
> >
> > > - I assume that CoNLL formats with SRL annotations aren't supported
> (because they come with a variable number of columns, potentially different
> for every sentence). This does include CoNLL-2004 and CoNLL-2005 formats
> (among others), as well as the current PropBank "skel" format (which
> differs from the CoNLL SRL formats by replacing words with placeholders).
> >
> > The columns have to be the same for the entire corpus, of course. I
> don't think changing around columns arbitrarily would give a reliable input
> format.
> >
> > Missing fields at the end of a line are simply indexed as __UNDEF__ by
> cwb-encode (without warnings).
> >
> > > - Do you support the IOB(ES) formats for writing chunks (or are they
> just interpreted as strings)? These have been part of various CoNLL formats
> since 1999 and are still commonly used for chunking and named entity
> annotation.
> >
> > They are read as a positional attribute in IOB notation, of course. I
> would convert them to chunks (if desired) after indexing, with something
> like
> >
> > cqpcl -D CORPUS 'A = (?longest) [iob = "B"] [iob = "I"]+;
> tabulate A match, matchend;' | cwb-s-encode -d data_dir -S chunk
> >
> > > - Is there any support for PTB-style bracket formats (or are they just
> interpreted as strings)? They have been used for phrase-structure parsing
> and semantic role labelling in different CoNLL formats (and are, again,
> still part of the current PropBank "skel" format).
> >
> > They _are_ strings in the CoNLL format and are indexed as such. In my
> view, CoNLL encodes neither chunks, nor phrase structure, nor dependency
> graphs – just text columns which can later be reinterpreted as such data
> structures.
> >
> > > - Do you require TAB as column separator (as in most more recent CoNLL
> formats) or do you permit SPACE (as in the original CoNLL format) ? If the
> former, do you permit SPACE in tokens or annotations (traditionally, CoNLL
> formats don't, but with TAB-separated values, that is technically possible
> to occur)?
> >
> > CWB only accepts TAB-separated columns, so it's technically possible to
> have spaces in annotation values (but very much frowned upon).
> >
> > > - Is there a strategy for escaping special characters, e.g., SPACE or
> TAB? Almost all CoNLL formats are TSV (i.e., CSV with TABs as separators),
> but I'm not sure whether any of them uses the standard CSV conventions for
> this purpose -- partially, because they pre-date the CSV specification (
> https://tools.ietf.org/html/rfc4180).
> >
> > Is TSV a well-defined format, i.e. a variant of CSV?
> >
> > But the purpose of TSV is that one doesn't have to mess around with
> quoted fields, so TABs and newlines can't be embedded in fields. CWB also
> doesn't allow TABs or newlines in annotation values.
> >
> > > Simply because of inherent limitations of the CWB3 data model, the
> answer to some of these questions is fairly obvious (and how to overcome
> them with CWB4, as well) -- so, apologies for asking explicitly --, but as
> "support for CoNLL format" can mean a lot of different things to potential
> users (depending on what data they're most familiar with), I would ask for
> documenting that as part of appendix B and also refer to that in the manual
> when introducing "CoNLL format" as a term).
> >
> > And CWB 4 will require much better (i.e. more explicit) input formats
> than CoNLL.
> >
> > But thanks for the recommendations, I'll try to remember until I have
> time to work on the manual again. I think I explained my understanding of
> "CoNLL-style format" in the manpages, but I completely agree that "full
> CoNLL support" in the encoding tutorial is misleading.
> >
> > > - In understand it takes empty lines as sentences, but does it also do
> doc and s attributes? And what does it use for the pattributes for the
> columns? (TEITOK uses what the standard describes: form, upos, xpos, feats,
> deprel, deps, head, and misc)
> >
> > As explained above, the attribute names for the columns have to be
> declared by the user running cwb-encode. Since there is no structural
> annotation in CoNLL – only comment lines with poorly specified metadata
> format – the comments are ignored.
> >
> > > - TEITOK also uses <s> (since it comes from TEI) but in UD they use
> sent - what was the motivation behind <s>? (I am trying to find out from
> the UD community whether <s> would be acceptable)
> >
> > cwb-encode actually still reads the .vrt input format, just with a few
> modifications to make it more CoNLL-friendly. So you can encode XML tags
> in the normal way.
> >
> > > - Is there also a CoNLL-U export, and if so, does that require
> anything special in the compiled corpus?
> >
> > CWB provides a full round-trip in the sense that if you encode all the
> columns as p-attributes and blank lines as sentence breaks, you can
> re-construct the input file with cwb-encode, except for the comment lines.
> >
> > Best & thanks for your responses,
> > Stefan
> >
> > _______________________________________________
> > CWB mailing list
> > CWB at sslmit.unibo.it
> > http://liste.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://liste.sslmit.unibo.it/mailman/listinfo/cwb
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20210413/2e6d8a1d/attachment.html>
More information about the CWB
mailing list