[CWB] CWB and CoNLL format

Stefan Evert stefanML at collocations.de
Tue Apr 13 17:23:44 CEST 2021


Dear Luigi,

good to hear that the new basic CoNLL support in cwb-encode works for you.    In order to answer your specific questions:

> btw, how is
> the sentence boundary recognized, if there are not XML tags in the
> conllu file?

CoNLL indicates sentence boundaries by blank lines, so one of the main additions to cwb-encode was an option to recognise this convention (and it was implemented because I got bored of having to write a Perl script to turn them into <s> and </s> tags all the time).

> I already know from Stefan's answers that #lines are ignored, but it
> would be nice to have  at least the sentence id encoded -

If you mean metadata as in the examples at https://universaldependencies.org/format.html, e.g.

	# sent_id = 1
	# text = They buy and sell books.
	1   They     they    PRON    PRP    Case=Nom|Number=Plur               2   nsubj   2:nsubj|4:nsubj   _
	2   buy      buy     VERB    VBP    Number=Plur|Person=3|Tense=Pres    0   root    0:root            _

we do not intend to support this directly in cwb-encode because it's not an (even informally) standardise format.  However, you can put a small Perl, Python, … script between the input and cwb-encode that transforms the metadata comments into suitable XML tags, e.g.

	<s_id 1>
	<s_fulltext They buy and sell books.>
	1   They     they    PRON    PRP    Case=Nom|Number=Plur               2   nsubj   2:nsubj|4:nsubj   _
	2   buy      buy     VERB    VBP    Number=Plur|Person=3|Tense=Pres    0   root    0:root            _

Note these aren't proper XML tags but the simplified tag-with-value format expected by -V attributes. Then encode with flags

	… -V s_id -V s_fulltext

(omitting a :0 recursion specifier).  You don't have to bother with closing tags at the end of a sentence: the regions will automatically be closed by the next start tag.

Best,
Stefan



More information about the CWB mailing list