[CWB] CWB and CoNLL format
Stefan Evert
stefanML at collocations.de
Tue Apr 13 17:23:44 CEST 2021
Dear Luigi,
good to hear that the new basic CoNLL support in cwb-encode works for you. In order to answer your specific questions:
> btw, how is
> the sentence boundary recognized, if there are not XML tags in the
> conllu file?
CoNLL indicates sentence boundaries by blank lines, so one of the main additions to cwb-encode was an option to recognise this convention (and it was implemented because I got bored of having to write a Perl script to turn them into <s> and </s> tags all the time).
> I already know from Stefan's answers that #lines are ignored, but it
> would be nice to have at least the sentence id encoded -
If you mean metadata as in the examples at https://universaldependencies.org/format.html, e.g.
# sent_id = 1
# text = They buy and sell books.
1 They they PRON PRP Case=Nom|Number=Plur 2 nsubj 2:nsubj|4:nsubj _
2 buy buy VERB VBP Number=Plur|Person=3|Tense=Pres 0 root 0:root _
we do not intend to support this directly in cwb-encode because it's not an (even informally) standardise format. However, you can put a small Perl, Python, … script between the input and cwb-encode that transforms the metadata comments into suitable XML tags, e.g.
<s_id 1>
<s_fulltext They buy and sell books.>
1 They they PRON PRP Case=Nom|Number=Plur 2 nsubj 2:nsubj|4:nsubj _
2 buy buy VERB VBP Number=Plur|Person=3|Tense=Pres 0 root 0:root _
Note these aren't proper XML tags but the simplified tag-with-value format expected by -V attributes. Then encode with flags
… -V s_id -V s_fulltext
(omitting a :0 recursion specifier). You don't have to bother with closing tags at the end of a sentence: the regions will automatically be closed by the next start tag.
Best,
Stefan
More information about the CWB
mailing list