[CWB] Need help importing CONLL-U files into CWB
Bruce McKee
mckee2 at cornell.edu
Mon Jul 11 15:19:27 CEST 2022
Hello CWB experts;
We would like to bring CONLL-U formatted corpora into Corpus Workbench
v3.4.33, running under Ubuntu 20.04.4 LTS. The CONLL-U file is an
excerpt from a Stanford STANZA
<https://stanfordnlp.github.io/stanza/>-processed
corpus.
We succeeded in encoding & indexing a small sample corpus test.conllu, but
our cqp searches are not finding words. See the details below and the
attached test.conllu file.
We also noticed that with cwb-encode, the -N id option triggers the
following error (as does replacing it with just -n):
*Invalid input line [1 I I PRON PRP
Case=Nom|Number=Sing|Person=1|PronType=Prs 3 nsubj _
start_char=0|end_char=1], encoding aborted[location of error: file
test.conllu, line #4]*
Thoughts on how we could resolve these problems?
Thanks!
--
Bruce McKee
Research Systems Consultant
System Administrator for the Phonetics & Computational Linguistics Lab
Department of Linguistics, Cornell University
--------------------------------------------------------------------------------------------------------------------------------------------------------
*====================================*
*Our Encoding and indexing commands*
*====================================*
DATA=/home/smith/cwb/data/test
REGISTRY=/home/smith/cwb/registry
INDEX=/home/smith/cwb/registry/test
mkdir $DATA
cwb-encode -f test.conllu -d $DATA -R $INDEX -c ascii -L s -P lemma -P upos
-P xpos -P feats -P head -P deprel -P deps -P misc
cwb-make -r $REGISTRY -V TEST
*====================================*
*Corpus Description command*
*====================================*
cwb-describe-corpus -r $REGISTRY TEST
============================================================
Corpus: TEST
============================================================
description:
registry file: /home/smith/cwb/registry/test
home directory: /home/smith/cwb/data/test/
info file: /home/smith/cwb/data/test/.info
encoding: ascii
size (tokens): 69
9 positional attributes:
word lemma upos xpos
feats head deprel deps
misc
1 structural attributes:
s
0 alignment attributes:
*===============================================================*
*cqp word searches (default cqp startup commands are in the .cqprc file)*
*===============================================================*
$ cqp -e
System corpora:
E: EXAMPLE
T: TEST
[no corpus]> TEST;
TEST> info;
Size: 69
Charset: ascii
Properties:
language = '??'
charset = 'ascii'
No further information available about TEST
TEST> show cd;
===Context Descriptor=======================================
left context: 25 characters
right context: 25 characters
corpus position: shown
target anchors: not shown
Positional Attributes: * word
lemma
upos
xpos
feats
head
deprel
deps
misc
Structural Attributes: s
Aligned Corpora: <none>
============================================================
TEST> "same"
0 matches.
TEST> "getting"
0 matches.
TEST> [ lemma="the" ];
0 matches.
TEST>
*===========================================================*
*Test file test.conllu (also attached to this e-mail)*
*===========================================================*
# newdoc id = pcc_eng_test_1.0001_x00002
# sent_id = pcc_eng_test_1.0001_x00002_1
# text = I'm getting about the same thing trying to update "tf" (team
fortress 2) on Ubuntu 7.10 (just updated it yesterday).
1 I I PRON PRP
Case=Nom|Number=Sing|Person=1|PronType=Prs 3 nsubj _
start_char=0|end_char=1
2 'm be AUX VBP Mood=Ind|Tense=Pres|VerbForm=Fin
3 aux _ start_char=1|end_char=3
3 getting get VERB VBG Tense=Pres|VerbForm=Part 0
root _ start_char=4|end_char=11
4 about about ADV RB _ 7 advmod _
start_char=12|end_char=17
5 the the DET DT Definite=Def|PronType=Art 7
det _ start_char=18|end_char=21
6 same same ADJ JJ Degree=Pos 7 amod _
start_char=22|end_char=26
7 thing thing NOUN NN Number=Sing 3 obj _
start_char=27|end_char=32
8 trying try VERB VBG VerbForm=Ger 7 acl _
start_char=33|end_char=39
9 to to PART TO _ 10 mark _
start_char=40|end_char=42
10 update update VERB VB VerbForm=Inf 8 xcomp _
start_char=43|end_char=49
11 " " PUNCT `` _ 12 punct _
start_char=50|end_char=51
12 tf tf NOUN NN Number=Sing 10 obj _
start_char=51|end_char=53
13 " " PUNCT '' _ 12 punct _
start_char=53|end_char=54
14 ( ( PUNCT -LRB- _ 16 punct _
start_char=55|end_char=56
15 team team NOUN NN Number=Sing 16 compound
_ start_char=56|end_char=60
16 fortress fortress NOUN NN Number=Sing 12
appos _ start_char=61|end_char=69
17 2 2 NUM LS NumType=Card 16 nummod _
start_char=70|end_char=71
18 ) ) PUNCT -RRB- _ 16 punct _
start_char=71|end_char=72
19 on on ADP IN _ 20 case _
start_char=73|end_char=75
20 Ubuntu Ubuntu PROPN NNP Number=Sing 10 obl _
start_char=76|end_char=82
21 7.10 7.10 NUM CD NumType=Card 20 nummod _
start_char=83|end_char=87
22 ( ( PUNCT -LRB- _ 24 punct _
start_char=88|end_char=89
23 just just ADV RB _ 24 advmod _
start_char=89|end_char=93
24 updated update VERB VBD Tense=Past|VerbForm=Part 3
parataxis _ start_char=94|end_char=101
25 it it PRON PRP
Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs 24 obj _
start_char=102|end_char=104
26 yesterday yesterday NOUN NN Number=Sing 24
obl:tmod _ start_char=105|end_char=114
27 ) ) PUNCT -RRB- _ 24 punct _
start_char=114|end_char=115
28 . . PUNCT . _ 3 punct _
start_char=115|end_char=116
# sent_id = pcc_eng_test_1.0001_x00002_2
# text = DSL connection near Seattle, WA.
1 DSL dsl NOUN NN Number=Sing 2 compound
_ start_char=117|end_char=120
2 connection connection NOUN NN Number=Sing 0
root _ start_char=121|end_char=131
3 near near ADP IN _ 4 case _
start_char=132|end_char=136
4 Seattle Seattle PROPN NNP Number=Sing 2 nmod _
start_char=137|end_char=144
5 , , PUNCT , _ 4 punct _
start_char=144|end_char=145
6 WA WA PROPN NNP Number=Sing 4 appos _
start_char=146|end_char=148
7 . . PUNCT . _ 2 punct _
start_char=148|end_char=149
# sent_id = pcc_eng_test_1.0001_x00002_3
# text = Come to think of it, might have been "Connection Closed", I'll
have to check when I'm home in 10 hours.
1 Come come VERB VB Mood=Imp|VerbForm=Fin 0
root _ start_char=150|end_char=154
2 to to PART TO _ 3 mark _
start_char=155|end_char=157
3 think think VERB VB VerbForm=Inf 1 xcomp _
start_char=158|end_char=163
4 of of ADP IN _ 5 case _
start_char=164|end_char=166
5 it it PRON PRP
Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs 3 obl _
start_char=167|end_char=169
6 , , PUNCT , _ 11 punct _
start_char=169|end_char=170
7 might might AUX MD VerbForm=Fin 11 aux _
start_char=171|end_char=176
8 have have AUX VB VerbForm=Inf 11 aux _
start_char=177|end_char=181
9 been be AUX VBN Tense=Past|VerbForm=Part 11
cop _ start_char=182|end_char=186
10 " " PUNCT `` _ 11 punct _
start_char=187|end_char=188
11 Connection connection NOUN NN Number=Sing 1
parataxis _ start_char=188|end_char=198
12 Closed close VERB VBN Tense=Past|VerbForm=Part 11
acl _ start_char=199|end_char=205
13 " " PUNCT '' _ 11 punct _
start_char=205|end_char=206
14 , , PUNCT , _ 1 punct _
start_char=206|end_char=207
15 I I PRON PRP
Case=Nom|Number=Sing|Person=1|PronType=Prs 17 nsubj _
start_char=208|end_char=209
16 'll will AUX MD VerbForm=Fin 17 aux _
start_char=209|end_char=212
17 have have VERB VB VerbForm=Inf 1 parataxis
_ start_char=213|end_char=217
18 to to PART TO _ 19 mark _
start_char=218|end_char=220
19 check check VERB VB VerbForm=Inf 17 xcomp _
start_char=221|end_char=226
20 when when SCONJ WRB PronType=Int 23 mark _
start_char=227|end_char=231
21 I I PRON PRP
Case=Nom|Number=Sing|Person=1|PronType=Prs 23 nsubj _
start_char=232|end_char=233
22 'm be AUX VBP Mood=Ind|Tense=Pres|VerbForm=Fin
23 cop _ start_char=233|end_char=235
23 home home ADV RB _ 19 advcl _
start_char=236|end_char=240
24 in in ADP IN _ 26 case _
start_char=241|end_char=243
25 10 10 NUM CD NumType=Card 26 nummod _
start_char=244|end_char=246
26 hours hour NOUN NNS Number=Plur 23 obl _
start_char=247|end_char=252
27 . . PUNCT . _ 1 punct _
start_char=252|end_char=253
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20220711/be717763/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.conllu
Type: application/octet-stream
Size: 7323 bytes
Desc: not available
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20220711/be717763/attachment-0001.obj>
More information about the CWB
mailing list