[CWB] Need help importing CONLL-U files into CWB

Bruce McKee mckee2 at cornell.edu
Mon Jul 11 15:19:27 CEST 2022


Hello CWB experts;

We would like to bring CONLL-U formatted corpora into Corpus Workbench
v3.4.33, running under Ubuntu 20.04.4 LTS.  The CONLL-U file is an
excerpt from a Stanford STANZA
<https://stanfordnlp.github.io/stanza/>-processed
corpus.

We succeeded in encoding & indexing a small sample corpus test.conllu, but
our cqp searches are not finding words.  See the details below and the
attached test.conllu file.

We also noticed that with cwb-encode, the -N id  option triggers the
following error (as does replacing it with just -n):


*Invalid input line [1       I       I       PRON    PRP
Case=Nom|Number=Sing|Person=1|PronType=Prs      3       nsubj   _
start_char=0|end_char=1], encoding aborted[location of error: file
test.conllu, line #4]*

Thoughts on how we could resolve these problems?

Thanks!

--
Bruce McKee
Research Systems Consultant
System Administrator for the Phonetics & Computational Linguistics Lab
Department of Linguistics, Cornell University

--------------------------------------------------------------------------------------------------------------------------------------------------------

*====================================*
*Our Encoding and indexing commands*
*====================================*

DATA=/home/smith/cwb/data/test
REGISTRY=/home/smith/cwb/registry
INDEX=/home/smith/cwb/registry/test

mkdir $DATA

cwb-encode -f test.conllu -d $DATA -R $INDEX -c ascii -L s -P lemma -P upos
-P xpos -P feats -P head -P deprel -P deps -P misc

cwb-make -r $REGISTRY -V TEST

*====================================*
*Corpus Description command*
*====================================*

cwb-describe-corpus -r $REGISTRY TEST

============================================================
Corpus: TEST
============================================================

description:
registry file:  /home/smith/cwb/registry/test
home directory: /home/smith/cwb/data/test/
info file:      /home/smith/cwb/data/test/.info
encoding:       ascii
size (tokens):  69

  9 positional attributes:
      word            lemma           upos            xpos
      feats           head            deprel          deps
      misc

  1 structural attributes:
      s

  0 alignment  attributes:


*===============================================================*
*cqp word searches (default cqp startup commands are in the .cqprc file)*
*===============================================================*

$ cqp -e
System corpora:
 E: EXAMPLE
 T: TEST
[no corpus]> TEST;
TEST> info;
Size:    69
Charset: ascii
Properties:
        language = '??'
        charset = 'ascii'

No further information available about TEST
TEST> show cd;
===Context Descriptor=======================================

left context:     25 characters
right context:    25 characters
corpus position:  shown
target anchors:   not shown

Positional Attributes:  * word
                          lemma
                          upos
                          xpos
                          feats
                          head
                          deprel
                          deps
                          misc

Structural Attributes:    s

Aligned Corpora:          <none>

============================================================
TEST> "same"
0 matches.
TEST> "getting"
0 matches.
TEST> [ lemma="the" ];
0 matches.
TEST>

*===========================================================*
*Test file test.conllu (also attached to this e-mail)*
*===========================================================*

# newdoc id = pcc_eng_test_1.0001_x00002
# sent_id = pcc_eng_test_1.0001_x00002_1
# text = I'm getting about the same thing trying to update "tf" (team
fortress 2) on Ubuntu 7.10 (just updated it yesterday).
1       I       I       PRON    PRP
Case=Nom|Number=Sing|Person=1|PronType=Prs      3       nsubj   _
start_char=0|end_char=1
2       'm      be      AUX     VBP     Mood=Ind|Tense=Pres|VerbForm=Fin
     3       aux     _       start_char=1|end_char=3
3       getting get     VERB    VBG     Tense=Pres|VerbForm=Part        0
    root    _       start_char=4|end_char=11
4       about   about   ADV     RB      _       7       advmod  _
start_char=12|end_char=17
5       the     the     DET     DT      Definite=Def|PronType=Art       7
    det     _       start_char=18|end_char=21
6       same    same    ADJ     JJ      Degree=Pos      7       amod    _
    start_char=22|end_char=26
7       thing   thing   NOUN    NN      Number=Sing     3       obj     _
    start_char=27|end_char=32
8       trying  try     VERB    VBG     VerbForm=Ger    7       acl     _
    start_char=33|end_char=39
9       to      to      PART    TO      _       10      mark    _
start_char=40|end_char=42
10      update  update  VERB    VB      VerbForm=Inf    8       xcomp   _
    start_char=43|end_char=49
11      "       "       PUNCT   ``      _       12      punct   _
start_char=50|end_char=51
12      tf      tf      NOUN    NN      Number=Sing     10      obj     _
    start_char=51|end_char=53
13      "       "       PUNCT   ''      _       12      punct   _
start_char=53|end_char=54
14      (       (       PUNCT   -LRB-   _       16      punct   _
start_char=55|end_char=56
15      team    team    NOUN    NN      Number=Sing     16      compound
     _       start_char=56|end_char=60
16      fortress        fortress        NOUN    NN      Number=Sing     12
     appos   _       start_char=61|end_char=69
17      2       2       NUM     LS      NumType=Card    16      nummod  _
    start_char=70|end_char=71
18      )       )       PUNCT   -RRB-   _       16      punct   _
start_char=71|end_char=72
19      on      on      ADP     IN      _       20      case    _
start_char=73|end_char=75
20      Ubuntu  Ubuntu  PROPN   NNP     Number=Sing     10      obl     _
    start_char=76|end_char=82
21      7.10    7.10    NUM     CD      NumType=Card    20      nummod  _
    start_char=83|end_char=87
22      (       (       PUNCT   -LRB-   _       24      punct   _
start_char=88|end_char=89
23      just    just    ADV     RB      _       24      advmod  _
start_char=89|end_char=93
24      updated update  VERB    VBD     Tense=Past|VerbForm=Part        3
    parataxis       _       start_char=94|end_char=101
25      it      it      PRON    PRP
Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs  24      obj     _
    start_char=102|end_char=104
26      yesterday       yesterday       NOUN    NN      Number=Sing     24
     obl:tmod        _       start_char=105|end_char=114
27      )       )       PUNCT   -RRB-   _       24      punct   _
start_char=114|end_char=115
28      .       .       PUNCT   .       _       3       punct   _
start_char=115|end_char=116

# sent_id = pcc_eng_test_1.0001_x00002_2
# text = DSL connection near Seattle, WA.
1       DSL     dsl     NOUN    NN      Number=Sing     2       compound
     _       start_char=117|end_char=120
2       connection      connection      NOUN    NN      Number=Sing     0
    root    _       start_char=121|end_char=131
3       near    near    ADP     IN      _       4       case    _
start_char=132|end_char=136
4       Seattle Seattle PROPN   NNP     Number=Sing     2       nmod    _
    start_char=137|end_char=144
5       ,       ,       PUNCT   ,       _       4       punct   _
start_char=144|end_char=145
6       WA      WA      PROPN   NNP     Number=Sing     4       appos   _
    start_char=146|end_char=148
7       .       .       PUNCT   .       _       2       punct   _
start_char=148|end_char=149


# sent_id = pcc_eng_test_1.0001_x00002_3
# text = Come to think of it, might have been "Connection Closed", I'll
have to check when I'm home in 10 hours.
1       Come    come    VERB    VB      Mood=Imp|VerbForm=Fin   0
root    _       start_char=150|end_char=154
2       to      to      PART    TO      _       3       mark    _
start_char=155|end_char=157
3       think   think   VERB    VB      VerbForm=Inf    1       xcomp   _
    start_char=158|end_char=163
4       of      of      ADP     IN      _       5       case    _
start_char=164|end_char=166
5       it      it      PRON    PRP
Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs  3       obl     _
    start_char=167|end_char=169
6       ,       ,       PUNCT   ,       _       11      punct   _
start_char=169|end_char=170
7       might   might   AUX     MD      VerbForm=Fin    11      aux     _
    start_char=171|end_char=176
8       have    have    AUX     VB      VerbForm=Inf    11      aux     _
    start_char=177|end_char=181
9       been    be      AUX     VBN     Tense=Past|VerbForm=Part        11
     cop     _       start_char=182|end_char=186
10      "       "       PUNCT   ``      _       11      punct   _
start_char=187|end_char=188
11      Connection      connection      NOUN    NN      Number=Sing     1
    parataxis       _       start_char=188|end_char=198
12      Closed  close   VERB    VBN     Tense=Past|VerbForm=Part        11
     acl     _       start_char=199|end_char=205
13      "       "       PUNCT   ''      _       11      punct   _
start_char=205|end_char=206
14      ,       ,       PUNCT   ,       _       1       punct   _
start_char=206|end_char=207
15      I       I       PRON    PRP
Case=Nom|Number=Sing|Person=1|PronType=Prs      17      nsubj   _
start_char=208|end_char=209
16      'll     will    AUX     MD      VerbForm=Fin    17      aux     _
    start_char=209|end_char=212
17      have    have    VERB    VB      VerbForm=Inf    1       parataxis
    _       start_char=213|end_char=217
18      to      to      PART    TO      _       19      mark    _
start_char=218|end_char=220
19      check   check   VERB    VB      VerbForm=Inf    17      xcomp   _
    start_char=221|end_char=226
20      when    when    SCONJ   WRB     PronType=Int    23      mark    _
    start_char=227|end_char=231
21      I       I       PRON    PRP
Case=Nom|Number=Sing|Person=1|PronType=Prs      23      nsubj   _
start_char=232|end_char=233
22      'm      be      AUX     VBP     Mood=Ind|Tense=Pres|VerbForm=Fin
     23      cop     _       start_char=233|end_char=235
23      home    home    ADV     RB      _       19      advcl   _
start_char=236|end_char=240
24      in      in      ADP     IN      _       26      case    _
start_char=241|end_char=243
25      10      10      NUM     CD      NumType=Card    26      nummod  _
    start_char=244|end_char=246
26      hours   hour    NOUN    NNS     Number=Plur     23      obl     _
    start_char=247|end_char=252
27      .       .       PUNCT   .       _       1       punct   _
start_char=252|end_char=253
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20220711/be717763/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.conllu
Type: application/octet-stream
Size: 7323 bytes
Desc: not available
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20220711/be717763/attachment-0001.obj>


More information about the CWB mailing list