[CWB] Need help importing CONLL-U files into CWB

Hardie, Andrew a.hardie at lancaster.ac.uk
Wed Jan 11 11:56:42 CET 2023


Hi Bruce,

(advance general note: I believe Bruce's request is the last of the messages to this list that were on my email backlog, but if anyone else has asked a question that my catchup operation has overlooked, do feel free to ask again.)

There are a few different things going on here.

First, the input file that you attached to your query is not in the correct format for either CWB or ConLL-U, because the columns are not TAB-delimited. Instead they are delimited with spaces. Probably this is due to a setting in your text editor to save tabs as spaces. Nothing will work with this file, because the whole line will be encoded as a single column (the "word" column) and all the other columns treated as empty strings.

Second, indexing with this command

>> cwb-encode -f test.conllu -d $DATA -R $INDEX -c ascii -L s -P lemma -P upos -P xpos -P feats -P head -P deprel -P deps -P misc

won't work because no instruction has been included to treat the first column as an ID number. So the first column will be indexed as "word" which is not correct for your purposes.

In short if you have an ID number column, you must use either -n or -N.

The reason encoding failed for you is that using -n  or -N is effectively a "promise" to cwb-encode that the 1st column will contain only digits 0-9 - but due to the spaces having replaced tabs (as noted above), the whole line is read as a single column that doesn't meet that requirement. Thus the error message on the first token line.

I hope this helps, although I realise that (to put it mildly) a reply 6 months later probably is not what you had hoped for.

best

Andrew.


From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On Behalf Of Bruce McKee
Sent: 11 July 2022 14:19
To: cwb at sslmit.unibo.it
Subject: [CWB] Need help importing CONLL-U files into CWB

Hello CWB experts;

We would like to bring CONLL-U formatted corpora into Corpus Workbench v3.4.33, running under Ubuntu 20.04.4 LTS.  The CONLL-U file is an excerpt from a Stanford STANZA<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstanfordnlp.github.io%2Fstanza%2F&data=05%7C01%7Chardiea%40live.lancs.ac.uk%7Cda236f47ad774164a00308da63401d8c%7C9c9bcd11977a4e9ca9a0bc734090164a%7C0%7C0%7C637931425026606996%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=7ME%2BDk7m0ALOzbX0I5rD8F2zsjHh4%2FPkmlZZhAiSUlE%3D&reserved=0>-processed corpus.

We succeeded in encoding & indexing a small sample corpus test.conllu, but our cqp searches are not finding words.  See the details below and the attached test.conllu file.

We also noticed that with cwb-encode, the -N id  option triggers the following error (as does replacing it with just -n):

Invalid input line [1       I       I       PRON    PRP     Case=Nom|Number=Sing|Person=1|PronType=Prs      3       nsubj   _       start_char=0|end_char=1], encoding aborted
[location of error: file test.conllu, line #4]

Thoughts on how we could resolve these problems?

Thanks!

--
Bruce McKee
Research Systems Consultant
System Administrator for the Phonetics & Computational Linguistics Lab
Department of Linguistics, Cornell University

--------------------------------------------------------------------------------------------------------------------------------------------------------

====================================
Our Encoding and indexing commands
====================================

DATA=/home/smith/cwb/data/test
REGISTRY=/home/smith/cwb/registry
INDEX=/home/smith/cwb/registry/test

mkdir $DATA

cwb-encode -f test.conllu -d $DATA -R $INDEX -c ascii -L s -P lemma -P upos -P xpos -P feats -P head -P deprel -P deps -P misc

cwb-make -r $REGISTRY -V TEST

====================================
Corpus Description command
====================================

cwb-describe-corpus -r $REGISTRY TEST

============================================================
Corpus: TEST
============================================================

description:
registry file:  /home/smith/cwb/registry/test
home directory: /home/smith/cwb/data/test/
info file:      /home/smith/cwb/data/test/.info
encoding:       ascii
size (tokens):  69

  9 positional attributes:
      word            lemma           upos            xpos
      feats           head            deprel          deps
      misc

  1 structural attributes:
      s

  0 alignment  attributes:


===============================================================
cqp word searches (default cqp startup commands are in the .cqprc file)
===============================================================

$ cqp -e
System corpora:
 E: EXAMPLE
 T: TEST
[no corpus]> TEST;
TEST> info;
Size:    69
Charset: ascii
Properties:
        language = '??'
        charset = 'ascii'

No further information available about TEST
TEST> show cd;
===Context Descriptor=======================================

left context:     25 characters
right context:    25 characters
corpus position:  shown
target anchors:   not shown

Positional Attributes:  * word
                          lemma
                          upos
                          xpos
                          feats
                          head
                          deprel
                          deps
                          misc

Structural Attributes:    s

Aligned Corpora:          <none>

============================================================
TEST> "same"
0 matches.
TEST> "getting"
0 matches.
TEST> [ lemma="the" ];
0 matches.
TEST>

===========================================================
Test file test.conllu (also attached to this e-mail)
===========================================================

# newdoc id = pcc_eng_test_1.0001_x00002
# sent_id = pcc_eng_test_1.0001_x00002_1
# text = I'm getting about the same thing trying to update "tf" (team fortress 2) on Ubuntu 7.10 (just updated it yesterday).
1       I       I       PRON    PRP     Case=Nom|Number=Sing|Person=1|PronType=Prs      3       nsubj   _       start_char=0|end_char=1
2       'm      be      AUX     VBP     Mood=Ind|Tense=Pres|VerbForm=Fin        3       aux     _       start_char=1|end_char=3
3       getting get     VERB    VBG     Tense=Pres|VerbForm=Part        0       root    _       start_char=4|end_char=11
4       about   about   ADV     RB      _       7       advmod  _       start_char=12|end_char=17
5       the     the     DET     DT      Definite=Def|PronType=Art       7       det     _       start_char=18|end_char=21
6       same    same    ADJ     JJ      Degree=Pos      7       amod    _       start_char=22|end_char=26
7       thing   thing   NOUN    NN      Number=Sing     3       obj     _       start_char=27|end_char=32
8       trying  try     VERB    VBG     VerbForm=Ger    7       acl     _       start_char=33|end_char=39
9       to      to      PART    TO      _       10      mark    _       start_char=40|end_char=42
10      update  update  VERB    VB      VerbForm=Inf    8       xcomp   _       start_char=43|end_char=49
11      "       "       PUNCT   ``      _       12      punct   _       start_char=50|end_char=51
12      tf      tf      NOUN    NN      Number=Sing     10      obj     _       start_char=51|end_char=53
13      "       "       PUNCT   ''      _       12      punct   _       start_char=53|end_char=54
14      (       (       PUNCT   -LRB-   _       16      punct   _       start_char=55|end_char=56
15      team    team    NOUN    NN      Number=Sing     16      compound        _       start_char=56|end_char=60
16      fortress        fortress        NOUN    NN      Number=Sing     12      appos   _       start_char=61|end_char=69
17      2       2       NUM     LS      NumType=Card    16      nummod  _       start_char=70|end_char=71
18      )       )       PUNCT   -RRB-   _       16      punct   _       start_char=71|end_char=72
19      on      on      ADP     IN      _       20      case    _       start_char=73|end_char=75
20      Ubuntu  Ubuntu  PROPN   NNP     Number=Sing     10      obl     _       start_char=76|end_char=82
21      7.10    7.10    NUM     CD      NumType=Card    20      nummod  _       start_char=83|end_char=87
22      (       (       PUNCT   -LRB-   _       24      punct   _       start_char=88|end_char=89
23      just    just    ADV     RB      _       24      advmod  _       start_char=89|end_char=93
24      updated update  VERB    VBD     Tense=Past|VerbForm=Part        3       parataxis       _       start_char=94|end_char=101
25      it      it      PRON    PRP     Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs  24      obj     _       start_char=102|end_char=104
26      yesterday       yesterday       NOUN    NN      Number=Sing     24      obl:tmod        _       start_char=105|end_char=114
27      )       )       PUNCT   -RRB-   _       24      punct   _       start_char=114|end_char=115
28      .       .       PUNCT   .       _       3       punct   _       start_char=115|end_char=116

# sent_id = pcc_eng_test_1.0001_x00002_2
# text = DSL connection near Seattle, WA.
1       DSL     dsl     NOUN    NN      Number=Sing     2       compound        _       start_char=117|end_char=120
2       connection      connection      NOUN    NN      Number=Sing     0       root    _       start_char=121|end_char=131
3       near    near    ADP     IN      _       4       case    _       start_char=132|end_char=136
4       Seattle Seattle PROPN   NNP     Number=Sing     2       nmod    _       start_char=137|end_char=144
5       ,       ,       PUNCT   ,       _       4       punct   _       start_char=144|end_char=145
6       WA      WA      PROPN   NNP     Number=Sing     4       appos   _       start_char=146|end_char=148
7       .       .       PUNCT   .       _       2       punct   _       start_char=148|end_char=149


# sent_id = pcc_eng_test_1.0001_x00002_3
# text = Come to think of it, might have been "Connection Closed", I'll have to check when I'm home in 10 hours.
1       Come    come    VERB    VB      Mood=Imp|VerbForm=Fin   0       root    _       start_char=150|end_char=154
2       to      to      PART    TO      _       3       mark    _       start_char=155|end_char=157
3       think   think   VERB    VB      VerbForm=Inf    1       xcomp   _       start_char=158|end_char=163
4       of      of      ADP     IN      _       5       case    _       start_char=164|end_char=166
5       it      it      PRON    PRP     Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs  3       obl     _       start_char=167|end_char=169
6       ,       ,       PUNCT   ,       _       11      punct   _       start_char=169|end_char=170
7       might   might   AUX     MD      VerbForm=Fin    11      aux     _       start_char=171|end_char=176
8       have    have    AUX     VB      VerbForm=Inf    11      aux     _       start_char=177|end_char=181
9       been    be      AUX     VBN     Tense=Past|VerbForm=Part        11      cop     _       start_char=182|end_char=186
10      "       "       PUNCT   ``      _       11      punct   _       start_char=187|end_char=188
11      Connection      connection      NOUN    NN      Number=Sing     1       parataxis       _       start_char=188|end_char=198
12      Closed  close   VERB    VBN     Tense=Past|VerbForm=Part        11      acl     _       start_char=199|end_char=205
13      "       "       PUNCT   ''      _       11      punct   _       start_char=205|end_char=206
14      ,       ,       PUNCT   ,       _       1       punct   _       start_char=206|end_char=207
15      I       I       PRON    PRP     Case=Nom|Number=Sing|Person=1|PronType=Prs      17      nsubj   _       start_char=208|end_char=209
16      'll     will    AUX     MD      VerbForm=Fin    17      aux     _       start_char=209|end_char=212
17      have    have    VERB    VB      VerbForm=Inf    1       parataxis       _       start_char=213|end_char=217
18      to      to      PART    TO      _       19      mark    _       start_char=218|end_char=220
19      check   check   VERB    VB      VerbForm=Inf    17      xcomp   _       start_char=221|end_char=226
20      when    when    SCONJ   WRB     PronType=Int    23      mark    _       start_char=227|end_char=231
21      I       I       PRON    PRP     Case=Nom|Number=Sing|Person=1|PronType=Prs      23      nsubj   _       start_char=232|end_char=233
22      'm      be      AUX     VBP     Mood=Ind|Tense=Pres|VerbForm=Fin        23      cop     _       start_char=233|end_char=235
23      home    home    ADV     RB      _       19      advcl   _       start_char=236|end_char=240
24      in      in      ADP     IN      _       26      case    _       start_char=241|end_char=243
25      10      10      NUM     CD      NumType=Card    26      nummod  _       start_char=244|end_char=246
26      hours   hour    NOUN    NNS     Number=Plur     23      obl     _       start_char=247|end_char=252
27      .       .       PUNCT   .       _       1       punct   _       start_char=252|end_char=253







-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20230111/3e10ff47/attachment-0001.html>


More information about the CWB mailing list