[CWB] Help with CWB under linux
Hardie, Andrew
a.hardie at lancaster.ac.uk
Mon Nov 30 19:36:26 CET 2009
Gassan,
It looks suspiciously as if the entire line is being encoded as a single p-attribute rather than 3 different p-attributes, due to a problem in the input format: it looks as if you are using spaces to delimit the colums. The different "fields" on each line need to be delimited by a single tab in the input file, with no spaces. CWB counts spaces as "part of the word".
In other words, you need
volunteersTABNN2TABvolunteer
or, in regex-style, volunteers\tNN2\tvolunteer
hope that helps!
best
Andrew.
________________________________
From: cwb-bounces at sslmit.unibo.it on behalf of Gassan Tabajah
Sent: Mon 30/11/2009 17:44
To: 'Open source development of the Corpus WorkBench'
Cc: 'Itai Alon'
Subject: RE: [CWB] Help with CWB under linux
Hi Serge,
My input format looks like this:
<corpus>
<text id="http://www.foo.org/index.html">
<s>
volunteers NN2 volunteer
work VVB work
as PRP as
part NN1 part
of PRF of
a AT0 a
team NN1 team
and CJC and
provide VVB provide
help NN1-VVB help
</s>
</text>
</corpus>
I used the following commands under the bin directory:
$ cwb-encode -d /usr/local/mycorpus -f filename.xml -R
/usr/local/share/cwb/registry/mycorpus -P pos -P lemma -V text -S s -S
corpus
$ cwb-makeall -V MYCORPUS
Then I run cqp -e -> MYCORPUS
When I inter a regular expression like "a.*" I got the following output:
MYCORPUS> "a.*";
2: teer work VVB work <as PRP as> part NN1 part
of
5: part of PRF of <a AT0 a> team NN1 team
and
7: a team NN1 team <and CJC and> provide VVB
provide
But when I tried something simple like "a", I got no matches:
MYCORPUS> "a";
0 matches.
I don't exactly understand why I got these results, do you have any Ideas?
What should be the output of 'cwb-decode'? Do you have an example how to use
it?
(BTW, I am using cwb-2.2.b99-RC1 version under Cygwin).
Regards,
Ghassan Tabajah
SoftWare Engineer - Mila Center
Computer Science Faculty -Technion
Room 644, Tel: (829) 3969
-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On
Behalf Of Serge HEIDEN
Sent: Monday, November 30, 2009 6:58 PM
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Help with CWB under linux
Dear Ghassan,
From: "Gassan Tabajah" <gtabajah at cs.technion.ac.il>
>> Also I noticed that the following files under "mycorpus" directory:
>> lemma.corpus, pos.corpus, word.corpus includes only <nul>'s (Is that
>> an error !?)
Yes, this is an error.
Try to use the 'cwb-decode' tool to decode your indexes independently
of using them from 'cqp'.
It seems that your 'cwb-encode' or 'cwb-makeall' process had a problem.
Are you sure of your input format ? Have you an exerpt of it ?
Best,
Serge
--
Dr. Serge Heiden, slh at ens-lsh.fr, http://textometrie.ens-lsh.fr <http://textometrie.ens-lsh.fr/>
ENS-LSH/CNRS - ICAR UMR5191, Institut de Linguistique Française
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb
_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 8040 bytes
Desc: not available
Url : http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20091130/c7b448a3/attachment-0001.bin
More information about the CWB
mailing list