[CWB] File format of encoded cwb corpora

Serge Heiden slh at ens-lyon.fr
Fri Jul 13 15:59:00 CEST 2012


Hi Ingmar,

I don't know any documented definition of this format,
but you can find some description in the "Corpus Encoding
Tutorial: First Steps" from Stefan Evert:
http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CWBTutorial/cwb-tutorial.pdf

But it is simple:
- a file encodes a whole corpus
- each word is encoded on a single line
- each word property is encoded in a column
- the column separator is TABULATION
- the first column is reserved for the word form property
- by default property values are encoded in ISO-LATIN-8859-1 (if not, it 
must be declared in the registry file or told to cwb-encode)
- corpus structures are encoded by XML-like tags
- each starting element is encoded on a single line
- each ending element is encoded on a single line
- every starting element must have a corresponding ending element (no 
XML milestone)

Each corpus file is closely associated to a 'registry file'
declaring all word properties, all the structures
and the properties, etc. This file can be generated
by the cwb-encode tool.

Best,
--slh


le 13/07/2012 15:21 Selon Ingmar Schuster:
> Hi,
>
> is there any document describing the file format cwb uses for encoded
> corpora? If not, could somebody elaborate on it a bit?
>
> Yours
> Ingmar
>
> --
> Ingmar Schuster
> Natural Language Processing Group
> Department of Computer Science
> University of Leipzig
> Johannisgasse 26
> 04103 Leipzig, Germany
>
> Tel. +49 341 9732205
>
> http://asv.informatik.uni-leipzig.de/en/staff/Ingmar_Schuster
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb

-- 
Dr. Serge Heiden, slh at ens-lyon.fr, http://textometrie.ens-lyon.fr
ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883





More information about the CWB mailing list