[CWB] File format of encoded cwb corpora
Serge Heiden
slh at ens-lyon.fr
Fri Jul 13 17:33:28 CEST 2012
Something nice would be to do documents
like the ones Stefan Evert has done for the NXT Search engine :
http://www.ims.uni-stuttgart.de/projekte/nite
A) a CQP object model justifying a detailed description of index files
architecture
(like the "CQP Corpus Administrator's Manual" schema p. 14 but
with real file names to begin with)
Like this document:
Formal specification of the NITE Object Model, the abstract data model
used by the NITE XML Toolkit.
-> http://www.ltg.ed.ac.uk/NITE/documents/NiteObjectModel.v2.1.pdf
B) a CQL formal specification
Like this document:
Formal specification of NiteQL, the query language that operates over
data conforming to the NITE Object Model.
-> http://www.ltg.ed.ac.uk/NITE/documents/NiteQL.v2.1.pdf
I once started a list of all the CQL syntax features I know of
in a Googledoc, but it hasn't evolved to something readable:
https://docs.google.com/document/d/1rz39LixYl6uegx35kIj6JLYbMPEOsy2ycg4JuCBZ68Y/edit?hl=fr&pli=1
Best,
Serge
le 13/07/2012 16:37 Selon Hardie, Andrew:
> Yes, definitely a good start, and we might be able to pinch some bits from there for a full document, but it's not complete (a-attributes in particular are only partially documented), it's not in a terribly clear order, and some aspects, e.g. its emphasis on fiddling with the binary files using Unix utilities, are very out of date, so caveat lector!
>
> Andrew.
>
> -----Original Message-----
> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Serge Heiden
> Sent: 13 July 2012 15:29
> To: cwb at sslmit.unibo.it
> Subject: Re: [CWB] File format of encoded cwb corpora
>
> For the various index files of CQP, to start I would recommend:
> IMS Corpus Workbench "CQP Corpus Administrator's Manual", Oliver Christ, Universität Stuttgart, Institut für maschinelle Sprache, 1994 (p. 14 for a partial overview of index architecture) A copy of which is here:
> http://txm.sourceforge.net/doc/cwb/technical-manual.pdf
>
> --slh
>
>
> le 13/07/2012 16:08 Selon Hardie, Andrew:
>> -----Original Message-----
>> From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it]
>> On Behalf Of Stefan Evert
>>
>>>>>> There's no formal specification of the precise file format
>> Arguably there should be, however, especially if we need to change it and thus have to deal with format versioning. Moreover, having obtained (and read) a copy of the "Managing Gigabytes" book, I personally don't think the book alone alone adequately documents the technical details of the binary format: for a full understanding of how CWB does it, the book has to be read alongside the indexing code.
>>
>> Yet another thing for the TODO list!
>>
>> Andrew.
>> _______________________________________________
>> CWB mailing list
>> CWB at sslmit.unibo.it
>> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> --
> Dr. Serge Heiden, slh at ens-lyon.fr, http://textometrie.ens-lyon.fr ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française 15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883
>
>
>
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
> _______________________________________________
> CWB mailing list
> CWB at sslmit.unibo.it
> http://devel.sslmit.unibo.it/mailman/listinfo/cwb
--
Dr. Serge Heiden, slh at ens-lyon.fr, http://textometrie.ens-lyon.fr
ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883
More information about the CWB
mailing list