[CWB] Linking CWB and R
Hardie, Andrew
a.hardie at lancaster.ac.uk
Thu Nov 24 19:30:10 CET 2011
>> If a layer of glue in C have to be added, don't you think this effort may be better invested in linking directly the two, without the stream protocol, in some way?
Alas, the issues that need to be solved in order to make CQP-queries available as a C library are much, much greater than just writing a glue layer. The problem is that the query language and concordance generation features of CQP are bound up inextricably with the CQP interactive environment. This is an architectural mistake, and we intend to rectify it. But it is a big big job, because it involves refactoring the internals of CQP. If you have ever looked at the CQP code, you will know this is not something to be done lightly. Even understanding how it does what it does is a tall order. We do fully intend to tackle it, but it's not something that could be done quickly (especially as there are a whole suite of much more urgent bugs & feature requests)
For more info, see the roadmap here:
http://cwb.svn.sourceforge.net/viewvc/cwb/cwb/trunk/doc/todo-3.5
best
Andrew.
-----Original Message-----
From: cwb-bounces at sslmit.unibo.it [mailto:cwb-bounces at sslmit.unibo.it] On Behalf Of Sylvain Loiseau
Sent: 24 November 2011 16:02
To: Stefan Evert; Open source development of the Corpus WorkBench; Bernard Desgraupes
Subject: Re: [CWB] Linking CWB and R
Le 24 nov. 2011 à 16:13, Stefan Evert a écrit :
>
>> There is no reason why you couldn't create R functions that call CWB libraries. The existing CL (corpus library) is designed for exactly such undertakings. In the long term, we hope to separate out other functionality like the CQP query syntax into libraries that could be accessed in the same way. There's a lot of work to get to that point, however!
>
> And we don't know when we'll get around to making such fundamental changes, so it may be a better idea -- at least for the time being -- to implement a CQi client that communicates with CQPserver.
>
> Is that what you've done in your implementation, Sylvain? Or did you write your own client-server protocol?
I wrote an R implementation just mimicking the perl implementation.
It's available here :
https://r-forge.r-project.org/projects/rcwb/
(at the bottom of the page, "SCM repository".)
But I'm afraid is full of bug and not very clean/efficient.
If you source this three files:
> source("client.R")
> source("constantes.R")
> source("server.R")
you can interact with the cqpserver using the CQI protocol :
> con <- get_cwb(server_options="-r /path/to/your/registry");
> cqi_attributes("YOUR_CORPUS", "p", con) # ask for positional attribute returned as a character vector.
[1] "word" "pos" "func" "lemma" "id"
The cqpserver is launched by the first command.
The rest of the files in the same directory try to define a more high-level set of CWB objects (corpus, attribute...) but it's not satisfactory I think up to now. The file test.R shows usage of these objects.
> I've been reluctant about CQi recently since it was quickly cobbled together as an ad-hoc solution and has never been revised properly; and I'm not using it in my own research because the CWB/Perl interface is faster and more flexible. However, there does seem to be increasing interest, especially for using CWB from Java, and some people I talked to seemed to be quite happy with the current state of the CQi.
>
> I'd very much like a CQi client for R, preferably with a few higher-level wrappers so you don't always have to execute low-level CQi calls.
> The biggest hurdle, I guess, is that the code for encoding and decoding the byte stream protocol should be written in C if we want to achieve reasonable speed.
>> I would be very interested, before going further, in your comments and opinions about another project : liking cwb and R though call to a CWB C library. A CWB library could be linked to a R module (and automatically installed with this module). Rather than being communicated one by one via socket, the vector elements produced by CWB would be represented in C, with a light extension to the existing code, and wrapped with a R vector. Such R vector would simply give access to the original data, without copying any data structure.
>
> I'm not an expert on R hacking, but I don't think you're allowed to do this. R manages its own memory, and when you create an R vector from a list of integers or strings returned by CWB, R will make a copy of the vector. Anything that gives direct access to internal CWB data would probably require very advanced R hacking skills.
If the CWB code may be packaged, extended with some linking code, and the whole compiled into a library, it may not that hard (?).
A C array in a library may be wrapped in order to be seen as an R vector by R code using this library, this is what I infer from : http://cran.r-project.org/doc/manuals/R-exts.html#Interface-functions-_002eC-and-_002eFortran
Best,
Sylvain_______________________________________________
CWB mailing list
CWB at sslmit.unibo.it
http://devel.sslmit.unibo.it/mailman/listinfo/cwb
More information about the CWB
mailing list