[CWB] Sketch Grammars and macros
Stefan Evert
stefanML at collocations.de
Fri Jan 31 08:21:00 CET 2020
[This seems to have been blocked by the mailing list server. Sorry if you get multiple copies.]
> On 29 Jan 2020, at 18:31, Hardie, Andrew <a.hardie at lancaster.ac.uk> wrote:
>
> Yes, and it’s the same way – see http://cwb.sourceforge.net/files/CQP_Tutorial/node25.html
> though you mind find it easier to use, e.g. node: coll: than numbers.
Keep in mind, though, that this labels can only be used _within_ the query itself (and must be used, so you can't add them as comments to explain your query).
> Or, alternative, you can use the target / keyword anchor points.http://cwb.sourceforge.net/files/CQP_Tutorial/node19.html
If you want to extract syntactic collocation data, this is the way to go. If you have a really recent version of CWB (v3.4.19), then you can mark two positions in the query with @0 and @1, which will become the target and keyword anchors in the query result. (Prior versions only allow a single position to be marked with @.)
Here are some useful tricks if this turns out to be too limited:
1) You can use each target marker (@0, @1) multiple times in a query, which makes sense if they are in alternative branches of your query. E.g. to extract both verb-subject and verb-object collocations in a single query:
@0 [pos="N.*"] … @1 [pos="V.*"] | @1 [pos="V.*"] … @0 [pos="N.*"] ;
If a single match encounters multiple instances of the same marker, only the most recent one will be retained.
2) Use zero-width assertions to mark the start and end of an s-attribute region or a parenthesized subquery (esp. with alternatives):
… @0 [::] ( … | … | … ) @1 [::] … ;
But don't put such an empty assertion at the start of your query if you want it to complete in finite time. :-)
3) You don't need target markers at the start or end of the query, because the match and matchend anchors will always be set implicitly. Sometimes queries need to extend a little beyond the token of interest, but if it is at a fixed distance from the start or end, you can adjust for that by using offsets when exporting data with tabulate. E.g.
… [pos = "V.*"] "\?" </s> ;
doesn't need a marker on the verb: just uses matchend[-1].
4) If you _really_ need more than two marked positions inside the query, you can use additional markers @2 … @9. However, only _two_ of them can be active at the same time, controlled by "set AnchorNumberTarget …" and "set AnchorNumberKeyword …". By re-running the same query several times with different settings of these options, you can build a table with the positions of all markers.
We use this strategy in our own work, supported by a small Python wrapper script. If you're sure that the matches of your query cannot overlap, you can speed up the process by executing the follow-up runs as anchored subqueries (but you need to know what you're doing).
Most of the is explained in Sec. 8.6 of the CQP Tutorial – again, you may need to check out the latest version from the SVN repository.
Best,
Stefan
>
> By the way, syntactic collocates is something I’ve wanted to add to CQPweb for a while. I have put it on hold for the moment because it will be easier with the Ziggurat engine available because then dependency parser output will be indexable.
>
More information about the CWB
mailing list