[CWB] empty element
Stefan Evert
stefanML at collocations.de
Mon May 11 17:15:16 CEST 2020
> I have problems with a <pause></pause> xml tag in a spoken corpus.
> If I run a query, for example to look at all the words following or preceding a pause, I get no results, both in CWB and in CQPweb. I guess that the problem is that it is an empty element, without any text inside the xml tags.
Exactly: CWB doesn't support empty XML elements, all s-attribute regions must enclose one or more tokens. And for good reason, as empty elements are a major pain in the corpus.
> How do you suggest to solve this problem?
BNCweb solves this problem by encoding such empty tags before the current token as a p-attribute, either in XML notation, e.g.
<pause/><noise/>
or as a feature set
|noise|pause|
so it is easier to query for a specific tag, e.g with
[tags_before contains "pause"]
In fact, BNCweb stores _all_ XML tags (not just empty ones) before and after the current position in two separate p-attributes, which makes it a lot easier to reconstruct the original XML markup in the context display.
Best,
Stefan
More information about the CWB
mailing list