[CWB] empty element

Stefan Evert stefanML at collocations.de
Mon May 11 17:15:16 CEST 2020


> I have problems with a <pause></pause> xml tag in a spoken corpus.
> If I run a query, for example to look at all the words following or preceding a pause, I get no results, both in CWB and in CQPweb. I guess that the problem is that it is an empty element, without any text inside the xml tags.

Exactly: CWB doesn't support empty XML elements, all s-attribute regions must enclose one or more tokens.  And for good reason, as empty elements are a major pain in the corpus.

> How do you suggest to solve this problem?

BNCweb solves this problem by encoding such empty tags before the current token as a p-attribute, either in XML notation, e.g.

	<pause/><noise/>

or as a feature set

	|noise|pause|

so it is easier to query for a specific tag, e.g with

	[tags_before contains "pause"]

In fact, BNCweb stores _all_ XML tags (not just empty ones) before and after the current position in two separate p-attributes, which makes it a lot easier to reconstruct the original XML markup in the context display.

Best,
Stefan



More information about the CWB mailing list