[CWB] Dealing with "malformed tag" error

Tue Jan 21 18:09:43 CET 2020

Generally: cwb-encode will attempt to parse anything with a < at the start of the line as if it were an XML tag.  So yes, they need escaping.

&lt; is the best way to do so as Peter says. If you’reusing cwb-encode directly, remember to use the -x option so that this will be properly interpreted. If you’re going via CQPweb, then -x is always switched on

best

Andrew

From: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> On Behalf Of Uhrig, Peter
Sent: 21 January 2020 16:42
To: Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it>
Subject: Re: [CWB] Dealing with "malformed tag" error

Hi Scott,

I recommend using &lt; here as an XML entity.
See here: http://liste.sslmit.unibo.it/pipermail/cwb/2018-February/003072.html<https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fliste.sslmit.unibo.it%2Fpipermail%2Fcwb%2F2018-February%2F003072.html&data=02%7C01%7Ca.hardie%40lancaster.ac.uk%7C395a9a26e0df4a7928f808d79e920a12%7C9c9bcd11977a4e9ca9a0bc734090164a%7C1%7C1%7C637152222392088934&sdata=bf1OII%2FBs2y14LHzp4fC9fRO7%2B1DOSwKfwxssM4bExY%3D&reserved=0>

Best wishes,
Peter

Von: cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it> <cwb-bounces at sslmit.unibo.it<mailto:cwb-bounces at sslmit.unibo.it>> Im Auftrag von Scott Sadowsky
Gesendet: Dienstag, 21. Januar 2020 16:54
An: CWBdev Mailing List <cwb at sslmit.unibo.it<mailto:cwb at sslmit.unibo.it>>
Betreff: [CWB] Dealing with "malformed tag" error

I'm trying to encode a very large corpus derived from very heterogeneous text files. I've solved most of the problems (e.g. multiple character encodings and the like), but there's one I'm not sure how to deal with.

After tagging the texts with FreeLing I end up with a certain number of lines that are as follows:

<     <     Fz     Fz     F     oth

When compiling the corpus, CQP throws the following error for each such case:

Malformed tag < <       Fz      Fz      F       oth, inserted literally (file ~/02-Tagged/0128716.xml, line #85)

These cases seem to be from when writers got unduly creative with symbols, rather than from mathematical uses, so they're probably mostly expendable.

What's the best way to handle cases like these? I could in theory eliminate them with a script before CQP tries to compile the corpus, but I'm loathe to make destructive changes to text contents. So it would be good to know what effect leaving them in will have on the final corpus -- with they interfere with CQP's corpus compilation process? For example, will they cause it to incorrectly determine where actual tags begin and end? Or are they basically harmless?

Thanks,
Scott

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20200121/64be2d73/attachment-0001.html>