[CWB] Dealing with "malformed tag" error

Tue Jan 21 17:41:40 CET 2020

Hi Scott,

I recommend using &lt; here as an XML entity.
See here: http://liste.sslmit.unibo.it/pipermail/cwb/2018-February/003072.html

Best wishes,
Peter

Von: cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> Im Auftrag von Scott Sadowsky
Gesendet: Dienstag, 21. Januar 2020 16:54
An: CWBdev Mailing List <cwb at sslmit.unibo.it>
Betreff: [CWB] Dealing with "malformed tag" error

I'm trying to encode a very large corpus derived from very heterogeneous text files. I've solved most of the problems (e.g. multiple character encodings and the like), but there's one I'm not sure how to deal with.

After tagging the texts with FreeLing I end up with a certain number of lines that are as follows:

<     <     Fz     Fz     F     oth

When compiling the corpus, CQP throws the following error for each such case:

Malformed tag < <       Fz      Fz      F       oth, inserted literally (file ~/02-Tagged/0128716.xml, line #85)

These cases seem to be from when writers got unduly creative with symbols, rather than from mathematical uses, so they're probably mostly expendable.

What's the best way to handle cases like these? I could in theory eliminate them with a script before CQP tries to compile the corpus, but I'm loathe to make destructive changes to text contents. So it would be good to know what effect leaving them in will have on the final corpus -- with they interfere with CQP's corpus compilation process? For example, will they cause it to incorrectly determine where actual tags begin and end? Or are they basically harmless?

Thanks,
Scott

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20200121/d6419488/attachment.html>