[CWB] Dealing with "malformed tag" error

Scott Sadowsky ssadowsky at gmail.com
Tue Jan 21 16:53:37 CET 2020


I'm trying to encode a very large corpus derived from very heterogeneous
text files. I've solved most of the problems (e.g. multiple character
encodings and the like), but there's one I'm not sure how to deal with.

After tagging the texts with FreeLing I end up with a certain number of
lines that are as follows:

<     <     Fz     Fz     F     oth

When compiling the corpus, CQP throws the following error for each such
case:

Malformed tag < <       Fz      Fz      F       oth, inserted literally
(file ~/02-Tagged/0128716.xml, line #85)

These cases seem to be from when writers got unduly creative with symbols,
rather than from mathematical uses, so they're probably mostly expendable.

What's the best way to handle cases like these? I could in theory eliminate
them with a script before CQP tries to compile the corpus, but I'm loathe
to make destructive changes to text contents. So it would be good to know
what effect leaving them in will have on the final corpus -- with they
interfere with CQP's corpus compilation process? For example, will they
cause it to incorrectly determine where actual tags begin and end? Or are
they basically harmless?

Thanks,
Scott
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20200121/04868699/attachment.html>


More information about the CWB mailing list