[CWB] Dealing with "malformed tag" error

Scott Sadowsky ssadowsky at gmail.com
Tue Jan 21 18:46:37 CET 2020


Thanks very much, Peter and Andrew. I do indeed use the XML encoding
through cwb-encode, and I knew that that processes tags correctly, but I
didn't know how extensively it handles entities. All clear now.

Best wishes,
Scott

On Tue, Jan 21, 2020, 18:09 Hardie, Andrew <a.hardie at lancaster.ac.uk> wrote:

> Generally: cwb-encode will attempt to parse anything with a < at the start
> of the line as if it were an XML tag.  So yes, they need escaping.
>
>
>
> &lt; is the best way to do so as Peter says. If you’reusing cwb-encode
> directly, remember to use the -x option so that this will be properly
> interpreted. If you’re going via CQPweb, then -x is always switched on
>
>
>
> best
>
>
>
> Andrew
>
>
>
> *From:* cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> *On
> Behalf Of *Uhrig, Peter
> *Sent:* 21 January 2020 16:42
> *To:* Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it
> >
> *Subject:* Re: [CWB] Dealing with "malformed tag" error
>
>
>
> Hi Scott,
>
>
>
> I recommend using &lt; here as an XML entity.
>
> See here:
> http://liste.sslmit.unibo.it/pipermail/cwb/2018-February/003072.html
> <https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fliste.sslmit.unibo.it%2Fpipermail%2Fcwb%2F2018-February%2F003072.html&data=02%7C01%7Ca.hardie%40lancaster.ac.uk%7C395a9a26e0df4a7928f808d79e920a12%7C9c9bcd11977a4e9ca9a0bc734090164a%7C1%7C1%7C637152222392088934&sdata=bf1OII%2FBs2y14LHzp4fC9fRO7%2B1DOSwKfwxssM4bExY%3D&reserved=0>
>
>
>
> Best wishes,
>
> Peter
>
>
>
> *Von:* cwb-bounces at sslmit.unibo.it <cwb-bounces at sslmit.unibo.it> *Im
> Auftrag von *Scott Sadowsky
> *Gesendet:* Dienstag, 21. Januar 2020 16:54
> *An:* CWBdev Mailing List <cwb at sslmit.unibo.it>
> *Betreff:* [CWB] Dealing with "malformed tag" error
>
>
>
> I'm trying to encode a very large corpus derived from very heterogeneous
> text files. I've solved most of the problems (e.g. multiple character
> encodings and the like), but there's one I'm not sure how to deal with.
>
>
>
> After tagging the texts with FreeLing I end up with a certain number of
> lines that are as follows:
>
>
>
> <     <     Fz     Fz     F     oth
>
>
>
> When compiling the corpus, CQP throws the following error for each such
> case:
>
>
>
> Malformed tag < <       Fz      Fz      F       oth, inserted literally
> (file ~/02-Tagged/0128716.xml, line #85)
>
>
>
> These cases seem to be from when writers got unduly creative with symbols,
> rather than from mathematical uses, so they're probably mostly expendable.
>
>
>
> What's the best way to handle cases like these? I could in theory
> eliminate them with a script before CQP tries to compile the corpus, but
> I'm loathe to make destructive changes to text contents. So it would be
> good to know what effect leaving them in will have on the final corpus --
> with they interfere with CQP's corpus compilation process? For example,
> will they cause it to incorrectly determine where actual tags begin and
> end? Or are they basically harmless?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20200121/07c87315/attachment.html>


More information about the CWB mailing list