Thanks a lot, I just did some little adjustments and now is fixing the corpora, you saved me a
lot of time, thanks again<br /><br />On Mon, January 27, 2014 23:25, Stefan Evert wrote:<br
/>> <br />>> Is there any easy way to transform the metadata format for the Wacky
corpora so that they<br />>> can be used with the cqpWeb interface? We are trying to
install a few of these corpora but I<br />>> have problems with some of the headings.<br
/>> <br />> This is not a problem of the WaCky corpora in general. Most of them are
provided in a format<br />> that's directly CWB-compatible. Only sdeWaC has this different
and nonstandard format.<br />> <br />> That's also why I happen to have a script named
"fix_sdewac_tagged.perl" on my computer. :-)<br />> <br />> I'm attaching a
ZIP archive with this script as well as the CWB/Perl encoding script (and a<br />> second
script that annotates sentence lengths). What you have to do is extract the<br />>
"web_address_list.txt" from the 7z archive (or download it separately), then run<br
/>> "extract_sdewac_tagged.sh". If you want to keep it in UTF-8 encoding or
process an<br />> uncompressed version of the corpus, you'll have to edit the scripts
accordingly.<br />> <br />> Hope this helps,<br />> Stefan<br />> <br />> <br
/><br /><br /><br />_______________________<br
/> andrés
chandía<br /><a target="_blank" href="http://www.chandia.net"><img border="0"
alt="chandia.net" src="http://www.chandia.net/sites/default/files/images/chandia.netd.png"
/></a><a target="_blank" href="https://twitter.com/andreschandia"><img
src="http://www.upf.edu/universitat/_img/ico_tw.png" alt="" /></a><br />administrador de<br
/><a href="http://parles.upf.edu">parles.upf.edu</a><br /><a
href="http://psicoaching.net">psicoaching.net</a><br /><a
href="http://koyaktumapuche.net">mapuche koyaktu</a><br /><a
href="http://corporacionkoyaktu.net">ong mapuche koyaktu</a><br /><span style="font-size:
18pt; color: rgb(79, 98, 40); font-family: Webdings;">P </span><span style="font-size: 10pt;
color: rgb(79, 98, 40);">No imprima innecesariamente. ¡Cuide el medio ambiente!</span>