<html>
  <head>
    <meta content="text/html; charset=ISO-8859-1"
      http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    Hi Andrew,<br>
    <br>
    It seems that for 10 (out of 310) texts, the word count is wrong.<br>
    I simply looked for all tokens ("[]") and made a frequency
    distribution across texts.<br>
    The result was:<br>
    Your query &#8220;[]&#8221; returned 2,076,963 matches in 310 different texts
    (in 1,961,752 words [310 texts]; frequency: 1058728.63 instances per
    million words).<br>
    <br>
    So all tokens are basically there.<br>
    A frequency distribution showed that in 10 text the word count
    (second column) is lower than the number of hits (third column,
    which shows the correct word count).<br>
    <br>
    So all tokens are basically asigned to the correct texts, but the
    word count misses out of them somehow.<br>
    <br>
    Hope this helps with debugging<br>
    <br>
    Best<br>
    Hannah <br>
    <br>
    <table class="concordtable" width="100%">
      <tbody>
        <tr>
          <td class="concordgeneral" align="center"><a
href="https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/textmeta.php?text=1696_Tryon&amp;uT=y">1696_Tryon
            </a> </td>
          <td class="concordgeneral" align="center"> 4,446 </td>
          <td class="concordgeneral" align="center"> <a
href="https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/concordance.php?qname=edjb6tznqc&amp;newPostP=text&amp;newPostP_textTargetId=1696_Tryon&amp;uT=y">
              15,937 </a> </td>
          <td class="concordgeneral" align="center"> 3584570.4 </td>
        </tr>
        <tr>
          <td class="concordgeneral" align="center"> <a
href="https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/textmeta.php?text=1563_Gale&amp;uT=y">
              1563_Gale </a> </td>
          <td class="concordgeneral" align="center"> 12,082 </td>
          <td class="concordgeneral" align="center"> <a
href="https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/concordance.php?qname=edjb6tznqc&amp;newPostP=text&amp;newPostP_textTargetId=1563_Gale&amp;uT=y">
              36,168 </a> </td>
          <td class="concordgeneral" align="center"> 2993544.12 </td>
        </tr>
        <tr>
          <td class="concordgeneral" align="center"> <a
href="https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/textmeta.php?text=1539_Moulton&amp;uT=y">
              1539_Moulton </a> </td>
          <td class="concordgeneral" align="center"> 4,446 </td>
          <td class="concordgeneral" align="center"> <a
href="https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/concordance.php?qname=edjb6tznqc&amp;newPostP=text&amp;newPostP_textTargetId=1539_Moulton&amp;uT=y">
              9,167 </a> </td>
          <td class="concordgeneral" align="center"> 2061853.35 </td>
        </tr>
        <tr>
          <td class="concordgeneral" align="center"> <a
href="https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/textmeta.php?text=1698_Colbatch&amp;uT=y">
              1698_Colbatch </a> </td>
          <td class="concordgeneral" align="center"> 11,341 </td>
          <td class="concordgeneral" align="center"> <a
href="https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/concordance.php?qname=edjb6tznqc&amp;newPostP=text&amp;newPostP_textTargetId=1698_Colbatch&amp;uT=y">
              23,070 </a> </td>
          <td class="concordgeneral" align="center"> 2034212.15 </td>
        </tr>
        <tr>
          <td class="concordgeneral" align="center"> <a
href="https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/textmeta.php?text=1700_Salmon&amp;uT=y">
              1700_Salmon </a> </td>
          <td class="concordgeneral" align="center"> 12,167 </td>
          <td class="concordgeneral" align="center"> <a
href="https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/concordance.php?qname=edjb6tznqc&amp;newPostP=text&amp;newPostP_textTargetId=1700_Salmon&amp;uT=y">
              24,623 </a> </td>
          <td class="concordgeneral" align="center"> 2023752.77 </td>
        </tr>
        <tr>
          <td class="concordgeneral" align="center"> <a
href="https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/textmeta.php?text=1612_Guillemeau&amp;uT=y">
              1612_Guillemeau </a> </td>
          <td class="concordgeneral" align="center"> 11,928 </td>
          <td class="concordgeneral" align="center"> <a
href="https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/concordance.php?qname=edjb6tznqc&amp;newPostP=text&amp;newPostP_textTargetId=1612_Guillemeau&amp;uT=y">
              24,103 </a> </td>
          <td class="concordgeneral" align="center"> 2020707.58 </td>
        </tr>
        <tr>
          <td class="concordgeneral" align="center"> <a
href="https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/textmeta.php?text=1596_Clowes&amp;uT=y">
              1596_Clowes </a> </td>
          <td class="concordgeneral" align="center"> 12,295 </td>
          <td class="concordgeneral" align="center"> <a
href="https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/concordance.php?qname=edjb6tznqc&amp;newPostP=text&amp;newPostP_textTargetId=1596_Clowes&amp;uT=y">
              24,789 </a> </td>
          <td class="concordgeneral" align="center"> 2016185.44 </td>
        </tr>
        <tr>
          <td class="concordgeneral" align="center"> <a
href="https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/textmeta.php?text=1652_Fioravanti&amp;uT=y">
              1652_Fioravanti </a> </td>
          <td class="concordgeneral" align="center"> 11,754 </td>
          <td class="concordgeneral" align="center"> <a
href="https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/concordance.php?qname=edjb6tznqc&amp;newPostP=text&amp;newPostP_textTargetId=1652_Fioravanti&amp;uT=y">
              23,614 </a> </td>
          <td class="concordgeneral" align="center"> 2009018.21 </td>
        </tr>
        <tr>
          <td class="concordgeneral" align="center"> <a
href="https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/textmeta.php?text=1652_Culpeper&amp;uT=y">
              1652_Culpeper </a> </td>
          <td class="concordgeneral" align="center"> 12,014 </td>
          <td class="concordgeneral" align="center"> <a
href="https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/concordance.php?qname=edjb6tznqc&amp;newPostP=text&amp;newPostP_textTargetId=1652_Culpeper&amp;uT=y">
              23,682 </a> </td>
          <td class="concordgeneral" align="center"> 1971200.27 </td>
        </tr>
        <tr>
          <td class="concordgeneral" align="center"> <a
href="https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/textmeta.php?text=1659_Culpeper&amp;uT=y">
              1659_Culpeper </a> </td>
          <td class="concordgeneral" align="center"> 3,343 </td>
          <td class="concordgeneral" align="center"> <a
href="https://fedora.clarin-d.uni-saarland.de/cqpweb/EMEMC/concordance.php?qname=edjb6tznqc&amp;newPostP=text&amp;newPostP_textTargetId=1659_Culpeper&amp;uT=y">
              5,874 </a> </td>
          <td class="concordgeneral" align="center"> 1757104.4 </td>
        </tr>
      </tbody>
    </table>
    <br>
    <br>
    <div class="moz-cite-prefix">Am 14.02.2014 15:04, schrieb Hardie,
      Andrew:<br>
    </div>
    <blockquote
      cite="mid:28078EC3FBF1B940A3EF3D0D19BE351D2EAD83@EX-0-MB1.lancs.local"
      type="cite">
      <pre wrap="">Hannah &amp; Stefan,

Can you tell me (a) which function you used to get the CQP word count (b) where you got the CQPweb wordcount (corpus metadata, or concordance infobar)?

The most obvious explanation is that there are tokens outside &lt;text&gt; elements, since CQPweb calculates the size of the corpus by summing the tokens in each individual text. This in turn is based on calculating cpos differences.

But I would like to investigate on my own server first.

best

Andrew.

-----Original Message-----
From: <a class="moz-txt-link-abbreviated" href="mailto:cwb-bounces@sslmit.unibo.it">cwb-bounces@sslmit.unibo.it</a> [<a class="moz-txt-link-freetext" href="mailto:cwb-bounces@sslmit.unibo.it">mailto:cwb-bounces@sslmit.unibo.it</a>] On Behalf Of Stefan Evert
Sent: 14 February 2014 12:49
To: Open source development of the Corpus WorkBench
Subject: Re: [CWB] Difference in token number between CQP and CQPweb


On 14 Feb 2014, at 12:07, Hannah Kermes <a class="moz-txt-link-rfc2396E" href="mailto:h.kermes@mx.uni-saarland.de">&lt;h.kermes@mx.uni-saarland.de&gt;</a> wrote:

</pre>
      <blockquote type="cite">
        <pre wrap="">I just realized a difference in the token numbers between CQP and CQPweb.
The encoded corpus in CQPweb is a copy of the CQP corpus. The encoding has been performed with CQP on the command line and has been installed in CQPweb as an encoded corpus.

Token numbers: 1,961,752 (CQPweb); 2,076,963 (CQP)

The difference is also present if you look at subcorpora.
</pre>
      </blockquote>
      <pre wrap="">
Interesting. I see the same discrepancy on my local copy of CQPweb (v3.0.7) for _one_ of the corpora I installed.   Everything else is fine.

Andrew, is it possible that this may be caused by some particular corpus settings, e.g. if it's not in UTF-8 encoding?

Otherwise, the only explanation I can think of is that you may have re-encoded the CWB corpus, changing its size, and forgot to re-install it in CQPweb (so CQPweb still has the old frequency information etc. and all subcorpora and distributions will be totally messed up)?

Cheers,
Stefan



_______________________________________________
CWB mailing list
<a class="moz-txt-link-abbreviated" href="mailto:CWB@sslmit.unibo.it">CWB@sslmit.unibo.it</a>
<a class="moz-txt-link-freetext" href="http://devel.sslmit.unibo.it/mailman/listinfo/cwb">http://devel.sslmit.unibo.it/mailman/listinfo/cwb</a>
_______________________________________________
CWB mailing list
<a class="moz-txt-link-abbreviated" href="mailto:CWB@sslmit.unibo.it">CWB@sslmit.unibo.it</a>
<a class="moz-txt-link-freetext" href="http://devel.sslmit.unibo.it/mailman/listinfo/cwb">http://devel.sslmit.unibo.it/mailman/listinfo/cwb</a>
</pre>
    </blockquote>
    <br>
    <pre class="moz-signature" cols="72">-- 
Dr. Hannah Kermes
Dept. of Applied Linguistics, Interpreting and Translation (FR 4.6)
Universit&auml;t des Saarlandes
Campus, Building A2.2, Room 1.07
D-66123 Saarbr&uuml;cken
phone: +49-(0)681-302-70077
</pre>
  </body>
</html>