<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">Hi again,<br>

      <br>

      There hasn't been any reply to our previous message from anybody

      in the list. Does this mean this problem has no possible solution

      within CQP? Would the method we suggested be too hard or

      impossible to implement? We would really appreciate your input

      because we have to make decisions at this point on how we have to

      pre-process and depending on the options we have with CQP we would

      go one way or another. Thanks for all your help.<br>

      <br>

      Josep M<br>

    </div>

    <blockquote cite="mid:511E744E.9070505@upf.edu" type="cite">

      <meta content="text/html; charset=ISO-8859-1"

        http-equiv="Content-Type">

      <div class="moz-cite-prefix">Hi Andrew and Stefan. I work with Eva

        and now it is my turn to write. First thanks for your help. <br>

        Your answers has given us some ideas that we explain below. What

        we don't really know is the potential pitfalls the

        implementation we suggest would have for its processing via CQP.

        Below we'll try to explain why we would want to do it like we

        are proposing. <br>

        <br>

        <meta http-equiv="content-type" content="text/html;

          charset=ISO-8859-1">

      </div>

      <blockquote

        cite="mid:28078EC3FBF1B940A3EF3D0D19BE351D131B11@EX-0-MB1.lancs.local"

        type="cite">

        <blockquote type="cite">

          <blockquote type="cite">

            <pre wrap="">But this would break the alignment between the two attributes, if one has two tokens and the other only a single token, wouldn't it?

</pre>

          </blockquote>

        </blockquote>

        <pre wrap="">I was thinking of this kind of arrangement:

apressurada        apressuradamientre

mientre        {some kind of ditto mark or just __NULL__}

.... so that subsequent tokens on the two attributes stay in sync.

OR, going the other way

apressuradamientre        apressurada mientre

I'm quite open to alternatives, though the XML way strikes me as liable to cause trouble.</pre>

      </blockquote>

      <br>

      OK, first the reason Andrew's suggestion in (a) below, even though

      it is less likely to cause problems, would be a bit less desirable

      is that by having something like the following we would miss the

      fact that the two words for all intents and purposes work as a

      single unit. To give you an idea, this is exactly the same as if

      in the same texts you would find strings like "hurriedly" and

      "hurried ly". So, by default we want these multi-word expressions

      to be found as a single unit any time a user searches for an

      adverb or for the lemma 'apresuradamente'.<br>

      <br>

      (a)<br>

      <pre wrap="">apressurada        apressuradamientre

mientre        {some kind of ditto mark or just __NULL__}</pre>

      <br>

      Andrew's suggestion in (b) below would overcome this problem but

      then we don't really know how it could be implemented in CQP. What

      we usually have in our tagged corpora are entries with 3 columns:

      1) the form, 2) the lemma and 3) the POS tag. So (b) would be

      problematic because there is apparently no way to say that the

      lemma is in fact 'apresuradamente' and that "apressurada mientre"

      is a multi-word instance/form of that lemma. Furthermore, for

      reasons that have to do with the kind of research potential users

      of this corpus are likely to do, it would be ideal to consider the

      two parts of the multi-word expression also as two independent

      words, each one with its lemma and its part of speech. This is so

      because, in this particular example of adverbs with -mente, in the

      early stages of the change that resulted in the creation of the

      current manner adverbs, the strings with the two forms could have

      been ambiguous between a single adverb (the interpretation we want

      to be the default interpretation when doing a normal search) and

      two independent words: one an adjective and the other a noun. So,

      'apresurada' (which means 'hurried') is not a really good example

      for this development but in the earlier stages of this change, the

      string "fuerte mientre" (lit. "strong mind") could literally have

      meant "with a strong mind" (I think the origins of adverbs with

      -ly in English is similar) as well as "strongly". So we would like

      for these expressions to be also searchable as two separate items

      each one with its lemma and its POS in case a particular

      researcher was interested in studying this phenomenon. For the

      majority of researchers, though, the fact that the expression is

      written in two separate words would not matter. For this reason,

      we would like the default assumption in CQP was that there is a

      single word.<br>

      <br>

      (b)<br>

      <pre wrap="">apressuradamientre        apressurada mientre</pre>

      <br>

      Now, what Stefan proposed made us think of the following

      possibility: <br>

      <br>

      &lt;X&gt;<br>

      &nbsp;word="apresurada mientre" &nbsp;&nbsp; lemma="apresuradamente"&nbsp; pos="ADV"<br>

      &nbsp;&lt;wp word="apresurada" lemma="apresurada"

      pos="ADJ"&gt;&lt;/wp&gt;<br>

      &nbsp;&lt;wp word="mientre" lemma="mente" pos="N"&gt;&lt;/wp&gt;<br>

      &lt;/X&gt;<br>

      <br>

      We choose the label &lt;X&gt; for lack of a better one but the

      idea is that by default CQP interpreted &lt;X&gt;....&lt;/X&gt; as

      it interprets entries for any single word. Then we would have an

      extra p-attribute available &lt;wp&gt; (the users would know this)

      for cases where a user was interested in doing stuff (just finding

      the specific forms and their POS tag or doing some quantitative

      analysis with the different parts) with the differentiated parts

      of the expression. <br>

      <br>

      Being able to do this is extremely important for diachronic

      corpora but it would have advantages for all kinds of corpora

      since all of them contain multi-word expressions where you might

      need their components to be processed independently at some point.

      So, in our corpora we have trouble not only with these types of

      expressions but also with many others like the following:<br>

      <br>

      "compte Guifr&eacute; de Montblanc" This is a proper name literally

      composed by the words count + Wilfred + of + Montblanc<br>

      <br>

      In the texts you find independent instances of 'Guifr&eacute;', 'compte'

      or 'Montblanc'. What is most important is to be able to tag the

      whole string as a noun. To do this is kind of trivial because you

      could artificially create single strings of the type

      'compte_Guifr&eacute;_de_Montblanc' at the pre-processing stage and add

      them to the dictionary as proper nouns. But then imagine that some

      user is interested in studying the variation in the types of

      prepositional phrases that occur within proper nouns, the place

      names used in proper nouns of people or some such legitimate

      research goal. <br>

      <br>

      Having created a single word obscures all this information that

      could be valuable for some. There are many more examples. Another

      typical one are subordinating conjunctions formed by more than one

      word (e.g. "Puis que" literally "since that"), etc. etc.&nbsp; If you

      give them to the tagger as independent words the resulting

      sentence structure is grammatically weird because the two words

      are really working as one (just like 'since') so it is better to

      tag them as a single subordinating conjunction. Again, though,

      people interested in doing research on how these combinations of

      functional words evolved would loose all the information if you

      tag them only as a single expression. I'm sure modern languages

      have lots of cases like this.<br>

      <br>

      You see what I mean? This is part of a more general problem with

      linguistic annotation of corpora but it poses very specific

      challenges for CWB/CQP which we would like to overcome if

      possible.<br>

      <br>

      JM<br>

    </blockquote>

    <br>

  </body>

</html>