<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">Sorry. I should have considered this

      possibility. I apologize. I don't mean to rush you guys. Just some

      confirmation that this is a legitimate concern and that you think

      giving some thought to this question is worth it and we are

      satisfied. Just knowing that this or a similar solution might be

      possible to implement in the future would already be useful for us

      since it would justify making some particular choices right now. <br>

      <br>

      One of the avenues we are considering if the solution we suggest

      were at all possible is to pre-process our tokenized texts so that

      all the possible multiword expressions would be added to the

      dictionary for our tager with the appropriate labels; something

      like: <br>

      <br>

      Saint_Anselm, Sir_Lancelot_of_the_lake, in_order_to, etc. <br>

      <br>

      This would at least allow the tagger to learn about the existing

      multi-word units and their distribution and tag the texts with

      what is for us the most important information. We want to parse

      the resulting corpus. So making the encoding of the specific

      syntactic information about relationships between different

      expressions as simple and intuitive as possible is our main

      concern at this point. Later, if the solution we suggest or a

      different one that achieved the same goals were available, we

      would reprocess the texts of the corpus to include the information

      about the different components of the multi-word expressions by

      eliminating the '_' and adding the new labels for the individual

      words via some encoding scheme possibly involving XML.<br>

      <br>

      <br>

      JM<br>

    </div>

    <blockquote

      cite="mid:28078EC3FBF1B940A3EF3D0D19BE351D133B6F@EX-0-MB1.lancs.local"

      type="cite">

      <meta http-equiv="Content-Type" content="text/html;

        charset=ISO-8859-1">

      <meta name="Generator" content="Microsoft Word 12 (filtered

        medium)">

      <style><!--

/* Font Definitions */

@font-face

        {font-family:"Cambria Math";

        panose-1:2 4 5 3 5 4 6 3 2 4;}

@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

@font-face

        {font-family:Tahoma;

        panose-1:2 11 6 4 3 5 4 4 2 4;}

@font-face

        {font-family:Verdana;

        panose-1:2 11 6 4 3 5 4 4 2 4;}

@font-face

        {font-family:Consolas;

        panose-1:2 11 6 9 2 2 4 3 2 4;}

/* Style Definitions */

p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0cm;

        margin-bottom:.0001pt;

        font-size:12.0pt;

        font-family:"Times New Roman","serif";

        color:black;}

a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:blue;

        text-decoration:underline;}

a:visited, span.MsoHyperlinkFollowed

        {mso-style-priority:99;

        color:purple;

        text-decoration:underline;}

pre

        {mso-style-priority:99;

        mso-style-link:"HTML Preformatted Char";

        margin:0cm;

        margin-bottom:.0001pt;

        font-size:10.0pt;

        font-family:"Courier New";

        color:black;}

span.HTMLPreformattedChar

        {mso-style-name:"HTML Preformatted Char";

        mso-style-priority:99;

        mso-style-link:"HTML Preformatted";

        font-family:Consolas;

        color:black;}

span.EmailStyle19

        {mso-style-type:personal-reply;

        font-family:"Verdana","sans-serif";

        color:#1F497D;}

.MsoChpDefault

        {mso-style-type:export-only;

        font-size:10.0pt;}

@page WordSection1

        {size:612.0pt 792.0pt;

        margin:72.0pt 72.0pt 72.0pt 72.0pt;}

div.WordSection1

        {page:WordSection1;}

--></style><!--[if gte mso 9]><xml>

<o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

<o:shapelayout v:ext="edit">

<o:idmap v:ext="edit" data="1" />

</o:shapelayout></xml><![endif]-->

      <div class="WordSection1">

        <p class="MsoNormal"><span

style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D">On

            my part it means it is a more complex question than I have

            had time to write an email about yet!<o:p></o:p></span></p>

        <p class="MsoNormal"><span

style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span></p>

        <p class="MsoNormal"><span

style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D">Andrew.<o:p></o:p></span></p>

        <p class="MsoNormal"><span

style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span></p>

        <div>

          <div style="border:none;border-top:solid #B5C4DF

            1.0pt;padding:3.0pt 0cm 0cm 0cm">

            <p class="MsoNormal"><b><span

style="font-size:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;;color:windowtext"

                  lang="EN-US">From:</span></b><span

style="font-size:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;;color:windowtext"

                lang="EN-US"> <a class="moz-txt-link-abbreviated" href="mailto:cwb-bounces@sslmit.unibo.it">cwb-bounces@sslmit.unibo.it</a>

                [<a class="moz-txt-link-freetext" href="mailto:cwb-bounces@sslmit.unibo.it">mailto:cwb-bounces@sslmit.unibo.it</a>] <b>On Behalf Of </b>Josep

                M. Fontana<br>

                <b>Sent:</b> 19 February 2013 10:39<br>

                <b>To:</b> <a class="moz-txt-link-abbreviated" href="mailto:cwb@sslmit.unibo.it">cwb@sslmit.unibo.it</a><br>

                <b>Subject:</b> Re: [CWB] Multi-word units<o:p></o:p></span></p>

          </div>

        </div>

        <p class="MsoNormal"><o:p>&nbsp;</o:p></p>

        <div>

          <p class="MsoNormal">Hi again,<br>

            <br>

            There hasn't been any reply to our previous message from

            anybody in the list. Does this mean this problem has no

            possible solution within CQP? Would the method we suggested

            be too hard or impossible to implement? We would really

            appreciate your input because we have to make decisions at

            this point on how we have to pre-process and depending on

            the options we have with CQP we would go one way or another.

            Thanks for all your help.<br>

            <br>

            Josep M<o:p></o:p></p>

        </div>

        <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

          <div>

            <p class="MsoNormal" style="margin-bottom:12.0pt">Hi Andrew

              and Stefan. I work with Eva and now it is my turn to

              write. First thanks for your help.

              <br>

              Your answers has given us some ideas that we explain

              below. What we don't really know is the potential pitfalls

              the implementation we suggest would have for its

              processing via CQP. Below we'll try to explain why we

              would want to do it like we are proposing.

              <o:p></o:p></p>

          </div>

          <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

            <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

              <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

                <pre>But this would break the alignment between the two attributes, if one has two tokens and the other only a single token, wouldn't it?<o:p></o:p></pre>

              </blockquote>

            </blockquote>

            <pre>I was thinking of this kind of arrangement:<o:p></o:p></pre>

            <pre><o:p>&nbsp;</o:p></pre>

            <pre>apressurada&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; apressuradamientre<o:p></o:p></pre>

            <pre>mientre&nbsp;&nbsp;&nbsp; {some kind of ditto mark or just __NULL__}<o:p></o:p></pre>

            <pre><o:p>&nbsp;</o:p></pre>

            <pre>.... so that subsequent tokens on the two attributes stay in sync.<o:p></o:p></pre>

            <pre><o:p>&nbsp;</o:p></pre>

            <pre>OR, going the other way<o:p></o:p></pre>

            <pre><o:p>&nbsp;</o:p></pre>

            <pre>apressuradamientre apressurada mientre<o:p></o:p></pre>

            <pre><o:p>&nbsp;</o:p></pre>

            <pre>I'm quite open to alternatives, though the XML way strikes me as liable to cause trouble.<o:p></o:p></pre>

          </blockquote>

          <p class="MsoNormal"><br>

            OK, first the reason Andrew's suggestion in (a) below, even

            though it is less likely to cause problems, would be a bit

            less desirable is that by having something like the

            following we would miss the fact that the two words for all

            intents and purposes work as a single unit. To give you an

            idea, this is exactly the same as if in the same texts you

            would find strings like "hurriedly" and "hurried ly". So, by

            default we want these multi-word expressions to be found as

            a single unit any time a user searches for an adverb or for

            the lemma 'apresuradamente'.<br>

            <br>

            (a)<br>

            <br>

            <o:p></o:p></p>

          <pre>apressurada&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; apressuradamientre<o:p></o:p></pre>

          <pre>mientre&nbsp; {some kind of ditto mark or just __NULL__}<o:p></o:p></pre>

          <p class="MsoNormal"><br>

            Andrew's suggestion in (b) below would overcome this problem

            but then we don't really know how it could be implemented in

            CQP. What we usually have in our tagged corpora are entries

            with 3 columns: 1) the form, 2) the lemma and 3) the POS

            tag. So (b) would be problematic because there is apparently

            no way to say that the lemma is in fact 'apresuradamente'

            and that "apressurada mientre" is a multi-word instance/form

            of that lemma. Furthermore, for reasons that have to do with

            the kind of research potential users of this corpus are

            likely to do, it would be ideal to consider the two parts of

            the multi-word expression also as two independent words,

            each one with its lemma and its part of speech. This is so

            because, in this particular example of adverbs with -mente,

            in the early stages of the change that resulted in the

            creation of the current manner adverbs, the strings with the

            two forms could have been ambiguous between a single adverb

            (the interpretation we want to be the default interpretation

            when doing a normal search) and two independent words: one

            an adjective and the other a noun. So, 'apresurada' (which

            means 'hurried') is not a really good example for this

            development but in the earlier stages of this change, the

            string "fuerte mientre" (lit. "strong mind") could literally

            have meant "with a strong mind" (I think the origins of

            adverbs with -ly in English is similar) as well as

            "strongly". So we would like for these expressions to be

            also searchable as two separate items each one with its

            lemma and its POS in case a particular researcher was

            interested in studying this phenomenon. For the majority of

            researchers, though, the fact that the expression is written

            in two separate words would not matter. For this reason, we

            would like the default assumption in CQP was that there is a

            single word.<br>

            <br>

            (b)<br>

            <br>

            <o:p></o:p></p>

          <pre>apressuradamientre&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; apressurada mientre<o:p></o:p></pre>

          <p class="MsoNormal"><br>

            Now, what Stefan proposed made us think of the following

            possibility: <br>

            <br>

            &lt;X&gt;<br>

            &nbsp;word="apresurada mientre" &nbsp;&nbsp; lemma="apresuradamente"&nbsp;

            pos="ADV"<br>

            &nbsp;&lt;wp word="apresurada" lemma="apresurada"

            pos="ADJ"&gt;&lt;/wp&gt;<br>

            &nbsp;&lt;wp word="mientre" lemma="mente" pos="N"&gt;&lt;/wp&gt;<br>

            &lt;/X&gt;<br>

            <br>

            We choose the label &lt;X&gt; for lack of a better one but

            the idea is that by default CQP interpreted

            &lt;X&gt;....&lt;/X&gt; as it interprets entries for any

            single word. Then we would have an extra p-attribute

            available &lt;wp&gt; (the users would know this) for cases

            where a user was interested in doing stuff (just finding the

            specific forms and their POS tag or doing some quantitative

            analysis with the different parts) with the differentiated

            parts of the expression.

            <br>

            <br>

            Being able to do this is extremely important for diachronic

            corpora but it would have advantages for all kinds of

            corpora since all of them contain multi-word expressions

            where you might need their components to be processed

            independently at some point. So, in our corpora we have

            trouble not only with these types of expressions but also

            with many others like the following:<br>

            <br>

            "compte Guifr&eacute; de Montblanc" This is a proper name literally

            composed by the words count + Wilfred + of + Montblanc<br>

            <br>

            In the texts you find independent instances of 'Guifr&eacute;',

            'compte' or 'Montblanc'. What is most important is to be

            able to tag the whole string as a noun. To do this is kind

            of trivial because you could artificially create single

            strings of the type 'compte_Guifr&eacute;_de_Montblanc' at the

            pre-processing stage and add them to the dictionary as

            proper nouns. But then imagine that some user is interested

            in studying the variation in the types of prepositional

            phrases that occur within proper nouns, the place names used

            in proper nouns of people or some such legitimate research

            goal. <br>

            <br>

            Having created a single word obscures all this information

            that could be valuable for some. There are many more

            examples. Another typical one are subordinating conjunctions

            formed by more than one word (e.g. "Puis que" literally

            "since that"), etc. etc.&nbsp; If you give them to the tagger as

            independent words the resulting sentence structure is

            grammatically weird because the two words are really working

            as one (just like 'since') so it is better to tag them as a

            single subordinating conjunction. Again, though, people

            interested in doing research on how these combinations of

            functional words evolved would loose all the information if

            you tag them only as a single expression. I'm sure modern

            languages have lots of cases like this.<br>

            <br>

            You see what I mean? This is part of a more general problem

            with linguistic annotation of corpora but it poses very

            specific challenges for CWB/CQP which we would like to

            overcome if possible.<br>

            <br>

            JM<o:p></o:p></p>

        </blockquote>

        <p class="MsoNormal"><o:p>&nbsp;</o:p></p>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

CWB mailing list

<a class="moz-txt-link-abbreviated" href="mailto:CWB@sslmit.unibo.it">CWB@sslmit.unibo.it</a>

<a class="moz-txt-link-freetext" href="http://devel.sslmit.unibo.it/mailman/listinfo/cwb">http://devel.sslmit.unibo.it/mailman/listinfo/cwb</a>

</pre>

    </blockquote>

    <br>

  </body>

</html>