<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<meta name="Generator" content="Microsoft Word 12 (filtered medium)">

<style><!--

/* Font Definitions */

@font-face

        {font-family:"Cambria Math";

        panose-1:2 4 5 3 5 4 6 3 2 4;}

@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

@font-face

        {font-family:Tahoma;

        panose-1:2 11 6 4 3 5 4 4 2 4;}

@font-face

        {font-family:Verdana;

        panose-1:2 11 6 4 3 5 4 4 2 4;}

/* Style Definitions */

p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0cm;

        margin-bottom:.0001pt;

        font-size:12.0pt;

        font-family:"Times New Roman","serif";}

a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:blue;

        text-decoration:underline;}

a:visited, span.MsoHyperlinkFollowed

        {mso-style-priority:99;

        color:purple;

        text-decoration:underline;}

span.EmailStyle17

        {mso-style-type:personal-reply;

        font-family:"Verdana","sans-serif";

        color:#1F497D;}

.MsoChpDefault

        {mso-style-type:export-only;}

@page WordSection1

        {size:612.0pt 792.0pt;

        margin:72.0pt 72.0pt 72.0pt 72.0pt;}

div.WordSection1

        {page:WordSection1;}

--></style><!--[if gte mso 9]><xml>

<o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

<o:shapelayout v:ext="edit">

<o:idmap v:ext="edit" data="1" />

</o:shapelayout></xml><![endif]-->

</head>

<body lang="EN-GB" link="blue" vlink="purple">

<div class="WordSection1">

<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D">Hi Eva,<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span></p>

<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D">If I understand the problem correctly - the normal way that I would do this would be to encode the original orthography (with as-is token breaks) and a normalised

 orthography (with normalised token breaks) as two separate attributes (either 2 p-attributes, or one p-attribute with the normalised version and one s-attribute with the original version).<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span></p>

<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D">best<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span></p>

<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D">Andrew.<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Verdana&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span></p>

<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0cm 0cm 0cm">

<p class="MsoNormal"><b><span lang="EN-US" style="font-size:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;">From:</span></b><span lang="EN-US" style="font-size:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;"> cwb-bounces@sslmit.unibo.it [mailto:cwb-bounces@sslmit.unibo.it]

<b>On Behalf Of </b>BOFÍAS ALBERCH, EVA<br>

<b>Sent:</b> 14 February 2013 18:53<br>

<b>To:</b> cwb@sslmit.unibo.it<br>

<b>Subject:</b> [CWB] Multi-word units<o:p></o:p></span></p>

</div>

<p class="MsoNormal"><o:p>&nbsp;</o:p></p>

<p class="MsoNormal" style="margin-bottom:12.0pt">Hi,<br>

<br>

I don't know whether this is possible at all but it doesn't hurt to ask. OK, here's the problem we have. We are developing a corpus to be exploited via CQP and we would like future users to access information in different ways. This is a diachronic corpus and

 sometimes it is important to know what parts a given multi-word expression has. So for instance in Old Spanish we&nbsp; have expressions such as &quot;apressurada mientre&quot; ('mientre' is the equivalent to the English -ly) which are clearly working as their contemporary

 Spanish equivalent expressions: &quot;apresuradamente&quot;. It is important to encode this as a single word marked as 'adverb' but some potential users might be interested in studying the evolution of these forms and might want to distinguish between forms that the

 scribes wrote as a single word (the same texts also have these adverbs with &quot;mente&quot; as single words) from the ones that are written as two different words. The idea would be to find some way of coding the corpus so that multiword expressions such as these

 ones could be tagged as a single word but if a user wanted to find all the instances of 'mientre' independently of whether it is attached to the preceding word or not s/he would be able to do it as well. Any suggestions? Or are we asking for something that

 is not possible?<br>

<br>

Eva<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<o:p></o:p></p>

</div>

</body>

</html>