[CWB] cqpweb and phonetic transcription

graham.ranger graham.ranger at univ-avignon.fr
Sun Dec 21 19:17:09 CET 2025


Brilliant! Thanks as ever for your full and precise answer, Andrew. I think that, given the material, I'm probably looking at a parallel corpus set up. I'm going to have fun transforming my colleague's fairly anarchic word files into something palatable but that's another story! Best, Graham.Envoyé depuis mon appareil Galaxy
-------- Message d'origine --------De : "Hardie, Andrew via CWB" <cwb at sslmit.unibo.it> Date : 21/12/2025  15:41  (GMT+01:00) À : Open source development of the Corpus WorkBench <cwb at sslmit.unibo.it> Cc : "Hardie, Andrew" <a.hardie at lancaster.ac.uk> Objet : Re: [CWB] cqpweb and phonetic transcription 

I’ve indexed various corpora whose primary token stream was an IPA transcription (because the language was one without a written form).
 It works just as normal. Remember CQPweb as software is totally agnostic as to the script that the data uses, so IPA is just as good as Latin, Greek, Cyrillic, Japanese, or whatever.
 
But that means that, just like data in any other script, you need it to be tokenised, and any word-level annotation needs to be presented
 alongside the tokens as extra columns in the Vrt file. 
 
So for instance you can have IPA as an annotation, alongside others possibly, e.g. a POS as here:
 
my          maɪ          POSSPRO
name        ne:m         NOUN
is          ɪz           VERB
Andrew      andɹu:       NOUN
 
Or you can have the primary data be in IPA, and then either add or don’t add the orthographic form as annotation:
 
maɪ         my         

ne:m        name       

ɪz          is         

andɹu:      Andrew     

 
IN SUM, If your standard French and your IPA transcriptions line up word by word, you can use one of them as an annotation on the other.
 Then, you can search on either in the usual way using either CQL or simple query. This is the best and most flexible approach.
 
If the word lineup
doesn’t match, so you can’t do it as per above, then either of the techniques you mention, IE giving the Stand.Fr. as a sentence-level translation, or using two “parallel” corpora, would work. Neither is the ideal way to handle this kind of data. But
 if you don’t have tokenisation lineup,  then you might have to go with one of these.
 
 
>> Would the first type allow for searches that start with the IPA transcription?
 
So long as your IPA data is either the “word” (first column of the input) or an annotation (second column), you can search it.

 
(Your users would need an IPA soft keyboard of course. I am working on adding soft keyboards, but it’s not complete yet.)
 
 
>> One last question: I think that the audio could be linked to the files as metadata. Is this right?
 
Yes. See admin manual section 7.5.1. Provide address of the files with the
audio: prefix.
 
best
 
Andrew.
 
 
 


From: CWB <cwb-bounces at sslmit.unibo.it>
On Behalf Of Graham Ranger -- UAPV via CWB
Sent: 21 December 2025 13:25
To: cwb at sslmit.unibo.it
Cc: Graham Ranger -- UAPV <graham.ranger at univ-avignon.fr>
Subject: [CWB] cqpweb and phonetic transcription


 

Hello again,
A second question, on a different thread for clarity: does anybody have experience with text and phonetic transcription? Specifically, I have transcriptions of interviews made 30-40 years ago in a form of regional French that only had 40 speakers at the time.
 I have 1) IPA transcriptions, with one or two local conventions for pauses, etc. and 2) reformulations in standard French. The variety being exclusively oral, this is all I have. Now, I would imagine that I could do this either as a corpus and its "translation"
 or as a single corpus with the transcriptions as sentence-level attributs <s trans="..."> or something like that. Would the first type allow for searches that start with the IPA transcription? The second type appears of rather limited interest, since searches
 would need to start with the reformulation. One last question: I think that the audio could be linked to the files as metadata. Is this right?
In short, any accounts of user experiences with similar corpora would be very helpful!
Best,
Graham. 




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20251221/9eceb893/attachment.html>


More information about the CWB mailing list