[CWB] cqpweb and phonetic transcription

Hardie, Andrew a.hardie at lancaster.ac.uk
Sun Dec 21 15:39:23 CET 2025


I’ve indexed various corpora whose primary token stream was an IPA transcription (because the language was one without a written form). It works just as normal. Remember CQPweb as software is totally agnostic as to the script that the data uses, so IPA is just as good as Latin, Greek, Cyrillic, Japanese, or whatever.

But that means that, just like data in any other script, you need it to be tokenised, and any word-level annotation needs to be presented alongside the tokens as extra columns in the Vrt file.

So for instance you can have IPA as an annotation, alongside others possibly, e.g. a POS as here:

my          maɪ          POSSPRO
name        ne:m         NOUN
is          ɪz           VERB
Andrew      andɹu:       NOUN

Or you can have the primary data be in IPA, and then either add or don’t add the orthographic form as annotation:

maɪ         my
ne:m        name
ɪz          is
andɹu:      Andrew

IN SUM, If your standard French and your IPA transcriptions line up word by word, you can use one of them as an annotation on the other. Then, you can search on either in the usual way using either CQL or simple query. This is the best and most flexible approach.

If the word lineup doesn’t match, so you can’t do it as per above, then either of the techniques you mention, IE giving the Stand.Fr. as a sentence-level translation, or using two “parallel” corpora, would work. Neither is the ideal way to handle this kind of data. But if you don’t have tokenisation lineup,  then you might have to go with one of these.


>> Would the first type allow for searches that start with the IPA transcription?

So long as your IPA data is either the “word” (first column of the input) or an annotation (second column), you can search it.

(Your users would need an IPA soft keyboard of course. I am working on adding soft keyboards, but it’s not complete yet.)


>> One last question: I think that the audio could be linked to the files as metadata. Is this right?

Yes. See admin manual section 7.5.1. Provide address of the files with the audio: prefix.

best

Andrew.



From: CWB <cwb-bounces at sslmit.unibo.it> On Behalf Of Graham Ranger -- UAPV via CWB
Sent: 21 December 2025 13:25
To: cwb at sslmit.unibo.it
Cc: Graham Ranger -- UAPV <graham.ranger at univ-avignon.fr>
Subject: [CWB] cqpweb and phonetic transcription

Hello again,
A second question, on a different thread for clarity: does anybody have experience with text and phonetic transcription? Specifically, I have transcriptions of interviews made 30-40 years ago in a form of regional French that only had 40 speakers at the time. I have 1) IPA transcriptions, with one or two local conventions for pauses, etc. and 2) reformulations in standard French. The variety being exclusively oral, this is all I have. Now, I would imagine that I could do this either as a corpus and its "translation" or as a single corpus with the transcriptions as sentence-level attributs <s trans="..."> or something like that. Would the first type allow for searches that start with the IPA transcription? The second type appears of rather limited interest, since searches would need to start with the reformulation. One last question: I think that the audio could be linked to the files as metadata. Is this right?
In short, any accounts of user experiences with similar corpora would be very helpful!
Best,
Graham.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://liste.sslmit.unibo.it/pipermail/cwb/attachments/20251221/4b6d3588/attachment-0001.html>


More information about the CWB mailing list