Lr0708/c
From EnWiki
Members of the group:
- Zulema Cabrera
- Nuria Ferreiro
- Ainize Aguillo
OUTLINE FOR THE PROJECT
What are we going to do?
Our topic is: Multilingual Corpora. We will take the information from this link: http://del.icio.us/joseba_abaitua/corpus+translationaitua. This is what more or less we are going to do with the topic. We still have to choose the pages because there are quite a lot of them and some of them are quite interesting. Lets see what we can do with the pages:
- all the possible search they offer
- which results they offer
- how they can be improved
- their history, when they were created, for what goal, who were the creators
- how we can compare them with similar sites: if it is better, worse, what they offer, if they offer more or less, the tools, etc.
In order to do our project, we decided to analyse two different sites:
CLUVI
Bwana Net
Multilingual Corpora
Their history, when they were created, for what goal, who were the creators
The Corpus project is the priority project of search of IULA in which all the members take part. It gathers texts written in five different languages Catalan, Spanish, English, French and German in the domains of economy, law, environment, medicine and computer science. Across the corpus, they try to interfere the laws that govern the behavior of every language in every area. This corpus is the main support of the searching activities and teaching of the institute.The foreseen searches on the corpus are the following ones: detection of neologisms and terms, studies on linguistic variation, partial parsing, alignment of texts, extraction of information for the education of the second languages, extraction of information for the construction of electronic dictionaries, production of tesauro, etc.
The texts are selected by specialists of every area and brought together on the basis of a thematic classification and proposed usages by the same specialists (Law, Economy, Environment, Medicine and Computer science). Later the texts are marked according to the standard *SGML and following the directives marked by the Corpus.
The processing of the texts of the corpus leads the following steps:
- Structural marking
- preprocés (detection of dates, numbers, phrases, proper names ...)
- Morphologic analysis and marking agreement with the etiquetaris morfosintàctics designed to *IULA
- Storage in a textual database
The CLUVI (Linguistic Corpus of the University of Vigo) is an open set of parallel textual corpora of specialized registers of contemporary Galician language developed by the SLI (Computational Linguistics Group of the University of Vigo) and publicly available in its website since September 2003.
The CLUVI Corpus contains over 22 million words, and its main components are the TECTRA Corpus of English-Galician literary texts, the FEGA Corpus of French-Galician literary texts, the LEGA Corpus of Galician-Spanish legal texts, the UNESCO Corpus of English-Galician-French-Spanish scientific-technical divulgation texts, the LOGALIZA Corpus of English-Galician software localization, and the CONSUMER Corpus of Spanish-Galician-Catalan-Basque consumer information.
The public searching and browsing tool designed by the SLI is available at http://sli.uvigo.es/CLUVI/. This web application permits both simple and very complex searches of isolated words or sequences of words, and shows the multilingual equivalences of the terms in context, as found in real and referenced translations.
The terms searched can correspond to either of the languages of the translation, but it is also possible to carry out true multilingual searches, that is, to simultaneously search one term from each of the languages of translation. The number of aligned works and language pairs available in the website increases regularly, since the CLUVI is a academic research project in progress and with great vitality.
At the moment, the CLUVI Parallel Corpus webpage permits to search five major corpora -TECTRA, FEGA, LEGA, UNESCO and LOGALIZA-, as well as other minor parallel corpora now in progress. It should be pointed out that the CLUVI interface also permits to browse the TURIGAL Corpus of Portuguese-English tourism texts, and the Legebiduna Corpus of Basque-Spanish administrative texts developed by the DELi group at the U. of Deusto.
What pages are we going to use?
What are they used for?
CLUVI
CLUVI is the Linguistic Corpus of the University of Vigo. Besides, it is an open textual corpus of specialized registers of contemporary oral and written Galician language.
As we can find in the official site: “is an open set of parallel textual corpora of specialized registers of contemporary Galician language developed by the SLI (Computational Linguistics Group of the University of Vigo) and publicly available in its website since September 2003”.
This web application permits both simple and very complex searches of isolated words or sequences of words, and shows the multilingual equivalences of the terms in context, as found in real and referenced translations.
The corpus is divided into four subcorpora:
- the TECTRA parallel corpus of English-Galician
literary texts
- the LEGA parallel corpus of Galician-Spanish legal-administrative texts
- the XIGA monolingual corpus of texts about computing in Galician
- the MEGA monolingual Galician corpus of language from the
media
Besides, there are six more sections in progress. They are:
- EGAL Corpus of Galician-Spanish economy texts
- Corpus of English-Portuguese literary texts.
- Corpus of English-Spanish literary texts.
- DEGA Corpus of German-Galician literary texts.
- Corpus VEIGA of English-Galician film subtitling.
- PALOP Corpus of Portuguese-Spanish postcolonial literature.
For instance, the following alignment would be encoded in this way:
‘Hello.’
-Ola - dixen.
<tu>
<tuv xml:lang=”en”>
<seg>’Hello.’</seg>
</tuv>
<tuv xml:lang=”gl”>
<seg>-Ola <hi type=”incl”>- dixen.</hi>
</tuv>
</tu>
This is in essence the document type definition for the CLUVI
parallel corpora:
<!ELEMENT cluvi_tmx (header, body) >
<!ATTLIST cluvi_tmx
- version CDATA #REQUIRED >
<!ELEMENT header (#PCDATA)>
<!ATTLIST header
- creationtool CDATA #REQUIRED
- creationtoolversion CDATA #REQUIRED
- segtype (block|paragraph|sentence|phrase
- REQUIRED
- o-tmf CDATA #REQUIRED
- adminlang CDATA #REQUIRED
- srclang CDATA #REQUIRED
- datatype CDATA #REQUIRED >
<!ELEMENT body (tu*) >
<!ELEMENT tu (tuv+) >
<!ELEMENT tuv (seg) >
<!ATTLIST tuv
- xml:lang CDATA #REQUIRED>
<!ELEMENT seg (#PCDATA | ph | hi | ling)*>
<!ELEMENT hi (#PCDATA | ling)*>
<!ATTLIST hi
- type CDATA #IMPLIED
- x CDATA #IMPLIED>
<!ELEMENT ph EMPTY>
<!ATTLIST ph
- x CDATA #IMPLIED>
<!ELEMENT ling (mor, ort)>
<!ELEMENT mor EMPTY>
<!ATTLIST mor
- cat (ARDFP|ARDFS|...) #REQUIRED
- lema CDATA #REQUIRED
- lema2 CDATA #IMPLIED>
<!ELEMENT ort (#PCDATA)>
The sections that are in progress can also be used, and you are able to look up words and expressions in the same way you do with the rest of the components. As we have said before, this web application both simple and very complex searches. The site offers a ‘Search help’ that is quite useful for the complex searches.
In the ‘Search help’ there are for different groups:
- Complex searches match regular expressions following the syntax and semantics of the regular expressions supported by PCRE (Perl
- Compatible Regular Expressions).[\b[xg]igab[yi]tes?\b ]
- Symbols for characters. [(abc|xyz)], [\w ]
- Quantifiers. [x+] [x{m, n} ]
- Escaping characters. [\+ (literally "+") ] [\. (literally ".") ]�
BwanaNet
- What is BwanaNet?
BwanaNet is an interface developed at the IULA that allows to query the Technical Corpus(CT) of the Institut via Internet. The CT is indexed using the Corpus Workbench, a set of tools developed at the Institut für Maschinelle Sprachverarbeitung of the Stuttgart University.
With BwanaNet people can consult the CT-IULA documents. These are the steps to follow:
- 1. Select the language document.
- 2. Select if you want to do a monolingual or a multilingual consult.
- 3. Select the documents
- 4. Define the kind of consult
- 5. Define the consult
- 6. Visualize the results
One of the good points of this site is that you can choose the domain and the sub domains, in order to have a more specific and close search.
The problem with this corpus, and one of the main differences between this and Cluvi is that one you search what you have chosen the results are quite different
- CLUVI: http://webs.uvigo.es/sli/arquivos/lrec2004.pdf
- BWANANET: http://bwananet.iula.upf.edu/bwananet1a.es.htm
All the possible search they offer
CLUVI
As we enter the page we can see that there are a great number of possibilities among different languages. The possibilities start with the LEGA corpus. LEGA is the corpus that allows us to look up words from Galician-Spanish legal texts.
Apart from LEGA, there are more tools, all in which Galician is used and offered, and these are the tools:
- UNESCO Corpus of English-Galician-French-Spanish scientific-technical divulgation
- LOGALIZA Corpus of English-Galician software localization
- TECTRA Corpus of English-Galician literary texts
- FEGA Corpus of French-Galician literary texts
- CONSUMER Corpus of Spanish-Galician-Catalan-Basque consumer information
But there are also too corpus that are offering other languages, apart from the Galician one:
- LEGE-BI Corpus of Basque-Spanish legal texts
- TURIGAL Corpus of Portuguese-English tourism texts
In order to see how it works, we are going to show you, with photos and information the tools, the corpus and the results that the research offers.
LEGA
LEGA Corpus of Galician-Spanish legal texts:
http://ukey0708.files.wordpress.com/2008/06/lega.jpg
To look up a word or an expression we just have to type them:
http://ukey0708.files.wordpress.com/2008/06/cluvi3.jpg
As a result this is what we obtain, among so many other results:
http://ukey0708.files.wordpress.com/2008/06/lega3.jpg
What we have as a result are the different contexts and meanings of a legal word like ‘derecho’.
If we click in the pink image what we will obtain is the following thing:
http://ukey0708.files.wordpress.com/2008/06/lega4.jpg
What we obtain is the translation context.
On the left there is this picture:
http://ukey0708.files.wordpress.com/2008/06/lega5.jpg
If we click on the CIV letters what we will obtain is where the information is taken from:
http://ukey0708.files.wordpress.com/2008/06/lega6.jpg
UNESCO
*UNESCO Corpus of English-Galician-French-Spanish scientific-technical divulgation.
- This component allows the searching of any word or expression in four different languages. Let’s introduce a word/expression in English and see the different results:
http://ukey0708.files.wordpress.com/2008/06/unesco1.jpg
As a result we obtain:
http://ukey0708.files.wordpress.com/2008/06/unesco2.jpg
If we focus in, for example, number 13 we will have:
http://ukey0708.files.wordpress.com/2008/06/unesco3.jpg
And the place from which all this information is taken from is:
http://ukey0708.files.wordpress.com/2008/06/unesco4.jpg
- The sections that are in progress can also be used, and you are able to look up words and expressions in the same way you do with the rest of the components.
- As we have said before, this web application both simple and very complex searches. The site offers a ‘Search help’ that is quite useful for the complex searches.
In the ‘Search help’ there are for different groups:
- Complex searches match regular expressions following the syntax and semantics of the regular expressions supported by PCRE (Perl Compatible Regular Expressions).[\b[xg]igab[yi]tes?\b ]
- Symbols for characters. [(abc|xyz)], [\w ]
- Quantifiers. [x+] [x{m, n} ]
- Escaping characters. [\+ (literally "+") ] [\. (literally ".") ]
BWANA NET
Using the six steps that Bwana Net offers to do a searching, we are going to see, with photos and some explanations how Bwana Net works. Let's begin:
1: Documents and language: At the beginning what you find is this:
http://ukey0708.files.wordpress.com/2008/06/v1.jpg
If we are interested in parallel documents we just have to click ‘yes’ and the choose the language of the documents.
We have chosen English and parallel documents in Spanish. Now we have to continue deciding:
http://ukey0708.files.wordpress.com/2008/06/v2.jpg
One of the good points of this site is that you can choose the domain and the sub domains, in order to have a more specific and close search.
Indeed, you can also choose between original or translated documents:
http://ukey0708.files.wordpress.com/2008/06/v3.jpg
And the type of documents:
http://ukey0708.files.wordpress.com/2008/06/v4.jpg
Eventually, our selection is:
http://ukey0708.files.wordpress.com/2008/06/v5.jpg
The problem with this corpus, and one of the main differences between this and Cluvi is that one you search what you have chosen the results are quite different:
http://ukey0708.files.wordpress.com/2008/06/v6.jpg
And if we click on ‘Standard concordance’…:
http://ukey0708.files.wordpress.com/2008/06/v8.jpg
If we click on ‘isolated tokens’ this is what we see:
http://ukey0708.files.wordpress.com/2008/06/v7.jpg
As we can see CLUVI and Bwana Net are quite different. CLUVI is a simpler tool while Bwana Net is more specialised and accurate.
How they can be improved
- Analysing the side and at the first sight we have realised that the Bwananet side first page hasn't got clear instructions for the searching. We have a brief description in the main page of what is it based on and the history of the creation.It offers a link to the IULA that describes the project of the corpus while at the first sight of the CLUVI site we can see a more clear presentation of what they offer and a description of all the kind of translations that the user can try.
- Bwananet offers the possibility of visiting its page through three languages : English, Spanish and Catalán. CLUVI offers only two: Galego and English; both Corpus offer very few resources in what is referred to language a very few amount of people are able to use it because of the language.Both of them offer a description of the project and the searches you can do and find. CLUVI has got a better and clear description of the information and the way to use it.
- CLUVI offers a clear and listed description of all the different languages and translations that can be used while the Bwananet site does not make a very deep description of the possibilities that offers.
- Both CLUVI and Bwana Net offer good results; they offer different languages, different kinds of texts (original, translated, legal, literary, etc.). While CLUVI is easier, Bwana Net is more specific.
How we can compare them with similar sites: if it is better, worse, what they offer, if they offer more or less, the tools, etc.
At the end of the researching what we find is that both sites are quite different one from the other. Let’s see some common aspects:
- They both offer multilingual searching.
- They both offer specific research in very specific areas and contexts.
Let’s see now some differences:
- Bwana Net requires more time and a more specific knowledge and searching.
- Cluvi offers more information about the context and the place the information is taken from.
- Bwana Net is much more specific than Cluvi. They offer a closer research and more options than Cluvi.
Final Conclusions
Both CLUVI and Bwana Net are quite interesting and useful sitesand with really good results; they offer different languages, different kinds of texts (original, translated, legal, literary, etc.). While CLUVI is easier, Bwana Net is more specific. The positive side of this conclusion is that each one is made for one specific purpose, and you are the one deciding which one to use.
List of sites
Multilingual corpora
- BwanaNet, interface developed by Institut Universitari de Lingüística Aplicada from Universitat Pompeu Fabra to access the multilingual Corpus Tècnic (CT): http://bwananet.iula.upf.edu/ (last visited, April 15th 2008).
- OPUS, compiled and published by Jörg Tiedemann, Rijksuniversiteit Groningen:
http://urd.let.rug.nl/tiedeman/OPUS/ (last visited, April 15th 2008)

