Lr0708/a
From EnWiki
Contents |
CORPUS ESPAÑOL
PROJECT SUMMARY
This semester we have to do a project in groups about the Corpus. We did not know what the Corpus was, but for that, we had the first classes, where we began to discover what was it; and later, with this project, we can to obtein information about its usefulness, how it is used...
Our project is divides in two parts, an informative part and another practice.
- In the informative part, we include different information about the Corpus, we explain what the Corpus is, the purposes of the Spanish Corpora, who creates its, the different Spanish Corpora, and a comparision between the Corpuses we are going to use in the practice part.
- In the practice part, we explain how this Corpus is used with the Mark Davies's Corpus application. For this practice we are going to use some Spanish expressions and sayings.
DEFINITION OF "CORPUS" AND "CORPUS LINGUISTICS"
- The Corpus is: The concept of carrying out research on written or spoken texts is not restricted to corpus linguistics. Indeed, individual texts are often used for many kinds of literary tookand linguistic analysis - the stylistic analysis of a poem, or a conversation analysis of a tv talk show. However, the notion of a corpus as the basis for a form of empirical linguistics is different from the examination of single texts in several fundamental ways.
- In principle, any collection of more than one text can be called a corpus, (corpus being Latin for "body", hence a corpus is any body of text). But the term "corpus" when used in the context of modern linguistics tends most frequently to have more specific connotations than this simple definition.
- Corpus linguistics is simply the study of language through corpus-based research, but it differs from traditional linguistics in its insistence on the systematic study of authentic examples of language in use.
1. Text linguistics vs corpus linguistics
2. Illustration vs evidence
3. Introspection and informant testing vs observation of text
(i.e. corpus = evidence)
- Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language. Originally done by hand, corpora are largely derived by an automated process, which is corrected.
PURPOSES OF THE SPANISH CORPORA
Some of the general intentions of the corpuses, in all the languages, are the following ones:
- To emphasize the structure of the language.
- To identify the structural units and the types of language.
- To describe the different possibilities of combination that exists between among the units of the language.
- To underline the use, the employment of the language.
- To study the characteristics of the language, bearing in mind “the association patterns”.
- To facilitate the search of any word that is needed.
- To increase the rapiidity in the above mentioned search.
- To know the last innovations of the language; since the languages are changeable.
- There exists the possibility of interaction, participation of the different users of the language, if the search is online.
- Consultation of doubts with regard to the language.
- Explanation of concepts.
All these uses can be applied without any problem to the corpuses of the Spanish, and even add some more specific characteristics, like for example:
- To know the different changes of meaning that have the words depending on the context in which they are placed. This characteristic is related to the pragmatics, the intention of the speaker.
- To be conscious of the diverse ways of considering a term, synonymous.
- To know the differences between the Spanish of America and that of Spain.
We have chosen two different corpuses of Spanish language. One is a corpus created by the Royal Spanish Academy and the other is a corpus application created by Mark Davies.
-Corpus del Español:
Which is a website that allows you to quickly and easily search more than 100 million words in more than 20,000 Spanish texts from the 1200s to the 1900s. The interface allows you to search for exact words or phrases, wildcards, lemmas, part of speech, or any combinations of these.
You can search for surrounding words (collocates) within a ten-word window (e.g. all nouns somewhere near cadena, all adjectives near mujer, or all nouns near girar).
The corpus also allows you to easily limit searches by frequency and compare the frequency of words, phrases, and grammatical constructions, in at least two main ways:
• By register: comparisons between spoken, fiction, newspaper, and academic
• By historical period: compare different centuries from the 1200s to the 1900s
• You can also easily carry out semantically-based queries of the corpus. For example, you can compare and contrast the collocates of two related words, to determine the difference in meaning between these words. You can find the frequency and distribution of synonyms for nearly 30,000 words and also compare their frequency in different registers and historical periods, and use these word lists as part of other queries. Finally, you can easily create your own lists of semantically-related words, and then use them directly as part of the query.
-And its latest versión:
That integrates a collection of more than 10.000 texts of the XIIIth to XXth century(on XXth centuries). The Corpus of the Spanish allows to realize new types of searches, - that till now was not possible to realize with other corpora of the Spanish in the network(net)-, since p. ej.: searches of synonymous for more than 30.000 words.
- Searches of placements, that is to say, searches of coocurrences between words, depending on its frequency, p. ej., which are the most common adjectives with 'face', the nouns that meet(compete) with more frequency after 'softly', or the most common verbs with 'jokes';
- Searches of frequencies, since(as,like) p. ej., what new verbs have appeared from the 19th century, or what synonymous of 'torn(broken)' are more common in the written Spanish than in the spoken Spanish;
- Search of frequencies depending on:
- The grammatical category, p. ej., the most common infinitives after ' impossible of ', or the most common adjectives after 'night'
- The motto, p. ej., the frequency of all the verbal forms associated with the motto "to 'say", in the 12th century, the XVI or XXth.
- Searches of words depending on suffixes, p. ej., the words that end in '-azo ' or searches depending on chains of characters you hospitalize(send inland), p. ej., the words that have the chain '-camin-' in his(her,your) interior;
- Possibility of creation of lists of personalized words, p. ej like that., there can be created lists of words related to the emotions, the clothes, etc., and later they can be used in other searches;
-Searches from combinations of simpler searches, p. ej., all the forms of all the synonymous ones of 'saying', followed (consecutive) by all the forms of all the synonymous ones of 'joke'.
In our opinión, corpuses are a very useful tool that introduces anybody to a world of countless possibilities en lo que a búsqueda de palabras se refiere, and also allow us to know better our own language.
All these corpuses llevan a cabo a difficult task as they have to analyze, to contain a huge variety of language characteristics and a great amount of language collected by many speakers.
WHO CREATES IT?
For the preparation of the work in Spanish Corpuses, we have chosen two different corpuses on the net: the RAE Spanish Corpus and the Spanish Corpus by Mark Davis and sponsored by NEH, National Endowment for the Humanities. In this section of the work, we are going to take a look in the history and biography of the creators of the two corpuses we are using and analysing. We are going to start with the corpus of the Real Academia Española, which has been created by this Spanish institution. Then, we are going to take a look in the biography of Mark Davis and the NEH.
Real Academia Española
The Royal Spanish Academy was founded in 1713 following the initiative of the Marquis of Villena, Juan Manuel Fernandez Pacheco. The king Felipe V approved it in 1914. Its first purpose was to fix the voices and words of the Spanish language on its mayor propiety, purity and elegance. The institution has been adapting its functions throw the time. Nowadays, the academy’s mision is to watch over the changes that the Spanish language is having.The Royal Spanish Academy takes its decissions about the possibles changes in the dictionaries or corpora by doing meetings and discussing the proposals of the members of the academy.
The Royal Spanish Academy has two different main corpuses: CORDE and CREA. The first one is a diachronic Spanish corpus and the second one is an actual Spanish corpus. We are going to use the first one, because we are more interested in the use of Spanish historically. We want to focus on the different changes and uses of different expresions Spanish language has had.
Mark Davis
Mark Davies is a professor at Brigham Young University in Utah. He teaches Corpus Linguistics in the Department of Linguistics and English Language of the university. He has also been professor of Spanish Linguistics at Illinois State University.
He is very interested in everything that has to do with corpuses and linguistics. He has had several ideas which he has continue with and actually, he is member of different initiatives in the net. He has created differente web searchers and corpuses like BYU Corpus of American English or Brirtish National Corpus. He has also worked in another languages like Spanish with Corpus del Español, and Portuguesse with Corpus do Português. Furthermore, he has made a collection of TIME magazine’s articles that we can see in TIME Magazine, where we can find more than 275,000 articles in different topics.
As we can see, Davies is a very productive creator of searching machines on the net. He has received different awards for his labour as professor and corpus creator. The Davis’ corpus that we are going to use is Corpus del Español and we have to say that it was created in 2001 and, quoting Mark Davies, it can be said that this corpus is different to the rest of the Spanish corpuses because it «allows users to perform advanced searches based on part of speech, lemma, synonyms, and word and clause frequency».
NEH
Natioanl Endowment for the Humanities is an independent grant-making agency of the United States government dedicated to supporting research, education, preservation, and public programs in the humanities. It was created in 1965 and its main goal is to promote the searching of different American companies and individuals in themes like linguistics or history.
BYU
Brigham Young University is the university where Mark Davies works at. It ahs helped in the creation of Corpus del Español and in the creation of some of the rest projects of Davies.
DIFFERENT SPANISH CORPUSES
Corpus de Español. Mark Davies
"In April 2001 Mark Davies was awarded a 16 month grant from the National Endowment for the Humanities to develop a 100 million word searchable corpus of historical and modern Spanish texts on the web. Unlike other large corpora of Spanish, my Corpus del Español allows users to perform advanced searches based on part of speech, lemma, synonyms, and word and clause frequency."
- This Corpus del Español contains 100 millions of words
- This corpus is based in a architecture which he created. This architecture has been used in other corpus.
New version of the Corpus del Español
This new version, which was made in 2007, allows to do more things than the other one. We can see some new things like:
- With one simple doubt,we can compare two words:
We can see the differences between related words like:
'pelo/cabello'
'comenzar/iniciar'
'gozar'/'disfrutar').
- Compare the words in two historical periods, like in the next example:
a comparison of the collocates of 'woman' in the 1800s and the 1900s
- See papers or works in which word appears. It offers the overall frequency of a word or phrase in the 1200s-1900s and the four registers from the 1900s.
- It can save the results of a word and it can retrieve it.
- "The corpus has been completely re-lemmatized and re-tagged for part of speech, and it is much more accurate than before. With the new architecture, it will be possible to do searches using fuzzy matching for part of speech"
[v*] for all verbs) or for more specific parts of speech
[*n*ms*] for all singular masculine nouns
- "The search interface and the query syntax have been completely changed, to make the searches more intuitive and easy to carry out."
- There are some differences between the old version and the new one.
CREA
It is a Corpus of Reference of the current Spanish.It has all the variantions which Spanish can have nowadays.This corpus has a mixture of written and oral texts. These texts are from 1975 until our age.
Its structure of work is the following:
1. It determines the dimension of the corpus
2. It can classify texts in a chronological way, spatial, etc.
3. Acquisition of texts
4. It can classify texts:
- why these texts are in the corpus?
- how can it prepare these texts for the later?
- annotation and exploitation?
Its works began in 1996.With texts edited between 1975-1999, its first phrase was in December,2000.It is a finished phase but in continuous review.It possesses 130 million forms, so that there were fulfilled the aims(lenses) marked to the beginning of the project. At the end of 2004, it possessed 170 million of forms.
Its materials are selected by a series of parameters:
1. Means : 90 % corresponds(fits) to the written language and 10 %, to the oral language.
Of this 90 %, 49 % are books, other one 49 % is remaining press and 2 % gathers the texts that we name a miscellany: leaflets, prospectos.
2. Chronological:
In periods of five years: 1975-79; 1980-84; 1985-89; 1990-94; 1995-99.
3. Origin:
The texts belong to: 50 % to Spain and other 50 % from Spanish America. 50 % of Spanish America is distributed in linguistic zones according to the number of speakers. The zones are: Andean, Caribbean, central, Chilean, Mexican text.
4. Type of text:
It is formed by three blocks of materials: books and press miscellany transcriptions of spoken language
Books and press
They were divided in two blocks:
fiction and not fiction
These two blocks were divided too in other seven blocks. These blocks are distinguished because of their capacity and the number of forms. They are:
1. Sciences and technology
2. Social sciences
3. Politics, economy, and finance
4. Arts
5. Leisure
6. Wealth
7. Fiction: novel, statements, theatre
Miscellany
It is divided in two blocks:
1. Printed
2. Not printed
It has web pages, e-mails...
Oral corpus
It takes information from television programs or radio programs.
It has one big block which has subspecies: 1. News
2 . Reports
3. Interviews
4. Debates
5. Gathering
6. Documentaries
7. Sport news
8. Magazines
9. Sport magazines
10. Varieties
11. Drawings and contests
CORDE
What made that RAE created this corpus, CORDE, was the good results obtained guring the first
months of its project.
CORDE consists on something more of 300 million forms which proceeded from texts from the origin of language until 1974.
PRACTICE PART
As a brief practice for our project, we have decided to choose and analyze eight different expresions in Spanish language. After analysing them both in the CORDE, corpus of the Royal Spanish Academy, and in the application of the Spanish corpus of Mark Davies/NEH/BYU, we have written a short conclusion of what we have found.
Expressions
We used different expressions in Spanish with which we have confusion at the time whether these are good or not. We looked these expressions in the RAE Spanish Corpus and in the Spanish Corpus by Mark Davies.
1."Hoy en día":
- In the case of the RAE Spanish Corpus, we put the expression "Hoy en día" in the "consulta" column, next we pressed the "search" button, and 43 results of the search in 26 documents appeared. Then we clicked "ver estadísticas", and it appears the statistic of the founded examples divided by year, country and topic.
Prosa histórica 20
Prosa jurídica 7
Prosa científica 6
Prosa religiosa 4
Verso narrativo 3
Verso lírico 2
Prosa narrativa 1
1.1 Hoy en día
- In the search in the Spanish Corpus Application of Davies/NEH/BYU we have found these results:
News 27
Fiction 17
Academic 2
Oral 50
2."Hoy día":
- This expression, sometimes we think that is not correct, but in the RAE Spanish Corpus is accepted, because some years before, people usually used this expresion day by day. When we put "Hoy día" in the "consulta" column,the search showed us 379 cases in 141 documents. And them, clicking "ver estadísticas" to see the search results in the following themes:
Prosa científica 478
Prosa histórica 342
Prosa narrativa 202
Prosa de sociedad 147
Prosa didáctica 103
Prosa religiosa 100
Prosa jurídica 78
Verso dramático 43
Verso narrativo 33
2.1. Hoy día
In the search in the Spanish Corpus Application of Davies/NEH/BYU we have found these results:
News 12
Fiction 5
Academic 20
Oral 62
3."Querer es poder"
- In the RAE Spanish corpuses 4 cases in 4 documents are found.
Prosa narrativa 1 cases
Prosa didáctica 1 cases
Prosa científica 1 cases
Prosa histórica 1 cases
3.1. Querer es poder
- In the search in the Spanish Corpus Application of Davies/NEH/BYU we have found these results:
News 0
Fiction 0.2
Academic 0
Oral 0
4."Mucho ruido y pocas nueces"
- In the RAE Spanish Corpus we have founf these cases:
Prosa narrativa 5
Prosa de sociedad 4
Prosa histórica 3
4.1. " Mucho ruido y pocas nueces"
- In the search in the Spanish Corpus Application we have found these cases:
News 0.2
Fiction 0.7
Academic 0.2
Oral 0.2
5."De tal palo tal astilla"
- In the RAE Spanish corpus we have founf these cases:
Prosa narrativa 6 cases
Prosa dramática 1 cases
Prosa didáctica 1 cases
5.1. "De tal palo tal astilla"
Academic 0
News 0
Fiction 0.5
Oral 0
6."Con la miel en los labios"
- In the RAE Spanish corpus we have found these cases:
Prosa narrativa 8
Prosa didáctica 2
Prosa histórica 2
Prosa dramática 1
Prosa religiosa 1
6.1. "Con la miel en los labios"
- In the search of the Corpus of Davies we have found these cases:
Academic 0
News 0.7
Fiction 0.2
Oral 0
7."Más vale tarde que nunca"
- In the RAE Spanish corpus we have found these cases:
Prosa narrativa 4
Prosa periodística 3
Prosa religiosa 2
Prosa didáctica 1
Prosa científica 1
Prosa histórica 1
7.1. " Más vale tarde que nunca"
- In the search of the Corpus of Davies we have found these cases:
Academic 0
News 0.7
Fiction 0.2
Oral 0
8."Faltaría más"
- In the RAE Spanish corpus we have found these cases:
Prosa narrativa 34
Prosa histórica 5
Prosa dramática 3
Prosa didáctica 2
Prosa de sociedad 1
Prosa periodística 1
8.1."Faltaría más"
- In the search of the Corpus of Davies we have found these cases:
Academic 0
News 0
Fiction 0.3
Oral 0
Conclusions of the practice
After our search we can say that the corpora is a useful tool for any linguist or even for the public itself, because it gives the possibility of knowing about the history and use of different expresions, in our case, of Spanish language. With the use of the corpora, we can find any expresion that we want, and the corpora is going to give use its use nowadays and, in the case that the expresion is not used in today's language, the corpora will give us dates and historic information of the expresion.
Moreover, the copora will give us the normal use of the expresion and the contexts in which it is used. It's going to place the expresion in possible contexts through texts in which it is used. In addition to this, we can say that the corpora will place the expresion also in its normal use, that is, we are going to know if the expresion is normally used in coloquial language or in a more formal use of the language.
FINAL CONCLUSION
After the whole semester having been working at the Corpuses, we decided to do our final project about Spanish Corpus.
We have come to the conclusion, after having been investigated through Internet, looking for information about the corpuses (its usefulness, history, and putting into practice), that treats itself about a new very essential and interesting tool for many of our future jobs.
For us, that we are still studying, we believe that it is a good system when the time come knowing better the meaning of the expressions, words ... through the examples found in the application of Mark Davies. Thanks to the RAE Spanish Corpus o we have known the information about the different used expressions, to know in that type of writings are more used.
Pages that we are using
- Corpus del Español Davies/NEH/BYU: [1]
- Real AcAdemia Española CORDE: [2]
- Article about the new version of the Spanish Corpus by Joseba Abaitua: [3]
- Mark Davies: [4]
- National Endowment for the Humanities: [5]
- Brigham Young University: [6]
- Book: “Corpus Linguistics” Investigating Language, Structure and Use” Douglas Biber, Susan Conrad and Randi Reppen.

