Awesome Linguistics Resources for Spanish
Curated list of Linguistic Resources for doing Spanish NLP & CL.
Clustering
Speech
Part of Speech Taggers (POS Taggers)
Name Entity Recognition (NER)
Corpora
Shared tasks
Corpora
- Multilingual Aligned Annotated Corpus (CRATER)
- UAM Treebank - 1,500 syntactically annotated sentences extracted from newspapers (El País Digital and Compra Maestra
- POSTagged/syntactic dependencies - European Corpus Initiative Multilingual Corpus I
- The Corpus of Contemporary Spanish(POStags, lemmas)
- Lemmas Dictionary
- esTenten Spanish (POSTagged)
- Europarl Corpus (Parallel Corpus English-Spanish)
- Colombian Political Speeches
- South American Slang Expressions/MTWE
- Syntax and Semantic Annotations (Subset Ancora Corpus)
- Plurilingual Specific Corpus on Economics, Medicine, Computer Science
- Copenhagen Treebank (Dependency Parsing)
- Reuters Corpora RCV2 - New Corpora
- MolinoLabs Corpus - News Corpora from Spain, Argentina and Mexico
- PANACEA- Legislation Corpus
- PANACEA- Legislation Ngram Corpus
- PANACEA- Dependency Parsed Corpus
- PANACEA- Monolingual Lexica (MWE, Frames, Semantic Classes)
- Opinion Mining - User reviews on Cars, Hotels, Washing machines, Books, Cell phones, Music..
- Cross Lingual Textual Entailment (CLTE) Corpus (English-Spanish)
- Ngram Frequencies out of Colombia News Corpora
- Sagan Textual Entailment Test Suite
- Garcia, Marcos and Pablo Gamallo, 2013 - Portuguese and Spanish biographical relation extraction corpora (Garcia, Marcos and Pablo Gamallo, 2013. Exploring the Effectiveness of Linguistic Knowledge for Biographical Relation Extraction. Natural Language Engineering, CJO2013. doi:10.1017/S1351324913000314.)
- Garcia, Marcos and Pablo Gamallo, 2014 - Portuguese, Spanish and Galician coreference corpora (Garcia, Marcos and Pablo Gamallo, 2014. Multilingual corpora with coreferential annotation of person entities. In Proceedings of the 9th edition of the Language Resources and Evaluation Conference (LREC 2014), Reykjavik: 3229-3233.)
- COW(Corpora From the Web) Ngram/Annotated People’s Name Corpora
- Wikicorpus- Portion of 2006’s wikipedia annotated with WordNet Synsets and POS
- Spanish Billion Words Corpus with word2vec Embeddings
- OSCAR or Open Super-large Crawled ALMAnaCH coRpus Spanish subset
Misc
Contribute
Contributions welcome! Read the contribution guidelines first.
License
To the extent possible under law, David Przybilla has waived all copyright and related or neighboring rights to this work.