The Peruvian Amazon forestry dataset: A leaf image classification corpus
Gerson Vizcarra,
Danitza Bermejo,
Antoni Mauricio,
Ricardo Zarate-Gomez,
Erwin Dianderas
Ecological informatics, 2021
[
bib]
[
abstract]
@article{vizcarra2021peruvian,
title={The Peruvian Amazon forestry dataset: A leaf image classification corpus},
author={Vizcarra, Gerson and Bermejo, Danitza and Mauricio, Antoni and Gomez, Ricardo Zarate and Dianderas, Erwin},
journal={Ecological Informatics},
volume={62},
pages={101268},
year={2021},
publisher={Elsevier}
}
Forest census allows getting precise data for logging planning and elaboration of the forest management plan. Species identification blunders carry inadequate forest management plans and high risks inside forest concessions. Hence, an identification protocol prevents the exploitation of non-commercial or endangered timber species. The current Peruvian legislation allows the incorporation of non-technical experts, called “materos”, during the identification. Materos use common names given by the folklore and traditions of their communities instead of formal ones, which generally lead to misclassifications. In the real world, logging companies hire materos instead of botanists due to cost/time limitations. Given such a motivation, we explore an end-to-end software solution to automatize the species identification. This paper introduces the Peruvian Amazon Forestry Dataset, which includes 59,441 leaves samples from ten of the most profitable and endangered timber-tree species. The proposal contemplates a background removal algorithm to feed a pre-trained CNN by the ImageNet dataset. We evaluate the quantitative (accuracy metric) and qualitative (visual interpretation) impacts of each stage by ablation experiments. The results show a 96.64% training accuracy and 96.52% testing accuracy on the VGG-19 model. Furthermore, the visual interpretation of the model evidences that leaf venations have the highest correlation in the plant recognition task.
Paraphrase Generation via Adversarial Penalizations
Gerson Vizcarra,
Jose Ochoa-Luna
Workshop on Noisy User-generated Text at EMNLP 2020
[
bib]
[
abstract]
@inproceedings{vizcarra2020paraphrase,
title={Paraphrase Generation via Adversarial Penalizations},
author={Vizcarra, Gerson and Ochoa-Luna, Jose},
booktitle={Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)},
pages={249--259},
year={2020}
}
Paraphrase generation is an important problem in Natural Language Processing that has been addressed with neural network-based approaches recently. This paper presents an adversarial framework to address the paraphrase generation problem in English. Unlike previous methods, we employ the discriminator output as penalization instead of using policy gradients, and we propose a global discriminator to avoid the Monte-Carlo search. In addition, this work use and compare different settings of input representation. We compare our methods to some baselines in the Quora question pairs dataset. The results show that our framework is competitive against the previous benchmarks.
A Deep Learning Approach for Sentiment Analysis in Spanish Tweets
Gerson Vizcarra,
Antoni Mauricio,
Leonidas Mauricio
International Conference on Artificial Neural Networks 2018
[
bib]
[
abstract]
@inproceedings{vizcarra2018deep,
title={A deep learning approach for sentiment analysis in spanish tweets},
author={Vizcarra, Gerson and Mauricio, Antoni and Mauricio, Leonidas},
booktitle={International conference on artificial neural networks},
pages={622--629},
year={2018},
organization={Springer}
}
Sentiment Analysis at Document Level is a well-known problem in Natural Language Processing (NLP), being considered as a reference in NLP, over which new architectures and models are tested in order to compare metrics that are also referents in other issues. This problem has been solved in good enough terms for English language, but its metrics are still quite low in other languages. In addition, architectures which are successful in a language do not necessarily works in another. In the case of Spanish, data quantity and quality become a problem during data preparation and architecture design, due to the few labeled data available including not-textual elements (like emoticons or expressions). This work presents an approach to solve the sentiment analysis problem in Spanish tweets and compares it with the state of art. To do so, a preprocessing algorithm is performed based on interpretation of colloquial expressions and emoticons, and trivial words elimination. Processed sentences turn into matrices using the 3 most successful methods of word embeddings (GloVe, FastText and Word2Vec), then the 3 matrices merge into a 3-channels matrix which is used to feed our CNN-based model. The proposed architecture uses parallel convolution layers as k-grams, by this way the value of each word and their contexts are weighted, to predict the sentiment polarity among 4 possible classes. After several tests, the optimal tuple which improves the accuracy were <1, 2>. Finally, our model presents %61.58 and %71.14 of accuracy in InterTASS and General Corpus respectively.