-
Securing Hybrid Wireless Body Area Networks (HyWBAN): Advancements in Semantic Communications and Jamming Techniques
Authors:
Simone Soderi,
Mariella Särestöniemi,
Syifaul Fuada,
Matti Hämäläinen,
Marcos Katz,
Jari Iinatti
Abstract:
This paper explores novel strategies to strengthen the security of Hybrid Wireless Body Area Networks (HyWBANs), essential in smart healthcare and Internet of Things (IoT) applications. Recognizing the vulnerability of HyWBAN to sophisticated cyber-attacks, we propose an innovative combination of semantic communications and jamming receivers. This dual-layered security mechanism protects against u…
▽ More
This paper explores novel strategies to strengthen the security of Hybrid Wireless Body Area Networks (HyWBANs), essential in smart healthcare and Internet of Things (IoT) applications. Recognizing the vulnerability of HyWBAN to sophisticated cyber-attacks, we propose an innovative combination of semantic communications and jamming receivers. This dual-layered security mechanism protects against unauthorized access and data breaches, particularly in scenarios involving in-body to on-body communication channels. We conduct comprehensive laboratory measurements to understand hybrid (radio and optical) communication propagation through biological tissues and utilize these insights to refine a dataset for training a Deep Learning (DL) model. These models, in turn, generate semantic concepts linked to cryptographic keys for enhanced data confidentiality and integrity using a jamming receiver. The proposed model demonstrates a significant reduction in energy consumption compared to traditional cryptographic methods, like Elliptic Curve Diffie-Hellman (ECDH), especially when supplemented with jamming. Our approach addresses the primary security concerns and sets the baseline for future secure biomedical communication systems advancements.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
Predicting Sustainable Development Goals Using Course Descriptions -- from LLMs to Conventional Foundation Models
Authors:
Lev Kharlashkin,
Melany Macias,
Leo Huovinen,
Mika Hämäläinen
Abstract:
We present our work on predicting United Nations sustainable development goals (SDG) for university courses. We use an LLM named PaLM 2 to generate training data given a noisy human-authored course description input as input. We use this data to train several different smaller language models to predict SDGs for university courses. This work contributes to better university level adaptation of SDG…
▽ More
We present our work on predicting United Nations sustainable development goals (SDG) for university courses. We use an LLM named PaLM 2 to generate training data given a noisy human-authored course description input as input. We use this data to train several different smaller language models to predict SDGs for university courses. This work contributes to better university level adaptation of SDGs. The best performing model in our experiments was BART with an F1-score of 0.786.
△ Less
Submitted 23 April, 2024; v1 submitted 26 February, 2024;
originally announced February 2024.
-
Noise Morphing for Audio Time Stretching
Authors:
Eloi Moliner,
Leonardo Fierro,
Alec Wright,
Matti Hämäläinen,
Vesa Välimäki
Abstract:
This letter introduces an innovative method to enhance the quality of audio time stretching by precisely decomposing a sound into sines, transients, and noise and by improving the processing of the latter component. While there are established methods for time-stretching sines and transients with high quality, the manipulation of noise or residual components has lacked robust solutions in prior re…
▽ More
This letter introduces an innovative method to enhance the quality of audio time stretching by precisely decomposing a sound into sines, transients, and noise and by improving the processing of the latter component. While there are established methods for time-stretching sines and transients with high quality, the manipulation of noise or residual components has lacked robust solutions in prior research. The proposed method combines sound decomposition with previous techniques for audio spectral resynthesis. The time-stretched noise component is achieved by morphing its time-interpolated spectral magnitude with a white-noise excitation signal. This method stands out for its simplicity, efficiency, and audio quality. The results of a subjective experiment affirm the superiority of this approach over current state-of-the-art methods across all evaluated stretch factors. The proposed technique notably excels in extreme stretching scenarios, signifying a substantial elevation in performance. The proposed method holds promise for a wide range of applications in slow-motion media content, such as music or sports video production.
△ Less
Submitted 22 December, 2023;
originally announced December 2023.
-
Sentiment Analysis Using Aligned Word Embeddings for Uralic Languages
Authors:
Khalid Alnajjar,
Mika Hämäläinen,
Jack Rueter
Abstract:
In this paper, we present an approach for translating word embeddings from a majority language into 4 minority languages: Erzya, Moksha, Udmurt and Komi-Zyrian. Furthermore, we align these word embeddings and present a novel neural network model that is trained on English data to conduct sentiment analysis and then applied on endangered language data through the aligned word embeddings. To test ou…
▽ More
In this paper, we present an approach for translating word embeddings from a majority language into 4 minority languages: Erzya, Moksha, Udmurt and Komi-Zyrian. Furthermore, we align these word embeddings and present a novel neural network model that is trained on English data to conduct sentiment analysis and then applied on endangered language data through the aligned word embeddings. To test our model, we annotated a small sentiment analysis corpus for the 4 endangered languages and Finnish. Our method reached at least 56\% accuracy for each endangered language. The models and the sentiment corpus will be released together with this paper. Our research shows that state-of-the-art neural models can be used with endangered languages with the only requirement being a dictionary between the endangered language and a majority language.
△ Less
Submitted 24 May, 2023;
originally announced May 2023.
-
Ring That Bell: A Corpus and Method for Multimodal Metaphor Detection in Videos
Authors:
Khalid Alnajjar,
Mika Hämäläinen,
Shuo Zhang
Abstract:
We present the first openly available multimodal metaphor annotated corpus. The corpus consists of videos including audio and subtitles that have been annotated by experts. Furthermore, we present a method for detecting metaphors in the new dataset based on the textual content of the videos. The method achieves a high F1-score (62\%) for metaphorical labels. We also experiment with other modalitie…
▽ More
We present the first openly available multimodal metaphor annotated corpus. The corpus consists of videos including audio and subtitles that have been annotated by experts. Furthermore, we present a method for detecting metaphors in the new dataset based on the textual content of the videos. The method achieves a high F1-score (62\%) for metaphorical labels. We also experiment with other modalities and multimodal methods; however, these methods did not out-perform the text-based model. In our error analysis, we do identify that there are cases where video could help in disambiguating metaphors, however, the visual cues are too subtle for our model to capture. The data is available on Zenodo.
△ Less
Submitted 15 December, 2022;
originally announced January 2023.
-
Modern French Poetry Generation with RoBERTa and GPT-2
Authors:
Mika Hämäläinen,
Khalid Alnajjar,
Thierry Poibeau
Abstract:
We present a novel neural model for modern poetry generation in French. The model consists of two pretrained neural models that are fine-tuned for the poem generation task. The encoder of the model is a RoBERTa based one while the decoder is based on GPT-2. This way the model can benefit from the superior natural language understanding performance of RoBERTa and the good natural language generatio…
▽ More
We present a novel neural model for modern poetry generation in French. The model consists of two pretrained neural models that are fine-tuned for the poem generation task. The encoder of the model is a RoBERTa based one while the decoder is based on GPT-2. This way the model can benefit from the superior natural language understanding performance of RoBERTa and the good natural language generation performance of GPT-2. Our evaluation shows that the model can create French poetry successfully. On a 5 point scale, the lowest score of 3.57 was given by human judges to typicality and emotionality of the output poetry while the best score of 3.79 was given to understandability.
△ Less
Submitted 6 December, 2022;
originally announced December 2022.
-
Emotion Conditioned Creative Dialog Generation
Authors:
Khalid Alnajjar,
Mika Hämäläinen
Abstract:
We present a DialGPT based model for generating creative dialog responses that are conditioned based on one of the following emotions: anger, disgust, fear, happiness, pain, sadness and surprise. Our model is capable of producing a contextually apt response given an input sentence and a desired emotion label. Our model is capable of expressing the desired emotion with an accuracy of 0.6. The best…
▽ More
We present a DialGPT based model for generating creative dialog responses that are conditioned based on one of the following emotions: anger, disgust, fear, happiness, pain, sadness and surprise. Our model is capable of producing a contextually apt response given an input sentence and a desired emotion label. Our model is capable of expressing the desired emotion with an accuracy of 0.6. The best performing emotions are neutral, fear and disgust. When measuring the strength of the expressed emotion, we find that anger, fear and disgust are expressed in the most strong fashion by the model.
△ Less
Submitted 6 December, 2022;
originally announced December 2022.
-
Automatic Generation of Factual News Headlines in Finnish
Authors:
Maximilian Koppatz,
Khalid Alnajjar,
Mika Hämäläinen,
Thierry Poibeau
Abstract:
We present a novel approach to generating news headlines in Finnish for a given news story. We model this as a summarization task where a model is given a news article, and its task is to produce a concise headline describing the main topic of the article. Because there are no openly available GPT-2 models for Finnish, we will first build such a model using several corpora. The model is then fine-…
▽ More
We present a novel approach to generating news headlines in Finnish for a given news story. We model this as a summarization task where a model is given a news article, and its task is to produce a concise headline describing the main topic of the article. Because there are no openly available GPT-2 models for Finnish, we will first build such a model using several corpora. The model is then fine-tuned for the headline generation task using a massive news corpus. The system is evaluated by 3 expert journalists working in a Finnish media house. The results showcase the usability of the presented approach as a headline suggestion tool to facilitate the news production process.
△ Less
Submitted 5 December, 2022;
originally announced December 2022.
-
Video Games as a Corpus: Sentiment Analysis using Fallout New Vegas Dialog
Authors:
Mika Hämäläinen,
Khalid Alnajjar,
Thierry Poibeau
Abstract:
We present a method for extracting a multilingual sentiment annotated dialog data set from Fallout New Vegas. The game developers have preannotated every line of dialog in the game in one of the 8 different sentiments: \textit{anger, disgust, fear, happy, neutral, pained, sad } and \textit{surprised}. The game has been translated into English, Spanish, German, French and Italian. We conduct experi…
▽ More
We present a method for extracting a multilingual sentiment annotated dialog data set from Fallout New Vegas. The game developers have preannotated every line of dialog in the game in one of the 8 different sentiments: \textit{anger, disgust, fear, happy, neutral, pained, sad } and \textit{surprised}. The game has been translated into English, Spanish, German, French and Italian. We conduct experiments on multilingual, multilabel sentiment analysis on the extracted data set using multilingual BERT, XLMRoBERTa and language specific BERT models. In our experiments, multilingual BERT outperformed XLMRoBERTa for most of the languages, also language specific models were slightly better than multilingual BERT for most of the languages. The best overall accuracy was 54\% and it was achieved by using multilingual BERT on Spanish data. The extracted data set presents a challenging task for sentiment analysis. We have released the data, including the testing and training splits, openly on Zenodo. The data set has been shuffled for copyright reasons.
△ Less
Submitted 5 December, 2022;
originally announced December 2022.
-
Extreme Audio Time Stretching Using Neural Synthesis
Authors:
Leonardo Fierro,
Alec Wright,
Vesa Välimäki,
Matti Hämäläinen
Abstract:
A deep neural network solution for time-scale modification (TSM) focused on large stretching factors is proposed, targeting environmental sounds. Traditional TSM artifacts such as transient smearing, loss of presence, and phasiness are heavily accentuated and cause poor audio quality when the TSM factor is four or larger. The weakness of established TSM methods, often based on a phase vocoder stru…
▽ More
A deep neural network solution for time-scale modification (TSM) focused on large stretching factors is proposed, targeting environmental sounds. Traditional TSM artifacts such as transient smearing, loss of presence, and phasiness are heavily accentuated and cause poor audio quality when the TSM factor is four or larger. The weakness of established TSM methods, often based on a phase vocoder structure, lies in the poor description and scaling of the transient and noise components, or nuances, of a sound. Our novel solution combines a sines-transients-noise decomposition with an independent WaveNet synthesizer to provide a better description of the noise component and an improve sound quality for large stretching factors. Results of a subjective listening test against four other TSM algorithms are reported, showing the proposed method to be often superior. The proposed method is stereo compatible and has a wide range of applications related to the slow motion of media content.
△ Less
Submitted 30 November, 2022;
originally announced November 2022.
-
When to Laugh and How Hard? A Multimodal Approach to Detecting Humor and its Intensity
Authors:
Khalid Alnajjar,
Mika Hämäläinen,
Jörg Tiedemann,
Jorma Laaksonen,
Mikko Kurimo
Abstract:
Prerecorded laughter accompanying dialog in comedy TV shows encourages the audience to laugh by clearly marking humorous moments in the show. We present an approach for automatically detecting humor in the Friends TV show using multimodal data. Our model is capable of recognizing whether an utterance is humorous or not and assess the intensity of it. We use the prerecorded laughter in the show as…
▽ More
Prerecorded laughter accompanying dialog in comedy TV shows encourages the audience to laugh by clearly marking humorous moments in the show. We present an approach for automatically detecting humor in the Friends TV show using multimodal data. Our model is capable of recognizing whether an utterance is humorous or not and assess the intensity of it. We use the prerecorded laughter in the show as annotation as it marks humor and the length of the audience's laughter tells us how funny a given joke is. We evaluate the model on episodes the model has not been exposed to during the training phase. Our results show that the model is capable of correctly detecting whether an utterance is humorous 78% of the time and how long the audience's laughter reaction should last with a mean absolute error of 600 milliseconds.
△ Less
Submitted 3 November, 2022;
originally announced November 2022.
-
Multilingual Persuasion Detection: Video Games as an Invaluable Data Source for NLP
Authors:
Teemu Pöyhönen,
Mika Hämäläinen,
Khalid Alnajjar
Abstract:
Role-playing games (RPGs) have a considerable amount of text in video game dialogues. Quite often this text is semi-annotated by the game developers. In this paper, we extract a multilingual dataset of persuasive dialogue from several RPGs. We show the viability of this data in building a persuasion detection system using a natural language processing (NLP) model called BERT. We believe that video…
▽ More
Role-playing games (RPGs) have a considerable amount of text in video game dialogues. Quite often this text is semi-annotated by the game developers. In this paper, we extract a multilingual dataset of persuasive dialogue from several RPGs. We show the viability of this data in building a persuasion detection system using a natural language processing (NLP) model called BERT. We believe that video games have a lot of unused potential as a datasource for a variety of NLP tasks. The code and data described in this paper are available on Zenodo.
△ Less
Submitted 10 July, 2022;
originally announced July 2022.
-
Harnessing Multilingual Resources to Question Answering in Arabic
Authors:
Khalid Alnajjar,
Mika Hämäläinen
Abstract:
The goal of the paper is to predict answers to questions given a passage of Qur'an. The answers are always found in the passage, so the task of the model is to predict where an answer starts and where it ends. As the initial data set is rather small for training, we make use of multilingual BERT so that we can augment the training data by using data available for languages other than Arabic. Furth…
▽ More
The goal of the paper is to predict answers to questions given a passage of Qur'an. The answers are always found in the passage, so the task of the model is to predict where an answer starts and where it ends. As the initial data set is rather small for training, we make use of multilingual BERT so that we can augment the training data by using data available for languages other than Arabic. Furthermore, we crawl a large Arabic corpus that is domain specific to religious discourse. Our approach consists of two steps, first we train a BERT model to predict a set of possible answers in a passage. Finally, we use another BERT based model to rank the candidate answers produced by the first BERT model.
△ Less
Submitted 16 May, 2022;
originally announced May 2022.
-
Processing M.A. Castrén's Materials: Multilingual Typed and Handwritten Manuscripts
Authors:
Niko Partanen,
Jack Rueter,
Mika Hämäläinen,
Khalid Alnajjar
Abstract:
The study forms a technical report of various tasks that have been performed on the materials collected and published by Finnish ethnographer and linguist, Matthias Alexander Castrén (1813-1852). The Finno-Ugrian Society is publishing Castrén's manuscripts as new critical and digital editions, and at the same time different research groups have also paid attention to these materials. We discuss th…
▽ More
The study forms a technical report of various tasks that have been performed on the materials collected and published by Finnish ethnographer and linguist, Matthias Alexander Castrén (1813-1852). The Finno-Ugrian Society is publishing Castrén's manuscripts as new critical and digital editions, and at the same time different research groups have also paid attention to these materials. We discuss the workflows and technical infrastructure used, and consider how datasets that benefit different computational tasks could be created to further improve the usability of these materials, and also to aid the further processing of similar archived collections. We specifically focus on the parts of the collections that are processed in a way that improves their usability in more technical applications, complementing the earlier work on the cultural and linguistic aspects of these materials. Most of these datasets are openly available in Zenodo. The study points to specific areas where further research is needed, and provides benchmarks for text recognition tasks.
△ Less
Submitted 28 December, 2021;
originally announced December 2021.
-
TFW2V: An Enhanced Document Similarity Method for the Morphologically Rich Finnish Language
Authors:
Quan Duong,
Mika Hämäläinen,
Khalid Alnajjar
Abstract:
Measuring the semantic similarity of different texts has many important applications in Digital Humanities research such as information retrieval, document clustering and text summarization. The performance of different methods depends on the length of the text, the domain and the language. This study focuses on experimenting with some of the current approaches to Finnish, which is a morphological…
▽ More
Measuring the semantic similarity of different texts has many important applications in Digital Humanities research such as information retrieval, document clustering and text summarization. The performance of different methods depends on the length of the text, the domain and the language. This study focuses on experimenting with some of the current approaches to Finnish, which is a morphologically rich language. At the same time, we propose a simple method, TFW2V, which shows high efficiency in handling both long text documents and limited amounts of data. Furthermore, we design an objective evaluation method which can be used as a framework for benchmarking text similarity approaches.
△ Less
Submitted 23 December, 2021;
originally announced December 2021.
-
Detecting Depression in Thai Blog Posts: a Dataset and a Baseline
Authors:
Mika Hämäläinen,
Pattama Patpong,
Khalid Alnajjar,
Niko Partanen,
Jack Rueter
Abstract:
We present the first openly available corpus for detecting depression in Thai. Our corpus is compiled by expert verified cases of depression in several online blogs. We experiment with two different LSTM based models and two different BERT based models. We achieve a 77.53\% accuracy with a Thai BERT model in detecting depression. This establishes a good baseline for future researcher on the same c…
▽ More
We present the first openly available corpus for detecting depression in Thai. Our corpus is compiled by expert verified cases of depression in several online blogs. We experiment with two different LSTM based models and two different BERT based models. We achieve a 77.53\% accuracy with a Thai BERT model in detecting depression. This establishes a good baseline for future researcher on the same corpus. Furthermore, we identify a need for Thai embeddings that have been trained on a more varied corpus than Wikipedia. Our corpus, code and trained models have been released openly on Zenodo.
△ Less
Submitted 8 November, 2021;
originally announced November 2021.
-
Finnish Dialect Identification: The Effect of Audio and Text
Authors:
Mika Hämäläinen,
Khalid Alnajjar,
Niko Partanen,
Jack Rueter
Abstract:
Finnish is a language with multiple dialects that not only differ from each other in terms of accent (pronunciation) but also in terms of morphological forms and lexical choice. We present the first approach to automatically detect the dialect of a speaker based on a dialect transcript and transcript with audio recording in a dataset consisting of 23 different dialects. Our results show that the b…
▽ More
Finnish is a language with multiple dialects that not only differ from each other in terms of accent (pronunciation) but also in terms of morphological forms and lexical choice. We present the first approach to automatically detect the dialect of a speaker based on a dialect transcript and transcript with audio recording in a dataset consisting of 23 different dialects. Our results show that the best accuracy is received by combining both of the modalities, as text only reaches to an overall accuracy of 57\%, where as text and audio reach to 85\%. Our code, models and data have been released openly on Github and Zenodo.
△ Less
Submitted 6 November, 2021;
originally announced November 2021.
-
The Current State of Finnish NLP
Authors:
Mika Hämäläinen,
Khalid Alnajjar
Abstract:
There are a lot of tools and resources available for processing Finnish. In this paper, we survey recent papers focusing on Finnish NLP related to many different subcategories of NLP such as parsing, generation, semantics and speech. NLP research is conducted in many different research groups in Finland, and it is frequently the case that NLP tools and models resulting from academic research are m…
▽ More
There are a lot of tools and resources available for processing Finnish. In this paper, we survey recent papers focusing on Finnish NLP related to many different subcategories of NLP such as parsing, generation, semantics and speech. NLP research is conducted in many different research groups in Finland, and it is frequently the case that NLP tools and models resulting from academic research are made available for others to use on platforms such as Github.
△ Less
Submitted 23 September, 2021;
originally announced September 2021.
-
When a Computer Cracks a Joke: Automated Generation of Humorous Headlines
Authors:
Khalid Alnajjar,
Mika Hämäläinen
Abstract:
Automated news generation has become a major interest for new agencies in the past. Oftentimes headlines for such automatically generated news articles are unimaginative as they have been generated with ready-made templates. We present a computationally creative approach for headline generation that can generate humorous versions of existing headlines. We evaluate our system with human judges and…
▽ More
Automated news generation has become a major interest for new agencies in the past. Oftentimes headlines for such automatically generated news articles are unimaginative as they have been generated with ready-made templates. We present a computationally creative approach for headline generation that can generate humorous versions of existing headlines. We evaluate our system with human judges and compare the results to human authored humorous titles. The headlines produced by the system are considered funny 36\% of the time by human evaluators.
△ Less
Submitted 17 September, 2021;
originally announced September 2021.
-
How Cute is Pikachu? Gathering and Ranking Pokémon Properties from Data with Pokémon Word Embeddings
Authors:
Mika Hämäläinen,
Khalid Alnajjar,
Niko Partanen
Abstract:
We present different methods for obtaining descriptive properties automatically for the 151 original Pokémon. We train several different word embeddings models on a crawled Pokémon corpus, and use them to rank automatically English adjectives based on how characteristic they are to a given Pokémon. Based on our experiments, it is better to train a model with domain specific data than to use a pret…
▽ More
We present different methods for obtaining descriptive properties automatically for the 151 original Pokémon. We train several different word embeddings models on a crawled Pokémon corpus, and use them to rank automatically English adjectives based on how characteristic they are to a given Pokémon. Based on our experiments, it is better to train a model with domain specific data than to use a pretrained model. Word2Vec produces less noise in the results than fastText model. Furthermore, we expand the list of properties for each Pokémon automatically. However, none of the methods is spot on and there is a considerable amount of noise in the different semantic models. Our models have been released on Zenodo.
△ Less
Submitted 21 August, 2021;
originally announced August 2021.
-
Human Evaluation of Creative NLG Systems: An Interdisciplinary Survey on Recent Papers
Authors:
Mika Hämäläinen,
Khalid Alnajjar
Abstract:
We survey human evaluation in papers presenting work on creative natural language generation that have been published in INLG 2020 and ICCC 2020. The most typical human evaluation method is a scaled survey, typically on a 5 point scale, while many other less common methods exist. The most commonly evaluated parameters are meaning, syntactic correctness, novelty, relevance and emotional value, amon…
▽ More
We survey human evaluation in papers presenting work on creative natural language generation that have been published in INLG 2020 and ICCC 2020. The most typical human evaluation method is a scaled survey, typically on a 5 point scale, while many other less common methods exist. The most commonly evaluated parameters are meaning, syntactic correctness, novelty, relevance and emotional value, among many others. Our guidelines for future evaluation include clearly defining the goal of the generative system, asking questions as concrete as possible, testing the evaluation setup, using multiple different evaluation setups, reporting the entire evaluation process and potential biases clearly, and finally analyzing the evaluation results in a more profound way than merely reporting the most typical statistics.
△ Less
Submitted 31 July, 2021;
originally announced August 2021.
-
Lemmatization of Historical Old Literary Finnish Texts in Modern Orthography
Authors:
Mika Hämäläinen,
Niko Partanen,
Khalid Alnajjar
Abstract:
Texts written in Old Literary Finnish represent the first literary work ever written in Finnish starting from the 16th century. There have been several projects in Finland that have digitized old publications and made them available for research use. However, using modern NLP methods in such data poses great challenges. In this paper we propose an approach for simultaneously normalizing and lemmat…
▽ More
Texts written in Old Literary Finnish represent the first literary work ever written in Finnish starting from the 16th century. There have been several projects in Finland that have digitized old publications and made them available for research use. However, using modern NLP methods in such data poses great challenges. In this paper we propose an approach for simultaneously normalizing and lemmatizing Old Literary Finnish into modern spelling. Our best model reaches to 96.3\% accuracy in texts written by Agricola and 87.7\% accuracy in other contemporary out-of-domain text. Our method has been made freely available on Zenodo and Github.
△ Less
Submitted 7 July, 2021;
originally announced July 2021.
-
Apurinã Universal Dependencies Treebank
Authors:
Jack Rueter,
Marília Fernanda Pereira de Freitas,
Sidney da Silva Facundes,
Mika Hämäläinen,
Niko Partanen
Abstract:
This paper presents and discusses the first Universal Dependencies treebank for the Apurinã language. The treebank contains 76 fully annotated sentences, applies 14 parts-of-speech, as well as seven augmented or new features - some of which are unique to Apurinã. The construction of the treebank has also served as an opportunity to develop finite-state description of the language and facilitate th…
▽ More
This paper presents and discusses the first Universal Dependencies treebank for the Apurinã language. The treebank contains 76 fully annotated sentences, applies 14 parts-of-speech, as well as seven augmented or new features - some of which are unique to Apurinã. The construction of the treebank has also served as an opportunity to develop finite-state description of the language and facilitate the transfer of open-source infrastructure possibilities to an endangered language of the Amazon. The source materials used in the initial treebank represent fieldwork practices where not all tokens of all sentences are equally annotated. For this reason, establishing regular annotation practices for the entire Apurinã treebank is an ongoing project.
△ Less
Submitted 7 June, 2021;
originally announced June 2021.
-
Never guess what I heard... Rumor Detection in Finnish News: a Dataset and a Baseline
Authors:
Mika Hämäläinen,
Khalid Alnajjar,
Niko Partanen,
Jack Rueter
Abstract:
This study presents a new dataset on rumor detection in Finnish language news headlines. We have evaluated two different LSTM based models and two different BERT models, and have found very significant differences in the results. A fine-tuned FinBERT reaches the best overall accuracy of 94.3% and rumor label accuracy of 96.0% of the time. However, a model fine-tuned on Multilingual BERT reaches th…
▽ More
This study presents a new dataset on rumor detection in Finnish language news headlines. We have evaluated two different LSTM based models and two different BERT models, and have found very significant differences in the results. A fine-tuned FinBERT reaches the best overall accuracy of 94.3% and rumor label accuracy of 96.0% of the time. However, a model fine-tuned on Multilingual BERT reaches the best factual label accuracy of 97.2%. Our results suggest that the performance difference is due to a difference in the original training data. Furthermore, we find that a regular LSTM model works better than one trained with a pretrained word2vec model. These findings suggest that more work needs to be done for pretrained models in Finnish language as they have been trained on small and biased corpora.
△ Less
Submitted 7 June, 2021;
originally announced June 2021.
-
Neural Morphology Dataset and Models for Multiple Languages, from the Large to the Endangered
Authors:
Mika Hämäläinen,
Niko Partanen,
Jack Rueter,
Khalid Alnajjar
Abstract:
We train neural models for morphological analysis, generation and lemmatization for morphologically rich languages. We present a method for automatically extracting substantially large amount of training data from FSTs for 22 languages, out of which 17 are endangered. The neural models follow the same tagset as the FSTs in order to make it possible to use them as fallback systems together with the…
▽ More
We train neural models for morphological analysis, generation and lemmatization for morphologically rich languages. We present a method for automatically extracting substantially large amount of training data from FSTs for 22 languages, out of which 17 are endangered. The neural models follow the same tagset as the FSTs in order to make it possible to use them as fallback systems together with the FSTs. The source code, models and datasets have been released on Zenodo.
△ Less
Submitted 26 May, 2021;
originally announced May 2021.
-
!Qué maravilla! Multimodal Sarcasm Detection in Spanish: a Dataset and a Baseline
Authors:
Khalid Alnajjar,
Mika Hämäläinen
Abstract:
We construct the first ever multimodal sarcasm dataset for Spanish. The audiovisual dataset consists of sarcasm annotated text that is aligned with video and audio. The dataset represents two varieties of Spanish, a Latin American variety and a Peninsular Spanish variety, which ensures a wider dialectal coverage for this global language. We present several models for sarcasm detection that will se…
▽ More
We construct the first ever multimodal sarcasm dataset for Spanish. The audiovisual dataset consists of sarcasm annotated text that is aligned with video and audio. The dataset represents two varieties of Spanish, a Latin American variety and a Peninsular Spanish variety, which ensures a wider dialectal coverage for this global language. We present several models for sarcasm detection that will serve as baselines in the future research. Our results show that results with text only (89%) are worse than when combining text with audio (91.9%). Finally, the best results are obtained when combining all the modalities: text, audio and video (93.1%).
△ Less
Submitted 12 May, 2021;
originally announced May 2021.
-
The Great Misalignment Problem in Human Evaluation of NLP Methods
Authors:
Mika Hämäläinen,
Khalid Alnajjar
Abstract:
We outline the Great Misalignment Problem in natural language processing research, this means simply that the problem definition is not in line with the method proposed and the human evaluation is not in line with the definition nor the method. We study this misalignment problem by surveying 10 randomly sampled papers published in ACL 2020 that report results with human evaluation. Our results sho…
▽ More
We outline the Great Misalignment Problem in natural language processing research, this means simply that the problem definition is not in line with the method proposed and the human evaluation is not in line with the definition nor the method. We study this misalignment problem by surveying 10 randomly sampled papers published in ACL 2020 that report results with human evaluation. Our results show that only one paper was fully in line in terms of problem definition, method and evaluation. Only two papers presented a human evaluation that was in line with what was modeled in the method. These results highlight that the Great Misalignment Problem is a major one and it affects the validity and reproducibility of results obtained by a human evaluation.
△ Less
Submitted 12 April, 2021;
originally announced April 2021.
-
From Plenipotentiary to Puddingless: Users and Uses of New Words in Early English Letters
Authors:
Tanja Säily,
Eetu Mäkelä,
Mika Hämäläinen
Abstract:
We study neologism use in two samples of early English correspondence, from 1640--1660 and 1760--1780. Of especial interest are the early adopters of new vocabulary, the social groups they represent, and the types and functions of their neologisms. We describe our computer-assisted approach and note the difficulties associated with massive variation in the corpus. Our findings include that while m…
▽ More
We study neologism use in two samples of early English correspondence, from 1640--1660 and 1760--1780. Of especial interest are the early adopters of new vocabulary, the social groups they represent, and the types and functions of their neologisms. We describe our computer-assisted approach and note the difficulties associated with massive variation in the corpus. Our findings include that while male letter-writers tend to use neologisms more frequently than women, the eighteenth century seems to have provided more opportunities for women and the lower ranks to participate in neologism use as well. In both samples, neologisms most frequently occur in letters written between close friends, which could be due to this less stable relationship triggering more creative language use. In the seventeenth-century sample, we observe the influence of the English Civil War, while the eighteenth-century sample appears to reflect the changing functions of letter-writing, as correspondence is increasingly being used as a tool for building and maintaining social relationships in addition to exchanging information.
△ Less
Submitted 17 March, 2021;
originally announced March 2021.
-
Endangered Languages are not Low-Resourced!
Authors:
Mika Hämäläinen
Abstract:
The term low-resourced has been tossed around in the field of natural language processing to a degree that almost any language that is not English can be called "low-resourced"; sometimes even just for the sake of making a mundane or mediocre paper appear more interesting and insightful. In a field where English is a synonym for language and low-resourced is a synonym for anything not English, cal…
▽ More
The term low-resourced has been tossed around in the field of natural language processing to a degree that almost any language that is not English can be called "low-resourced"; sometimes even just for the sake of making a mundane or mediocre paper appear more interesting and insightful. In a field where English is a synonym for language and low-resourced is a synonym for anything not English, calling endangered languages low-resourced is a bit of an overstatement. In this paper, I inspect the relation of the endangered with the low-resourced from my own experiences.
△ Less
Submitted 17 March, 2021;
originally announced March 2021.
-
Speech Recognition for Endangered and Extinct Samoyedic languages
Authors:
Niko Partanen,
Mika Hämäläinen,
Tiina Klooster
Abstract:
Our study presents a series of experiments on speech recognition with endangered and extinct Samoyedic languages, spoken in Northern and Southern Siberia. To best of our knowledge, this is the first time a functional ASR system is built for an extinct language. We achieve with Kamas language a Label Error Rate of 15\%, and conclude through careful error analysis that this quality is already very u…
▽ More
Our study presents a series of experiments on speech recognition with endangered and extinct Samoyedic languages, spoken in Northern and Southern Siberia. To best of our knowledge, this is the first time a functional ASR system is built for an extinct language. We achieve with Kamas language a Label Error Rate of 15\%, and conclude through careful error analysis that this quality is already very useful as a starting point for refined human transcriptions. Our results with related Nganasan language are more modest, with best model having the error rate of 33\%. We show, however, through experiments where Kamas training data is enlarged incrementally, that Nganasan results are in line with what is expected under low-resource circumstances of the language. Based on this, we provide recommendations for scenarios in which further language documentation or archive processing activities could benefit from modern ASR technology. All training data and processing scripts haven been published on Zenodo with clear licences to ensure further work in this important topic.
△ Less
Submitted 9 December, 2020;
originally announced December 2020.
-
Normalization of Different Swedish Dialects Spoken in Finland
Authors:
Mika Hämäläinen,
Niko Partanen,
Khalid Alnajjar
Abstract:
Our study presents a dialect normalization method for different Finland Swedish dialects covering six regions. We tested 5 different models, and the best model improved the word error rate from 76.45 to 28.58. Contrary to results reported in earlier research on Finnish dialects, we found that training the model with one word at a time gave best results. We believe this is due to the size of the tr…
▽ More
Our study presents a dialect normalization method for different Finland Swedish dialects covering six regions. We tested 5 different models, and the best model improved the word error rate from 76.45 to 28.58. Contrary to results reported in earlier research on Finnish dialects, we found that training the model with one word at a time gave best results. We believe this is due to the size of the training data available for the model. Our models are accessible as a Python package. The study provides important information about the adaptability of these methods in different contexts, and gives important baselines for further study.
△ Less
Submitted 9 December, 2020;
originally announced December 2020.
-
Ve'rdd. Narrowing the Gap between Paper Dictionaries, Low-Resource NLP and Community Involvement
Authors:
Khalid Alnajjar,
Mika Hämäläinen,
Jack Rueter,
Niko Partanen
Abstract:
We present an open-source online dictionary editing system, Ve'rdd, that offers a chance to re-evaluate and edit grassroots dictionaries that have been exposed to multiple amateur editors. The idea is to incorporate community activities into a state-of-the-art finite-state language description of a seriously endangered minority language, Skolt Sami. Problems involve getting the community to take p…
▽ More
We present an open-source online dictionary editing system, Ve'rdd, that offers a chance to re-evaluate and edit grassroots dictionaries that have been exposed to multiple amateur editors. The idea is to incorporate community activities into a state-of-the-art finite-state language description of a seriously endangered minority language, Skolt Sami. Problems involve getting the community to take part in things above the pencil-and-paper level. At times, it seems that the native speakers and the dictionary oriented are lacking technical understanding to utilize the infrastructures which might make their work more meaningful in the future, i.e. multiple reuse of all of their input. Therefore, our system integrates with the existing tools and infrastructures for Uralic language masking the technical complexities behind a user-friendly UI.
△ Less
Submitted 4 December, 2020;
originally announced December 2020.
-
An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish
Authors:
Quan Duong,
Mika Hämäläinen,
Simon Hengchen
Abstract:
Historical corpora are known to contain errors introduced by OCR (optical character recognition) methods used in the digitization process, often said to be degrading the performance of NLP systems. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We build on previous work on fully au…
▽ More
Historical corpora are known to contain errors introduced by OCR (optical character recognition) methods used in the digitization process, often said to be degrading the performance of NLP systems. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We build on previous work on fully automatic unsupervised extraction of parallel data to train a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction designed for English, and adapt it to Finnish by proposing solutions that take the rich morphology of the language into account. Our new method shows increased performance while remaining fully unsupervised, with the added benefit of spelling normalisation. The source code and models are available on GitHub and Zenodo.
△ Less
Submitted 6 November, 2020;
originally announced November 2020.
-
Automated Prediction of Medieval Arabic Diacritics
Authors:
Khalid Alnajjar,
Mika Hämäläinen,
Niko Partanen,
Jack Rueter
Abstract:
This study uses a character level neural machine translation approach trained on a long short-term memory-based bi-directional recurrent neural network architecture for diacritization of Medieval Arabic. The results improve from the online tool used as a baseline. A diacritization model have been published openly through an easy to use Python package available on PyPi and Zenodo. We have found tha…
▽ More
This study uses a character level neural machine translation approach trained on a long short-term memory-based bi-directional recurrent neural network architecture for diacritization of Medieval Arabic. The results improve from the online tool used as a baseline. A diacritization model have been published openly through an easy to use Python package available on PyPi and Zenodo. We have found that context size should be considered when optimizing a feasible prediction model.
△ Less
Submitted 11 October, 2020;
originally announced October 2020.
-
Automatic Dialect Adaptation in Finnish and its Effect on Perceived Creativity
Authors:
Mika Hämäläinen,
Niko Partanen,
Khalid Alnajjar,
Jack Rueter,
Thierry Poibeau
Abstract:
We present a novel approach for adapting text written in standard Finnish to different dialects. We experiment with character level NMT models both by using a multi-dialectal and transfer learning approaches. The models are tested with over 20 different dialects. The results seem to favor transfer learning, although not strongly over the multi-dialectal approach. We study the influence dialectal a…
▽ More
We present a novel approach for adapting text written in standard Finnish to different dialects. We experiment with character level NMT models both by using a multi-dialectal and transfer learning approaches. The models are tested with over 20 different dialects. The results seem to favor transfer learning, although not strongly over the multi-dialectal approach. We study the influence dialectal adaptation has on perceived creativity of computer generated poetry. Our results suggest that the more the dialect deviates from the standard Finnish, the lower scores people tend to give on an existing evaluation metric. However, on a word association test, people associate creativity and originality more with dialect and fluency more with standard Finnish.
△ Less
Submitted 6 September, 2020;
originally announced September 2020.
-
Morphological Disambiguation of South Sámi with FSTs and Neural Networks
Authors:
Mika Hämäläinen,
Linda Wiechetek
Abstract:
We present a method for conducting morphological disambiguation for South Sámi, which is an endangered language. Our method uses an FST-based morphological analyzer to produce an ambiguous set of morphological readings for each word in a sentence. These readings are disambiguated with a Bi-RNN model trained on the related North Sámi UD Treebank and some synthetically generated South Sámi data. The…
▽ More
We present a method for conducting morphological disambiguation for South Sámi, which is an endangered language. Our method uses an FST-based morphological analyzer to produce an ambiguous set of morphological readings for each word in a sentence. These readings are disambiguated with a Bi-RNN model trained on the related North Sámi UD Treebank and some synthetically generated South Sámi data. The disambiguation is done on the level of morphological tags ignoring word forms and lemmas; this makes it possible to use North Sámi training data for South Sámi without the need for a bilingual dictionary or aligned word embeddings. Our approach requires only minimal resources for South Sámi, which makes it usable and applicable in the contexts of any other endangered language as well.
△ Less
Submitted 29 April, 2020;
originally announced April 2020.
-
FST Morphology for the Endangered Skolt Sami Language
Authors:
Jack Rueter,
Mika Hämäläinen
Abstract:
We present advances in the development of a FST-based morphological analyzer and generator for Skolt Sami. Like other minority Uralic languages, Skolt Sami exhibits a rich morphology, on the one hand, and there is little golden standard material for it, on the other. This makes NLP approaches for its study difficult without a solid morphological analysis. The language is severely endangered and th…
▽ More
We present advances in the development of a FST-based morphological analyzer and generator for Skolt Sami. Like other minority Uralic languages, Skolt Sami exhibits a rich morphology, on the one hand, and there is little golden standard material for it, on the other. This makes NLP approaches for its study difficult without a solid morphological analysis. The language is severely endangered and the work presented in this paper forms a part of a greater whole in its revitalization efforts. Furthermore, we intersperse our description with facilitation and description practices not well documented in the infrastructure. Currently, the analyzer covers over 30,000 Skolt Sami words in 148 inflectional paradigms and over 12 derivational forms.
△ Less
Submitted 9 April, 2020;
originally announced April 2020.
-
Let's FACE it. Finnish Poetry Generation with Aesthetics and Framing
Authors:
Mika Hämäläinen,
Khalid Alnajjar
Abstract:
We present a creative poem generator for the morphologically rich Finnish language. Our method falls into the master-apprentice paradigm, where a computationally creative genetic algorithm teaches a BRNN model to generate poetry. We model several parts of poetic aesthetics in the fitness function of the genetic algorithm, such as sonic features, semantic coherence, imagery and metaphor. Furthermor…
▽ More
We present a creative poem generator for the morphologically rich Finnish language. Our method falls into the master-apprentice paradigm, where a computationally creative genetic algorithm teaches a BRNN model to generate poetry. We model several parts of poetic aesthetics in the fitness function of the genetic algorithm, such as sonic features, semantic coherence, imagery and metaphor. Furthermore, we justify the creativity of our method based on the FACE theory on computational creativity and take additional care in evaluating our system by automatic metrics for concepts together with human evaluation for aesthetics, framing and expressions.
△ Less
Submitted 30 October, 2019;
originally announced October 2019.
-
From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction
Authors:
Mika Hämäläinen,
Simon Hengchen
Abstract:
A great deal of historical corpora suffer from errors introduced by the OCR (optical character recognition) methods used in the digitization process. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We present a fully automatic unsupervised way of extracting parallel data for trainin…
▽ More
A great deal of historical corpora suffer from errors introduced by the OCR (optical character recognition) methods used in the digitization process. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We present a fully automatic unsupervised way of extracting parallel data for training a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction.
△ Less
Submitted 12 October, 2019;
originally announced October 2019.
-
Contextual Minimum-Norm Estimates (CMNE): A Deep Learning Method for Source Estimation in Neuronal Networks
Authors:
Christoph Dinh,
John GW Samuelsson,
Alexander Hunold,
Matti S Hämäläinen,
Sheraz Khan
Abstract:
Magnetoencephalography (MEG) and Electroencephalography (EEG) source estimates have thus far mostly been derived sample by sample, i.e., independent of each other in time. However, neuronal assemblies are heavily interconnected, constraining the temporal evolution of neural activity in space as detected by MEG and EEG. The observed neural currents are thus highly context dependent. Here, a new met…
▽ More
Magnetoencephalography (MEG) and Electroencephalography (EEG) source estimates have thus far mostly been derived sample by sample, i.e., independent of each other in time. However, neuronal assemblies are heavily interconnected, constraining the temporal evolution of neural activity in space as detected by MEG and EEG. The observed neural currents are thus highly context dependent. Here, a new method is presented which integrates predictive deep learning networks with the Minimum-Norm Estimates (MNE) approach. Specifically, we employ Long Short-Term Memory (LSTM) networks, a type of recurrent neural network, for predicting brain activity. Because we use past activity (context) in the estimation, we call our method Contextual MNE (CMNE). We demonstrate that these contextual algorithms can be used for predicting activity based on previous brain states and when used in conjunction with MNE, they lead to more accurate source estimation. To evaluate the performance of CMNE, it was tested on simulated and experimental data from human auditory evoked response experiments.
△ Less
Submitted 5 September, 2019;
originally announced September 2019.
-
Modelling the Socialization of Creative Agents in a Master-Apprentice Setting: The Case of Movie Title Puns
Authors:
Mika Hämäläinen,
Khalid Alnajjar
Abstract:
This paper presents work on modelling the social psychological aspect of socialization in the case of a computationally creative master-apprentice system. In each master-apprentice pair, the master, a genetic algorithm, is seen as a parent for its apprentice, which is an NMT based sequence-to-sequence model. The effect of different parenting styles on the creative output of each pair is in the foc…
▽ More
This paper presents work on modelling the social psychological aspect of socialization in the case of a computationally creative master-apprentice system. In each master-apprentice pair, the master, a genetic algorithm, is seen as a parent for its apprentice, which is an NMT based sequence-to-sequence model. The effect of different parenting styles on the creative output of each pair is in the focus of this study. This approach brings a novel view point to computational social creativity, which has mainly focused in the past on computationally creative agents being on a socially equal level, whereas our approach studies the phenomenon in the context of a social hierarchy.
△ Less
Submitted 10 July, 2019;
originally announced July 2019.
-
Maturation Trajectories of Cortical Resting-State Networks Depend on the Mediating Frequency Band
Authors:
Sheraz Khan,
Javeria Hashmi,
Fahimeh Mamashli,
Konstantinos Michmizos,
Manfred Kitzbichler,
Hari Bharadwaj,
Yousra Bekhti,
Santosh Ganesan,
Keri A Garel,
Susan Whitfield-Gabrieli,
Randy Gollub,
Jian Kong,
Lucia M Vaina,
Kunjan Rana,
Steven Stufflebeam,
Matti Hamalainen,
Tal Kenet
Abstract:
The functional significance of resting state networks and their abnormal manifestations in psychiatric disorders are firmly established, as is the importance of the cortical rhythms in mediating these networks. Resting state networks are known to undergo substantial reorganization from childhood to adulthood, but whether distinct cortical rhythms, which are generated by separable neural mechanisms…
▽ More
The functional significance of resting state networks and their abnormal manifestations in psychiatric disorders are firmly established, as is the importance of the cortical rhythms in mediating these networks. Resting state networks are known to undergo substantial reorganization from childhood to adulthood, but whether distinct cortical rhythms, which are generated by separable neural mechanisms and are often manifested abnormally in psychiatric conditions, mediate maturation differentially, remains unknown. Using magnetoencephalography (MEG) to map frequency band specific maturation of resting state networks from age 7 to 29 in 162 participants (31 independent), we found significant changes with age in networks mediated by the beta (13-30Hz) and gamma (31-80Hz) bands. More specifically, gamma band mediated networks followed an expected asymptotic trajectory, but beta band mediated networks followed a linear trajectory. Network integration increased with age in gamma band mediated networks, while local segregation increased with age in beta band mediated networks. Spatially, the hubs that changed in importance with age in the beta band mediated networks had relatively little overlap with those that showed the greatest changes in the gamma band mediated networks. These findings are relevant for our understanding of the neural mechanisms of cortical maturation, in both typical and atypical development.
△ Less
Submitted 12 February, 2018;
originally announced March 2018.