Complexity Science Hub * News * A New And Better Way To Create Word Lists

Word lists are the basis of so much research in so many fields. Researchers at the Complexity Science Hub have now developed an algorithm that can be applied to different languages and can expand word lists significantly better than others.

Many projects start with the creation of word lists. Not only in companies when mind maps are created, but also in all areas of research. Imagine you want to find out on which days people are in a particularly good mood by analyzing Twitter postings. Just looking for the word “happy” wouldn’t be enough.

Instead, you would have to use an algorithm that detects all tweets that indicate that someone is happy. “So the first step is to create a list of all the words that indicate just that. The whole research stands or falls on doing so,” explains Anna Di Natale, a researcher at the Complexity Science Hub in Vienna. But how to come up with the most accurate, complete word lists possible?

A PROBLEM THAT CONCERNS MANY

This widespread problem not only concerns opinion researchers who want to find out how politicians’ statements are received by the public. Companies, too, want to find out how their products are perceived through sentiment analyses.

To improve things, Di Natale has now developed a new method, called LEXpander, that outperforms previous algorithms. And this even in two different languages – German and English. Moreover, for the very first time ever, she has developed a way through which it is possible to compare different tools at all.

IMPROVED PERFORMANCE

In comparison with four other algorithms for wordlist expansion (WordNet, Empath 2.0, FastText and GloVe), LEXpander performed significantly better, especially in German. For example, the researchers found that LEXpander guesses 43% of words right when expanding an English word list for positive meaning. A very popular model, FastText, in comparison, is right only 28% of the time.

INDEPENDENCE FROM THE LANGUAGE ITSELF

The reason is that this tool works language-independently. It is not based on one language, but on a so-called colexification network. This recognized linguistic concept resides on homonyms and polysemies, single words that have two or more distinct meanings. For example: the ancient Greek word φάρμακον (pharmacon) can mean medicine or poison. Two different things, but thematically close. But there are others that don’t suggest kinship – such as “bank” as a financial institution or the land alongside a river.

“If you collect them across many languages – and here we analyzed about 19 different languages – you can see connections between them,” Di Natale says. The network is formed when these colexifications occur in several languages across different language families, creating connections.

This independence from the language itself allows LEXpander to achieve better results in different languages. “There are many methods developed for English. They work very well and quickly and everyone uses them. Trying to apply them to other languages works, but not as well as it might work if you had started developing a method for German or Italian,” Di Natale explains.

INDEPENDENCE FROM THE LANGUAGE ITSELF

For many topics there are already good word lists. But for new topics – like when COVID came up – new ones have to be created. Until now, they were usually created by hand during brainstorming with colleagues and several tools were used to help. But until now there was no way to compare them. Anna Di Natale and her team have now created this possibility and have also developed a new tool that performs better than the others. This can be an important cornerstone for many future research projects in various fields.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	1 year	Set by the GDPR Cookie Consent plugin, this cookie records the user consent for the cookies in the "Analytics" category.
cookielawinfo-checkbox-functional	1 year	The GDPR Cookie Consent plugin sets the cookie to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	1 year	Set by the GDPR Cookie Consent plugin, this cookie records the user consent for the cookies in the "Necessary" category.
CookieLawInfoConsent	1 year	CookieYes sets this cookie to record the default button state of the corresponding category and the status of CCPA. It works only in coordination with the primary cookie.
PHPSESSID	session	This cookie is native to PHP applications. The cookie stores and identifies a user's unique session ID to manage user sessions on the website. The cookie is a session cookie and will be deleted when all the browser windows are closed.
viewed_cookie_policy	1 year	The GDPR Cookie Consent plugin sets the cookie to store whether or not the user has consented to use cookies. It does not store any personal data.

Cookie	Duration	Description
mec_cart	1 month	Provides functionality for our ticket shop
VISITOR_INFO1_LIVE	6 months	YouTube sets this cookie to measure bandwidth, determining whether the user gets the new or old player interface.
VISITOR_PRIVACY_METADATA	6 months	YouTube sets this cookie to store the user's cookie consent state for the current domain.
YSC	session	Youtube sets this cookie to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the user's video preferences using embedded YouTube videos.
yt-remote-device-id	never	YouTube sets this cookie to store the user's video preferences using embedded YouTube videos.
yt.innertube::nextId	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
_ga	1 year	Google Analytics sets this cookie to calculate visitor, session and campaign data and track site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognise unique visitors.
_ga_*	1 year	Google Analytics sets this cookie to store and count page views.
_gat_gtag_UA_*	1 min	Google Analytics sets this cookie to store a unique user ID.
_gid	1 day	Google Analytics sets this cookie to store information on how visitors use a website while also creating an analytics report of the website's performance. Some of the collected data includes the number of visitors, their source, and the pages they visit anonymously.

13.03.2023

Collective Minds

A new and better way to create word lists

Researchers

Anna Di Natale

David Garcia

LEXpander: applying colexification networks to automated lexicon expansion

Emotional Well-Being

Emotional Well-Being in the Digital Society

David Garcia

13.03.2023

Collective Minds

A new and better way to create word lists

Researchers

Anna Di Natale

David Garcia

LEXpander: applying colexification networks to automated lexicon expansion

Emotional Well-Being

Emotional Well-Being in the Digital Society

David Garcia

CSH Newsletter