The philosopher Ludwig Wittgenstein, said of language: “The limits of my language mean the limits of my world.” Artificial intelligence (AI) is created with artificial (programming) languages, at its core, therefore, machine learning can be reduced to binary code. Conversely, natural language can be defined as having evolved through organic usage by humans, it has an indelibly human quality. Studying language can offer a glimpse into the limitations of an artificial language in mimicking the world of natural language.
Recently, I shared a thought piece discussing linguistic theory in relation to AI and natural language processing (NLP). Specifically, I attempted…
Recently, while working on an article comparing multi-language models to an Arabic-specific language model, I questioned why the multilingual models I had tested tended to perform poorly compared to the unilingual model. The comparison between multi-language and language-specific tools, reminded me of a recent science fiction read: Snow Crash by Neal Stephenson. In a particularly memorable conversation, the hacker hero named “Hiro Protagonist,” discusses the conflicting theories of linguistic relativity and linguistic universality with an AI (artificial intelligence), all in an effort to discover the origins of “neurolinguistic hacking.”
I encountered linguistic relativity over a decade ago, in the form…
Natural language processing (NLP) is a diverse field; the approaches and techniques are as varied as the diversity of textual samples available for analysis (eg. blogs, tweets, reviews, policy documents, new articles, journal publications etc.). Choosing a good approach requires an understanding of the questions being asked of the data, and the suitability of the available data. This tutorial, which includes a code walkthrough, aims to highlight how sentence embeddings can be leveraged to derive useful information from text data as an important part of exploratory data analysis (EDA).
I make use of the CoronaNet Research Project to conduct a…
Topic modeling is a form of unsupervised machine learning that allows for efficient processing of large collections of data, while preserving the statistical relationships that are useful for tasks such as classification or summarization. The goal of topic modeling is to uncover latent variables that govern the semantics of a document, these latent variables representing abstract topics. Currently, the most popular technique for topic modeling is Latent Dirichlet Allocation (LDA), and this model can be used effectively on a variety of document types such as collections of news articles, policy documents, social media posts or tweets.
This article will necessarily…
When it comes to analyzing social networks, my previous articles have primarily been about natural language processing (NLP), or more specifically Arabic NLP. Tweets, however, are more than just text data, they represent network connections between Twitter users. Adding on network analysis, allows for a synthesis between the content and actions of social media data; therefore, combining network and text data creates a far more nuanced understanding of a social media network.
My Python-learning journey began out of necessity, my goal was to animate a Twitter network graph and coding appeared to be the solution. Hence, my first-ever script was…
It can be difficult to conceptualize how large “big data’’ actually is and what it means for data scientists seeking to leverage the law of large numbers for social research. At the beginning of the dystopic 2020, the World Economic Forum estimated that the amount of data in the world was 44 zettabytes (one zettabyte has 21 zeros); this number is about 40 times larger than the number of stars in the observable universe. On a global scale, by the time we reach 2025, an estimated 463 exabytes will be created daily (one exabyte has 18 zeros). …
Natural language processing (NLP) is not a new discipline; its roots date back to the 1600s when philosophers such as Descartes and Leibniz proposed theoretical codes for language. In the past decade, the results of this long history have led to the integration of NLP into our own homes, in the form of digital assistants like Siri and Alexa. Although machine learning has remarkably accelerated the improvement of English NLP techniques, the study of NLP for other languages has always lagged behind.
As the official language of 22 countries spread across the Middle-East North Africa (MENA) region, Arabic is the…
Recently Facebook, after a three year delay, has finally banned QAnon, the far-right conspiracy theory from its platform. Following their lead, Twitter and Youtube have followed suit. Grown from the roots of white supremacy, what was once a theory has morphed into an aggressive movement seeking to wage information warfare as a means of influencing the upcoming American presidential election in favour of Donald Trump. The FBI has identified QAnon as a domestic terrorism threat, and earlier this year Twitter acted to ban thousands of QAnon-affiliated accounts. However, many pro-QAnon accounts continue to tweet, and more insidiously Trump has a…
In ancient Rome, public discourse happened at the Forum at the heart of the city. People gathered to exchange ideas and debate topics of social relevance. Today that public discourse has moved online to the digital forums of sites like Reddit, the microblogging arena of Twitter and other social media outlets. Perhaps as a researcher you are curious what people’s opinions are about a specific topic, or perhaps as an analyst you wish to study the effect of your company’s recent marketing campaign. Monitoring social media with sentiment analysis is a good way to gauge public opinion. …
Research Consultant and Data Scientist. Enthusiastic about machine learning, social justice, video games and philosophy.