Thoughts and Theory

A data science introduction to econometrics with Python library: DoWhy, including a detailed code walkthrough of a case-study causality paper

Satellite image of the West Bank, Palestine from “Hard traveling: unemployment and road infrastructure in the shadow of political conflict” (Abrahams, 2021)

Data scientists have a tendency to focus on descriptive and predictive analysis, but neglect causal analysis. Decision making, however, requires causal analysis, a fact well recognized by public health epidemiologists during this Covid-19 pandemic. Due to my background in biology, I had internalized the adage “correlation does not equal causation”, to such an extent that I studiously avoided all causal claims. Fortunately, my insatiable curiosity led me to the field of econometrics, which embraces causality and sets down a body of rigorous mathematics to facilitate causal analysis.

Recently, my interest in econometrics has been fueled by my regionally-focused consulting work…


Getting Started

A linguistic approach to understanding the impact and risks of attempting to understand natural language with artificial languages.

Photo by Raphael Schaller on Unsplash

Why study language?

The philosopher Ludwig Wittgenstein, said of language: “The limits of my language mean the limits of my world.” Artificial intelligence (AI) is created with artificial (programming) languages, at its core, therefore, machine learning can be reduced to binary code. Conversely, natural language can be defined as having evolved through organic usage by humans, it has an indelibly human quality. Studying language can offer a glimpse into the limitations of an artificial language in mimicking the world of natural language.

Recently, I shared a thought piece discussing linguistic theory in relation to AI and natural language processing (NLP). Specifically, I attempted…


An introduction to linguistic relativity and universality with respect to the development of AI language models.

Photo by Franki Chamaki on Unsplash

Recently, while working on an article comparing multi-language models to an Arabic-specific language model, I questioned why the multilingual models I had tested tended to perform poorly compared to the unilingual model. The comparison between multi-language and language-specific tools, reminded me of a recent science fiction read: Snow Crash by Neal Stephenson. In a particularly memorable conversation, the hacker hero named “Hiro Protagonist,” discusses the conflicting theories of linguistic relativity and linguistic universality with an AI (artificial intelligence), all in an effort to discover the origins of “neurolinguistic hacking.”

I encountered linguistic relativity over a decade ago, in the form…


Deep learning NLP tutorial on analyzing collections of documents with Extractive Text Summarization, utilizing Transformer-based sentence embeddings derived from SOTA language models

Photo by Brett Jordan on Unsplash

Natural language processing (NLP) is a diverse field; the approaches and techniques are as varied as the diversity of textual samples available for analysis (eg. blogs, tweets, reviews, policy documents, new articles, journal publications etc.). Choosing a good approach requires an understanding of the questions being asked of the data, and the suitability of the available data. This tutorial, which includes a code walkthrough, aims to highlight how sentence embeddings can be leveraged to derive useful information from text data as an important part of exploratory data analysis (EDA).

Case study: CoronaNet Research Project

I make use of the CoronaNet Research Project to conduct a…


A practical exploration of the Natural Language Processing technique of Latent Dirichlet Allocation and its application to the task of topic modeling.

The LDA model graphically represented with plate notation. Image by Author.

Topic modeling is a form of unsupervised machine learning that allows for efficient processing of large collections of data, while preserving the statistical relationships that are useful for tasks such as classification or summarization. The goal of topic modeling is to uncover latent variables that govern the semantics of a document, these latent variables representing abstract topics. Currently, the most popular technique for topic modeling is Latent Dirichlet Allocation (LDA), and this model can be used effectively on a variety of document types such as collections of news articles, policy documents, social media posts or tweets.

This article will necessarily…


Hands-on Tutorials

Arabic NLP tutorial on creating Arabic Sentence Embeddings with Multi-Task Learning for fast and efficient Semantic Textual Similarity tasks.

Photo by TOMOKO UJI on Unsplash

In the first article of this Arabic natural language processing (NLP) series, I introduced a transformer language model named AraBERT (Arabic Bidirectional Encoder Representations from Transformers) released by Antoun et al. (2020), which performs exceptionally well on a variety of Arabic NLP benchmarks. As is typical of state-of-the-art language models, AraBERT is quite large, the base model has 110 million parameters, and the large model has 340 million parameters. When one considers the size of these language models, it becomes evident that an accessibility gap exists between the pragmatic researcher and the usage of state-of-the-art NLP tools.

As determined by…


Data visualization tutorial on animating time-dynamic behaviour in social network graphs.

Gephi visualization of retweeting behaviour of Twitter influencers over time. Image by Author.

When it comes to analyzing social networks, my previous articles have primarily been about natural language processing (NLP), or more specifically Arabic NLP. Tweets, however, are more than just text data, they represent network connections between Twitter users. Adding on network analysis, allows for a synthesis between the content and actions of social media data; therefore, combining network and text data creates a far more nuanced understanding of a social media network.

My Python-learning journey began out of necessity, my goal was to animate a Twitter network graph and coding appeared to be the solution. Hence, my first-ever script was…


Tutorial with code, on combining big data with Arabic natural language processing using Apache Spark and Spark NLP for distributed computing

Photo by Jeremy Thomas on Unsplash

It can be difficult to conceptualize how large “big data’’ actually is and what it means for data scientists seeking to leverage the law of large numbers for social research. At the beginning of the dystopic 2020, the World Economic Forum estimated that the amount of data in the world was 44 zettabytes (one zettabyte has 21 zeros); this number is about 40 times larger than the number of stars in the observable universe. On a global scale, by the time we reach 2025, an estimated 463 exabytes will be created daily (one exabyte has 18 zeros). …


A discussion of Arabic natural language processing (NLP) for social media text, with code examples and in-depth analysis of the cutting-edge technology driving the most recent advancements.

Multi-lingual word cloud from tweets about the Beirut explosion (August 2020). Image by Author.

Natural language processing (NLP) is not a new discipline; its roots date back to the 1600s when philosophers such as Descartes and Leibniz proposed theoretical codes for language. In the past decade, the results of this long history have led to the integration of NLP into our own homes, in the form of digital assistants like Siri and Alexa. Although machine learning has remarkably accelerated the improvement of English NLP techniques, the study of NLP for other languages has always lagged behind.

Why study Arabic social media?

As the official language of 22 countries spread across the Middle-East North Africa (MENA) region, Arabic is the…


A tutorial on labeling QAnon tweets into topic categories, using the revolutionary, unsupervised machine learning technique of zero-shot text classification

Photo by Jon Tyson on Unsplash

Recently Facebook, after a three year delay, has finally banned QAnon, the far-right conspiracy theory from its platform. Following their lead, Twitter and Youtube have followed suit. Grown from the roots of white supremacy, what was once a theory has morphed into an aggressive movement seeking to wage information warfare as a means of influencing the upcoming American presidential election in favour of Donald Trump. The FBI has identified QAnon as a domestic terrorism threat, and earlier this year Twitter acted to ban thousands of QAnon-affiliated accounts. However, many pro-QAnon accounts continue to tweet, and more insidiously Trump has a…

Haaya Naushan

Research Consultant and Data Scientist. Enthusiastic about machine learning, social justice, video games and philosophy.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store