Thoughts and Theory

A data science introduction to econometrics with Python library: DoWhy, including a detailed code walkthrough of a case-study causality paper

Satellite image of the West Bank, Palestine from “Hard traveling: unemployment and road infrastructure in the shadow of political conflict” (Abrahams, 2021)

Data scientists have a tendency to focus on descriptive and predictive analysis, but neglect causal analysis. Decision making, however, requires causal analysis, a fact well recognized by public health epidemiologists during this Covid-19 pandemic. Due to my background in biology, I had internalized the adage “correlation does not equal causation”, to such an extent that I studiously avoided all causal claims. Fortunately, my insatiable curiosity led me to the field of econometrics, which embraces causality and sets down a body of rigorous mathematics to facilitate causal analysis.

Recently, my interest in econometrics has been fueled by my regionally-focused consulting work…


Getting Started

A linguistic approach to understanding the impact and risks of attempting to understand natural language with artificial languages.

Photo by Raphael Schaller on Unsplash

Why study language?

The philosopher Ludwig Wittgenstein, said of language: “The limits of my language mean the limits of my world.” Artificial intelligence (AI) is created with artificial (programming) languages, at its core, therefore, machine learning can be reduced to binary code. Conversely, natural language can be defined as having evolved through organic usage by humans, it has an indelibly human quality. Studying language can offer a glimpse into the limitations of an artificial language in mimicking the world of natural language.

Recently, I shared a thought piece discussing linguistic theory in relation to AI and natural language processing (NLP). Specifically, I attempted…


An introduction to linguistic relativity and universality with respect to the development of AI language models.

Photo by Franki Chamaki on Unsplash

Recently, while working on an article comparing multi-language models to an Arabic-specific language model, I questioned why the multilingual models I had tested tended to perform poorly compared to the unilingual model. The comparison between multi-language and language-specific tools, reminded me of a recent science fiction read: Snow Crash by Neal Stephenson. In a particularly memorable conversation, the hacker hero named “Hiro Protagonist,” discusses the conflicting theories of linguistic relativity and linguistic universality with an AI (artificial intelligence), all in an effort to discover the origins of “neurolinguistic hacking.”

I encountered linguistic relativity over a decade ago, in the form…


Deep learning NLP tutorial on analyzing collections of documents with Extractive Text Summarization, utilizing Transformer-based sentence embeddings derived from SOTA language models

Photo by Brett Jordan on Unsplash

Natural language processing (NLP) is a diverse field; the approaches and techniques are as varied as the diversity of textual samples available for analysis (eg. blogs, tweets, reviews, policy documents, new articles, journal publications etc.). Choosing a good approach requires an understanding of the questions being asked of the data, and the suitability of the available data. This tutorial, which includes a code walkthrough, aims to highlight how sentence embeddings can be leveraged to derive useful information from text data as an important part of exploratory data analysis (EDA).

Case study: CoronaNet Research Project

I make use of the CoronaNet Research Project to conduct a…


A practical exploration of the Natural Language Processing technique of Latent Dirichlet Allocation and its application to the task of topic modeling.

The LDA model graphically represented with plate notation. Image by Author.

Topic modeling is a form of unsupervised machine learning that allows for efficient processing of large collections of data, while preserving the statistical relationships that are useful for tasks such as classification or summarization. The goal of topic modeling is to uncover latent variables that govern the semantics of a document, these latent variables representing abstract topics. Currently, the most popular technique for topic modeling is Latent Dirichlet Allocation (LDA), and this model can be used effectively on a variety of document types such as collections of news articles, policy documents, social media posts or tweets.

This article will necessarily…


Combining data science and econometrics for an introduction to the DeepIV framework, including a full Python code tutorial.

Map of the West Bank, Palestine, showing small peripheral neighbourhoods in red and larger more central neighbourhoods in blue. Image by author.

Historically, both economists and philosophers have been preoccupied with extracting an understanding of cause and effect from empirical evidence. David Hume, an economist and philosopher, is renowned for exploring causality, both as an epistemological puzzle and as a matter of practical concern in applied economics. In an article titled “Causality in Economics and Econometrics”, economics professor Kevin D. Hoover states, “economists inherited from Hume the sense that practical economics was essentially a causal science.” (Hoover, 2006). As a capital “E” Empiricist, Hume was a major influence on the development of causality in economics; his skepticism created a tension between the…


Thoughts and Theory

Introduction to causal machine learning for econometrics, including a Python tutorial on estimating the CATE with a causal forest using EconML

Photo by Lukasz Szmigiel on Unsplash

Equity is not the same principle as equality. Within the social context they both relate to fairness; equality means treating everyone the same regardless of need, while equity means treating people differently depending on their needs. Consider vaccinations, if we based public health policy on equality, perhaps there would be a lottery system to decide who gets vaccinated first, giving everyone an equal chance. In practice, however, vaccinations are prioritized based on equity, those with the greatest risk, frontline healthcare workers and the elderly, understandably, are first in line.

Assuming we understand the causal relationship between treatment and outcome, the…


Hands-on Tutorials

Arabic NLP tutorial on creating Arabic Sentence Embeddings with Multi-Task Learning for fast and efficient Semantic Textual Similarity tasks.

Photo by TOMOKO UJI on Unsplash

In the first article of this Arabic natural language processing (NLP) series, I introduced a transformer language model named AraBERT (Arabic Bidirectional Encoder Representations from Transformers) released by Antoun et al. (2020), which performs exceptionally well on a variety of Arabic NLP benchmarks. As is typical of state-of-the-art language models, AraBERT is quite large, the base model has 110 million parameters, and the large model has 340 million parameters. When one considers the size of these language models, it becomes evident that an accessibility gap exists between the pragmatic researcher and the usage of state-of-the-art NLP tools.

As determined by…


Data visualization tutorial on animating time-dynamic behaviour in social network graphs.

Gephi visualization of retweeting behaviour of Twitter influencers over time. Image by Author.

When it comes to analyzing social networks, my previous articles have primarily been about natural language processing (NLP), or more specifically Arabic NLP. Tweets, however, are more than just text data, they represent network connections between Twitter users. Adding on network analysis, allows for a synthesis between the content and actions of social media data; therefore, combining network and text data creates a far more nuanced understanding of a social media network.

My Python-learning journey began out of necessity, my goal was to animate a Twitter network graph and coding appeared to be the solution. Hence, my first-ever script was…


Tutorial with code, on combining big data with Arabic natural language processing using Apache Spark and Spark NLP for distributed computing

Photo by Jeremy Thomas on Unsplash

It can be difficult to conceptualize how large “big data’’ actually is and what it means for data scientists seeking to leverage the law of large numbers for social research. At the beginning of the dystopic 2020, the World Economic Forum estimated that the amount of data in the world was 44 zettabytes (one zettabyte has 21 zeros); this number is about 40 times larger than the number of stars in the observable universe. On a global scale, by the time we reach 2025, an estimated 463 exabytes will be created daily (one exabyte has 18 zeros). …

Haaya Naushan

Research Consultant and Data Scientist. Enthusiastic about machine learning, social justice, video games and philosophy.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store