If you’re not scraping data, how are you going to remain competitive?


Data is king. There is no way around it. 80% of the value you get from building a machine learning model can come from a simple off-the-shelf solution. With this modern simplicity, how do you differentiate yourself or your company? The answer is the data. Your best model is only as good as the data you provide it. You can get serious gains by finding a unique data source or managing to acquire more data than your competitors.

In this series of blog posts, I’m going to talk about how to use my new favorite tool for web scraping and…

Eight lines of Python to memoize ANY Dynamic Programming problem. Only one if you’re feeling snarky

Don’t you want to be a cool Dynamic Programming Wizard like him? Photo by Nikolay Ivanov from Pexels


The technical interview is a cultural phenomenon that has penetrated to the core of almost any tech company these days. It seems almost impossible to get a software job without having to answer some arcane question from a second-year algorithms and data structures course.

When I first started interviewing I remember having the most trouble with Dynamic Programming problems. You know, how to find a maximum sublist, or figure out if you can make change given a set of coins and an angry customer. Those really useful problems that I totally use all the time in my day to day…

Getting Started

Pipelines can be hard to navigate here’s some code that works in general.

Photo by Quinten de Graaf on Unsplash


Pipelines are amazing! I use them in basically every data science project I work on. But, easily getting the feature importance is way more difficult than it needs to be. In this tutorial, I’ll walk through how to access individual feature names and their coefficients from a Pipeline. After that, I’ll show a generalized solution for getting feature importance for just about any pipeline.


Let’s start with a super simple pipeline that applies a single featurization step followed by a classifier.

from datasets import list_datasets, load_dataset, list_metrics
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm

As an NLP engineer, I’m going to be out of a job soon 😅

Photo by C. Cagnin from Pexels


I remember when I built my first seq2seq translation system back in 2015. It was a ton of work from processing the data to designing and implementing the model architecture. All that was to translate one language to one other language. Now the models are so much better and the tooling around these models leagues better as well. HuggingFace recently incorporated over 1,000 translation models from the University of Helsinki into their transformer model zoo and they are good. …

How to train a chatbot to sound like anyone with a phone, including your deceased relatives.

Photo by Manea Catalin from Pexels


I grew up reading science fiction where people would try and embed their consciousness into machines. I always found these stories fascinating. What does it mean to be conscious? If I put a perfect copy of myself into a machine which one is me? If biologic me dies but mechanical copy me survives did I die? I still love stories like this and have been devouring Greg Egan lately, I’d highly recommend his book Diaspora if you think these are interesting questions (it’s only $3).

But I digress. With today’s technology, it’s possible to make a rough approximation of a…

Label Smarter Not More.

Photo by RUN 4 FFWPU from Pexels


Imagine back to your school days studying for an exam. Did you randomly read sections of your notes, or randomly do problems in the back of the book? No! Well, at least I hope you didn’t approach your schooling with the same level of rigor as what to eat for breakfast. What you probably did was figure out what topics were difficult for you to master and worked diligently at those. Only doing minor refreshing of ideas that you felt you understood. So why do we treat our machine students differently?

We need more data! It is a clarion call…

There is a dearth of good psychotherapy data. Let’s work on that.

Photo by Polina Zimmerman from Pexels


This past year I was applying NLP to improve the quality of mental health care. One thing I found particularly difficult in this domain is the lack of high-quality data. Sure, you can go scrape Reddit and get some interesting therapeutic interactions between individuals, but in the author’s opinion, this is a poor substitute for what an actual interaction between a client and a therapist is like. Don’t get me wrong, there are datasets. They are just, more often than not, proprietary or pay to play.

My hope with this post is to introduce a data set of reasonably high-quality…

Figuring out what words are predictive for your problem is easy!

Photo Credit: Kendrick Mills


Nowadays NLP feels like it’s just about applying BERT and getting state of the art results on your problem. Often times, I find that grabbing a few good informative words can help too. Usually, I’ll have an expert come to me and say these five words are really predictive for this class. Then I’ll use those words as features, and voila! You get some performance improvements or a little bit more interpretability. But what do you do if you don’t have a domain expert? …

It’s only ~100 lines of code but the tweets are infinite.

The Author’s rendition of a Twitter bot


I love generative models. There is something magical about showing a machine a bunch of data and having it draw a picture or write a little story in the same vein as the original material. What good are these silly little models if we can’t share them with others right? This is the information age after all. In this post we’ll:

  1. Walk through setting up a Twitter bot from scratch
  2. Train one of the most cutting edge language models to generate text for us
  3. Use the Twitter API to make your bot tweet!

When you’re done with the tutorial you’ll…

BERT can get you state-of-the-art results on many NLP tasks and it only takes a few lines of code.

BERT as a Transformer (Image by Author)


Getting state of the art results in NLP used to be a harrowing task. You’d have to design all kinds of pipelines, do part of speech tagging, link these to knowledge bases, lemmatize your words, and build crazy parsers. Now just throw your task at BERT and you’ll probably do pretty well. The purpose of this tutorial is to set up a minimal example of sentence level classification with BERT and Sci-kit Learn. I’m not going to talk about what BERT is or how it works in any detail. I just want to show you in the smallest amount of…

Nicolas Bertagnolli

Sr. Machine Learning Engineer interested in NLP. Let’s connect on LinkedIn! https://www.linkedin.com/in/nicolas-bertagnolli-058aba81/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store