top of page
  • Writer's pictureTim Burns

Data Dictionary and Kaggle

A Dictionary that was a Birthday Present in 1940

Today I discovered Kaggle. It's a fascinating site with Data Science competitions for real money. A number of the competitions were closed, but I took a look.

The Data Dictionary is an underestimated tool in gathering data quickly. It can form a solid foundation for data and extend across data sets. For much of the daily-grid data I do in my job, Data Dictionaries are crucial.

The dictionary is such a simple concept - ordered words with pronunciation and definitions. If the word is interesting, then perhaps an illustration.

Being an ML newbie, I chose a newbie project: NLP Getting Started.

Quick and dirty data dictionary code will tell me high-level facts about the project.

It's an interesting question. We have a set of tweets that are labeled as emergency or not emergency.

Tweets with the label that matches:

Tweets that match:

I'm probably not going to spend a lot of time on this, but my thought is that we can build a sparse matrix of the double-metaphones for each word, dump out high-frequency metaphones like T_ and AN_, then build a sparse matrix classifier and let Spark do its thing.

The code to run the metaphones

import fuzzymatch
metaphones = fuzzymatch.get_df_metaphones(sqldf("select text from df_train where target=1").head())

The Metaphone code

def get_df_metaphones(df_column):
    Takes a column data frame and returns a vector of nouns
    :param df_column:
    metaphones = []
    for index, row in df_column.iterrows():
        row_words = []
        for word in row.text.split(" "):

    return metaphones

def double_metaphone(full_name):
    return f"{doublemetaphone(full_name)[0]}_{doublemetaphone(full_name)[1]}
25 views0 comments

Recent Posts

See All


bottom of page