Search
  • Tim Burns

Data Dictionary and Kaggle


A Dictionary that was a Birthday Present in 1940


Today I discovered Kaggle. It's a fascinating site with Data Science competitions for real money. A number of the competitions were closed, but I took a look.


The Data Dictionary is an underestimated tool in gathering data quickly. It can form a solid foundation for data and extend across data sets. For much of the daily-grid data I do in my job, Data Dictionaries are crucial.


The dictionary is such a simple concept - ordered words with pronunciation and definitions. If the word is interesting, then perhaps an illustration.

Being an ML newbie, I chose a newbie project: NLP Getting Started.


Quick and dirty data dictionary code will tell me high-level facts about the project.






It's an interesting question. We have a set of tweets that are labeled as emergency or not emergency.


Tweets with the label that matches:

Tweets that match:



I'm probably not going to spend a lot of time on this, but my thought is that we can build a sparse matrix of the double-metaphones for each word, dump out high-frequency metaphones like T_ and AN_, then build a sparse matrix classifier and let Spark do its thing.


The code to run the metaphones

#%%
import fuzzymatch
metaphones = fuzzymatch.get_df_metaphones(sqldf("select text from df_train where target=1").head())



The Metaphone code

def get_df_metaphones(df_column):
    """
    Takes a column data frame and returns a vector of nouns
    :param df_column:
    :return:
    """
    metaphones = []
    for index, row in df_column.iterrows():
        row_words = []
        for word in row.text.split(" "):
            row_words.append(double_metaphone(word))
        metaphones.append(row_words)

    return metaphones


def double_metaphone(full_name):
    return f"{doublemetaphone(full_name)[0]}_{doublemetaphone(full_name)[1]}
6 views0 comments