A Dictionary that was a Birthday Present in 1940
Today I discovered Kaggle. It's a fascinating site with Data Science competitions for real money. A number of the competitions were closed, but I took a look.
The Data Dictionary is an underestimated tool in gathering data quickly. It can form a solid foundation for data and extend across data sets. For much of the daily-grid data I do in my job, Data Dictionaries are crucial.
The dictionary is such a simple concept - ordered words with pronunciation and definitions. If the word is interesting, then perhaps an illustration.
Being an ML newbie, I chose a newbie project: NLP Getting Started.
Quick and dirty data dictionary code will tell me high-level facts about the project.
It's an interesting question. We have a set of tweets that are labeled as emergency or not emergency.
Tweets with the label that matches:
Tweets that match:
I'm probably not going to spend a lot of time on this, but my thought is that we can build a sparse matrix of the double-metaphones for each word, dump out high-frequency metaphones like T_ and AN_, then build a sparse matrix classifier and let Spark do its thing.
The code to run the metaphones
#%%
import fuzzymatch
metaphones = fuzzymatch.get_df_metaphones(sqldf("select text from df_train where target=1").head())
The Metaphone code
def get_df_metaphones(df_column):
"""
Takes a column data frame and returns a vector of nouns
:param df_column:
:return:
"""
metaphones = []
for index, row in df_column.iterrows():
row_words = []
for word in row.text.split(" "):
row_words.append(double_metaphone(word))
metaphones.append(row_words)
return metaphones
def double_metaphone(full_name):
return f"{doublemetaphone(full_name)[0]}_{doublemetaphone(full_name)[1]}
Comments