Homework 2

You can type this homework or handwrite and scan. Upload your solution to ELMS. Due on Tuesday Oct 15.

Tf-idf by hand

Make up 5 short documents and construct the tf-idf matrix by hand for these 5 documents. Each document should be short: 1-2 sentences, and around 10 words or less.

Steps: 1. First make up your data, described above. 2. Make the bag-of-words (document-term) matrix by counting the words in each document. 3. Then calculate inverse document frequency of each word. 4. Then multiply each column by the IDF for that word.

Use the version without log-scaling:

\[TF[d,t] = count(d,t) \,\,\,\,\,\,\,\,\, IDF[t] = \frac{N}{DF[t]}\] Show your work at each step!

Which words are the most discriminative words? (Meaning, which words appear in the fewest documents, and therefore provide the most information?)

Word-level feature representations

Write down the transposition of the document-term matrix above to get the term-document matrix. In this matrix, each row is the feature representation for a word.

Choose two pairs of words. Calculate the cosine similarity for each of the two pairs. Show every step in the arithmetic.