exercise 1.3 from Mining of Massive Data Sets book

1.1k Views Asked by Bumbble Comm At 10 May 2026 - 7:28

Hello there is a question given in Mining of Massive Data Sets book http://infolab.stanford.edu/~ullman/mmds/ch1.pdf it is on page 15 exercise 1.3.2

My solution is following: as there are $10$ million documents and word occurs in $320$ of them so Inverse Document Frequency = $\log(10*10^{6}/320)$;

Now as per question...

case a) word if appears once then $TF=1/15$ (as given $15$ is the max occurrence of word in a document)

case b) $TF = 5/15$ as given word appears $5$ times (maximum occurrence pre defined to be $15$ times)

so for case a) $TF.IDF$ score $= \log(10^{7}/320)*(1/15)$

and for case b) $TF.IDF$ score $= \log(10^{7}/320)*(5/15)$

Is this solution correct? I just want to understand if I have understood the concept correctly or not.

There are 1 best solutions below

Bumbble Comm On 15 Sep 2019 - 12:35 BEST ANSWER

You're on the right path...according to the definition of $IDF$, $IDF_i=\log_2 (N/n_i)$, so your answers should be

Case A: $TDF.IF \text { score} = \log_2 (10^{7}/320) * (1/15) = \log_2 (6250/3)$

Case B: $TDF.IF \text { score} = \log_2 (10^{7}/320) * (5/15) = \log_2 (31250/3)$