What does it mean to 'find the frequency vector' of a text file?

242 Views Asked by At

OK, this may seem a little odd, but I have this NumPy problem sheet for my maths course. I have been asked to write a function that computes the frequency vector of a text file, but in regards to the given equation, I don't know what the variables are referring to.

Below is some context provided in my problem sheet, and below that are two python functions that have also been provided:

If you are given a piece of text, then it is possible to compute the frequency of occurrence of each of the lower case letters in that text. It turns out that the frequency of these letters is relatively fixed from language to language (although this should not be seen as an absolutely fixed set of frequencies as every text can represent a different style). It turns out that these frequencies are surprisingly constant between different texts from the same language. We can even use it to decrypt texts.

Put more mathematically, for each text $\mathcal T$ we can compute the frequency of each letter $f_a^{\mathcal T}, f_b^{\mathcal T}, \cdots$ where $f_a^{\mathcal T}$ is the frequency of lower case "a"s in $\mathcal T$ and so on. For the English alphabet, there are $26$ letters, so we can construct a vector $\underline{f^{\mathcal T}}$ of these frequencies. So $\underline{f^{\mathcal T}}$ is a vector with $26$ entries.

We can then compute how similar two texts $A$ and $B$ are by comparing their frequencies. Let's suppose that their corresponding frequency vectors are $\underline{f^A}$ and $\underline f^B$. We can easily compute the cosine of the angle between these two vectors by computing $$\cos\theta = \dfrac{\underline{f^{A^\top}}\cdot\underline{f^B}}{\left\vert\underline{f^A}\right\vert\left\vert\underline{f^B}\right\vert}.$$

Remember - 1. if $\cos\theta = 1$ then $\underline{f^A}$ and $\underline{f^B}$ are parallel to each other, which indicates their frequencies are very similar. 2. it doesn't matter about the number of dimensions, the above relationship still holds (not just for two or three dimensions).

We can test this in Python. To do this you will write Python code to compute the frequency of letters for a text and then to compute the dot product between different frequency vectors. Without looking at the files one can check if texts are from the same language or not.

In the cell below are two functions that will read ina text and return a string and another function which will return a vector with the incidence of lower case letters (i.e., how many times the letter "a" occurs and so on) in a string.

import numpy as np
import re

def getString(fn):
  f = open(fn,'r')
  myString = f.read()
  return myString.lower()

def getIncidence(string):
  letters = "abcdefghijklmnopqrstuvwxyz"
  Nc = len(letters)
  n = np.zeros(Nc,dtype=np.float64)
  i = 0
  for c in letters:
    x = re.findall(c,string)
    n[i] = len(x) * 1.0
    i += 1
  return n

CHECKPOINT Write a function that will compute the frequency vector of the text found in a text file. The input is the file name of the text file name and the output is the frequency vector for that file. If $\underline{n}$ is the vector of incidences then $$f_i = \dfrac{n_i}{\sum_j N_j}.$$

My issue is understanding what the variables are referring to in the equation (2nd image). From what they are asking, does the function I am supposed to write use this equation (using a vector made from the getIncidence method)?

I can imagine this might not be the place to ask seeing as it is to do with programming, but it's the maths that is the core of my problem.

1

There are 1 best solutions below

1
On BEST ANSWER

The vector $n$ contains how many times each letter occured in the text. For example if your text is

"qwertyuiopasdfghjklxcvbnm"

then $n=[1,1,...,1]$

If your text is "abbcccdddd"

then $n=[1,2,3,4,0,0,...,0]$

The vectors $n$ for each text are calculated by the method/function $getIncidence$

The frequency of each letter $f_i$ is how many times it was obsevered in the text divided by how many letters the text has.

This means that for the first example text $f_i = 1/26$ for all $i$'s. And in the second text $f_1 = 1/10, f_2=2/10,f_3=3/10,f_4=4/10$ (or you can use $f_a,f_b,f_c,f_d$ instead of $f_1,f_2,f_3,f_4$).

This definition of frequency is used to "consider" the texts "abbcdd" and "aabbbbccdddd" as similar. So yes the function you are supposed to write should use this equation and yes you should use the vector made from the $getIncidence$ method.

I am not sure however why in the denominator in the second photo, a capital $N$ is used instead of a small one ($n$). If there is any other use of the capital letter $N$ in the exercise, maybe I could help more or change something in my answer.