Curve Fitting a Cyclical Pattern of Data

1.1k Views Asked by At

I'm analyzing phonological characteristics of the 22 letters used in the Hebrew alphabet, and assigned each letter an enumeration to see if they are organized based on place of articulation:

  • guttural = 1
  • labial = 2
  • palatal = 3
  • coronal = 4

The following is the result (alphabetic order matched with articulation code):

1,1
2,2
3,3
4,4
5,1
6,2
7,4
8,1
9,4
10,3
11,3
12,4
13,2
14,4
15,4
16,1
17,2
18,4
19,3
20,4
21,4
22,4

When evaluating the differences between the articulation points of each successive letter (simple subtraction), it becomes subjectively apparent that there is a non-random cyclical pattern (oddly like a wave) whereby the increments subtracted from decrements balance out to equal 3.

Here's the table for the "step direction" (whereby a 1 -> 4 step above is a positive number here, and a 4 -> 1 step above would be a negative number here):

1,0
2,1
3,1
4,1
5,-3
6,1
7,2
8,-3
9,3
10,-1
11,0
12,1
13,-2
14,2
15,0
16,-3
17,1
18,2
19,-1
20,1
21,0
22,0

Bottom Line: Please help me understand how to characterize this pattern, how to fit it to the most relevant function model, and how to test it (given a sample size of 22) for confidence.

Here's a screenshot of what the graphs look like (red line shows the second table, and the area chart shows the first table).

enter image description here

Thanks in advance!

2

There are 2 best solutions below

5
On BEST ANSWER

I actually do not believe Fourier analysis is relevant here. You can't get reasonable likelihood measures in that direction.

Your data is discrete and your probabilities can be calculated directly. What you are really looking at (as hinted by Rahul in the comments) is the combinatorics of rise and runs in a bounded sequence. To measure how "atypical" this sequence is, you will want to look at the combinatorial properties.

First, you will want to understand the constraints you need to place. You have a string of 22 numbers, each from 1 to 4. You have at least one occurrence of each number (otherwise it wouldn't be a category that you would have classified the speech characteristics under), which gives a fairly large but manageable space to work with. However, you probably want to start with the more narrow space of explicitly specifying that there are:

4 ones

4 twos

4 threes

10 fours

This is because you are analyzing the arrangements of an existing language with existing speech characteristics, not finding out how unusual the particular choice of language is in itself. In other words, you just want to see if they arranged the alphabet in a way that is "meaningful" or has information content due to the layout, but the letters already exist and the layout is the only thing being checked.

This narrows the possible ways to arrange the phonological classification to $\binom{24}{10,4,4,4}$ or 12368268712800.

Now your data has 7 runs of rises and 6 runs of falls. Is this typical of those 12,368,268,712,800 arrangements? Or does it indicate this was not likely random? You can calculate that!

L. Carlitz has a classical set of papers on this enumeration problem, starting with "Enumeration of sequences by rises and falls: a refinement of the Simon Newcomb problem". Goulden and Jackson also cover this problem in "Combinatorial Enumeration".

You want to calculate the possibilities for all of the different possible rise/fall combinations and then you can understand how special or typical this particular case is. The p-value can be calculated in the standard way by using the distribution given by probability over the rise/fall space.

I suspect what you will find is that seeing a number of rise and falls between 5 and 9 is pretty likely. When you have 10 fours placed randomly, any gaps around them will be rise and falls. Only the very rare cases where fours are bunched will not have a sinusoidal appearance. But the details of the rise and fall enumeration with a specified set of characters in the string is rather involved, so I don't have numbers. This may also be approachable by brute force enumeration or Monte-Carlo if the possibility space is too large.

2
On

I would try Fourier Analysis or a Fast Fourier Transform. These will decompose your cyclic functions int a series of weighted sine functions. You can then choose the level of accuracy by truncating the number of sines you include in the series. As far as inferential confidence, you only have 22 samples, so you will likely need to get more data and see how well your fitted curve explains a new set of measurements. Absent that, you should try cross validation, where you re-fit your fourier series model after leaving out 1 or more samples, then you rotate through all such combinatiosn to see the varability of your sine wave weights.