### Big Data for choosing bride/groom

### Want to know how many female friends of yours are still un-married....

This Blog is regarding the Logistic Regression Algorithm that has been developed primarily for classification. I see this algorithm to be massively useful for cloud based-big data analysis.

Let us understand this algorithm with a simple example.

**Problem statement:**I want to calculate how many of my friends who are married, divorced or single.

Solution: Here I have two groups of Sets

Marital Status group has 3 states – Single, Married and Divorced. Gender Status has 2 states Male and Female. Each of the 2 states has a score associated with it (z-score), a score that represents the normalized weight of the state.

Z score is used to map the raw distribution into a space where the mean is represented by 0 and standard deviation is represented by 1.

Note the graph 1 – shows a linear line across the sets of friends. This has been poorly modeled data using linear regression. Graph 2- shows a comparatively exact data, having a S shaped structure.

**Objective:**

How to achieve S shaped Graph

To achieve the above mentioned graph I need the exact probabilistic points ranging from 0 – 1 against the set of my friends. The set has been given to me by Facebook in a 4 page document. Each page has 100 numbers of friends.

Let us assume that we have a continuous set of values. In this set out of 100 friends may be 20 friends are missing/ not my friends. So now I have 80 friends and 20 people not my friends buts have some indirect connections.

Now consider positive score when state is present and negative when a state is not according to my result. After that I will do a mathematical calculation on the positive values (Actual friend list). This will be:

Friends= (X-mu)/(sigma)

Where x=actual value , mu is the mean calculated and sigma is the standard deviation. This mathematical calculation may differ according to requirement. This is going to give me the coefficient for calculating the probabilistic points.

For the negative set (People who are not my friends), this will be:

Not my friends= -(mu/sigma)

After applying the formulae I get a result which is

**Marital Status:**

1. X1 = -0.03 (Score for Not Single)

2. X2 = 0.92 (Score for Married)

3. X3 = -0.05 (Score for Not Divorced)

Gender

1. X4 = 0.89 (Score for Male)

2. X5 = -0.11(Score for Not Male)

Note: whenever we get negative result we add a Not to that state.

Now we have calculated our coefficients ( c0,c1....).

We need to do more 2 steps in order to calculate for p (probability points)

The next step will be to do a Z Normalization. The formulae for doing that is

Z= c0 + c1*x1 + c2*x2… This is a typical linear expression.

The Logistic Regression will now transform the Z equation to an S-shaped in the graph. The Y axis is between 0 to 1, where 0 resembles that z is moving towards negative infinity and 1 represents z is moving towards plus infinity.

Now comes the final part, how to calculate for probabilistic points (p)

P= Exp(z)/(1+Exp(z))

This total computation done by Logistic Regression helps me to identify the pattern of analysis, between two states. Such algorithm is widely used in demographic analysis.

This actually reminds me of the release by Microsoft on Azure traffic. Currently Azure deploys your application at one or at max 2 datacentres and in that datacentre you have multiple instances of your application. With Azure traffic coming, Microsoft intends to copy your application in all the 6 datacenters that it has, and while you are travelling from Asia to North America the Azure traffic feature does an intelligent mapping and renders fast speed by giving you the access of the nearest datacenter. In such kind of analysis this algorithm will really have some value.

The reference for this blog has been taken from http://msdn.microsoft.com/en-us/library/cc645904.aspx. Some of the analysis which you will not find in the msdn site has been done by me. They are out of my understanding, so it will be great if you feel something wrong and point it. I will try to give proper justification for the same

## Comments

## Post a Comment