Blog Archive

Sunday, July 6, 2014

The Great Indian Wedding



Arranged marriages have been part of long Indian tradition. With the advent of internet in India and increased online users, matrimonial services too went online and is doing exceptionally well. It is estimated to be a more than 10 Billion dollar industry in India and is growing at a rate of more than 50% year-on-year.

I along with my friend Sharan did some analysis on the public data posted on the matrimonial website.

The analysis of the Indian Wedding was done on Bride and Groom's data from simplymarriage.com

The following questionnaires capture the patterns in the Groom's data. 





What is the expected age of a Bride for the Groom?





From this you can clearly make out that as Men grow old their preference for a younger Bride increases. There could be various reasons to this but I'm not getting into it. Its debatable

.

What is the Groom's Education?




Clearly, the online wedding portals are dominated by Engineers and MBA grads. 


Which are the major sectors where the Groom's work?




The graph is pretty much self explanatory but there is a significant chunk of people who are Not working. This could be either due that they got confused while logging their details or most have just passed out from college and they looking for a Job as well a Girl to get married at the same time.


What is the proportion of the Men's Body Type?




 The Indian men seem to be from Average to Slim or Athletic. Pretty Impressive! Only 1% of them are heavy. According to the National Statsistics on Overweight or Obesity, its around 12% - 16% and this is for whole India. Definitely it will be higher for Urban population.

What is the expected body type of the Bride?



So it turns out that the expectation is also quite optimistic.

What is the complexion of the Indian Men?



This shows that majority of the Indian Men are Wheatish to Fair skin. Probably, most of the Dark Skin Men find their true love before marriage itself.

(Pun Intended!)

What is the expected complexion of the Bride?



 Its good that there is significant chunk of Men for whom the Bride's complexion don't matter but can't ignore the fact that most of them prefer the Bride to fair.

How many of the Groom Drink?


Most of them don't drink. Hmm...can't comment on this. The men who selected "Doesn't Matter" does not make much sense on what they tried to mean.

Do the groom expect the bride to drink?



Most of the Men don't want their future partner to drink.

What do the Groom have to say about themselves?



What do they expect in their Bride?




They want to them to be good looking, educated and a family person who understands them.




Suggestion for matrimonial providers

Matrimonial website shall provide innovative services like recommending the best suitable partner. Give a suggestion to user about the important set of information considered the most by the other gender. 

If they also have the information about who actually ended up marrying each other then this can be the basis of recommending to the partners to the future users.

Wednesday, May 28, 2014

Mining through running stats

Recently, TCS 10k run was held which was organized quite well except for the part where people had to do commando over a puddle of water.

Anyways, coming to the point. The data for it is publicly available which made me wonder that some interesting charting can be done on it and may be some insights too.

The following charting has been done on Open 10k only.


1. Average time across age groups  



The youngest age group clearly has the minimum time and the reasoning for it is obvious so no need to dwell over it. The 20 to 30 years have all kind of people running and mostly the 1st time runners because of which they have a higher average time. Subsequent age groups, People are more serious about running


2. Average time across gender



3. Maximum time across age groups



If you look at the max time across the age groups, its evident that the last person stepping the finish line is not only getting wiser but also is faster


4. Minimum time across age groups

In the end, the fastest always have the age advantage

5. Distribution of people across age groups


6. Overall gender distribution

7. Gender distribution across age groups


You can click on the buttons to see the specific category distribution. The younger lots have more female participation which is a good sign. Hopefully in the coming years, there will be a more even number of participants.

8. Distribution of people across age groups and different time intervals


This chart is quite an informative one which tells how many people finished at a given time interval across different age groups.

I'll leave the inference for this up to you.

oh yeah, the chart is interactive so you can go ahead and click on those small circles.

Tuesday, May 13, 2014

Real-time Analysis of Payment Gateway Failures in E-Commerce


Introduction
The 21st century has seen the growth of numerous E-commerce companies whose business model involves buying and selling of products and services over electronic systems such as the internet and computer networks. A major function of E-commerce is handling business transactions and processing of payments. Processing of payments is a crucial task as it involves tens of thousands of transaction requests every second. It is also crucial for revenue generation to ensure that any hardware or software issues that might crop up are identified and handled immediately to avoid customer churn.
Real-time analytics of Payment patterns in E-Commerce involves a sophisticated methodology wherein we use a Real-time Processing Engine, a temporary persistence layer and a permanent persistence layer to help business users gain insights which was never before possible. This technique gives the business user the power by giving him information regarding the transactions and payments that has happened till that second. He can perform various complex analysis on this data and see how well is he doing compared to another point in time, predict what his revenue would be for the coming days. He can identify his major clients/customers and make sure that all the offers and discounts are conveyed to them. He would be given the power wherein he can address payment gateway failures as soon as they crop up, identify the cause and fix it in real-time before it drastically affects his business or revenue. He can also segregate this clients/customers into various categories and provide category specific services and discounts. Payment Gateway trend analysis will be of much use in the coming age as transactions and banking methods shift more and more into the virtual environment. The growth of PaaS (Payment of a Service ) is a very good example of this trend, where E-commerce companies are actually outsourcing their entire payments to Payment as a Service Providers.

Various Stages of Online Payments
There are various issues with respect to processing of online transactions. The entire payment processing takes around 2-3 seconds in general and involves various stages:
  1. Customer places the order on website by pressing the 'Submit Order'.
  2. The customer's web browser encrypts the information to be sent between the browser and the merchant's web server. The merchant then forwards the data to the payment gateway. The in between steps are maybe done via  SSL (Secure Socket Layer) encryption. Data could also be directly sent from the customer's browser to the gateway.
  3. The payment processor forwards the transaction information to the card association (e.g., Visa/MasterCard/American Express).
  4. The credit card issuing bank receives the authorization request and does fraud and credit or debit checks and then sends a response back to the processor.
  5. Processor forwards the authorization response to the payment gateway.
  6. The payment gateway receives the response, and forwards it on to the website.

Characteristics of Payment Gateways
1) Compatibility: The payment gateway's technology platform should be compatible with the shopping cart's technology.
2) Security: Ability to detect fraud in a moving window.
3) Price: The fees charged for using the service.
4) Simplicity: Ease to use and the end user's need.

Problems faces during Payment stage of an E-commerce transaction are many. They arise due to many factors like:
  1. One time Password send in SMS mode
  2. Poor network
  3. Device and browser combinations.
  4. Eight server hops for successful payment, anything could go wrong in between, each stage adds its own complication.

Problems Faced by E-Commerce Merchants
The various problems that lead revenue loss and loss of transactions with respect to an E-commerce company are the following:
  1. Drop Offs: From the time the shopper clicks on the 'Pay' button on the merchant's site till the time the entire transaction is processed, the request goes through various servers. Some payment gateways have 8 hops in between during which the connection could drop off. There are also some providers who have less number of hops and enable direct payment, though in these cases success rates are much higher drop offs are inevitable.
  2. Multiple Payment Failure: Multiple payment failure is another crucial scenario wherein a loyal customer is trying to make payments more than once and failed more than once. This is a very serious issue, since it could lead to customer churn especially of loyal customers, they will opt for competitor sites.
  3. Payment Success but the confirmation message is not shown: In some cases payment is successful but, if the confirmation messages aren't shown it could lead to confusion and customers might misinterpret their payment status.
Present Approach
In most cases the merchant sites are unaware of the reason why their purchase to search ratio for a product/service is dropping. They know that the user hasn't completed his purchase, but the reason is still unknown and they are thus unable to rectify the issue.
Payments are so crucial to the functioning of an E-commerce site that any sort of approach to improve payment drop offs is welcome, a decrease of even 1% is considered enough to improve revenue by millions of dollars.
  1. Scheduled Batch Reports: The reports done my batch jobs give indications of payment failures being the reason for low sales for a given time period. While this method might help fix similar issues in the future, it cannot help get back drop offs or customer churn during the analysis interval.
  2. PaaS: Many E-commerce companies are now outsourcing their work to PaaS providers (Payment as a Service). This helps prevent drop offs by means of reduced number of hops and one click payments, etc. These services come with huge monetary costs.

Real-time Analysis of Payment Patterns in E-commerce
Architecture
This approach aims at giving E-commerce companies a 360 view of the payments happening in their site in real-time. The proposed architecture consists of:
  1. Real-time Processing Engine: A real-time engine like storm serves the function of ingesting the traffic across the website on a real-time basis, processing the data, based on certain complex software algorithms and forwarding the processed results either to a storage layer or reporting layer.
5) A Temporary Persistence Layer: A temporary persistence layer, a key value storage like Redis would help serve for temporary storage and look ups which are necessary to identify customers , unique apache cookies, on a real-time basis. It is preferred over a time consuming permanent persistence layer since storage and retrieval are much more efficient.
6) A permanent persistence layer: A permanent persistence layer stores the processed data that comes from the real-time processing engine. This can be any distributed database like Hbase or Cassandra. They will be able to scale and handle the huge volumes of E-commerce traffic.
7) User Interface: This gives the personnel at the E-commerce real-time insights into their payments data and traffic.



Solutions
Drop-Offs
According to this method, Payment-drop offs can be tackled real-time. The real time counts can be got for each stage of the funnel. The drop offs at each stage can be identified. The number of people in the 1st few stages will always be many times more than those at the lower stages. It reduces stage by stage to a very low number in the last stage. Figure shows a typical e-commerce book/buy funnel.
The algorithm for the storm code :
  1. Set store the entire message as the key of a map when "payment id"( unique id for each payment) exists and "payment status" is INITIAZE (making connection). The value will be the details of the message.
  2. If a message comes in with the "payment status" is SUCCESS or FAILURE, clear the map content with that particular "payment id". It means that the particular individual with that unique payment id has either succeeded or failed in making the transaction and his connection has not dropped off in any in between stage. We can therefore clear his information from the map.
  3. .The elements in the map are now filtered out by unique "payment id" where "payment status" is INITIAZE. This final map gives us details of those payment messages where payment has dropped off in between while trying to contact the bank or gateway, ie. those payment id's which have not generated a SUCESS/FAIL status, and therefore have got dropped off in between.
  4. These data is now shifted out to a Permanent Persistence layer where it is stored and the various details of the message are separated out, eg: reason for drop off can be got from the "Payment gateway response" field, the various other details leading to drop offs can also be identified as we can save the browser name , version , device type, etc.
  5. The data stored in the database can be used for analysis.
  6. For immediate action we can send customer details to the Call-centre for immediate call backs. This data could also be pushed to an e-mail or SMS queue to prevent customer churn.
Multiple Payment Failures
There are scenarios where customers face payment failure multiple times, it leads to customer churn and loss of loyalty as customer who face such issues will be unsatisfied with the website. This is a difficult scenario that has to be tackled in real-time before the customers choose other alternative sites. This use case can be leveraged to provide customer specific services and call backs, based on the value of that customer to organisation.

The proposed approach :
  • We use a temporary persistence layer ( Redis) which stores look-up data for all high value customers, their user id and other details.
  • As and when a payment message with a "Fail" status is received, a look-up is done with the High Value Customers list in Redis.
  • If the user is High value, his details are pushed to the call centre channel for immediate action.
  • If he is a non-high value user , his details are pushed into a temporary map in Redis with payment id as the key and the count of failures.
  • If the number of failures are greater than 2 in the given interval of time, his details are published onto an E-mail/SMS channel.


No Confirmation Message
There is another case when customers who have successfully completed transaction do not receive a success message, this in turn leads to a lot of confusion and might lead to customers attempting to pay again.
The proposed approach:
  • Whenever a Payment message comes in with status "Success" , it is pushed into a map in Redis (with fixed life time)." Payment id" is the key and message the value.
  • Whenever the funnel data reports completion of transaction, ie. the "confirm order id" is generated, storm checks up if data exists in the Redis map which has the same payment id . In case it exists, that map entry is removed.
  • At the end of the life time of the map elements in Redis, it pushed to an SMS.E-mail queue. This is a list of users who have not received a confirmation message.


PERFORMANCE EVALUATION
The traditional approach followed in most E-commerce companies, which involve using batch processing to record and analyze payment gateway failures were compared with the Real-time approach this paper proposes from both an efficiency point of view and time saving:
  • Drop Off :
The tradition approach of using scheduled batch jobs took around 300ms to query the payment gateway drop offs for a given period, the job was scheduled every 5minutes.
The real-time approach took around 20 seconds for the same query, data was recorded every 30 seconds.
  • Multiple Payment Failures & No confirmation Message Shown : T
These issues which are hardly handled in the traditional approach. Even if they were, it was only much later, after the customer left the site. These kind of actions are usually taken at intervals of 24hours or 12 hours.
The real-time approach was tried and payment failures were tackled on a real-time basis, it was also possible to contact the customer facing problems immediately, as soon as a failure was reported. The response time was a matter of few seconds or minutes at the most.
advantages
  • Scalability: The size of data doesn't matter
  • Real-time Action: Real-time pattern capture, analysis and actions.
  • Increased reporting performance: The absence of any pre-computation in the backend of Reports and Dashboards, leads to extremely fast and efficient reporting.
  • Ease of implementing custom rules: Not limited by the offerings of a proprietary software. Custom rules can be created and deployed in no time.

CONCLUSION

Addressing various issues with respect to payment gateways is one of the major concerns of all e-commerce companies nowadays. The traditional approach to identify drop offs, multiple payment gateway failures were using batch processing. This methodology involves scheduling a job which runs at regular intervals, collecting past payment data, analyzing this data and understanding the reason for payment failures. Since online payments lead to real-time customer churn and loss of loyalty, it should be tackled on a real-time basis. The proposed approach involves using storm, redis and a permanent storage layer to tackle payment failures.
The major advantage of this approach is being able to communicate with the customer and provide him customer specific services. This could directly translate into a huge increase in revenue, as payment gateway failures are the major cause for revenue loss in e-commerce and has a huge potential to improve based on the proposed approach. This real-time customer service could lead to increased customer loyalty and resultant increased traffic on the merchant’s site.



Saturday, February 8, 2014

Rahul Gandhi's interview by Arnab Goswame



The day after the famous interview was aired, the complete transcript of the interview was available online to read through. On facebook, I saw one of my college juniors had made a simple shell script to count the number occurrences of certain words. This by itself was quite insightful and I knew I could take it to the next level without much effort by using R and few of its packages.

I've uploaded the R script on github. Following is the basic flow of the script

  1. Separate the Rahul and Arnabs conversation into 2 buckets
  2. Remove extra spaces
  3. Remove punctuation
  4. Convert the text to lower case
  5. Remove the stop words
  6. Convert the text to a term document matrix
  7. Rank the words based on their occurrences
  8. Generate the word cloud and also their top 5 words

Here is the word cloud.


Thursday, October 10, 2013

Data Hacking through Hollywood


Always been a movie buff since the time I cannot remember. Whenever I've had some time to kill, I'll go on various movie websites and check out various details or trivia questions about the movie. Sometimes, questions like "which are the two most popular actors who acted together the most?" will pop in my head.

To answer such a question would a be cumbersome task if I've to go through the web pages manually. But these questions can be answered quite easily and many more interesting facts can be found out just by automating the process and  performing some elementary analysis.

Tools Used

  1. Python with Beautiful Soup for data scraping
  2. R with Google Visualization for charting
  3. Gephi to perform Social Network Analysis

Data Collection Approach






EDA on the Data

The next step was to ask some basic questions using this data.

Which is the movie with the maximum number of popular actors?
Turns out, "The thin red line(1998)" had the maximum. The 2nd to lead is "The Grand Budapest Hotel" which hasn't released yet. This movie has  quite a diverse set of cast ranging from Edward Norton, Jude Law to Owen Wilson, Bill Murray.

ColumnChartID1baf53de4981



Which are the titles with the maximum number of popular actors?
There are some movies which have been made multiple times. Hamlet leads the pack here. There were mainly 3 version of it. The one that is worth a watch is the 1996 version. The 2nd to lead is the three musketeers. Both Charlie Sheen's and Orlando Bloom were a disaster.

ColumnChartID1baf55569858



Which popular actors have the maximum number of movies?
Christopher Lee aka Saruman aka Count Dooku has done maximum movies. Mostly the people who have done a lot of movies are quite elderly, which seems obvious. Michael Madsen seems to be the odd one out.

ColumnChartID1baf7cdc5c41



Which popular actors have the minimum number of movies?
Haing S Ngor who is an Oscar winner has the least number movies. He was also the 1st Asian to win an Oscar for a Supporting role. Heath Ledger also comes in this list which is quite unfortunate.

ColumnChartID1baf5a0def06



Which are the two actors who have acted together the most?
The Estevez family are in this list with Martin Sheen, Charlie Sheen and Emilo Estevez. Martin Sheen and Charlie Sheen topping the list. Matt Damon and Ben Affleck would have been an easy prediction to this list if the data analysis wasn't done. Always thought, Robert De Niro must have appeared with Al Pacino the most but turns out he's acted more with Harvey Kietel.
ColumnChartID1baf69da92b3






Social Network Analysis on Actors

Since, I had the list of actors and their respective movies. I could perform SNA on it. Now, what this means in simple terms is that Will Smith and Tommy Lee Jones are connected as they acted together in MIB. Tommy Lee Jones and Chris Evans are connected as they acted in Captain America. If you keep repeating this and visualize it, you'll form some sort of an ugly spider web.

Lets start again by asking some questions


Which are the Top 5 actors who have acted with the most number of popular actors?

The following list will seem quite obvious

  1. Robert Di Nero
  2. Samuel L. Jackson
  3. Anthony Hopkins
  4. Donal Sutherland
  5. Harry Dean Stanton
The ones in dark purple are the five above Actors



Show some connection between few actors?
Took few actors and tried to see how they are connected to each other

Eddie Murphy and George Clooney
Turns out that they have Arnold Schwazenagger in common between them. Eddie Murphy and Arnold Schwazennager are acting together in Triplets which is a sequel to Twins, Eddie Murphy is the 3rd brother of Arnold and Dannie DeVito in the movie. Arnold Schwazenagger and George Clooney acted together in Batman and Robin.





Charlie Sheen and Ben Affleck
Charlie Sheen acted with Cuba Gooding Jr in Machete Kills and Cuba Gooding Jr acted with Ben Affleck in Pearl Habor





Nicolas Cage and Will Smith

Christian Slater acted with Nicolas Cage in WindTalkers and Christian Slater acted with Will Smith in "Where the day takes you" which was Will Smith's first movie.



Which is the actor through whom you can get connected to most of the actors?

This is one is a tricky question to understand. Lets say there are 4 main cities in North, South, West and East. Each of the main city has few small cities around it. All the 4 main cities are connected at a  common point in the middle which is a town. If someone from a small city in the south wants to go to a small city in the north then they have to pass through the town. The town sort of acts like a main junction for all cities.

The cities are actors and the town is that actor through whom you can reach most of the other actors.

The town turns out to be Eli Wallach. The Ugly from The Good, the bad and the ugly.



How are the actors clustered?

Used the Modularity measurement to group the actors together. In simple words, the actors with the same color are closely connected to each other based on the movies they have acted. The actors with different colors are not closely connected.

Hiccup

During data gathering, there was a slight hiccup in the approach. Once I got the list of Popular actors, I had to get their respective URLs. To get the URL, I decided to use the Google's Search API. Turns out, I can only get URLs for a limited number string as Google had deprecated their API and enabled some sort of Dynamic Throttling.

Since the generic approach won't work, I had to use the search URL of a particular movie website, added the name of the actor using the '?q' parameter and got their respective filmography page