Text analysis demonstration in Python

 
Today I am going to do some interesting text sentimental analysis using python, the tools I need is provided by package Graphlab. 
What I am going to do is to analyze baby products data from Amazon.com. The data size is roughly 18K, pretty large, but it is easy to handle by Python.
 
#1 load data
 
***********************************************
import graphlab as gr
product=gr.SFrame("amazon_baby.gl")
products.show()
 
by using .show() operation, we can have a look at our dataset
 
There are 18K rows of reviews with different products, each row contains a word review of this product and a rating (1~5 integer). What I want explore is the relationship between "review" and "rating". i.e I want to build a model to predict the clients' rating according to their word review.
 
********************************************************************************************
products["word_count"]=gr.text_analytics.count_words(products["review"])
giraffe_products=products[products["name"]=='Vulli Sophie the Giraffe Teether']
giraffe_products["rating"].show(view="Categorical")
products=products[products["rating"]!=3]
products["sentiment"]=products["rating"]>3
products["sentiment"].show(view="Categorical")
train_data,test_data=products.random_split(.8,seed=1)
sentiment_model=gr.logistic_classifier.create(train_data, target="sentiment", features=["word_count"],validation_set=test_data)
 
PROGRESS: Logistic regression:

PROGRESS: --------------------------------------------------------

PROGRESS: Number of examples          : 133448

PROGRESS: Number of classes           : 2

PROGRESS: Number of feature columns   : 1

PROGRESS: Number of unpacked features : 219217

PROGRESS: Number of coefficients    : 219218

PROGRESS: Starting L-BFGS

PROGRESS: --------------------------------------------------------

PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+

PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |

PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+

PROGRESS: | 1         | 5        | 0.000002  | 2.660477     | 0.841481          | 0.839989            |

PROGRESS: | 2         | 9        | 3.000000  | 4.087989     | 0.947425          | 0.894877            |

PROGRESS: | 3         | 10       | 3.000000  | 4.605857     | 0.923768          | 0.866232            |

PROGRESS: | 4         | 11       | 3.000000  | 5.093203     | 0.971779          | 0.912743            |

PROGRESS: | 5         | 12       | 3.000000  | 5.575046     | 0.975511          | 0.908900            |

PROGRESS: | 6         | 13       | 3.000000  | 6.090912     | 0.899991          | 0.825967            |

PROGRESS: | 10        | 18       | 1.000000  | 8.355521     | 0.988715          | 0.916256            |

PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+

PROGRESS: TERMINATED: Iteration limit reached.

PROGRESS: This model may not be optimal. To improve it, consider increasing `max_iterations`.
 
sentiment_model.evaluate(test_data, metric="roc_curve")
sentiment_model.show(view="Evaluation")
giraffe_products["predicted_sentiment"]=sentiment_model.predict(giraffe_products, output_type="probability")
giraffe_products=giraffe_products.sort("predicted_sentiment",ascending=False)
giraffe_products[0]["review"]
 
******************************************************************************************************
 
import graphlab as gr

import numpy as np

products=gr.SFrame("amazon_baby.gl")

def awesome_count(word):

    if "awesome" in word:

        return word["awesome"]

    else:

        return 0

def great_count(word):

    if "great" in word:

        return word["great"]

    else:

        return 0

def fantastic_count(word):

    if "fantastic" in word:

        return word["fantastic"]

    else:

        return 0

def amazing_count(word):

    if "amazing" in word:

        return word["amazing"]

    else:

        return 0

def love_count(word):

    if "love" in word:

        return word["love"]

    else:

        return 0

def horrible_count(word):

    if "horrible" in word:

        return word["horrible"]

    else:

        return 0

def bad_count(word):

    if "bad" in word:

        return word["bad"]

    else:

        return 0

def terrible_count(word):

    if "terrible" in word:

        return word["terrible"]

    else:

        return 0

def awful_count(word):

    if "awful" in word:

        return word["awful"]

    else:

        return 0

def wow_count(word):

    if "wow" in word:

        return word["wow"]

    else:

        return 0

def hate_count(word):

    if "hate" in word:

        return word["hate"]

    else:

        return 0
products=products[products["rating"]!=3]

products["sentiment"]=products["rating"]>3

products["word_count"]=gr.text_analytics.count_words(products["review"])

products["awesome_count"]=products["word_count"].apply(awesome_count)

products["great_count"]=products["word_count"].apply(great_count)

products["fantastic_count"]=products["word_count"].apply(fantastic_count)

products["amazing_count"]=products["word_count"].apply(amazing_count)

products["love_count"]=products["word_count"].apply(love_count)

products["horrible_count"]=products["word_count"].apply(horrible_count)

products["bad_count"]=products["word_count"].apply(bad_count)

products["terrible_count"]=products["word_count"].apply(terrible_count)

products["awful_count"]=products["word_count"].apply(awful_count)

products["wow_count"]=products["word_count"].apply(wow_count)

products["hate_count"]=products["word_count"].apply(hate_count)
selected_words=["awesome_count","great_count","fantastic_count","amazing_count",

                "love_count","horrible_count","bad_count","terrible_count","awful_count","wow_count","hate_count"]

train_data,test_data=products.random_split(0.8,seed=0)

sentiment_model=gr.logistic_classifier.create(train_data,target="sentiment",features=selected_words,validation_set=test_data)

# find the weight of different symbols

sentiment_model["coefficients"].print_rows(num_rows=12)

# test the accuracy

print sentiment_model.evaluate(test_data)