top of page

Text analysis demonstration in Python


Today I am going to do some interesting text sentimental analysis using python, the tools I need is provided by package Graphlab.

What I am going to do is to analyze baby products data from Amazon.com. The data size is roughly 18K, pretty large, but it is easy to handle by Python.

#1 load data

***********************************************

import graphlab as gr

product=gr.SFrame("amazon_baby.gl")

products.show()

by using .show() operation, we can have a look at our dataset

There are 18K rows of reviews with different products, each row contains a word review of this product and a rating (1~5 integer). What I want explore is the relationship between "review" and "rating". i.e I want to build a model to predict the clients' rating according to their word review.

********************************************************************************************

products["word_count"]=gr.text_analytics.count_words(products["review"])

giraffe_products=products[products["name"]=='Vulli Sophie the Giraffe Teether']

giraffe_products["rating"].show(view="Categorical")

products=products[products["rating"]!=3]

products["sentiment"]=products["rating"]>3

products["sentiment"].show(view="Categorical")

train_data,test_data=products.random_split(.8,seed=1)

sentiment_model=gr.logistic_classifier.create(train_data, target="sentiment", features=["word_count"],validation_set=test_data)

 

PROGRESS: Logistic regression: PROGRESS: -------------------------------------------------------- PROGRESS: Number of examples : 133448 PROGRESS: Number of classes : 2 PROGRESS: Number of feature columns : 1 PROGRESS: Number of unpacked features : 219217 PROGRESS: Number of coefficients : 219218 PROGRESS: Starting L-BFGS PROGRESS: -------------------------------------------------------- PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+ PROGRESS: | Iteration | Passes | Step size | Elapsed Time | Training-accuracy | Validation-accuracy | PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+ PROGRESS: | 1 | 5 | 0.000002 | 2.660477 | 0.841481 | 0.839989 | PROGRESS: | 2 | 9 | 3.000000 | 4.087989 | 0.947425 | 0.894877 | PROGRESS: | 3 | 10 | 3.000000 | 4.605857 | 0.923768 | 0.866232 | PROGRESS: | 4 | 11 | 3.000000 | 5.093203 | 0.971779 | 0.912743 | PROGRESS: | 5 | 12 | 3.000000 | 5.575046 | 0.975511 | 0.908900 | PROGRESS: | 6 | 13 | 3.000000 | 6.090912 | 0.899991 | 0.825967 | PROGRESS: | 10 | 18 | 1.000000 | 8.355521 | 0.988715 | 0.916256 | PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+ PROGRESS: TERMINATED: Iteration limit reached. PROGRESS: This model may not be optimal. To improve it, consider increasing `max_iterations`.

 

sentiment_model.evaluate(test_data, metric="roc_curve")

sentiment_model.show(view="Evaluation")

giraffe_products["predicted_sentiment"]=sentiment_model.predict(giraffe_products, output_type="probability")

giraffe_products=giraffe_products.sort("predicted_sentiment",ascending=False)

giraffe_products[0]["review"]

 

******************************************************************************************************

import graphlab as gr import numpy as np products=gr.SFrame("amazon_baby.gl") def awesome_count(word): if "awesome" in word: return word["awesome"] else: return 0 def great_count(word): if "great" in word: return word["great"] else: return 0 def fantastic_count(word): if "fantastic" in word: return word["fantastic"] else: return 0 def amazing_count(word): if "amazing" in word: return word["amazing"] else: return 0 def love_count(word): if "love" in word: return word["love"] else: return 0 def horrible_count(word): if "horrible" in word: return word["horrible"] else: return 0 def bad_count(word): if "bad" in word: return word["bad"] else: return 0 def terrible_count(word): if "terrible" in word: return word["terrible"] else: return 0 def awful_count(word): if "awful" in word: return word["awful"] else: return 0 def wow_count(word): if "wow" in word: return word["wow"] else: return 0 def hate_count(word): if "hate" in word: return word["hate"] else: return 0

products=products[products["rating"]!=3] products["sentiment"]=products["rating"]>3 products["word_count"]=gr.text_analytics.count_words(products["review"]) products["awesome_count"]=products["word_count"].apply(awesome_count) products["great_count"]=products["word_count"].apply(great_count) products["fantastic_count"]=products["word_count"].apply(fantastic_count) products["amazing_count"]=products["word_count"].apply(amazing_count) products["love_count"]=products["word_count"].apply(love_count) products["horrible_count"]=products["word_count"].apply(horrible_count) products["bad_count"]=products["word_count"].apply(bad_count) products["terrible_count"]=products["word_count"].apply(terrible_count) products["awful_count"]=products["word_count"].apply(awful_count) products["wow_count"]=products["word_count"].apply(wow_count) products["hate_count"]=products["word_count"].apply(hate_count)

selected_words=["awesome_count","great_count","fantastic_count","amazing_count", "love_count","horrible_count","bad_count","terrible_count","awful_count","wow_count","hate_count"] train_data,test_data=products.random_split(0.8,seed=0) sentiment_model=gr.logistic_classifier.create(train_data,target="sentiment",features=selected_words,validation_set=test_data) # find the weight of different symbols sentiment_model["coefficients"].print_rows(num_rows=12) # test the accuracy print sentiment_model.evaluate(test_data)


Who Am I?

Hello, My name is Tongda (Carl). I am currently a Math Finance graudate student at BU. I am fan of innovation and funny ideas! You can reach my by  LinkedIn below :)

Follow Me
  • LinkedIn Social Icon
bottom of page