By David Smith
Astronomer and budding data scientist Julia Silge has been using R for less than a year, but based on the posts using R on her blog has already become very proficient at using R to analyze some interesting data sets. She has posted detailed analyses of water consumption data and health care indicators from the Utah Open Data Catalog, religious affiliation data from the Association of Statisticians of American Religious Bodies, and demographic data from the American Community Survey (that’s the same dataset we mentioned on Monday).
In a two-part series, Julia analyzed another interesting dataset: her own archive of 10,000 tweets. (Julia provides all the R code for her analyses, so you can download your own Twitter archive and follow along.) In part one, Julia uses just a few lines of R to import her Twitter archive into R — in fact, that takes just one line of R code:
tweets read.csv("./tweets.csv", stringsAsFactors = FALSE)
She then uses the lubridate package to clean up the timestamps, and the ggplot2 package to create some simple charts of her Twitter activity. This chart takes just a few lines of R code and shows her Twitter activity over time categorized by type of tweet (direct tweets, replies, and retweets).
The really interesting part of the analysis comes in part two, where Julia uses the tm package (which provides a number of text mining functions to R) and syuzhet package (which includes the NRC Word-Emotion Association Lexicon algorithm) to analyze the sentiment of her tweets. Categorizing all 10,000 tweets as representing “anger”, “fear”, “surprise” and other sentiments, and generating a positive and negative sentiment score for each, is as simple as this one line of R code:
mySentiment get_nrc_sentiment(tweets$text)
Using those sentiment scores, Julia was easily able to summarize the sentiments expressed in her tweet …read more
Source:: r-bloggers.com