## Twitter Sentiment Analysis in Frequency Domain

### Twitter Harvesting and Analysis

#### Harvester

Two types of APIs are available for data collection from Twitter: Streaming API and Search API. Streaming API is mainly used to pull tweets created in real time. Tweets satisfying customized criteria will be returned. Search API, unlike Streaming API, can query and return tweets in the past that satisfy a certain search criteria. Developers are able to search tweets by username, keywords and locations.

Tweepy library is used in this project, which supports both Streaming and Search API. In order to improve overall efficiency, tweets are analyzed before storing them in the database. The filter used in Streaming API is set to the geolocation of Melbourne, Australia. Tweets without coordinates will be directly discarded.

One primary challenge in data collection is rate limiting. Each authentication key has a window of 15 minutes and at most 180 Search API calls are allowed within a window. Both Streaming API and Search API require a set of API key, API secret, Access token and Access token secret to be authenticated. Twitter restricts the number of connections that can be initialized by a single set of authorization keys. In order to retrieve more tweets, 6 sets of authentication keys were created for the harvester in this project.

#### Historical Tweets

Another challenge is to collect tweets posted in the past. Streaming API only returns real-time tweets and searching keyword in Search API will only return tweets posted within 7 days. Thus, Streaming API is used together with Search API. Firstly, Streaming API collects real-time tweets. For each real-time tweet, the harvester will get the username of the author and get 50 recent posts from the user's timeline via Search API.

#### Duplicate Tweets

The tweets returned by the Twitter API contains a lot of duplicates, therefore it is essential to discard them before putting into CouchDB. CouchDB allows user to set the document id _id. The tweet ID is copied to the document id as an unique identifier.

#### Retweets

Twitter users not only post their own tweets but also share others tweet i.e. retweet. In this project, all retweets are ignored. Specifically, any tweet with retweeted = true will be discarded.

#### Sentiment Analysis

TextBlob package is used to evaluate sentiment polarity. The polarity score is a float within the range [-1.0, 1.0]. Sentiment score is computed and then stored in CouchDB together with ID, text, timestamp and coordinates.

### Visualization

#### CouchDB View

The build-in \textttc{\_stats} reduce function is applied, which is implemented in Erlang instead of JavaScript. This function returns the sum of sentiment scores, the count of tweets and the sum of squares of sentiment scores. When querying this view with parameter group=true, CouchDB will aggregate all the documents by the key pair.

{
"rows": [
{
"value": {
"count": 10000,
"min": -1.0,
"sumsqr": 200,
"max": 1.0,
"sum": 1000
},
"key": [
"Mon", "0"
]
}
]
}


The average value of sentiment scores is used as an indicator of the overall sentiment of an hour $h$ on a day $d$. $$\operatorname{E}\left[ sentiment_{d,\ h} \right] = \frac{sum_{d,\ h}}{count_{d,\ h}}$$

The standard deviation is derived to describe how far a set of sentiment scores are spread out from their average value. $$\sigma \left[ sentiment_{d,\ h} \right] = \sqrt{\frac{sumsqr_{d,\ h}}{count_{d,\ h}} - \left( \operatorname{E}\left[ sentiment_{d,\ h} \right] \right)^2}$$

#### Heatmap

A heatmap is used to visualize the difference of average sentiment scores in different time slots since the layout of color spectrum easily captures the discrepancy of different sentimental behavior.

The heatmap roughly shows that tweets seem happier on Saturdays. If sentiment scores are grouped by day, it can be clearly seen that Saturday, together with Sunday and Friday, constitutes a sentiment peak. This result is reasonable because people feel more relaxed during the weekend. On Monday, the beginning of five working days, people are relatively unhappy. $$\operatorname{E}\left[ sentiment_d \right] = \frac{\sum_{h=0}^{23} sum_{d,\ h}}{\sum_{h=0}^{23} count_{d,\ h}}$$

Similarly, from the heatmap, people are down late at night. If sentiment scores are grouped by hour, it can be clearly seen that tweets posted at 5am are the most negative ones and people are happiest around 9 am. The result is understandable. For example, people suffering from insomnia / on the night shift / who have to get up very early tend to express negatively. $$\operatorname{E}\left[ sentiment_h \right] = \frac{\sum_{d=0}^{6} sum_{d,\ h}}{\sum_{d=0}^{6} count_{d,\ h}}$$ $$\operatorname{Var}\left[ sentiment_h \right] = \frac{\sum_{d=0}^{6} sumsqr_{d,\ h}}{\sum_{d=0}^{6} count_{d,\ h}} - \left( \operatorname{E}\left[ sentiment_h \right] \right)^2$$

##### Limitations

One limitation is that the difference between the peak and the valley (≈ 0.1) is much smaller than the standard deviation (≈ 0.3), although the large standard deviation can be explained by considering people are experiencing different situations. In addition, the standard deviation at the lowest point (5 am) is exactly the smallest, which may indicate samples are insufficient.

The average sentiment is correlated with the amount of tweets. For example, Saturday has the largest amount and Monday has the least. There are also some discrepancies. Specifically, Wednesday is a valley of sentiment but a peak of amount. Also, Sunday has fewer tweets than Friday but is happier than Friday.

$$\operatorname{E}\left[ count_h \right] = \sum_{d=0}^{6} \frac{count_{d,\ h}}{7}$$ $$\operatorname{Var}\left[ count_h \right] = \frac{\sum_{d=0}^{6} (count_{d,\ h})^2}{7} - \left( \operatorname{E}\left[ count_h \right] \right)^2$$

#### Spectral Analysis

As it is shown in the heatmap, the sentiment trend repeats a similar pattern every day. From the dawn to the morning, the sentiment score raises from the bottom to the peak and remains happy for several hours. In the afternoon, the sentiment score drops slightly and reaches a smaller climax during the TV prime time. Afterwards, the sentiment score drops again until reaching the valley and another cycle begins.

It can be easily concluded that the sentiment trend has a period of 24 hours. However, more details need to be discovered. Like a prism dispersing the white light into ROYGBIV, spectral analysis can be conducted on the Twitter sentiment trend.

To construct a time sequence $s[n]$, sentiment scores are firstly concatenated from 00 to 23 then from Mon to Sun, i.e. Mon 00 → Mon 01 → … → Mon 23 → Tue 00 → Tue 01 → … → Sun 22 → Sun 23. As a result, a time series of $168 = 24 \times 7$ points is obtained. $$s[n] = \frac{sum_{d,\ h}}{count_{d,\ h}} \qquad n = 24d + h = 0, 1, \ldots, 167$$ Then, Discrete Fourier Transform is conducted on $s[n]$ and magnitudes of the spectrum $\hat{S}[k]$ are obtained. $$S[k] = \sum_{n=0}^{167} e^{-i\frac{2\pi}{168} nk} s[n] = \sum _{n=0}^{167} s[n] \left( \cos \frac{2\pi kn}{168} - i \sin \frac{2\pi kn}{168} \right)$$ $$\hat{S}[k] = |S[k]| = \sum _{n=0}^{167} |s[n]| \sqrt{ \left( \cos \frac{2\pi kn}{168} \right)^2 + \left( \sin \frac{2\pi kn}{168} \right)^2 }$$ $S[k]$ is inherently conjugate symmetric because $s[n]$ is a real domain signal. Therefore, only the first $24 \times 7 / 2 + 1$ points are processed, i.e. $k = 0, 1, \ldots, 84$. Specifically, $\hat{S}[0]$ stands for the Direct Current component and $\hat{S}[84]$ corresponds to the Nyquist frequency (2 hours).

A dominant peak can be clearly seen at 11.57 µHz which corresponds to 24 hours of time. Afterwards, peaks happen at 13.15 µHz (12 hours) and 34.72 µHz (8 hours).

The 24h peak stands for the variations associated to the diurnal cycle. The 12h peak is attributed to the division of day and night and the 8h period is associated with the shift between work, life and sleep.