How to Create a Word Cloud Using Twitter Data?
What are word clouds? Word clouding is a visualization technique to present a list of online words related to a given topic or hashtag. Below is how a word cloud looks like, says, if someone wants to visually generate a list of words associated with the hashtag “data science” on Twitter:
Word clouds can be easily generated using free online generators. However, to use the method, you likely must have a list of words and their frequencies (e.g., how often each word has been mentioned on Twitter). Manually counting words associated with a hashtag on Twitter can be laborious and unpractical. This tutorial shows how to create a word cloud, using Python to generate a list of words associated with your hashtag of interest on Twitter.
What you need before getting started:
- Python libraries: numpy, Matplotlib, tweepy, and WordCloud.
Tweepy is based on the Twitter Application Programming Interface (API), allowing users to compose tweets, access their followers’ data, examine a large number of tweets on a given topic, and read twitters’ profiles.
Wordcloud is used for generating a list of words. Each word’s size of a word cloud represents how frequently is the word has been mentioned on Twitter. WordCloud depends on Matplotlib, a plotting Python package, to function.
Numpy is used to manage and mathematically operate matrices and multi-dimensional arrays. Numpy is a key module of Python language programming as data is stored in vectors and matrices. I highly recommend you to review linear algebra to get a sense of matrices’ and vectors’ operations because such knowledge is the key to data science. A good quick review of linear algebra can be found in the series “The Essence of Linear Algebra” by 3Blue1Brown on YouTube.
To install the required Python libraries, you may use Anaconda Navigator, the command prompt in your computer’s operating system, Jupyter Notebook, or PyCharm.
If you use Anaconda Navigator, to install a required library, go to Environment.
Search for the package to be installed in the search box
I am searching for the package ‘twitter.’ Check the box in front of the package and then click Apply.
If you use the command prompt (For Windows users: press Windows + R, for Mac Users, Press Fn + F4 to launch Launchpad, go to the Other folder and click, select Terminal), type the following command.
If you use Jupyter Notebook, type the below code:
import sys
!{sys.executable} -m pip install numpy
If you use Pycharm, go to File, Setting, and press the + button.
Type the name of the package you want to install in the search box and select Install Package.
Once you installed the required Python libraries, you will also need…
- Twitter API information: consumer key, consumer secret, access token, and access token secret. Visit https://dev.twitter.com/apps/new to create your Twitter developer account if you don’t have one to obtain your API information.
- PyCharm Community, Jupyter Notebook, or Python to run your word cloud generating associated commands. In this tutorial, I use Jupyter Notebook as it is visually easy for readers to comprehend the content.
Now that you have all of the required elements let’s get started.
Step 1: Import the required packages.
import tweepy
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline
Step 2: Create variables for your Twitter developer information that you retrieved from https://dev.twitter.com/apps/new.
consumer_key = "insert your consumer key"
consumer_secret = "insert your consumer secret"
access_token = "insert your access token"
access_token_secret = "insert your access token secret"
Request authorization to use data from Twitter, which will authenticate your request.
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
Step 3: Define a search-for-tweets function using tweetSearch. The function allows a search of 1,000 tweets associated with a given hashtag in English as the default mode. Create a empty text holder and then use Tweepy to retrieve your query and store the data in the text holder.
As Twitter data often contains URL links, you may want to exclude irrelevant words, such as “https” and “co,” from your word cloud and replace the irrelevant words with a blank in your text holder. Then, ask Python to return the clean text holder.
#Define a search-for-tweets function
def tweetSearch(query, limit = 1000, language = "en", remove = []):
#Create an empty text holder
text = ""
for tweet in tweepy.Cursor(api.search, q=query, lang=language).items(limit):
text += tweet.text.lower()
#Create a list of words to be removed
removeWords = ["https","co"]
removeWords += remove
#Replace the words that you would like to removed with an empty space.
for word in removeWords:
text = text.replace(word, "")
#returnclean text
return text
Step 4: Use tweetSearch to indicate the hashtag you would like to obtain a list of the most frequently mentioned 1,000 words associated with the hashtag. Here, I would like to obtain a list of 1,000 words related to the hashtag “datascience”. I will store the search results under the variable “search.” Create the word cloud from the list of words associated with the hashtag ‘data science,’ using WordCloud().generate().
search = tweetSearch(“datascience”)
wordcloud = WordCloud().generate(search)
Use the function plt to display the word cloud result. The word cloud’s figure size can be adjusted using plt.figure(figsize = (define the size that you want)). In this example, I use (12, 6).
plt.figure(figsize=(12,6))
plt.imshow(wordcloud, interpolation=’bilinear’)
plt.axis(“off”)
plt.show()search = tweetSearch(“datascience”)
The visual word cloud from my search of #datascience on Twitter is just like this:
See the full list of codes used in this tutorial below.
This tutorial is just a quick, simple way to generate a word cloud from Twitter data. Please stay tuned for more complex data visualization techniques.