Scrape Data From a Twitter Account and a Search Term

Kay Chansiri
4 min readDec 10, 2020
Image credit: Pixabay

Have you ever interested in what a science organization has recently discussed on Twitter? Have you ever been curious about how Twitter users have recently talked about a given topic? If you aim to answer those questions, this tutorial is for you.

Before answering the questions, you will need Python libraries and commands associated with data scraping and Twitter Application Programming Interface (API).

Pandas are used for data analysis and manipulation in Python, especially regarding numeric and time-series data.

Tweepy is based on the Twitter Application Programming Interface (API) that allows users to compose tweets, access their followers’ data, examine a large number of tweets on a given topic, and read twitters’ profiles.

Time is a python library used for presenting time as codes and variables, such as string, numeric, and objects.

Import the required libraries.

import tweepy
import pandas as pd
import time
  • Required Twitter API information: consumer key, consumer secret, access token, and access token secret. Visit https://dev.twitter.com/apps/new to create your Twitter developer account if you don’t have one to obtain the mentioned API information.

Now that you have obtained your Twitter API information and imported required python libraries let’s get started.

Step 1: Create your Twitter API variables:

You will need to request authorization to use data from Twitter, which will authenticate your request. To do so, you will use tweepy.oAuthHandler and auth.set_access_token to create an authorization request variable. You will then use tweepy.API to request Twitter to gather and return the data that you will ask for in the future. Here, I will set wait_on_rate_limit=True as I would like to automatically wait for rate limits to replenish.

consumer_key = "enter your consumer key"
consumer_key_secret = "enter your consumer key secret"
access_token = "enter your access token"
access_token_secret = "enter your acess token secret"
auth = tweepy.OAuthHandler(consumer_key, consumer_key_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth,wait_on_rate_limit=True)

Step 2: Create a variable ‘tweets’ as an empty variable to store prospective data.

tweets = []

Step 3: Use username_tweets_to_csv to define a specific Twitter account from which you would like to scrape data and convert the data into a CSV file. To do so, use tweepy.Cursor to generate a query method and then use tweet.created_at, tweet.id, and tweet.text to tell Python that you would like to obtain information associated with time, Tweet ID, and text from a given Twitter account.

Create an object from the data related to time, id, and tweets you retrieved under the name ‘tweets_info.’ Once the object is created, use pd.DataFrame to generate a data frame of the to-be-scraped tweets. Convert the data frame to a CSV file, using tweets_df.to_csv.

def username_tweets_to_csv(username, recent):
try:
tweets = tweepy.Cursor(api.user_timeline,id=username).items(recent)
tweets_info = [[tweet.created_at, tweet.id, tweet.text] for tweet in tweets]
tweets_df = pd.DataFrame(tweets_info,columns=['Datetime', 'Tweet Id', 'Text'])
tweets_df.to_csv('{}-tweets.csv'.format(username), sep=',', index = False)
except BaseException as e:
print('failed on_status,',str(e))
time.sleep(3)

Step 4: Identify Twitter’s user name from which you would like to scrape data and the amount of the most recent tweets that you would like to scrape from the username. Here I would like to get the most 150 recent tweets from the National Science Foundation, of which the username is NSF. Use username_tweets_to_csv(username, recent) to query and convert the scraped data in to a CSV file.

username = 'NSF'
recent = 150
username_tweets_to_csv(username, recent)

The NSF file.csv should be saved in your identified path.

Now that you learned how to scrape data from a Twitter account, let’s see how to scape Twitter data associated with a given topic.

Step 1: Create an empty variable ‘tweets’ to store data scraped from Twitter.

tweets = []

Step 2: Use text_query_to_csv to create a query method to obtain tweets associated with the topic you are interested in and convert such data into a CSV file. To do so, use tweepy.Cursor to generate a query method and then use tweet.created_at, tweet.id, and tweet.text to tell Python that you would like to obtain information associated with time, Tweet ID, and text from a given search term.

Create an object from the data associated with time, id, and tweets you retrieved under the name ‘tweets_list.’ Once the object is created, use pd.DataFrame to generate a data frame of the to be scraped tweets. Convert the data frame to a CSV file, using tweets_df.to_csv.

def text_query_to_csv(text_query,recent):
try:
tweets = tweepy.Cursor(api.search,q=text_query).items(recent)
tweets_list = [[tweet.created_at, tweet.id, tweet.text] for tweet in tweets]
tweets_df = pd.DataFrame(tweets_list,columns=['Datetime', 'Tweet Id', 'Text'])
tweets_df.to_csv('{}-tweets.csv'.format(text_query), sep=',', index = False)
except BaseException as e:
print('failed on_status,',str(e))
time.sleep(3)

Step 3: Identify the topic you are interested in as the ‘text_query’ variable and the number of recent tweets associated with the topic as the variable ‘recent.’ Here, I am interested in the most recent 150 tweets discussing the term ‘climate change.’

text_query = 'climate change'
recent = 15
text_query_to_csv(text_query, recent)

The climate change file.csv should be saved in your identified path.

It should be noted that the standard API allows users to scrape data from Twitter up to the previous 7 days. Users can scrape only 18,000 tweets per 15 minute period. An upcoming tutorial will teach you how to extend the limited API access.

See the full list of codes used in this tutorial below:

--

--