To get a take care of on this, I decided to look at at political speeches making use of VoteSmart.org (I gained the idea indigenous a study paper), and look in ~ partisan phrases. The article will proceed as follows:Assembling the corpusExtracting typical phrasesMeasuring politically polarized phrases
Step 1: Assembling the Corpus
I started with a well-maintained dataset of anyone in congress, which includes an id that links to their poll Smart page. Come make points human-readable, my goal was to do a folder for each legislator which consisted of text files of all their speeches. The research record I pointed out used a few types the speeches, but I contained every kind of publicly statement made beginning in January 2020.
You are watching: Why are political speeches biased?
This indict will focus on the text analysis, yet you can uncover the code for scraping the body here. I provided it to scrape end 10,000 distinctive statements and also speeches indigenous VoteSmart (I removed tweets and duplicate statements), though you can scrape much more if you fine-tuned it and included much more than simply senators, for example. Here is a sample message from a Bernie Sanders interview v CNN ago in March:
COOPER: now to our interview with Senator Bernie Sanders. That is campaigning in the super Tuesday state of Minnesota.Senator Sanders, many thanks for being through us. You simply heard mayor Buttigieg endorsing Vice chairman Biden. Klobuchar is expected to carry out the exact same tonight, as is Beto O'Rourke. Exactly how does the consolidation of the moderate vote impact your plans relocating forward, especially since Senator Warren shows no indicators of gaining out?SEN. BERNIE SANDERS (D-VT), PRESIDENTIAL CANDIDATE: Well, Anderson, i think, as you know, from job one we have actually been acquisition on the establishment, whether it is the that company establishment, you know, wall surface Street, the drug companies, the insurance allowance companies, the fossil fuel industry, or the politics establishment.And allow me be really clear, it is no surprised they perform not desire me to end up being president since our management will transform this country to produce an economy and a federal government that works for all of the people, not just the 1 percent. It will certainly not be the very same old exact same old....And below is what the organized corpus look at like:
Step 2: Extracting usual Phrases
The next step is come extract paragraph by splitting up the texts right into tokens. I supplied scikit-learn’s token vectorizer since it has some good features developed in. Here is how simple it looks to collection everything up:
from sklearn.feature_extraction.text import CountVectorizerfrom nltk.corpus income stopwordsnltk_stop_words = stopwords.words('english')tf_vectorizer = CountVectorizer(max_df=0.8, min_df=50, ngram_range = (1,2), binary=True, stop_words=nltk_stop_words)Stopwords (i.e. Nltk_stop_words) help get rid of non-informative words, usual examples space “of,” “to,” and also “and.” I supplied NLTK’s list since scikit’s built-in list has some well-known issues. Then, the tf_vectorizer (tf represents “term frequency”) gets initialized with a few settings:max_df=0.8 means exclude phrases that space in 80% the the papers or much more (similar to avoid words, these space unlikely to it is in informative since they room so common)min_df=50 method that the word must occur at the very least 50 time in the body to be consisted of in evaluation (I supplied 50 due to the fact that the research paper I stated did too, though you may experiment with different cutoffs)ngram_range=(1,2) method includes one-word and also two-word paragraph (you might easily set it come (1,3) to additionally include trigrams / three-word phrasesbinary=True means to only count if a word wake up at all in a given document (i.e. 0 or 1), quite than counting specifically how numerous times it wake up (i.e. 0 or 1 or 2 or 3 or…)stop_words=nltk_stop_words simply plugs in the NLTK protect against word list collection up in the vault line, so that words such together “of” and “to” space not included
After placing texts right into a list with document I/O, the tf_vectorizer have the right to turn the texts into a phrase matrix with simply one line:
term_frequencies = tf_vectorizer.fit_transform(texts_list)Now, term_frequences is a matrix through the counts of each term from the vectorizer. We deserve to turn it right into a DataFrame to make things an ext intuitive and also see the most common phrases:
phrases_df = pd.DataFrame(data=tf_vectorizer.get_feature_names(),columns=<'phrase'>)phrases_df<'total_occurrences'>=term_frequencies.sum(axis=0).Tphrases_df.sort_values(by='total_occurrences',ascending=False).head(20).to_csv('top_20_overall.csv',index=False)The result csv paper looks choose this:
Step 3: Measuring politics Polarized Phrases
First points first, we will need to split up the Democrat-authored and also Republican-authored texts, and also then obtain their ax frequency matrices. The Pandas DataFrame provides this quite easy:
dem_tfs = tf_vectorizer.transform(texts_df
You may notification there are only two variables affiliated in the formula: phrase probability because that Democrat-authored texts, and also phrase probability for Republican-authored texts. So, to calculation the partisan bias score, us will simply need to compute these two probabilities, i beg your pardon I dubbed p_dem and p_rep because that short. As soon as we have actually those, the prejudice score for each phrase is simply:
bias = (p_rep - p_dem) / (p_rep + p_dem)I provided a basic probability metric: the variety of documents that contained a phrase split by the total number of documents. There space some more sophisticated means to measure probability, but based upon my reading of the paper, this is more than likely what the writer did.
Almost there! Now, we simply need to placed the prejudice scores into the phrases_df DataFrame indigenous earlier, and also then we can easily look at some common partisan phrases. As soon as I an initial ran this, part name unit volume such as “senator harris” and also “senator patty” among the most partisan — one artifact the the transcripts. To put a band-aid on this, ns made a filter come make sure partisan phrases were supplied by at least three senators:
top_rep = phrases_df.sort_values(by='bias_score',ascending=False).head(100)top_rep<'n_senators'> = top_rep.apply(lambda x: len(texts_df
See more: Value Of 1897 Silver Dollar E Pluribus Unum, 1897 Morgan Silver Dollar Value
Here space the most Democratic-leaning phrase, along with the scores and probabilities we calculated.