William and Mary Sen., Data Science BS, Philosophy Major
Organizations
About
I’m interested in scholarly works within reformed theology and I have seriously benefited from reading the works of St. Augustine, St. Thomas Aquinas, Rene Descartes, George Berkeley, among others. I also love the works of Dr. R.C. Sproul and Adrian Rogers. My all time favorite book is the Bible, I recommend you give the Gospel of John a try.
Some of my hobbies include learning Koine Greek and playing guitar.
In my leisure time I enjoy listening to a variety of genres of music and playing first-person shooters.
My script will take a given directory to a .txt file or a single column pandas DataFrame stored in a variable, isolate, lowercase, stem, and place the unique non-stopwords into a corpus as a pandas DataFrame.
#!/usr/bin/env python3
#
# create_corpus.py
#
# VERSION 0.3.2
#
# Last edit: 2021-09-18
#
# This preliminary script will take a given input text...
# (single column dataframe or single string)
# ...stem the unique non-stopwords, and create a corpus
#
##############################################################################
# REQUIRED MODULES
##############################################################################
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import nltk
import heapq
import re
import pandas as pd
'stopwords')
nltk.download('punkt')
nltk.download(
= PorterStemmer()
pt
##############################################################################
# FUNCTIONS
##############################################################################
def df_corpus(df):
"""
Name: df_corpus
Inputs: pandas DataFrame (df)
Outputs: pandas DataFrame (corpus)
Features: Isolates, lowercases, and stems all unique, non_stopwords.
"""
= []
unique_df for i in range(len(df)):
= re.sub('[^a-zA-Z0-9 ]','', df[i])
ent = des.lower()
ent = des.split()
ent = [word for word in ent if not word in set(stopwords.words('english'))]
ent = [pt.stem(word) for word in ent]
ent = ' '.join(ent)
ent
unique_df.append(ent)
= []
corpus_df for i in range(len(df)):
= set(corpus_df).union(set(unique_df[i].split(' ')))
corpus_df
return pd.DataFrame(list(corpus_df), columns=['Unique Words'])
def txt_corpus(path):
"""
Name: txt_corpus
Inputs: str, directory path (path)
Outputs: pandas DataFrame (corpus)
Features: Isolates, lowercases, and stems all unique, non_stopwords.
"""
file = open(path, encoding='utf-8').read()
= nltk.word_tokenize(file)
sent for i in range(len(sent)):
= sent[i].lower()
sent[i] = re.sub(r'\W', ' ', sent[i])
sent[i] = re.sub(r'\s+', ' ', sent[i])
sent[i]
= []
unique_txt for i in range(len(sent)):
= re.sub('[^a-zA-Z0-9 ]','', sent[i])
txt = txt.lower()
txt = txt.split()
txt = [word for word in txt if not word in set(stopwords.words('english'))]
txt = [pt.stem(word) for word in txt]
txt = ' '.join(txt)
txt
unique_txt.append(txt)
for i in range(unique_txt.count('')):
'')
unique_txt.remove(
= []
corpus_txt for i in range(len(unique_txt)):
= set(corpus_txt).union(set(unique_txt[i].split(' ')))
corpus_txt
return pd.DataFrame(list(corpus_txt), columns=['Unique Words'])
##############################################################################
# MAIN
##############################################################################
= input('Are you passing a .txt file? (Y or N): ')
x
if x=='Y':
= input('Paste the path to your .txt file: ')
y = txt_corpus(y)
var_Y print(var_Y)
else:
= input('Paste the stored variable for your single column dataframe: ')
z = df_corpus(z)
var_N print(var_N)