My name is Vince, a Senior in the College of William and Mary studying Economics and Data Science
Things I enjoy doing
volleyball: right back, DS, libero
chess: I play online (lichess.org), participated in two US Opens just for fun
backpacks: unsure how I got into it, but I like backpacks. I sink a lot of hours watching and reading reviews of cool backpacks.
travel: traveled extensively in every continent except Africa, and dreaming of the day I can go there as well as to Antarctica. Synergizing with my backpack hobby, I travel with only one carryon backpack!
coding: similar to how one can “enjoy” exercise - it hurts in the moment, and you wonder why you do it, but give it a couple days’ time, and you realize the experience was fun (sorta)!!
{
#works with packhacker.com, e.g. https://packhacker.com/travel-gear/heimplanet/neck-pouch-a5/
import urllib.request, json, re, argparse #re needed to replace whitespace down the line
from bs4 import BeautifulSoup
def parsehtml(url):
with urllib.request.urlopen(url) as req:
contents = req.read() #this will be all the text
soup = BeautifulSoup(contents, features='html.parser')
text = soup.find_all(['p','h2']) #I want all text in paragraphs (p) and headings (h2)
review = []
for i in text:
if i.name == 'h2': #I want a colon after every heading to make sentences clearer.
review.append(i.get_text(strip=True)+':') #get_text() gives a str, which allows concat with ":"
else:
review.append(i.get_text(strip=True)) #if not h2, do nothing
review_cleaned = [re.sub("\s+"," ",rev) for rev in review] #changes all whitespace to single space
review_txt = " ".join(review_cleaned) #make list into one long str to be written in
with open("Output.txt", "w") as txtfile:
txtfile.write(review_txt)
return "done"
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='scrape a packhacker.com page by adding the url after py file name. You MUST have bs4 installed!')
parser.add_argument('website',type=str, metavar='',help='input your website here')
args = parser.parse_args()
print(parsehtml(args.website))
}