Monday, July 22, 2013

Quora Scraper

I love Quora. It's a great site and community. But I started getting a bit concerned how much writing I was doing there which was (potentially) disappearing inside their garden and not part of the body of thinking I'm building up on ThoughtStorms (or even my blogs).

Fortunately, I discovered Quora has an RSS feed of my answers, so I can save them to my local machine. (At some point I'll think about how to integrate them into ThoughtStorms; should I just make a page for each one?)

Anyway here's the script (powered by a lot of Python batteries.)

import feedparser
import hashlib
import json
from bs4 import BeautifulSoup
d = feedparser.parse("http://www.quora.com/YOUR-QUORA-NAME/answers/rss")
for e in d["entries"] :
title = e["title"]
summary = e["summary"]
h = hashlib.sha224(title).hexdigest()
summary = summary.replace("<br />","__LINEBREAK__")
soup = BeautifulSoup(summary)
answer = soup.get_text()[11:]
answer = answer[:-21]
answer = answer.replace("__LINEBREAK__","\n")
answer = answer.strip()
print title
print answer
j = {"question":title,"answer":answer,"link":e["link"]}
f=open(h+".quora.txt",'w')
f.write(json.dumps(j))
f.close()
view raw quorascraper.py hosted with ❤ by GitHub



And this turns the files back into a handy HTML page.
from os import listdir
from os.path import isfile, join
onlyfiles = [ f for f in listdir(".") if isfile(join("",f)) ]
import json
print """
<html>
<style>
body {
font-family:sans;
color:#333333;
padding:3px;
}
dt {
top-margin:4px;
}
</style>
<body>
<h2>Quora Answers by MY NAME</h2>
<dl>
"""
for fn in onlyfiles :
if fn[-3:] == "txt":
f = open(fn)
entry = json.loads(f.read())
print "<dt>%s <a href='%s'>Link</a></dt>" % (entry["question"].encode("utf-8"), entry["link"].encode("utf-8"))
answer = entry["answer"].encode("utf-8")
answer = answer.replace("\n\n","<br/>")
print "<dd>%s</dd>" % answer
print """
</dl>
</body>
</html>"""
view raw gistfile1.py hosted with ❤ by GitHub

No comments: