The Simplest Spider

2024-03-12

In these the days of training models and analysing random things off the internet, sometimes you just need to spider some content off a web site.

If you are going to be doing this for real, some good tools I recommend are:

Colly (golang)
Scrappy (python)
Beautiful Soup (not a spider but useful for doing html magic)
Use a specialised service (there are several)

However, sometimes you just need something quick, dirty and scrappy.

I was doing a spike, and I really just needed to grab a bunch of stuff off a web site. I didn’t need it to hide itself, spoof it’s user agent, stagger it’s requests, render JavaScript or any of the other fancy bits you can get from those libraries above. I just needed it to grab the content off a few pages, and feed the results into to a machine learning monster that lives under by bed.

Hopefully this will be of some use to someone out there - a ~15 line simple bash script web spider:

set -e
# allows xpath queries against html pages - 1995 called and wants it's xslt back.
command -v xmllint >/dev/null 2>&1 || { echo >&2 "xmllint is not installed. try 'sudo apt -y install libxml2-utils' or whatever."; exit 1; }

# dump everything here
mkdir -p working

# Once you download the html, this is where the content is located
XPATH="normalize-space(//div[@id='main']//.)"

# Grab a list of all the URLs we're going to grab, this is getting them from an
# RSS feed, but you could just use a text file or something
curl -X GET https://robrohan.com/index.xml > working/index.xml
# this grabs all the link looking things out of that index.xml
# This sed bit here was written by ChatGPT :-o - we're all doomed.
sed -nE 's/.*(https?|ftp):\/\/([^ "<>()]*)(<\/link>|<\/guid>)?.*/\1:\/\/\2/p' working/index.xml \
	| sort \
	| uniq \
	| grep "robrohan.com" \
	| grep "html" \
	> working/urls.txt

fetchurl() {
	curl -X GET $1 > ./working/$2
}
# boomer back ticks
for url in `cat ./working/urls.txt`; do
	# this replaces some basic chars the filesystem wont like
	clean_url=$(echo $url | tr "/:." "___")
	# download the file
	fetchurl $url $clean_url
	body=$(xmllint --html --noout -xpath $XPATH ./working/$clean_url 2>/dev/null | tr "\"\n\t" " ")
	# do stuff with the text
	echo $body
done;

(Ok, so it’s not technically a spider. It’s more of a content down loader, but whatever.)