The Simplest Spider
In these the days of training models and analysing random things off the internet, sometimes you just need to spider some content off a web site.
If you are going to be doing this for real, some good tools I recommend are:
- Colly (golang)
- Scrappy (python)
- Beautiful Soup (not a spider but useful for doing html magic)
- Use a specialised service (there are several)
However, sometimes you just need something quick, dirty and scrappy.
I was doing a spike, and I really just needed to grab a bunch of stuff off a web site. I didn’t need it to hide itself, spoof it’s user agent, stagger it’s requests, render JavaScript or any of the other fancy bits you can get from those libraries above. I just needed it to grab the content off a few pages, and feed the results into to a machine learning monster that lives under by bed.
Hopefully this will be of some use to someone out there - a ~15 line simple bash script web spider:
set -e
# allows xpath queries against html pages - 1995 called and wants it's xslt back.
command -v xmllint >/dev/null 2>&1 || { echo >&2 "xmllint is not installed. try 'sudo apt -y install libxml2-utils' or whatever."; exit 1; }
# dump everything here
mkdir -p working
# Once you download the html, this is where the content is located
XPATH="normalize-space(//div[@id='main']//.)"
# Grab a list of all the URLs we're going to grab, this is getting them from an
# RSS feed, but you could just use a text file or something
curl -X GET https://robrohan.com/index.xml > working/index.xml
# this grabs all the link looking things out of that index.xml
# This sed bit here was written by ChatGPT :-o - we're all doomed.
sed -nE 's/.*(https?|ftp):\/\/([^ "<>()]*)(<\/link>|<\/guid>)?.*/\1:\/\/\2/p' working/index.xml \
| sort \
| uniq \
| grep "robrohan.com" \
| grep "html" \
> working/urls.txt
fetchurl() {
curl -X GET $1 > ./working/$2
}
# boomer back ticks
for url in `cat ./working/urls.txt`; do
# this replaces some basic chars the filesystem wont like
clean_url=$(echo $url | tr "/:." "___")
# download the file
fetchurl $url $clean_url
body=$(xmllint --html --noout -xpath $XPATH ./working/$clean_url 2>/dev/null | tr "\"\n\t" " ")
# do stuff with the text
echo $body
done;
(Ok, so it’s not technically a spider. It’s more of a content down loader, but whatever.)