We frequently get questions at Beetle about the infrastructure that processes the thousands of emails we receive every day. In the following few posts, we’ll be walking through the different steps each of our emails go through before they are made visible in your feed, and the application that makes sure you keep receiving relevant emails.
You might be wondering by now what the Crawler is.
The Crawler is the first of our 7 automated steps. it’s important to understand the emails we receive aren’t as a result of any special deals with the senders. On the contrary, we want to remain under the radar and receive the emails that any Ordinary Joe would receive. That way, we know that we’re showing you the emails that real people are receiving, at the times they’re receiving them.
However, without any special deals with the vendors, the only way we’re going to receive any emails at all is by actually signing up to email campaigns. At the time of writing this post, the number of signups to email campaigns, marketing cycles and newsletters we’ve made stands at 383,942. 977 of them were made today. As much as I’d love to tell you we have a warehouse of tiny monkeys sat at computers signing up to email campaigns all over the web, paid in bananas and red bouncy balls, I’m afraid it’s not the case.
Our MySQL database contains a table of thousands of sites to crawl. Each of these sites is crawled occasionally by a nodejs application; the Crawler. Each page is scoured for forms that look like they might take an email address. if a form is found, the crawler creates a new signup record in our Signup table, fills in the form and submits it. Each signup record consists of a set of unique demographic data (Name, address, phone number, password etc), and its own email address. By giving each signup its own email address, we can tie each email we receive back to its source - what page we signup on, when we signed up and what details we used - based on what address the email was sent to.
We take form spamming and crawler abuse very seriously, so we make sure to limit the number of requests we make to all the sites we crawl, and we’re very conscientious about how often we crawl each site. We’ve found that by ensuring that we only crawl each site every now and again, we stay in the current email campaign cycles, and receive the most up to date follow up/marketing emails, without upsetting any company sysadmins!
Every so often, our crawler will come across a form it doesn’t know how to handle. In this case, it’s down to one of the team to sign up to the site manually. In order to do this properly, with proper demographic data, we each have a copy of a specially made chrome extension installed in our browsers that returns a new set of data for us to enter into the form. This way, even manual form submissions are relatively painless.
The crawler has a companion, which we call The Binger. Remember our table of URLs to crawl? This table also contains search terms. It’s the Binger’s job to periodically perform searches, using the API of a well known search engine (No prizes for guessing which one!) to find new sites related to these search terms. For example, we recently wanted to add a load of university sites to the crawler. Instead of inserting the university website URLs manually, we can simply enter related keywords into the Binger, and let it find the most relevant sites. These searches are performed regularly to make sure we’re always signing up to the most popular sites relating to our search terms.
That’s about it for our high level overview of our first step to getting you the emails that you need.
In the next post, we’ll be covering the first stage of what happens when we actually receive an email from all those sexy email campaigns we’ve signed up to! Get ready for some **mega sick acronyms. Like IMAP, S3, AWS and MD5 yo! **
Until next time, feel free to drop us an email with more specific questions at email@example.com. We’ve built an architecture we’re proud of here at Beetle, and we’re always willing to share some of the knowledge we’ve accrued along the way.