Creating a Private Database of Proxies – Part 2: Scraping IP Addresses

by blueshellgroup

In this section of our tutorial on creating a database of proxies, we’ll be walking through how we’re going to write our program.

What do I need to do this?

For this part of the tutorial, we’ll assume that you’ve read the tutorial introduction, and you know what we’re trying to do, and why. This section will not require you to have any programming knowledge; we’re only going to walk through the steps we’re going to take to get the IP addresses for the proxy server. You should have a VERY basic knowledge of HTML and CSS, or at least know what they are. If you don’t hopefully you will by the end of this section. Everything else that you’ll need for later sections will be explained then.

This should be easy! Why do I need a tutorial?

Initially when we set out to do this ourselves, we assumed the same thing. We thought we could set up the whole system in an hour or two, and have it running the same evening. It turns out, the folks over at HideMyAss.com thought about people doing what we were going to do, and made it a bit harder than we initially thought. Their primary method of stopping people from collecting the information from their site is by obfuscating the code that displays the IP address of each server. This makes the page so that the correct address is displayed, but in the HTML its much harder to tell, and thus harder for a computer to find automatically.

How does HideMyAss obfuscate the addresses?

To see how the IP addresses are hidden, we’re going to need to look at the HTML for the page. We can do this in a browser like Chrome or Firefox. In this tutorial, we’re going to use Chrome. Start by going to HideMyAss.com’s list of proxy servers, and open up the HTML for the page. Once there, isolate the <span> element that contains the first listed IP address, and lets look at what it contains. For us, our first <span> looks like this:

<span>
    <style>.r6cp{display:none}.Nz73{display:inline}</style>
    <span style="display:none">36</span>
    <div style="display:none">71</div>
    <span style="display:none">94</span>
    <span class="r6cp">94</span>
    <div style="display:none">94</div>
    <span style="display:none">118</span>
    <span></span>
    <span></span>
    <span style="display:none">185</span>
    <div style="display:none">185</div>
    <span style="display:none">194</span>
    <span class="r6cp">194</span>
    <span style="display:none">202</span>
    <span></span>
    203
    <div style="display:none">205</div>
    <span style="display:none">246</span>
    <span></span>
    <span style="display: inline">.</span>
    <span class="r6cp">57</span>
    <div style="display:none">57</div>
    <span class="110">156</span>
    <span class="Nz73">.</span>
    <span style="display:none">107</span>
    <span></span>
    <span class="208">250</span>
    <span class="123">.</span>
    <span class="r6cp">56</span>
    <div style="display:none">56</div>
    <span style="display: inline">101</span>
</span>

Clearly, that’s a lot more than the four numbers that make up an IP address. This is what all those numbers end up looking like in the browser to the user:

displayed_server_information

So how do we get from all the numbers in the HTML above, to what we see on the screen? There are two main methods being used here. The first, is creating <span> elements in the code, but not displaying them. We see that used quite a bit in the code, in this line for example:

<span style="display:none">36</span>

We can see than a <span> element is created that says “36″, with the display property set to display:none. Obviously, this tells the element not to display when the web page is rendered. Let’s take a look at the code for our single IP address again, but with all the elements with the display:none property removed. We’ll also remove all the empty <span&gt elements.

<span>
    <style>.r6cp{display:none}.Nz73{display:inline}</style>
    <span class="r6cp">94</span>
    <span class="r6cp">194</span>
    203
    <span style="display: inline">.</span>
    <span class="r6cp">57</span>
    <span class="110">156</span>
    <span class="Nz73">.</span>
    <span class="208">250</span>
    <span class="123">.</span>
    <span class="r6cp">56</span>
    <span style="display: inline">101</span>
</span>

That looks better, but there are still more numbers than we need. So what else isn’t being displayed? The answer lies in the very first line, between the <style> tags. Two classes are created, called r6cp and Nz73, each with the same display property we saw earlier. This time however, only the r6cp class has display:none; the Nz73 class has the display property set as display:inline, meaning an element in that class WILL be displayed. Additionally, any element with the display property set as display:inline in the <span> tag will also be displayed. Let’s see what the code looks like without the elements in class r6cp:

<span>
    <style>.r6cp{display:none}.Nz73{display:inline}</style>
    203
    <span style="display: inline">.</span>
    <span class="110">156</span>
    <span class="Nz73">.</span>
    <span class="208">250</span>
    <span class="123">.</span>
    <span style="display: inline">101</span>
</span>

That looks much more like our displayed page than we started with. In fact, those elements are exactly the ones that are displayed! This technique works for all of the IP addresses listed, and is what we will be writing our program to do in the next section.

How exactly are we going to write this program?

To put it simply, we’re going to create our program to do exactly what we just did by hand. First, identify and remove all elements with the display:none property in the tag. Then, find and get rid of all elements that are part of a class that has the property. Finally, take whats left and put it into one line, and we should have our address! Luckily, the rest of the information about each server is not obfuscated, so we can just get that normally. Check out Part 3 for instructions on where to go from here!

Where are the other parts?

Part 1 – Introduction

Part 2 – Scraping IP Addresses

Part 3 – TBA

Part 4 – TBA

Part 5 – TBA

Part 6 – TBA

About these ads