Sign in to follow this  
Followers 0

Data scraping using website crawlers/programming

16 posts in this topic

Right then,

 

A mate of mine wants to obtain a set of information from a website, specifically email adddresses. These are are on different pages so the surfer would in theory have to click on each page to get the addresses.

 

I tried my best with a "bastelling " approach downloading Ruby on Rails then editing the code but it failed. How might one do this?

0

Share this post


Link to post
Share on other sites

You write a script in PHP, perl, javascript, python, ruby or groovy that gets the pages in question, and then parses them probably using regular expressions to extract what you want.

 

This little php script will print Jeremy's location:

 

 

<?php

$url = 'http://www.toytowngermany.com/forum/index.php?showuser=32';

$page = file_get_contents($url);

preg_match('/<dt>Location:<\\/dt>[\s\S]+?<dd>([\s\S]+?)<\\/dd>/', $page, $loc);

echo "His location ".$loc[1]."\n";

?>

 

0

Share this post


Link to post
Share on other sites

 

parses them probably using regular expressions

 

Regular expressions to parse html? Only if you're a masochist.

 

There are libraries specifically for parsing html. Like this one for python.

 

 

A mate of mine wants to obtain a set of information from a website, specifically email adddresses.

 

Is he a spammer?

3

Share this post


Link to post
Share on other sites

 

 

A mate of mine wants to obtain a set of information from a website, specifically email adddresses.

How might one do this?

 

Ask the Local.

3

Share this post


Link to post
Share on other sites

 

A mate of mine wants to obtain a set of information from a website, specifically email adddresses.

 

Which is why many of us who create web pages (for Vereine etc.) have to resort to using graphics for mail addresses (or similar tricks) rather than plain text just to avoid ******s like these.

3

Share this post


Link to post
Share on other sites

There comes a time in a programmer's life when he decides that a problem calls for regular expressions.

 

Now the programmer has two problems.

2

Share this post


Link to post
Share on other sites

Regular expressions are the greatest thing since garbage collection.

You just have to respect them as a rather cryptic programming language.

 

It's a good question Jeremy. Why does your mate want to collect Email addresses from 3rd party pages? It's hard to envision any kind of honorable reason for him doing so.

2

Share this post


Link to post
Share on other sites

I believe there's a regular expression for describing people who try to harvest email addresses from webpages.

2

Share this post


Link to post
Share on other sites

MAM..

 

I can think of an example...

 

I would like to email the CEO or a Manager at BAXI in the Uk..

 

The only email adress I can find is info@....

 

The thing is, I dont want to talk or communicate with a spam filtering monkey... i want to communicate with the boss!

0

Share this post


Link to post
Share on other sites

 

I would like to email the CEO [..] at BAXI in the Uk..

david.pinder@bdrthermea.com

 

@OP: http://scrapy.org/

0

Share this post


Link to post
Share on other sites

wget -q -r -O - http://yoursite.com/ | grep -E -o "\b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b"

 

If you need more than one line for this, you suck.

2

Share this post


Link to post
Share on other sites

 

If you need more than one line for this, you suck.

 

Reminds me of the time i didn't want to do a particular cryptography exercise for class. So I instead devised a brute force attack that returned the same result. Then I optimized it into a single line of Perl. And then I optimized it to use less than 10 characters...

0

Share this post


Link to post
Share on other sites

 

wget -q -r -O - http://yoursite.com/ | grep -E -o "\b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b"

 

If you need more than one line for this, you suck.

If you crawl servers without something like "--limit-rate=50k" as wget parameter, you suck.

1

Share this post


Link to post
Share on other sites

Isn't wget a Unix command. Does it work for our "Windows comrades" without cygwin?

--limit-rate is going to be fairly useless on a single page wget.

 

@Spiderpig. If you want to extract one email from a page you might as well hand search in a browser. Jeremy's friend is talking about harvesting.

0

Share this post


Link to post
Share on other sites

 

--limit-rate is going to be fairly useless on a single page wget.

 

What I posted was not a single page wget but recursive, so this would actually throttle the connection over time.

 

It's somewhat unnecessary though, because any decent web server should be able to handle the amount of requests that this command will throw at it. It's single threaded, it's not doing anything in parallel.

0

Share this post


Link to post
Share on other sites

 

It's somewhat unnecessary though, because any decent web server should

Any decent web server will throw shit at you if behave like an unwanted crawler.

0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0