Back to Blog
Web Scraping Best Practices for Lead Generation
scrapingbest-practicesethics

Web Scraping Best Practices for Lead Generation

How to scrape websites ethically and effectively while respecting robots.txt, rate limits, and data privacy regulations.

Scrappy Team
January 4, 2025
4 min read

Web scraping is powerful, but with great power comes great responsibility. Here's how to do it right.

Why Ethics Matter

Unethical scraping can lead to:

  • IP bans and blocked access
  • Legal issues under CFAA or GDPR
  • Damaged reputation in your industry
  • Overloaded servers that hurt other users

Scrappy is built with ethics in mind. We help you get the data you need while respecting websites and their users.

The Golden Rules

1. Respect robots.txt

Every website has a robots.txt file that specifies what can and can't be scraped.

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/

Scrappy automatically checks and respects these rules. Don't try to bypass them.

2. Rate Limit Your Requests

Hammering a website with requests will:

  • Get you blocked
  • Slow down their site for real users
  • Potentially crash smaller servers

Scrappy's approach:

  • Automatic rate limiting between requests
  • Randomized delays to appear human
  • Concurrent request limits per domain

3. Identify Yourself

Some sites appreciate knowing who's scraping them. Scrappy uses a clear user agent that identifies the request as a bot, making it easy for site owners to contact us if needed.

4. Cache When Possible

Don't re-scrape data you already have. Scrappy caches:

  • Page content for 24 hours
  • Email validation results for 30 days
  • Company information indefinitely

This reduces load on target sites and saves you credits.

Legal Considerations

GDPR Compliance

If you're collecting data on EU citizens:

  • Only collect data that's publicly available
  • Document your legitimate interest
  • Provide opt-out mechanisms
  • Honor data deletion requests

Scrappy is fully GDPR compliant and helps you stay compliant too.

CAN-SPAM and CCPA

When using scraped emails:

  • Include your physical address
  • Provide clear unsubscribe options
  • Honor opt-out requests within 10 days
  • Don't sell personal data without consent

Terms of Service

Some websites explicitly prohibit scraping in their ToS. While this is often unenforceable, it's good to be aware of it.

Technical Best Practices

Target the Right Pages

Don't scrape entire websites. Target specific pages:

  • /team or /about - Team member info
  • /contact - Contact details
  • /leadership - Executive information

Handle Dynamic Content

Modern sites use JavaScript. Scrappy uses Steel Browser, a headless browser that:

  • Executes JavaScript
  • Waits for content to load
  • Handles infinite scroll
  • Bypasses basic bot detection

Extract Clean Data

Raw HTML is messy. Scrappy extracts:

  • Valid email addresses only
  • Phone numbers in consistent formats
  • Social media links
  • Company metadata

Validate Everything

Never trust scraped data blindly:

  • Validate email syntax and deliverability
  • Verify phone number formats
  • Check for spam traps and honeypots

Common Mistakes to Avoid

Mistake 1: Scraping Too Fast

Wrong: 100 requests per second Right: 1-2 requests per second, with delays

Mistake 2: Ignoring Errors

Wrong: Keep retrying failed requests indefinitely Right: Exponential backoff with max retries

Mistake 3: Not Handling Edge Cases

Watch for:

  • 404 pages
  • Redirects
  • CAPTCHAs
  • Login walls
  • Rate limit responses (429)

Mistake 4: Storing Sensitive Data

Never scrape or store:

  • Passwords or credentials
  • Payment information
  • Health records
  • Private communications

Scrappy's Built-in Protections

We've built these best practices into Scrappy:

| Feature | What It Does | |---------|--------------| | Rate limiting | Automatic delays between requests | | robots.txt respect | Checks and follows site rules | | Error handling | Graceful retry with backoff | | Data validation | Cleans and validates all output | | GDPR compliance | Built-in consent and deletion | | Proxy rotation | Distributes requests across IPs |

When to Use Scraping

Good use cases:

  • Building B2B lead lists from public directories
  • Competitive research from public sources
  • Aggregating publicly available contact info

Bad use cases:

  • Scraping private member areas
  • Bypassing paywalls
  • Collecting personal data without consent
  • Overloading small business websites

The Bottom Line

Web scraping is legal and ethical when done right. Respect websites, follow the rules, and use data responsibly.

Scrappy makes it easy to scrape ethically while still getting the data you need to grow your business.