Web Scraping Best Practices for Lead Generation
How to scrape websites ethically and effectively while respecting robots.txt, rate limits, and data privacy regulations.
Web scraping is powerful, but with great power comes great responsibility. Here's how to do it right.
Why Ethics Matter
Unethical scraping can lead to:
- IP bans and blocked access
- Legal issues under CFAA or GDPR
- Damaged reputation in your industry
- Overloaded servers that hurt other users
Scrappy is built with ethics in mind. We help you get the data you need while respecting websites and their users.
The Golden Rules
1. Respect robots.txt
Every website has a robots.txt file that specifies what can and can't be scraped.
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/
Scrappy automatically checks and respects these rules. Don't try to bypass them.
2. Rate Limit Your Requests
Hammering a website with requests will:
- Get you blocked
- Slow down their site for real users
- Potentially crash smaller servers
Scrappy's approach:
- Automatic rate limiting between requests
- Randomized delays to appear human
- Concurrent request limits per domain
3. Identify Yourself
Some sites appreciate knowing who's scraping them. Scrappy uses a clear user agent that identifies the request as a bot, making it easy for site owners to contact us if needed.
4. Cache When Possible
Don't re-scrape data you already have. Scrappy caches:
- Page content for 24 hours
- Email validation results for 30 days
- Company information indefinitely
This reduces load on target sites and saves you credits.
Legal Considerations
GDPR Compliance
If you're collecting data on EU citizens:
- Only collect data that's publicly available
- Document your legitimate interest
- Provide opt-out mechanisms
- Honor data deletion requests
Scrappy is fully GDPR compliant and helps you stay compliant too.
CAN-SPAM and CCPA
When using scraped emails:
- Include your physical address
- Provide clear unsubscribe options
- Honor opt-out requests within 10 days
- Don't sell personal data without consent
Terms of Service
Some websites explicitly prohibit scraping in their ToS. While this is often unenforceable, it's good to be aware of it.
Technical Best Practices
Target the Right Pages
Don't scrape entire websites. Target specific pages:
/teamor/about- Team member info/contact- Contact details/leadership- Executive information
Handle Dynamic Content
Modern sites use JavaScript. Scrappy uses Steel Browser, a headless browser that:
- Executes JavaScript
- Waits for content to load
- Handles infinite scroll
- Bypasses basic bot detection
Extract Clean Data
Raw HTML is messy. Scrappy extracts:
- Valid email addresses only
- Phone numbers in consistent formats
- Social media links
- Company metadata
Validate Everything
Never trust scraped data blindly:
- Validate email syntax and deliverability
- Verify phone number formats
- Check for spam traps and honeypots
Common Mistakes to Avoid
Mistake 1: Scraping Too Fast
Wrong: 100 requests per second Right: 1-2 requests per second, with delays
Mistake 2: Ignoring Errors
Wrong: Keep retrying failed requests indefinitely Right: Exponential backoff with max retries
Mistake 3: Not Handling Edge Cases
Watch for:
- 404 pages
- Redirects
- CAPTCHAs
- Login walls
- Rate limit responses (429)
Mistake 4: Storing Sensitive Data
Never scrape or store:
- Passwords or credentials
- Payment information
- Health records
- Private communications
Scrappy's Built-in Protections
We've built these best practices into Scrappy:
| Feature | What It Does | |---------|--------------| | Rate limiting | Automatic delays between requests | | robots.txt respect | Checks and follows site rules | | Error handling | Graceful retry with backoff | | Data validation | Cleans and validates all output | | GDPR compliance | Built-in consent and deletion | | Proxy rotation | Distributes requests across IPs |
When to Use Scraping
Good use cases:
- Building B2B lead lists from public directories
- Competitive research from public sources
- Aggregating publicly available contact info
Bad use cases:
- Scraping private member areas
- Bypassing paywalls
- Collecting personal data without consent
- Overloading small business websites
The Bottom Line
Web scraping is legal and ethical when done right. Respect websites, follow the rules, and use data responsibly.
Scrappy makes it easy to scrape ethically while still getting the data you need to grow your business.