Overview
This guide provides a step-by-step setup for a Scrapy project that uses TrustedProxies for rotating proxy support. You will set up a virtual environment, install dependencies, configure middleware, write a spider, and test the integration.
What is Scrapy?
Scrapy is a Python framework for fast, scalable web scraping. It handles asynchronous requests, data parsing, and export, making it ideal for large-scale web data extraction tasks.
Step-by-Step Setup:
Step 1: Create Virtual Environment
python3 -m venv scrapy_env
source scrapy_env/bin/activate
For Windows:scrapy_env\Scripts\activate
Step 2: Install Scrapy & Rotating Proxy Middleware
pip install scrapy scrapy-rotating-proxies
Step 3: Create Scrapy Project
scrapy startproject chair_scraper
cd chair_scraper
Project structure:
chair_scraper/
├── chair_scraper/
│ ├── middlewares.py
│ ├── settings.py
│ └── spiders/
│ └── chair_products_rotating.py
└── scrapy.cfg
Step 4: Trusted Proxy List in settings.py
TrustedProxies provides a list of authenticated proxies. Configure them like so:
ROTATING_PROXY_LIST = [
"http://testuser:password@shp-testuser-us-v00001.tp-ns.com:27281",
"http://testuser:password@shp-testuser-us-v00002.tp-ns.com:27281",
"http://testuser:password@shp-testuser-us-v000012.tp-ns.com:27281",
]
Scrapy will rotate proxies automatically. If one fails, another will be retried.
Step 5: Random User-Agent Middleware
Some websites block scrapers by detecting default or repeated browser headers. To randomize headers:
Add to middlewares.py:
import random
class RandomUserAgentMiddleware:
def process_request(self, request, spider):
ua = random.choice(spider.settings.getlist('USER_AGENTS'))
request.headers.setdefault("User-Agent", ua)
Add to settings.py:
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36",
]
Step 6: Enable All Middlewares in settings.py
Ensure these middleware settings are included:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
'chair_scraper.middlewares.RandomUserAgentMiddleware': 400,
}
RETRY_TIMES = 5
DOWNLOAD_TIMEOUT = 30
Step 7: Create the Spider
File: chair_scraper/spiders/chair_products_rotating.py
import scrapy
class ChairProductsRotatingSpider(scrapy.Spider):
name = "chair_products_rotating"
start_urls = ["https://customers.trusted.com/downloads/demo_products.html"]
def parse(self, response):
for product in response.css(".product"):
yield {
"title": product.css(".title::text").get(),
"price": product.css(".price::text").get(),
"description": product.css(".description::text").get(),
}
Step 8: Run the Spider
After creating and configuring your spider, you’re ready to run it and start scraping data.
In this example, the spider targets a demo page hosted by TrustedProxies at:
https://customers.trustedproxies.com/downloads/demo_products.html
This page contains sample product listings, which your spider will crawl to extract product details like title, price, and description.
How to run the spider:
-
Activate your virtual environment (if not already active):
source scrapy_env/bin/activate
(For Windows users, run scrapy_env\Scripts\activate
)
-
Navigate to your Scrapy project root directory where the
scrapy.cfg
file is located. -
Run the spider with the following command:
scrapy crawl chair_products_rotating -o chairs.csv
where
scrapy crawl
runs the spider.
chair_products_rotating
is the spider’s name you defined.
-o chairs.csv
exports the scraped data into a CSV file named chairs.csv
.
What happens during execution:
Scrapy visits the demo products page.
It parses the HTML and extracts each product’s title, price, and description.
The scraped data is saved into chairs.csv in a structured tabular format.
Sample Output:
The resulting chairs.csv file will look like:
title,price,description
"Stylish Wooden Chair","$129.99","Beautiful and sturdy chair crafted from oak wood."
"Ergonomic Office Chair","$199.00","Comfortable office chair with adjustable height."
...
Running the spider on this demo page is a simple way to validate your Scrapy setup, proxy rotation, and data extraction logic before scaling up to production scraping tasks.