Scrape Amazon With Python

Scrape Amazon With Python

Amazon.com, the king of the e-commerce industry, has a market cap of over 1.8 trillion $, placing it in the top 5 most valuable companies list. Its insane size can be estimated because it delivers more than 1.6 million packages per day worldwide.

One of the fascinating statistics about Amazon for data scrapers worldwide is that it has more than 353 million products listed on its website, making it the largest product database inventory in the world. 

Moreover, Amazon contributes the largest pie to the e-commerce industry, and additionally, Amazon price scraping consumption accounts for nearly 50% of web scraping activities(source). 

In this article, we will scrape Amazon product data using Python and explore how EcommerceAPI’s Amazon Product API can streamline the process.

Setting up the scraper

Our first step for setting the scraper would be to predetermine the list of details that we are going to scrape from the Amazon product page:

  • Product name
  • Product Features
  • Product Pricing
  • Product Rating
  • Product Images

The next step would involve installing the two third-party Python libraries to get and parse the HTML and get quality data out. 

  • Requests — A popular third-party Python library for making HTTP requests. This will help us get the HTML but not the refined data. This is where we will take the help of the second library we will install.
  • Beautiful Soup — It is a lightweight HTML parsing library similar to Cheerio in JavaScript. 

Now, you can create a Python file to try out the following code:

import requests
from bs4 import BeautifulSoup

scraping_url = "https://www.amazon.com/SAMSUNG-32-Inch-Tracking-Xcelerator-QN32Q60D/dp/B0CV9MGX22/"

resp = requests.get(scraping_url)

print(resp.text)

Save it and run it from the project terminal.

You might get this message from Amazon.

Sorry, Something went wrong on our end!

So, we received the CAPTCHA from the Amazon anti-bot system which has blocked our request because they caught it was not made from the browser but from the bot/script.

Breaking this wall would require you to pass headers along with the HTTP request.

Let’s open the Amazon Product page in the browser and check the Network Tab for the headers sent to amazon.com. 

So, there are a couple of headers including the most important one the User Agent. We will also pass other headers such as Accept and Accept-Language for a safe approach. 

import requests

url = "https://www.amazon.com/SAMSUNG-32-Inch-Tracking-Xcelerator-QN32Q60D/dp/B0CV9MGX22/"

headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Language": "en-US,en;q=0.9"
}

resp = requests.get(url, headers=headers)

print(resp.text)

Running this request with the headers will help your scraper defeat the bot protection on Amazon and get the precious data behind the wall. 

Web Scraping tip: If you are scraping for mass data collection, try to collect as many user agents as possible and rotate them with each request.

Scraping Amazon Product Data

We have already learned to scrape the HTML data. So, in this section, we will analyze the HTML structure of the Amazon Page. It is important to examine the structure before moving to any conclusions.

We will use our browser inspector tool to get a location for each data point from the HTML. It’s a common practice; you may already know this if you are familiar with web scraping.

Open this product URL, https://www.amazon.com/SAMSUNG-32-Inch-Tracking-Xcelerator-QN32Q60D/dp/B0CV9MGX22 in your browser and right-click on the product title to get access to Inspect tool.

Finding the Title

After inspecting the title, you will see that it is located inside the h1 tag with the id productTitle.

Web Scraping Tip: Always search for a unique ID or a class representing a particular entity.

In this case, we got the unique ID for the title, which makes our work easy. Let us get back to our project file and add this to our code.

import requests
from bs4 import BeautifulSoup

url = "https://www.amazon.com/SAMSUNG-32-Inch-Tracking-Xcelerator-QN32Q60D/dp/B0CV9MGX22/"

headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Language": "en-US,en;q=0.9"
}

resp = requests.get(url, headers=headers)

soup = BeautifulSoup(resp.text,'html.parser')

product_title = soup.select_one('#productTitle').text

print(product_title)

We also imported the second library we talked about, i.e., BeautifulSoup. 

Next, in the line, soup = BeautifulSoup(resp.text,’html.parser’) we created a BS4 object using the response received from the HTTP request.

After that, using the BS4 object, soup, we selected the product title using its unique ID productTitle.

You will get the product title upon running this code, but it will contain some white spaces. To optimize the title string we will use the strip() method in Python.

product_title = soup.select_one('#productTitle').text.strip()

This will give you the following response.

SAMSUNG 32-Inch Class QLED 4K Q60D Series Quantum HDR Smart TV w/Object Tracking Sound Lite, Motion Xcelerator, Slim Design, Gaming Hub, Alexa Built-in (QN32Q60D, 2024 Model)

Finding the Pricing

If you examine the HTML, the pricing can be found in more than two places. However, we will only extract the pricing which is below the rating.

product_pricing = soup.select_one('span.a-price span').text.strip()

Finding the Rating 

The rating can be found just below the title. Inspecting the HTML would give you it inside the span tag with the ID acrPopover.

Let us write the code to get the product rating.

product_rating = soup.select_one('#acrPopover .a-color-base').text.strip()

Finding the Features

Similar to the title, the product features can be seen inside the divtag with the unique ID feature-bullets.

The bullet points are inside the unordered list tag, i.e., <ul> tag.

feature_bullets = [li.get_text(strip=True) for li in soup.select('#feature-bullets li')]

Finding the Images

Finding images will be difficult and tricky compared to other data points. Why am I saying this? Let us proceed and check it further.

We can see the images are present inside the list tag which is also contained in the div container with ID altImages.

all_images = []
image_elements = soup.select("#altImages li")

for i, el in enumerate(image_elements):
img_tag = el.find("img")
if img_tag:
img_src = img_tag.get("src")
all_images.append(img_src)

However, when you run this code, you may get fewer images, but we saw 6 images on the page. This is because the rest of the images are loaded dynamically by Amazon through an AJAX call. That’s why when you make a simple GET API call, you never get all the images in HTML.

Hopefully, we have an alternative:

  1. Copy any of the product image URLs.
  2. Then inspect the page and navigate to the view page source of the web page.
  3. Search for the image URL in the view page source, you will find that each of these image URLs is assigned values for the hiRes key.

So, now we only have to target the values of the hiRes keyword, i.e., “hiRes”: “url”

import re

images = re.findall('"hiRes":"(.+?)"', resp.text)
print(images)

After importing the regex library, we used re.findall() method on the response from the API using the capturing group .+? to match all the sequences of characters. 

When you run this code, you will get high-resolution images related to the product.

This is how we can scrape all these data entities. While we haven’t attempted to retrieve all the data from the page, it doesn’t make the other entities any less important. Companies can utilize customer reviews for natural language processing and sentiment analysis, and the country of origin helps customers understand where the product is coming from. Every entity has a meaning and can be used for a commercial purpose. 

Complete Code

Feel free to customize the code to add new features and retrieve more data as there is still tons of useful information on the page. 

import requests
from bs4 import BeautifulSoup
import re

url = "https://www.amazon.com/SAMSUNG-32-Inch-Tracking-Xcelerator-QN32Q60D/dp/B0CV9MGX22/"

headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Language": "en-US,en;q=0.9"
}

try:
resp = requests.get(url)
resp.raise_for_status()

# Check if the request was successful
print(f"Request status code: {resp.status_code}")

soup = BeautifulSoup(resp.text, 'html.parser')

# Extract title
try:
title = soup.find('span', {'id': "productTitle"}).text.strip()
except AttributeError:
title = None
print("Error: Title not found")

# Extract pricing
try:
pricing = soup.find("span", {"class": "a-price"}).find("span").text.strip()
except AttributeError:
pricing = None
print("Error: Pricing not found")

# Extract rating
try:
rating = soup.select_one('#acrPopover .a-color-base').text.strip()
# Convert the text to a float
average_rating = float(rating) if rating else None
except AttributeError:
rating = None
average_rating = None
print("Error: Rating not found")

# Extract specifications
specifications = {}
try:
prodct_attrs = []
prodct_values = []
for el in soup.find_all("tr", class_="a-spacing-small"):
prodct_attrs.append(el.select_one("span.a-text-bold").text.strip())
prodct_values.append(el.select_one(".po-break-word").text.strip())
specifications = dict(zip(prodct_attrs, prodct_values))
except AttributeError:
print("Error: Specifications not found")

# Extract product description
try:
product_description = soup.select_one('#productDescription').text.strip()
except AttributeError:
product_description = None
print("Error: Product description not found")

# Extract images
try:
images = re.findall('"hiRes":"(.+?)"', resp.text)
except re.error:
images = []
print("Error: Images not found")

# Print extracted data
print(f"Title: {title}")
print(f"Pricing: {pricing}")
print(f"Rating: {rating}")
print(f"Average Rating: {average_rating}")
print(f"Specifications: {specifications}")
print(f"Product Description: {product_description}")
print(f"Images: {images}")

except requests.exceptions.RequestException as e:
print(f"Error: {e}")

However, after certain requests, your IP will start blocking and you won’t be able to get the required information. To avoid this, we need to learn about the best practices needed for scraping Amazon.

Best Practises

Scraping Amazon is never been easy and requires a huge infrastructure if you are extracting data at scale. Different marketplaces and different product page structures might require an altogether different setup for higher precision and accuracy. 

Taking these factors into account, here are some recommendations to consider while scraping Amazon:

Utilize Multiple Headers — Amazon anti-bot protection can easily identify the pattern repeatedly used for scraping. To achieve a higher success rate, make sure you utilize multiple headers and rotate them on each request. 

import random

user_agents =
[
'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.4844.443 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.4187.505 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.5280.694 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.361681239886',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.6335.644 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.4789.627 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.7386.727 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.361681239913',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.4833.299 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.4920.655 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.1216.392 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.8953.494 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.1221.499 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.361681239909',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.1186.635 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.4288.265 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.7732.275 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.7423.805 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.4.288.10 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.2764.721 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.5624.209 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.4837.210 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.4666.186 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.3207.378 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.361681239843',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.361681239895',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.1413.941 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.4324.430 Safari/537.36',
]

headers = {
"User-Agent": user_agents[random.randint(0, len(user_agents) - 1)],
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Language": "en-US,en;q=0.9"
}

So, we pushed a couple of the latest user agents into our code and used the random library to rotate them for each request. This will increase the longevity of your scraper and you may be able to scrape more pages with a single IP.

Rotate IPs — A single IP is not sufficient and may put you in obstacles in your data-scraping journey. A simple solution is to collect some random IPs and rotate them on each request to make your crawler much more efficient.

Still, these solutions won’t allow you to scrape data from Amazon at scale. You need a reliable and efficient scraping API solution for scraping thousands or millions of pages.

Using EcommerceAPI for scraping Amazon

An alternative solution for this approach can be using E-commerceAPI’s Amazon Scraper API as your reliable data provider. It provides various advantages:

  1. The API will manage header rotation.
  2. Localized results from every country on the earth.
  3. Refined results for every Amazon Page including Amazon Search, Poruduct, Reviews, etc.
  4. Rotating IPs of residential and data center proxies to bypass captcha verification.

Fair enough?

Ok, so let’s try it before coming to any conclusions. 

First, let us register on the website to get our API Key.

After successful registration, get your API Key from the dashboard.

Then, copy the API into the code below and run it in your project terminal.

import requests
payload = {'api_key': 'APIKEY', 'domain': 'com' ,'asin':'B0CV9MGX22'}
resp = requests.get('https://api.ecommerceapi.io/amazon_product', params=payload)
print (resp.text)

This will give you the following JSON response:

Pretty Straightforward? Right?

Yeah, it is, as we have handled all the difficulties for scraping Amazon at our backend.

Conclusion

In this tutorial, we learned to scrape Amazon using Python. We created our custom scraper with Python which rotates headers and bypasses the protection wall of Amazon. However, if you are going with your scraper to extract data at scale, you must put more effort into it.

Alternatively, you can choose the simplest and easiest solution, E-commerce API’s Amazon Scraper API. You can read their documentation to get started with scraping Amazon at scale.

I hope you like this blog. Feel free to message me anything you need clarification on. Follow me on Twitter. Thanks for reading!

Additional Resources

  1. Create An Application For Live Price Tracking
  2. What is Amazon Data Scraping
  3. Best Amazon Scraper APIs
  4. Scraping E-commerce Website With Python
  5. Top 5 Tech Trends in E-commerce Industry