How to build advanced Brightdata web scraper for AI-driven data extraction using Google Gemini

In this tutorial, we will take you to build an enhanced web scraping tool BrightdataA powerful proxy network for smart data extraction with Google’s Gemini API. You will see how to build a Python project, install and import the necessary libraries, and encapsulate the scratch logic in a clean reusable BrightDataScraper class. Whether you are targeting Amazon product pages, bestseller lists, or LinkedIn configuration files, Scraper’s modular approach demonstrates how to configure scraping parameters, gracefully handle errors and return structured JSON results. The optional reactive style AI proxy integration also shows you how to combine LLM-driven reasoning with real-time scratching, allowing you to make up natural language queries for direct data analysis.
!pip install langchain-brightdata langchain-google-genai langgraph langchain-core google-generativeai
We installed all the key libraries needed for the tutorial in one step: Langchain-Brightdata for Brightdata Web Scratch, langchain-google-genai and Google-generativeai for Google Gemini Integration, langgraph, Langgraph for agent orchestration, and Langchain Core for Langchain core langchain core.
import os
import json
from typing import Dict, Any, Optional
from langchain_brightdata import BrightDataWebScraperAPI
from langchain_google_genai import ChatGoogleGenerativeAI
from langgraph.prebuilt import create_react_agent
These imported devices prepare your environment and core features: OS and JSON handle system operations and data serialization while typing provides a structured type of prompt. You then bring BrightDatawebsCraperapi into Brightdata scraping, ChatGoogleGenerativeAi interacts with Google’s Gemini LLM, and create create_react_agent to coordinate these components in a React-style agent.
class BrightDataScraper:
"""Enhanced web scraper using BrightData API"""
def __init__(self, api_key: str, google_api_key: Optional[str] = None):
"""Initialize scraper with API keys"""
self.api_key = api_key
self.scraper = BrightDataWebScraperAPI(bright_data_api_key=api_key)
if google_api_key:
self.llm = ChatGoogleGenerativeAI(
model="gemini-2.0-flash",
google_api_key=google_api_key
)
self.agent = create_react_agent(self.llm, [self.scraper])
def scrape_amazon_product(self, url: str, zipcode: str = "10001") -> Dict[str, Any]:
"""Scrape Amazon product data"""
try:
results = self.scraper.invoke({
"url": url,
"dataset_type": "amazon_product",
"zipcode": zipcode
})
return {"success": True, "data": results}
except Exception as e:
return {"success": False, "error": str(e)}
def scrape_amazon_bestsellers(self, region: str = "in") -> Dict[str, Any]:
"""Scrape Amazon bestsellers"""
try:
url = f"
results = self.scraper.invoke({
"url": url,
"dataset_type": "amazon_product"
})
return {"success": True, "data": results}
except Exception as e:
return {"success": False, "error": str(e)}
def scrape_linkedin_profile(self, url: str) -> Dict[str, Any]:
"""Scrape LinkedIn profile data"""
try:
results = self.scraper.invoke({
"url": url,
"dataset_type": "linkedin_person_profile"
})
return {"success": True, "data": results}
except Exception as e:
return {"success": False, "error": str(e)}
def run_agent_query(self, query: str) -> None:
"""Run AI agent with natural language query"""
if not hasattr(self, 'agent'):
print("Error: Google API key required for agent functionality")
return
try:
for step in self.agent.stream(
{"messages": query},
stream_mode="values"
):
step["messages"][-1].pretty_print()
except Exception as e:
print(f"Agent error: {e}")
def print_results(self, results: Dict[str, Any], title: str = "Results") -> None:
"""Pretty print results"""
print(f"n{'='*50}")
print(f"{title}")
print(f"{'='*50}")
if results["success"]:
print(json.dumps(results["data"], indent=2, ensure_ascii=False))
else:
print(f"Error: {results['error']}")
print()
The BrightDataScraper class encapsulates all BrightData network splicing logic and uses optional Gemini-driven intelligence under a single reusable interface. Its approach allows you to easily get Amazon product details, bestseller lists, and LinkedIn configuration files, handle API calls, error handling and JSON formats, and even stream natural language “proxy” queries when even providing Google API keys. Convenient print_results assistant to ensure that your output is always formatted for inspection.
def main():
"""Main execution function"""
BRIGHT_DATA_API_KEY = "Use Your Own API Key"
GOOGLE_API_KEY = "Use Your Own API Key"
scraper = BrightDataScraper(BRIGHT_DATA_API_KEY, GOOGLE_API_KEY)
print("🛍️ Scraping Amazon India Bestsellers...")
bestsellers = scraper.scrape_amazon_bestsellers("in")
scraper.print_results(bestsellers, "Amazon India Bestsellers")
print("📦 Scraping Amazon Product...")
product_url = "
product_data = scraper.scrape_amazon_product(product_url, "10001")
scraper.print_results(product_data, "Amazon Product Data")
print("👤 Scraping LinkedIn Profile...")
linkedin_url = "
linkedin_data = scraper.scrape_linkedin_profile(linkedin_url)
scraper.print_results(linkedin_data, "LinkedIn Profile Data")
print("🤖 Running AI Agent Query...")
agent_query = """
Scrape Amazon product data for
in New York (zipcode 10001) and summarize the key product details.
"""
scraper.run_agent_query(agent_query)
The Main() function connects everything together by setting BrightData and Google API keys, instantiating BrightDataScraper, and then demonstrating each feature: it scratches Amazon India bestsellers, gets details for a specific product, retrieves LinkedIn configuration files, and ultimately performs the natural language proxy Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Query Quary formerly Greened results.
if __name__ == "__main__":
print("Installing required packages...")
os.system("pip install -q langchain-brightdata langchain-google-genai langgraph")
os.environ["BRIGHT_DATA_API_KEY"] = "Use Your Own API Key"
main()
Finally, this entry point block ensures that when running as a standalone script, the required scratch library is quietly installed and the BrightData API key is set in the environment. The main function is then executed to start all scratch and proxy workflows.
In short, by the end of this tutorial, you will have an upcoming Python script that automates tedious data collection tasks, summarizes the low-level API details, and selectively mines it as “generate AI” for advanced query processing. You can extend this foundation by adding support for other dataset types, integrating other LLMs, or deploying scrapers as part of a larger data pipeline or web service. With these building blocks, you now have the ability to collect, analyze and present web data, whether it is for market research, competitive intelligence, or custom AI-powered applications.
Check notebook. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.
Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.
