Coding guides for implementing ScrapeGraph and Gemini AI for automated, scalable, insight-driven competitive intelligence and market analytics workflows

In this tutorial, we demonstrate how to leverage the powerful scratch tool used in combination with ScrapeGraph and Gemini AI to automate the collection, parsing, and analyzing of competitor information. By using ScrapeGraph’s SmartScrapertool and MarkdownifyTool, users can extract detailed insights on product products, pricing strategies, technology stacks and markets directly from competitor websites. This tutorial then synthesizes these different data points into structured, actionable intelligence using Gemini’s high-level language model. Throughout the process, ScrapeGraph ensures that the original extraction is both accurate and scalable, allowing analysts to focus on strategic interpretation rather than manual data collection.
%pip install --quiet -U langchain-scrapegraph langchain-google-genai pandas matplotlib seaborn
We quietly upgraded or installed the latest version of the required libraries, including Langchain-ScrapeGraph for advanced network scratches and Langchain-Google-Genai for integrated Gemini AI, as well as data analytics tools such as Pandas, Matplotlib and Seaborn to ensure your environment can provide your environment with seamless competitive Intelligence Works.
import getpass
import os
import json
import pandas as pd
from typing import List, Dict, Any
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
We import basic Python libraries to set up secure, data-driven pipelines: GEGPASS and OS manage passwords and environment variables, JSON handles serialized data, and PANDAS provides powerful data framework operations. The typing module provides type hints for better code clarity, while DateTime records timestamps. Finally, matplotlib.pyplot and Seaborn provide us with tools to create insightful visualizations.
if not os.environ.get("SGAI_API_KEY"):
os.environ["SGAI_API_KEY"] = getpass.getpass("ScrapeGraph AI API key:n")
if not os.environ.get("GOOGLE_API_KEY"):
os.environ["GOOGLE_API_KEY"] = getpass.getpass("Google API key for Gemini:n")
We check if the SGAI_API_KEY and Google_api_key environment variables are set; if not, the script will safely prompt the user to make ScrapeGraph and Google (Gemini) API keys through getPass and store them in the environment to get subsequent authentication requests.
from langchain_scrapegraph.tools import (
SmartScraperTool,
SearchScraperTool,
MarkdownifyTool,
GetCreditsTool,
)
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableConfig, chain
from langchain_core.output_parsers import JsonOutputParser
smartscraper = SmartScraperTool()
searchscraper = SearchScraperTool()
markdownify = MarkdownifyTool()
credits = GetCreditsTool()
llm = ChatGoogleGenerativeAI(
model="gemini-1.5-flash",
temperature=0.1,
convert_system_message_to_human=True
)
Here we import and instantiate scrapegraph tools, SmartScrapertool, searchscrapertool, MarkdownifyTool, and GetCreditStool for extracting and processing web data, and then configure ChatGoogleGeneratiVeai as “Gemini-1.5-Flash” (a low-temperature and human-readable system) to drive our analysis. We also bring chatPromptTemplate, RunnableConfig, chain and JSONOUTPUTPARSER from Langchain_core to structure prompts and parse model output.
class CompetitiveAnalyzer:
def __init__(self):
self.results = []
self.analysis_timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
def scrape_competitor_data(self, url: str, company_name: str = None) -> Dict[str, Any]:
"""Scrape comprehensive data from a competitor website"""
extraction_prompt = """
Extract the following information from this website:
1. Company name and tagline
2. Main products/services offered
3. Pricing information (if available)
4. Target audience/market
5. Key features and benefits highlighted
6. Technology stack mentioned
7. Contact information
8. Social media presence
9. Recent news or announcements
10. Team size indicators
11. Funding information (if mentioned)
12. Customer testimonials or case studies
13. Partnership information
14. Geographic presence/markets served
Return the information in a structured JSON format with clear categorization.
If information is not available, mark as 'Not Available'.
"""
try:
result = smartscraper.invoke({
"user_prompt": extraction_prompt,
"website_url": url,
})
markdown_content = markdownify.invoke({"website_url": url})
competitor_data = {
"company_name": company_name or "Unknown",
"url": url,
"scraped_data": result,
"markdown_length": len(markdown_content),
"analysis_date": self.analysis_timestamp,
"success": True,
"error": None
}
return competitor_data
except Exception as e:
return {
"company_name": company_name or "Unknown",
"url": url,
"scraped_data": None,
"error": str(e),
"success": False,
"analysis_date": self.analysis_timestamp
}
def analyze_competitor_landscape(self, competitors: List[Dict[str, str]]) -> Dict[str, Any]:
"""Analyze multiple competitors and generate insights"""
print(f"🔍 Starting competitive analysis for {len(competitors)} companies...")
for i, competitor in enumerate(competitors, 1):
print(f"📊 Analyzing {competitor['name']} ({i}/{len(competitors)})...")
data = self.scrape_competitor_data(
competitor['url'],
competitor['name']
)
self.results.append(data)
analysis_prompt = ChatPromptTemplate.from_messages([
("system", """
You are a senior business analyst specializing in competitive intelligence.
Analyze the scraped competitor data and provide comprehensive insights including:
1. Market positioning analysis
2. Pricing strategy comparison
3. Feature gap analysis
4. Target audience overlap
5. Technology differentiation
6. Market opportunities
7. Competitive threats
8. Strategic recommendations
Provide actionable insights in JSON format with clear categories and recommendations.
"""),
("human", "Analyze this competitive data: {competitor_data}")
])
clean_data = []
for result in self.results:
if result['success']:
clean_data.append({
'company': result['company_name'],
'url': result['url'],
'data': result['scraped_data']
})
analysis_chain = analysis_prompt | llm | JsonOutputParser()
try:
competitive_analysis = analysis_chain.invoke({
"competitor_data": json.dumps(clean_data, indent=2)
})
except:
analysis_chain_text = analysis_prompt | llm
competitive_analysis = analysis_chain_text.invoke({
"competitor_data": json.dumps(clean_data, indent=2)
})
return {
"analysis": competitive_analysis,
"raw_data": self.results,
"summary_stats": self.generate_summary_stats()
}
def generate_summary_stats(self) -> Dict[str, Any]:
"""Generate summary statistics from the analysis"""
successful_scrapes = sum(1 for r in self.results if r['success'])
failed_scrapes = len(self.results) - successful_scrapes
return {
"total_companies_analyzed": len(self.results),
"successful_scrapes": successful_scrapes,
"failed_scrapes": failed_scrapes,
"success_rate": f"{(successful_scrapes/len(self.results)*100):.1f}%" if self.results else "0%",
"analysis_timestamp": self.analysis_timestamp
}
def export_results(self, filename: str = None):
"""Export results to JSON and CSV files"""
if not filename:
filename = f"competitive_analysis_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
with open(f"{filename}.json", 'w') as f:
json.dump({
"results": self.results,
"summary": self.generate_summary_stats()
}, f, indent=2)
df_data = []
for result in self.results:
if result['success']:
df_data.append({
'Company': result['company_name'],
'URL': result['url'],
'Success': result['success'],
'Data_Length': len(str(result['scraped_data'])) if result['scraped_data'] else 0,
'Analysis_Date': result['analysis_date']
})
if df_data:
df = pd.DataFrame(df_data)
df.to_csv(f"{filename}.csv", index=False)
print(f"âś… Results exported to {filename}.json and {filename}.csv")
Competition analyzers class coordinates end-to-end competitor research, uses ScrapeGraph tools to scratch detailed company information, compile and clean results, and then leverages Gemini AI to generate structured competitive insights. It also tracks success rates and timestamps and provides practical methods to export raw and aggregated data into JSON and CSV formats for easy downstream reporting and analysis.
def run_ai_saas_analysis():
"""Run a comprehensive analysis of AI/SaaS competitors"""
analyzer = CompetitiveAnalyzer()
ai_saas_competitors = [
{"name": "OpenAI", "url": "
{"name": "Anthropic", "url": "
{"name": "Hugging Face", "url": "
{"name": "Cohere", "url": "
{"name": "Scale AI", "url": "
]
results = analyzer.analyze_competitor_landscape(ai_saas_competitors)
print("n" + "="*80)
print("🎯 COMPETITIVE ANALYSIS RESULTS")
print("="*80)
print(f"n📊 Summary Statistics:")
stats = results['summary_stats']
for key, value in stats.items():
print(f" {key.replace('_', ' ').title()}: {value}")
print(f"n🔍 Strategic Analysis:")
if isinstance(results['analysis'], dict):
for section, content in results['analysis'].items():
print(f"n {section.replace('_', ' ').title()}:")
if isinstance(content, list):
for item in content:
print(f" • {item}")
else:
print(f" {content}")
else:
print(results['analysis'])
analyzer.export_results("ai_saas_competitive_analysis")
return results
The above functionality initiates competition analysis by instantiating the competition analyzer and defining the critical AI/SaaS player to be evaluated. It then runs the full scratch and Inlights workflow, prints the summary statistics and strategic discovery in the format, and finally exports the detailed results to JSON and CSV for further use.
def run_ecommerce_analysis():
"""Analyze e-commerce platform competitors"""
analyzer = CompetitiveAnalyzer()
ecommerce_competitors = [
{"name": "Shopify", "url": "
{"name": "WooCommerce", "url": "
{"name": "BigCommerce", "url": "
{"name": "Magento", "url": "
]
results = analyzer.analyze_competitor_landscape(ecommerce_competitors)
analyzer.export_results("ecommerce_competitive_analysis")
return results
The above features create strategic insights by scratching the details of each site, then exporting the results to JSON and CSV files under the name “ecommerce_competistil_analysis”, thus setting competitiveness for the main e-commerce platform.
@chain
def social_media_monitoring_chain(company_urls: List[str], config: RunnableConfig):
"""Monitor social media presence and engagement strategies of competitors"""
social_media_prompt = ChatPromptTemplate.from_messages([
("system", """
You are a social media strategist. Analyze the social media presence and strategies
of these companies. Focus on:
1. Platform presence (LinkedIn, Twitter, Instagram, etc.)
2. Content strategy patterns
3. Engagement tactics
4. Community building approaches
5. Brand voice and messaging
6. Posting frequency and timing
Provide actionable insights for improving social media strategy.
"""),
("human", "Analyze social media data for: {urls}")
])
social_data = []
for url in company_urls:
try:
result = smartscraper.invoke({
"user_prompt": "Extract all social media links, community engagement features, and social proof elements",
"website_url": url,
})
social_data.append({"url": url, "social_data": result})
except Exception as e:
social_data.append({"url": url, "error": str(e)})
chain = social_media_prompt | llm
analysis = chain.invoke({"urls": json.dumps(social_data, indent=2)}, config=config)
return {
"social_analysis": analysis,
"raw_social_data": social_data
}
Here, this linking feature defines a pipeline to collect and analyze competitors’ social media footprints: It uses ScrapeGraph’s smart scraper to extract social media links and engagement elements, and then feeds data into Gemini with key tips focused on presence, content strategy and community strategy. Finally, it returns the original scratch information and AI-generated, actionable social media insights in a single structured output.
def check_credits():
"""Check available credits"""
try:
credits_info = credits.invoke({})
print(f"đź’ł Available Credits: {credits_info}")
return credits_info
except Exception as e:
print(f"⚠️ Could not check credits: {e}")
return None
The above function calls getCreditStool to retrieve and display your available ScrapeGraph/Gemini API credits, prints the result or warning if the check fails, and returns the credit information (or no errors).
if __name__ == "__main__":
print("🚀 Advanced Competitive Analysis Tool with Gemini AI")
print("="*60)
check_credits()
print("n🤖 Running AI/SaaS Competitive Analysis...")
ai_results = run_ai_saas_analysis()
run_additional = input("nâť“ Run e-commerce analysis as well? (y/n): ").lower().strip()
if run_additional == 'y':
print("nđź›’ Running E-commerce Platform Analysis...")
ecom_results = run_ecommerce_analysis()
print("n✨ Analysis complete! Check the exported files for detailed results.")
Finally, the last piece of code serves as the entry point for the script: it prints a title, checks API points, and then starts AI/SaaS competitor analysis (and optional e-commerce analysis) before indicating that all results have been exported.
In short, integrating ScrapeGraph’s scratching capabilities with Gemini AI transforms traditionally time-consuming competitive intelligent workflows into effective, repeatable pipelines. ScrapeGraph handles the heavy lifting of acquisition and web-based information, while Gemini’s language understanding translates raw data into advanced strategic advice. As a result, companies can quickly evaluate market positioning, identify characteristic gaps, and identify emerging opportunities with minimal manual intervention. By automating these steps, users can increase speed and consistency and extend their analysis to new competitors or market flexibility as needed.
View notebooks on Github. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 95k+ ml reddit And subscribe Our newsletter.
Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.