Build a Gemini-driven data frame proxy with giant pandas and Lanchain for natural language data analysis

by admin · June 10, 2025

In this tutorial, we will learn how to leverage the power of Google’s Gemini model and the flexibility of pandas. We will conduct direct and complex data analysis of the classic Titanic dataset. By combining the ChatGoogleGenerativeAi client with Langchain’s experimental Pandas DataFrame proxy, we will set up an interactive “proxy” that can interpret natural language queries. It will examine data, calculate statistics, discover correlations and generate visual insights without writing manual code for each task. We will look at basic exploration steps (such as calculating rows or calculating survival rates). We will dig into advanced analyses such as survival through demographic breakdown and fare correlation. We will then compare modifications across multiple data ranges. Finally, we will build custom scoring and pattern mining programs to extract novel insights.

!pip install langchain_experimental langchain_google_genai pandas


import os
import pandas as pd
import numpy as np
from langchain.agents.agent_types import AgentType
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent
from langchain_google_genai import ChatGoogleGenerativeAI


os.environ["GOOGLE_API_KEY"] = "Use Your Own API Key"

First, we installed the required libraries, Langchain_exprementim, Langchain_google_genai and Pandas using PIP, enabling DataFrame Agent and Google Gemini integration. Then import the core module. Next, set your Google_api_key environment variable, and we are ready to instantiate the Gemini-driven panda agent for conversational data analysis.

def setup_gemini_agent(df, temperature=0, model="gemini-1.5-flash"):
    llm = ChatGoogleGenerativeAI(
        model=model,
        temperature=temperature,
        convert_system_message_to_human=True
    )
   
    agent = create_pandas_dataframe_agent(
        llm=llm,
        df=df,
        verbose=True,
        agent_type=AgentType.OPENAI_FUNCTIONS,
        allow_dangerous_code=True
    )
    return agent

This accessibility initializes the Gemini-driven LLM client through our selected model and temperature. It then wraps it in a Langchain Pandas DataFrame proxy that can perform natural language queries (including “dangerous” code) against our data framework. Just pass our data framework to return an interactive proxy and prepare for dialogue analysis.

def load_and_explore_data():
    print("Loading Titanic Dataset...")
    df = pd.read_csv(
        "
    )
    print(f"Dataset shape: {df.shape}")
    print(f"Columns: {list(df.columns)}")
    return df

This feature gets the Titanic CSV directly from the Pandas Github Reserve. It also prints out its size and column names for a quick sane check. It then returns the loaded dataframe so we can start exploratory analysis right away.

def basic_analysis_demo(agent):
    print("nBASIC ANALYSIS DEMO")
    print("=" * 50)
   
    queries = [
        "How many rows and columns are in the dataset?",
        "What's the survival rate (percentage of people who survived)?",
        "How many people have more than 3 siblings?",
        "What's the square root of the average age?",
        "Show me the distribution of passenger classes"
    ]
   
    for query in queries:
        print(f"nQuery: {query}")
        try:
            result = agent.invoke(query)
            print(f"Result: {result['output']}")
        except Exception as e:
            print(f"Error: {e}")

This show starts the Basic Analysis session by printing the header. It then iterates through a common set of exploratory queries such as dataset dimensions, survival rates, family counts, and class distributions. For each natural language prompt, it calls the proxy. Later, it captures its output and prints the result or error.

def advanced_analysis_demo(agent):
    print("nADVANCED ANALYSIS DEMO")
    print("=" * 50)
   
    advanced_queries = [
        "What's the correlation between age and fare?",
        "Create a survival analysis by gender and class",
        "What's the median age for each passenger class?",
        "Find passengers with the highest fares and their details",
        "Calculate the survival rate for different age groups (0-18, 18-65, 65+)"
    ]
   
    for query in advanced_queries:
        print(f"nQuery: {query}")
        try:
            result = agent.invoke(query)
            print(f"Result: {result['output']}")
        except Exception as e:
            print(f"Error: {e}")

This Advanced Analytics feature prints a title and then runs a series of more complex queries. It calculates correlations, performs stratified survival analysis, calculates median statistics, and performs detailed filtering for our Gemini-driven dataframe agent. It loops through each natural language prompt, captures the proxy’s response and prints the result (or error). Therefore, it shows that we can easily leverage conversational AI for more in-depth segmented insights into our datasets.

def multi_dataframe_demo():
    print("nMULTI-DATAFRAME DEMO")
    print("=" * 50)
   
    df = pd.read_csv(
        "
    )
   
    df_filled = df.copy()
    df_filled["Age"] = df_filled["Age"].fillna(df_filled["Age"].mean())
   
    agent = setup_gemini_agent([df, df_filled])
   
    queries = [
        "How many rows in the age column are different between the two datasets?",
        "Compare the average age in both datasets",
        "What percentage of age values were missing in the original dataset?",
        "Show summary statistics for age in both datasets"
    ]
   
    for query in queries:
        print(f"nQuery: {query}")
        try:
            result = agent.invoke(query)
            print(f"Result: {result['output']}")
        except Exception as e:
            print(f"Error: {e}")

This demonstration illustrates how to rotate a Gemini-driven proxy in multiple data ranges. In this case, it includes the original Titanic data and a missing version of the estimated age. Therefore, we can ask cross-database comparison questions (e.g. differences in row counts, mean age comparisons, missing percentages, and side-by-side summary statistics) using simple natural language prompts.

def custom_analysis_demo(agent):
    print("nCUSTOM ANALYSIS DEMO")
    print("=" * 50)
   
    custom_queries = [
        "Create a risk score for each passenger based on: Age (higher age = higher risk), Gender (male = higher risk), Class (3rd class = higher risk), Family size (alone or large family = higher risk). Then show the top 10 highest risk passengers who survived",
       
        "Analyze the 'deck' information from the cabin data: Extract deck letter from cabin numbers, Show survival rates by deck, Which deck had the highest survival rate?",
       
        "Find interesting patterns: Did people with similar names (same surname) tend to survive together? What's the relationship between ticket price and survival? Were there any age groups that had 100% survival rate?"
    ]
   
    for i, query in enumerate(custom_queries, 1):
        print(f"nCustom Analysis {i}:")
        print(f"Query: {query[:100]}...")
        try:
            result = agent.invoke(query)
            print(f"Result: {result['output']}")
        except Exception as e:
            print(f"Error: {e}")

This routine begins a “Custom Analysis” session by browsing three complex multi-step prompts. It establishes a passenger risk measurement model, extracts and evaluates deck-based survival rates, and is based on survival patterns and fare/age outliers based on last name. So we can see how our Gemini-driven agents can easily handle customization, domain-specific surveys, while natural language queries.

def main():
    print("Advanced Pandas Agent with Gemini Tutorial")
    print("=" * 60)
   
    if not os.getenv("GOOGLE_API_KEY"):
        print("Warning: GOOGLE_API_KEY not set!")
        print("Please set your Gemini API key as an environment variable.")
        return
   
    try:
        df = load_and_explore_data()
        print("nSetting up Gemini Agent...")
        agent = setup_gemini_agent(df)
       
        basic_analysis_demo(agent)
        advanced_analysis_demo(agent)
        multi_dataframe_demo()
        custom_analysis_demo(agent)
       
        print("nTutorial completed successfully!")
       
    except Exception as e:
        print(f"Error: {e}")
        print("Make sure you have installed all required packages and set your API key.")


if __name__ == "__main__":
    main()

The main() function is the starting point of the tutorial. It verifies that our Gemini API key is set, loads and explores the Titanic dataset, and initializes the conversation pandas proxy. It then runs the Basic, Advanced, Multi-DataFrame, and Custom Analytics demos in turn. Finally, it wraps the entire workflow with try/except, except blocks, to catch and report any errors, and then before the signal completes successfully.

df = pd.read_csv(")
agent = setup_gemini_agent(df)


agent.invoke("What factors most strongly predicted survival?")
agent.invoke("Create a detailed survival analysis by port of embarkation")
agent.invoke("Find any interesting anomalies or outliers in the data")

Finally, we directly load the Titanic data, instantiate our Gemini-driven panda agent, and launch three one-time queries. We identify key survival predictors, decompose survival through login ports, and find out abnormalities or outliers. We implement all of this without modifying any of our demo features.

In short, combining giant pandas with Gemini through the Langchain DataFrame proxy to convert data exploration from writing boilerplate code to making clear natural language queries. Whether we calculate summary statistics, build custom risk scores, compare multiple data ranges, or drill into subtle survival analyses, the transition is obvious. With just a few lines of setup, we get an interactive analysis assistant who can adapt to new problems at any time. It can surface hide patterns and speed up our workflow.

View notebook. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 99K+ ml reddit And subscribe Our newsletter.

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.