AI

Build custom AI tools for your AI agents to combine machine learning and statistical analysis

The ability to build custom tools is crucial to building customizable AI agents. In this tutorial, we demonstrate how to use Python to create a powerful and intelligent data analysis tool that can be integrated into an AI agent powered by Langchain. By defining a structured architecture for user input and implementing key features such as correlation analysis, clustering, outlier detection, and target variable analysis, the tool converts raw tabular data into actionable insights. Leveraging the modularity of Langchain docking, this implementation illustrates how developers encapsulate domain-specific logic and build reusable components, thereby improving the analytics of autonomous AI systems.

!pip install langchain langchain-core pandas numpy matplotlib seaborn scikit-learn




import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from typing import Dict, List, Tuple, Optional, Any
from langchain_core.tools import BaseTool
from langchain_core.tools.base import ToolException
from pydantic import BaseModel, Field
import json

We installed basic Python packages for data analysis, visualization, machine learning and lanban tool development. It then imports key libraries including pandas, numpy, scikit-learn, and langchain_core, setting up the environment to build custom smart tools for AI agents. These libraries provide the basis for preprocessing, clustering, evaluation, and tool integration.

class DataAnalysisInput(BaseModel):
   data: List[Dict[str, Any]] = Field(description="List of data records as dictionaries")
   analysis_type: str = Field(default="comprehensive", description="Type of analysis: 'comprehensive', 'clustering', 'correlation', 'outlier'")
   target_column: Optional[str] = Field(default=None, description="Target column for focused analysis")
   max_clusters: int = Field(default=5, description="Maximum clusters for clustering analysis")

Above, we use Pydantic’s basic model to define the input architecture of a custom analysis tool. The DataAnalysisinput class ensures that incoming data follows a structured format, allowing users to specify data sets, analysis types, optional target columns, and the maximum number of clusters used for clustering tasks. It is a clean interface to verify inputs before analysis begins.

class IntelligentDataAnalyzer(BaseTool):
   name: str = "intelligent_data_analyzer"
   description: str = "Advanced data analysis tool that performs statistical analysis, machine learning clustering, outlier detection, correlation analysis, and generates visualizations with actionable insights."
   args_schema: type[BaseModel] = DataAnalysisInput
   response_format: str = "content_and_artifact"
  
   def _run(self, data: List[Dict], analysis_type: str = "comprehensive", target_column: Optional[str] = None, max_clusters: int = 5) -> Tuple[str, Dict]:
       try:
           df = pd.DataFrame(data)
           if df.empty:
               raise ToolException("Dataset is empty")
          
           insights = {"dataset_info": self._get_dataset_info(df)}
          
           if analysis_type in ["comprehensive", "correlation"]:
               insights["correlation_analysis"] = self._correlation_analysis(df)
           if analysis_type in ["comprehensive", "clustering"]:
               insights["clustering_analysis"] = self._clustering_analysis(df, max_clusters)
           if analysis_type in ["comprehensive", "outlier"]:
               insights["outlier_detection"] = self._outlier_detection(df)
          
           if target_column and target_column in df.columns:
               insights["target_analysis"] = self._target_analysis(df, target_column)
          
           recommendations = self._generate_recommendations(df, insights)
           summary = self._create_analysis_summary(insights, recommendations)
          
           artifact = {
               "insights": insights,
               "recommendations": recommendations,
               "data_shape": df.shape,
               "analysis_type": analysis_type,
               "numeric_columns": df.select_dtypes(include=[np.number]).columns.tolist(),
               "categorical_columns": df.select_dtypes(include=['object']).columns.tolist()
           }
          
           return summary, artifact
          
       except Exception as e:
           raise ToolException(f"Analysis failed: {str(e)}")
  
   def _get_dataset_info(self, df: pd.DataFrame) -> Dict:
       return {
           "shape": df.shape,
           "columns": df.columns.tolist(),
           "dtypes": df.dtypes.astype(str).to_dict(),
           "missing_values": df.isnull().sum().to_dict(),
           "memory_usage": df.memory_usage(deep=True).sum()
       }
  
   def _correlation_analysis(self, df: pd.DataFrame) -> Dict:
       numeric_df = df.select_dtypes(include=[np.number])
       if numeric_df.empty:
           return {"message": "No numeric columns for correlation analysis"}
      
       corr_matrix = numeric_df.corr()
       strong_corr = []
       for i in range(len(corr_matrix.columns)):
           for j in range(i+1, len(corr_matrix.columns)):
               corr_val = corr_matrix.iloc[i, j]
               if abs(corr_val) > 0.7:
                   strong_corr.append({"var1": corr_matrix.columns[i], "var2": corr_matrix.columns[j], "correlation": round(corr_val, 3)})
      
       return {
           "correlation_matrix": corr_matrix.round(3).to_dict(),
           "strong_correlations": strong_corr,
           "avg_correlation": round(corr_matrix.values[np.triu_indices_from(corr_matrix.values, k=1)].mean(), 3)
       }
  
   def _clustering_analysis(self, df: pd.DataFrame, max_clusters: int) -> Dict:
       numeric_df = df.select_dtypes(include=[np.number]).dropna()
       if numeric_df.shape[0]  1 else 0.0,
           "inertias": inertias
       }
  
   def _outlier_detection(self, df: pd.DataFrame) -> Dict:
       numeric_df = df.select_dtypes(include=[np.number])
       if numeric_df.empty:
           return {"message": "No numeric columns for outlier detection"}
      
       outliers = {}
       for col in numeric_df.columns:
           data = numeric_df[col].dropna()
           Q1, Q3 = data.quantile(0.25), data.quantile(0.75)
           IQR = Q3 - Q1
           iqr_outliers = data[(data  Q3 + 1.5 * IQR)]
           z_scores = np.abs((data - data.mean()) / data.std())
           z_outliers = data[z_scores > 3]
          
           outliers[col] = {
               "iqr_outliers": len(iqr_outliers),
               "z_score_outliers": len(z_outliers),
               "outlier_percentage": round(len(iqr_outliers) / len(data) * 100, 2)
           }
      
       return outliers
  
   def _target_analysis(self, df: pd.DataFrame, target_col: str) -> Dict:
       if target_col not in df.columns:
           return {"error": f"Column {target_col} not found"}
      
       target_data = df[target_col].dropna()
      
       if pd.api.types.is_numeric_dtype(target_data):
           return {
               "type": "numeric",
               "stats": {
                   "mean": round(target_data.mean(), 3),
                   "median": round(target_data.median(), 3),
                   "std": round(target_data.std(), 3),
                   "skewness": round(target_data.skew(), 3),
                   "kurtosis": round(target_data.kurtosis(), 3)
               },
               "distribution": "normal" if abs(target_data.skew())  List[str]:
       recommendations = []
      
       missing_pct = sum(insights["dataset_info"]["missing_values"].values()) / (df.shape[0] * df.shape[1]) * 100
       if missing_pct > 10:
           recommendations.append(f"Consider data imputation - {missing_pct:.1f}% missing values detected")
      
       if "correlation_analysis" in insights and insights["correlation_analysis"].get("strong_correlations"):
           recommendations.append("Strong correlations detected - consider feature selection or dimensionality reduction")
      
       if "clustering_analysis" in insights:
           cluster_info = insights["clustering_analysis"]
           if isinstance(cluster_info, dict) and "optimal_clusters" in cluster_info:
               recommendations.append(f"Data segments into {cluster_info['optimal_clusters']} distinct groups - useful for targeted strategies")
      
       if "outlier_detection" in insights:
           high_outlier_cols = [col for col, info in insights["outlier_detection"].items() if isinstance(info, dict) and info.get("outlier_percentage", 0) > 5]
           if high_outlier_cols:
               recommendations.append(f"High outlier percentage in: {', '.join(high_outlier_cols)} - investigate data quality")
      
       return recommendations if recommendations else ["Data appears well-structured with no immediate concerns"]
  
   def _create_analysis_summary(self, insights: Dict, recommendations: List[str]) -> str:
       dataset_info = insights["dataset_info"]
       summary = f"""📊 INTELLIGENT DATA ANALYSIS COMPLETE


Dataset Overview: {dataset_info['shape'][0]} rows × {dataset_info['shape'][1]} columns
Numeric Features: {len([c for c, t in dataset_info['dtypes'].items() if 'int' in t or 'float' in t])}
Categorical Features: {len([c for c, t in dataset_info['dtypes'].items() if 'object' in t])}


Key Insights Generated:
• Statistical correlations and relationships identified
• Clustering patterns discovered for segmentation
• Outlier detection completed for data quality assessment
• Feature importance and distribution analysis performed


Top Recommendations:
{chr(10).join('• ' + rec for rec in recommendations[:3])}


Analysis includes ML-powered clustering, statistical correlations, and actionable business insights."""
      
       return summary
  
   def _find_elbow_point(self, inertias: List[float], k_range: range) -> int:
       if len(inertias) 

The IntelligentDataAnalyzer class is a custom tool built using Langchain’s dock to conduct comprehensive data analysis of structured data sets. It integrates a variety of analytical methods, including the generation of correlation matrix, clustering of k-means and contour scores, outlier detection using IQR and z-scores, and descriptive statistics on target columns to incorporate a unified pipeline. This tool not only extracts valuable insights, but also automatically generates suggestions and summary reports, which is useful for building AI agents that require data-based decision support capabilities.

data_analyzer = IntelligentDataAnalyzer()


sample_data = [
   {"age": 25, "income": 50000, "education": "Bachelor", "satisfaction": 7},
   {"age": 35, "income": 75000, "education": "Master", "satisfaction": 8},
   {"age": 45, "income": 90000, "education": "PhD", "satisfaction": 6},
   {"age": 28, "income": 45000, "education": "Bachelor", "satisfaction": 7},
   {"age": 52, "income": 120000, "education": "Master", "satisfaction": 9},
]


result = data_analyzer.invoke({
   "data": sample_data,
   "analysis_type": "comprehensive",
   "target_column": "satisfaction"
})


print("Analysis Summary:")
print(result)

Finally, we initialize the intelligent Dataanalyzer tool and feed it into a sample dataset containing demographic and satisfaction data. By specifying the analysis type as “Comprehensive” and setting Satisfaction as the target column, the tool can perform a series of analyses including statistical analysis, correlation checks, clustering, clustering, outlier detection, and target distribution analysis. The final output is a human-readable summary and structured insight that demonstrates how AI agents automatically process and interpret real-world tabular data.

In short, we created an advanced customization tool to integrate with AI agents. The IntelligentDataanalyzer class handles a variety of analytical tasks from statistical analysis to machine learning-based clustering and provides explicitly recommended insights in structured outputs. This approach highlights how custom Langchain tools bridge the gap between data science and interactive AI, allowing agents to understand and provide rich data-driven decisions.


Check Code. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.


Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button