Use lilac to convert, filter and export coding guides to structured insights to build functional data analysis workflow

admin12 hours ago

0 4 5 minutes read

Use lilac to convert, filter and export coding guides to structured insights to build functional data analysis workflow

In this tutorial, we use this tutorial to demonstrate a fully functional modular data analysis pipeline lilac Library, does not rely on signal processing. It combines the lilac dataset management capabilities with Python’s functional programming paradigm to create a clean, scalable workflow. From setting up projects and generating realistic sample data to extracting insights and exporting filtered outputs, this tutorial highlights reusable, reusable code structure. Core functional utilities such as pipelines, map_over, and filter_by are used to build declaration streams, while Pandas facilitates detailed data transformation and quality analysis.

!pip install lilac[all] pandas numpy

To get started, we use the command! The library required for PIP installation[all] Pandas Numpy. This ensures that we have a complete lilac kit along with giant pandas and Numpy for smooth data processing and analysis. Before we proceed, we should run this on our laptop.

import json
import uuid
import pandas as pd
from pathlib import Path
from typing import List, Dict, Any, Tuple, Optional
from functools import reduce, partial
import lilac as ll

We import all the basic libraries. These include JSON and UUIDs for processing data and generating unique project names, giant pandas for using table data, and paths for managing directories from Pathlib. We also introduced type hints to improve functional clarity and functionality of functional composition modes. Finally, we import the core lilac library as LL to manage our dataset.

def pipe(*functions):
   """Compose functions left to right (pipe operator)"""
   return lambda x: reduce(lambda acc, f: f(acc), functions, x)


def map_over(func, iterable):
   """Functional map wrapper"""
   return list(map(func, iterable))


def filter_by(predicate, iterable):
   """Functional filter wrapper"""
   return list(filter(predicate, iterable))


def create_sample_data() -> List[Dict[str, Any]]:
   """Generate realistic sample data for analysis"""
   return [
       {"id": 1, "text": "What is machine learning?", "category": "tech", "score": 0.9, "tokens": 5},
       {"id": 2, "text": "Machine learning is AI subset", "category": "tech", "score": 0.8, "tokens": 6},
       {"id": 3, "text": "Contact support for help", "category": "support", "score": 0.7, "tokens": 4},
       {"id": 4, "text": "What is machine learning?", "category": "tech", "score": 0.9, "tokens": 5}, 
       {"id": 5, "text": "Deep learning neural networks", "category": "tech", "score": 0.85, "tokens": 4},
       {"id": 6, "text": "How to optimize models?", "category": "tech", "score": 0.75, "tokens": 5},
       {"id": 7, "text": "Performance tuning guide", "category": "guide", "score": 0.6, "tokens": 3},
       {"id": 8, "text": "Advanced optimization techniques", "category": "tech", "score": 0.95, "tokens": 3},
       {"id": 9, "text": "Gradient descent algorithm", "category": "tech", "score": 0.88, "tokens": 3},
       {"id": 10, "text": "Model evaluation metrics", "category": "tech", "score": 0.82, "tokens": 3},
   ]

In this section, we define reusable functional utilities. This pipeline function helps us to clearly chain transformations, while MAP_OVER and FILLE_BY allow us to convert or filter data on the function. We then create a sample dataset that mimics real-world records that contain fields like text, categories, scores, and tokens that we will use later to demonstrate Lilac’s data planning capabilities.

def setup_lilac_project(project_name: str) -> str:
   """Initialize Lilac project directory"""
   project_dir = f"./{project_name}-{uuid.uuid4().hex[:6]}"
   Path(project_dir).mkdir(exist_ok=True)
   ll.set_project_dir(project_dir)
   return project_dir


def create_dataset_from_data(name: str, data: List[Dict]) -> ll.Dataset:
   """Create Lilac dataset from data"""
   data_file = f"{name}.jsonl"
   with open(data_file, 'w') as f:
       for item in data:
           f.write(json.dumps(item) + 'n')
  
   config = ll.DatasetConfig(
       namespace="tutorial",
       name=name,
       source=ll.sources.JSONSource(filepaths=[data_file])
   )
  
   return ll.create_dataset(config)

Using the setup_lilac_project function, we initialize a unique working directory for the lilac project and register it using the lilac API. Using create_dataset_from_data, we convert the original list of the dictionary to a .jsonl file and create a lilac dataset by defining its configuration. This prepares data for cleaning and structured analysis.

def extract_dataframe(dataset: ll.Dataset, fields: List[str]) -> pd.DataFrame:
   """Extract data as pandas DataFrame"""
   return dataset.to_pandas(fields)


def apply_functional_filters(df: pd.DataFrame) -> Dict[str, pd.DataFrame]:
   """Apply various filters and return multiple filtered versions"""
  
   filters = {
       'high_score': lambda df: df[df['score'] >= 0.8],
       'tech_category': lambda df: df[df['category'] == 'tech'],
       'min_tokens': lambda df: df[df['tokens'] >= 4],
       'no_duplicates': lambda df: df.drop_duplicates(subset=['text'], keep='first'),
       'combined_quality': lambda df: df[(df['score'] >= 0.8) & (df['tokens'] >= 3) & (df['category'] == 'tech')]
   }
  
   return {name: filter_func(df.copy()) for name, filter_func in filters.items()}

We use Extract_DataFrame to extract the dataset into a PANDAS DataFrame, which allows us to work with the selected fields in a familiar format. Then, using apply_functional_filters, we define and apply a set of logical filters such as high score selection, category-based filtering, token count constraints, duplication deletion and compound quality conditions to generate multiple filtered views of the data.

def analyze_data_quality(df: pd.DataFrame) -> Dict[str, Any]:
   """Analyze data quality metrics"""
   return {
       'total_records': len(df),
       'unique_texts': df['text'].nunique(),
       'duplicate_rate': 1 - (df['text'].nunique() / len(df)),
       'avg_score': df['score'].mean(),
       'category_distribution': df['category'].value_counts().to_dict(),
       'score_distribution': {
           'high': len(df[df['score'] >= 0.8]),
           'medium': len(df[(df['score'] >= 0.6) & (df['score']  Dict[str, callable]:
   """Create various data transformation functions"""
   return {
       'normalize_scores': lambda df: df.assign(norm_score=df['score'] / df['score'].max()),
       'add_length_category': lambda df: df.assign(
           length_cat=pd.cut(df['tokens'], bins=[0, 3, 5, float('inf')], labels=['short', 'medium', 'long'])
       ),
       'add_quality_tier': lambda df: df.assign(
           quality_tier=pd.cut(df['score'], bins=[0, 0.6, 0.8, 1.0], labels=['low', 'medium', 'high'])
       ),
       'add_category_rank': lambda df: df.assign(
           category_rank=df.groupby('category')['score'].rank(ascending=False)
       )
   }

To evaluate the dataset quality, we use Analyze_Data_Quality, which helps us measure key metrics such as total and unique records, repetition rate, category decomposition, and fraction/token distribution. This gives us a clear understanding of the readiness and reliability of the dataset. We also use create_data_transformations to define transformation functions, enabling enhancements such as score normalization, token length classification, quality layer allocation, and in-category ranking.

def apply_transformations(df: pd.DataFrame, transform_names: List[str]) -> pd.DataFrame:
   """Apply selected transformations"""
   transformations = create_data_transformations()
   selected_transforms = [transformations[name] for name in transform_names if name in transformations]
  
   return pipe(*selected_transforms)(df.copy()) if selected_transforms else df


def export_filtered_data(filtered_datasets: Dict[str, pd.DataFrame], output_dir: str) -> None:
   """Export filtered datasets to files"""
   Path(output_dir).mkdir(exist_ok=True)
  
   for name, df in filtered_datasets.items():
       output_file = Path(output_dir) / f"{name}_filtered.jsonl"
       with open(output_file, 'w') as f:
           for _, row in df.iterrows():
               f.write(json.dumps(row.to_dict()) + 'n')
       print(f"Exported {len(df)} records to {output_file}")

Then, through apply_transformations, we selectively apply the required transformations in the function chain to ensure our data is enriched and structured. After filtering, we use export_filtered_data to write each dataset variant into a separate .jsonl file. This allows us to store subsets in an organized format, such as high-quality entries or non-uniform records for downstream use.

def main_analysis_pipeline():
   """Main analysis pipeline demonstrating functional approach"""
  
   print("🚀 Setting up Lilac project...")
   project_dir = setup_lilac_project("advanced_tutorial")
  
   print("📊 Creating sample dataset...")
   sample_data = create_sample_data()
   dataset = create_dataset_from_data("sample_data", sample_data)
  
   print("📋 Extracting data...")
   df = extract_dataframe(dataset, ['id', 'text', 'category', 'score', 'tokens'])
  
   print("🔍 Analyzing data quality...")
   quality_report = analyze_data_quality(df)
   print(f"Original data: {quality_report['total_records']} records")
   print(f"Duplicates: {quality_report['duplicate_rate']:.1%}")
   print(f"Average score: {quality_report['avg_score']:.2f}")
  
   print("🔄 Applying transformations...")
   transformed_df = apply_transformations(df, ['normalize_scores', 'add_length_category', 'add_quality_tier'])
  
   print("🎯 Applying filters...")
   filtered_datasets = apply_functional_filters(transformed_df)
  
   print("n📈 Filter Results:")
   for name, filtered_df in filtered_datasets.items():
       print(f"  {name}: {len(filtered_df)} records")
  
   print("💾 Exporting filtered datasets...")
   export_filtered_data(filtered_datasets, f"{project_dir}/exports")
  
   print("n🏆 Top Quality Records:")
   best_quality = filtered_datasets['combined_quality'].head(3)
   for _, row in best_quality.iterrows():
       print(f"  • {row['text']} (score: {row['score']}, category: {row['category']})")
  
   return {
       'original_data': df,
       'transformed_data': transformed_df,
       'filtered_data': filtered_datasets,
       'quality_report': quality_report
   }


if __name__ == "__main__":
   results = main_analysis_pipeline()
   print("n✅ Analysis complete! Check the exports folder for filtered datasets.")

Finally, in main_analysis_pipeline, we perform a complete workflow from setup to data export, demonstrating the combination of lilac and functional programming, allowing us to build modular, scalable, and expressive pipelines. We even print the highest quality entries as fast snapshots. This function represents our complete data planning loop driven by lilac.

In short, users will have hands-on insights into creating reproducible data pipelines that utilize lilac dataset abstraction and functional programming patterns for scalable, clean analysis. The pipeline covers all key stages, including creating datasets, transformations, filtering, quality analysis and exports, providing flexibility for experiments and deployments. It also shows how meaningful metadata can be embedded, such as normalized scores, mass layers, and length categories, which can play a role in downstream tasks such as models or human reviews.

Check Code. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Nikhil is an intern consultant at Marktechpost. He is studying for a comprehensive material degree in integrated materials at the Haragpur Indian Technical College. Nikhil is an AI/ML enthusiast and has been studying applications in fields such as biomaterials and biomedical sciences. He has a strong background in materials science, and he is exploring new advancements and creating opportunities for contribution.

admin12 hours ago

0 4 5 minutes read

Use lilac to convert, filter and export coding guides to structured insights to build functional data analysis workflow

admin

Leave a Reply Cancel reply

New study finds freshwater availability amounts for lithium mining overestimate – Air quality issues

If not recorded, it won’t happen: US documentation and regulation of randomized controlled trials of human nutrition

G quadruples reveal molecular links between telomeres and telomerase: key findings in tumor transformation, aging and regeneration therapy

Wastewater technology is not as “green” as it should be

Explore UAE headphone price expectations in 2025

MDM-Prime: Generalized Mask Diffusion Model (MDMS) framework that can partially reveal tokens during sampling

If not recorded, it won’t happen: US documentation and regulation of randomized controlled trials of human nutrition

G quadruples reveal molecular links between telomeres and telomerase: key findings in tumor transformation, aging and regeneration therapy

Rehabilitation strategies can improve clinical outcomes after concussion within the first three weeks

Interventions may reduce defects associated with premature birth to inhibit responses.

Hepatitis C drugs enhance Remdesivir’s antiviral activity against Covid-19

admin

UC San Diego researchers present DEX1B: a billion-scale dataset for dexterous manual manipulation of robotics

Researchers at the University of Michigan propose G-ACT: A scalable machine learning framework to guide programming language bias in LLMS

Related Articles

This AI paper studies test time scaling of English-centric RLM to enhance multilingual reasoning and domain summary

AI-driven solution for enhanced location tracking • AI Parabellum

Healthai Chief AI Officer Alberto – Dr. Giovanni Busetto – Interview Series

Why generalizations in traffic matching models come from approximation, not randomness

Leave a Reply Cancel reply

MDM-Prime: Generalized Mask Diffusion Model (MDMS) framework that can partially reveal tokens during sampling

If not recorded, it won’t happen: US documentation and regulation of randomized controlled trials of human nutrition

G quadruples reveal molecular links between telomeres and telomerase: key findings in tumor transformation, aging and regeneration therapy

Rehabilitation strategies can improve clinical outcomes after concussion within the first three weeks

Interventions may reduce defects associated with premature birth to inhibit responses.

Hepatitis C drugs enhance Remdesivir’s antiviral activity against Covid-19