Step by step guide to creating synthetic data using synthetic databases (SDVs)

Real-world data is often expensive, messy, and limited by privacy rules. Synthetic data provides solutions and has been widely used:

LLMS training AI-generated text

Fraud system simulates edge cases

Visual models are preprocessed on fake images

SDV (Synthetic Data Vault) is an open source python library that can generate realistic tabular data using machine learning. It learns patterns from real data and creates high-quality synthetic data for secure sharing, testing and model training.

In this tutorial, we will use SDV to generate comprehensive data step by step.

We will first install the SDV library:

from sdv.io.local import CSVHandler

connector = CSVHandler()
FOLDER_NAME = '.' # If the data is in the same directory

data = connector.read(folder_name=FOLDER_NAME)
salesDf = data['data']

Next, we import the necessary modules and connect to the local folder containing the dataset files. This will read the CSV file of the specified folder and store it as PANDAS DataFrames. In this case, we use data[‘data’].

from sdv.metadata import Metadata
metadata = Metadata.load_from_json('metadata.json')

Now, we import metadata for the dataset. This metadata is stored in a JSON file and tells SDV how to interpret your data. It includes:

this Table name
this Primary key
this Data Type Each column (e.g., classification, number, dateTime, etc.)
Elective Column format Like date time mode or ID mode
table relation (For multi-desk setup)

This is the sample metadata. json format:

{
  "METADATA_SPEC_VERSION": "V1",
  "tables": {
    "your_table_name": {
      "primary_key": "your_primary_key_column",
      "columns": {
        "your_primary_key_column": { "sdtype": "id", "regex_format": "T[0-9]{6}" },
        "date_column": { "sdtype": "datetime", "datetime_format": "%d-%m-%Y" },
        "category_column": { "sdtype": "categorical" },
        "numeric_column": { "sdtype": "numerical" }
      },
      "column_relationships": []
    }
  }
}

from sdv.metadata import Metadata

metadata = Metadata.detect_from_dataframes(data)

In addition, we can use the SDV library to automatically infer metadata. However, the results may not always be accurate or complete, so if there are any differences, you may need to review and update it.

from sdv.single_table import GaussianCopulaSynthesizer

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data=salesDf)
synthetic_data = synthesizer.sample(num_rows=10000)

Once the metadata and the original dataset are prepared, we can now train the model using SDV and generate synthetic data. This model learns structures and patterns in your real dataset and uses that knowledge to create synthetic records.

You can control how many rows are generated using num_rows debate.

from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
    salesDf,
    synthetic_data,
    metadata)

The SDV library also provides tools to evaluate the quality of synthetic data by comparing synthetic data with the original data set. A good starting point is to generate Quality Report

You can also use SDV’s built-in drawing tool to visualize the comparison of comprehensive data with real data. For example, import get_column_plot from sdv.evaluation.single_table Create a comparison chart for a specific column:

from sdv.evaluation.single_table import get_column_plot

fig = get_column_plot(
    real_data=salesDf,
    synthetic_data=synthetic_data,
    column_name="Sales",
    metadata=metadata
)
   
fig.show()

We can observe that the distribution of the “sales” columns in real and synthetic data is very similar. To explore further, we can create more detailed comparisons using matplotlib, such as visualizing the average monthly sales trends for both datasets.

import pandas as pd
import matplotlib.pyplot as plt

# Ensure 'Date' columns are datetime
salesDf['Date'] = pd.to_datetime(salesDf['Date'], format="%d-%m-%Y")
synthetic_data['Date'] = pd.to_datetime(synthetic_data['Date'], format="%d-%m-%Y")

# Extract 'Month' as year-month string
salesDf['Month'] = salesDf['Date'].dt.to_period('M').astype(str)
synthetic_data['Month'] = synthetic_data['Date'].dt.to_period('M').astype(str)

# Group by 'Month' and calculate average sales
actual_avg_monthly = salesDf.groupby('Month')['Sales'].mean().rename('Actual Average Sales')
synthetic_avg_monthly = synthetic_data.groupby('Month')['Sales'].mean().rename('Synthetic Average Sales')

# Merge the two series into a DataFrame
avg_monthly_comparison = pd.concat([actual_avg_monthly, synthetic_avg_monthly], axis=1).fillna(0)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(avg_monthly_comparison.index, avg_monthly_comparison['Actual Average Sales'], label="Actual Average Sales", marker="o")
plt.plot(avg_monthly_comparison.index, avg_monthly_comparison['Synthetic Average Sales'], label="Synthetic Average Sales", marker="o")

plt.title('Average Monthly Sales Comparison: Actual vs Synthetic')
plt.xlabel('Month')
plt.ylabel('Average Sales')
plt.xticks(rotation=45)
plt.grid(True)
plt.legend()
plt.ylim(bottom=0)  # y-axis starts at 0
plt.tight_layout()
plt.show()

The chart also shows that average monthly sales in both datasets are very similar, with only a small difference.

In this tutorial, we demonstrate how to use the SDV library to prepare data and metadata to generate synthetic data. By training models in your original dataset, SDV can create high-quality synthetic data to closely reflect the patterns and distributions of the actual data. We also explore how to evaluate and visualize synthetic data to confirm that key metrics such as sales distribution and monthly trends are consistent. Synthetic data provides a powerful way to overcome privacy and usability challenges while enabling powerful data analytics and machine learning workflows.

View notebooks on Github. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 95k+ ml reddit And subscribe Our newsletter.

I am a civil engineering graduate in Islamic Islam in Jamia Milia New Delhi (2022) and I am very interested in data science, especially neural networks and their applications in various fields.

Step by step guide to creating synthetic data using synthetic databases (SDVs)

You may also like...

Leave a Reply Cancel reply

Recent Posts

Step by step guide to creating synthetic data using synthetic databases (SDVs)

You may also like...

How to order earplugs in bulk in Dubai

Science lags behind Breakneck Tech development

Alphaevolve: The pioneering step for Google DeepMind toward AGI

Leave a Reply Cancel reply

Recent Posts