Unlocking the Power of Synthetic Data using AI
ECLIPTICA® is an AI Agent for synthetic data generation, built to help researchers and analysts unlock insights without the barriers of restricted or unavailable data. Using state-of-the-art generative AI methods, ECLIPTICA® can produce high-quality synthetic datasets at the click of a button.
Whether you’re working in orphan disease research, real-world evidence, market access, or other areas where data scarcity is a challenge, ECLIPTICA® redefines what’s possible by safely simulating realistic, statistically valid data.
Because ECLIPTICA® runs entirely within your local environment, your original data never leaves your system, ensuring complete privacy and compliance.
What Users Are Saying
“The synthetic data is almost as good as the real data”
“We used ECLIPTICA® for our project and found the decisions we made from the synthetic data were corroborated by the results from the actual real data”
Why use ECLIPTICA®?
- No data restrictions: Generate and explore datasets where access to real data is limited or impossible.
- AI-driven accuracy: Advanced GenAI models produce highly realistic, reliable synthetic data.
- Privacy by design: 100% local execution, with no data leaving your secure environment.
- Flexible applications: Ideal for registries, internal repositories, expanding sample sizes, or rare disease studies.
- Accelerated decision-making: Make confident decisions, even before real-world data becomes available.
- No data restrictions: Generate and explore datasets where access to real data is limited or impossible.
- AI-driven accuracy: Advanced GenAI models produce highly realistic, reliable synthetic data.
- Privacy by design: 100% local execution, with no data leaving your secure environment.
- Flexible applications: Ideal for registries, internal repositories, expanding sample sizes, or rare disease studies.
- Accelerated decision-making: Make confident decisions, even before real-world data becomes available.
Frequently Asked Questions
What file types can be uploaded to ECLIPTICA®?
ECLIPTICA® supports three tabular file types for upload: CSV (.csv), Excel (.xlsx, .xls), and JSON (.json). Excel files are read from a single sheet by default, the first sheet is used, but you can also specify a sheet name from a dropdown box. JSON files are expected to follow a table-like structure, typically an array of records or objects.
How do I need to structure my data before uploading, will ECLIPTICA® accept negative values, date and text values?
Yes, ECLIPTICA® accepts negative values, dates, and text fields. Currently, it handles missing data only when cells are blank or empty. Users should avoid uploading placeholder values like -999 or NaN, as these values can distort the data distribution and lead to inaccurate statistical similarity (SD) metrics. In a future version, support for handling -999 and NaN as missing values will be added. To obtain more reliable and accurate synthetic data, users are encouraged to interpolate missing values before uploading their dataset to ECLIPTICA®. However, if interpolation is not preferred, missing entries should be left as blank cells rather than using -999 or NaN. For long text fields (i.e., text values longer than 50 characters), ECLIPTICA® treats them as unstructured notes and does not preserve their original patterns during synthesis. These fields are randomly generated and are not synthesized based on the data distribution. ECLIPTICA® does not synthesize date or long text variables directly. However, if these fields are included, the system converts date, time, or datetime columns into numeric representations so that models like CTGAN, TVAE, or GCP can learn general temporal patterns. The generated values are then converted back to standard date or time formats. It’s important to note that these date-derived values do not preserve joint distributions across records in the same way as standard synthesized variables. In summary, ECLIPTICA® can accept date, time, text, and negative values, but it does not synthesize them. Instead, it handles them in a way that allows the models to operate without compromising the integrity of the rest of the data.
How is missing data handled by ECLIPTICA®?
Empty cells or blank strings are normalized to missing values. ECLIPTICA® computes the missing ratio for each column. Based on these ratios, columns are split into two categories: those with low missingness (less than or equal to 20%) and those with high missingness (more than 20%). High-missing columns are handled separately using a simpler strategy and later re-integrated into the dataset. Categorical columns are temporarily label encoded with a reserved placeholder, and a missing mask is retained to ensure that true missing values can be restored accurately. Numerical columns have their missing values imputed using the mean or mode to maintain stability. These preprocessing steps ensure that machine learning models receive consistent and learnable inputs, while preserving information about the original missingness.
The pipeline reintroduces missing values per column to match the original missing ratio. For paired columns (such as a code and its corresponding label), if the base column is missing, the associated label column is also forced to be missing. When the base is present, the label is deterministically regenerated using the inferred one-to-one mapping. This approach allows the model to train on cleaner data without the use of placeholder values like -999 and ensures that the synthetic data file mirrors the original dataset’s missingness patterns. The missing ratios are either exactly matched or fall within a small acceptable tolerance. In the case of paired columns (e.g., code and code label), their missingness is synchronized, and the label values are deterministically derived from their corresponding codes. For improved display in spreadsheets, text and label columns have their missing values replaced with empty strings before saving.
Which model should I be using?
With ECLIPTICA®, you can use any of the 5 models below for your synthetic data generation:
CTGAN (Conditional Tabular Generative Adversarial Network): CTGAN is a machine learning model designed to generate synthetic tabular data. The CTGAN Synthesizer uses GAN-based, deep learning methods to train a model and generate synthetic data.
Gaussian Copula: The Gaussian Copula Synthesizer is a method for generating synthetic tabular data that preserves the statistical properties and dependencies of the original dataset. It uses a Gaussian copula model, a statistical tool that models the multivariate distribution of data by separating marginal distributions from their dependence structure. This allows it to capture complex relationships between columns while generating realistic synthetic data. It is particularly effective for datasets with numerical and categorical variables.
Sequential Decision Tree: The Sequential Decision Tree (SDT) Synthesizer is a method for generating synthetic tabular data by modelling each column of a dataset sequentially using decision trees. It captures dependencies between columns by using previously synthesized columns as features for predicting subsequent ones, ensuring the synthetic data preserves statistical properties and relationships of the original dataset.
Copula GAN Synthesizer (CGS): The Copula GAN Synthesizer uses a mix of classic, statistical methods and GAN-based deep learning methods to train a model and generate synthetic data. It combines the strengths of a Gaussian Copula and a Generative Adversarial Network (GAN) to create high-quality synthetic tabular data. It is designed to model complex, non-linear dependencies and statistical distributions in datasets with mixed data types (numerical, categorical, etc.), while preserving privacy by generating data that mimics the original without replicating exact records.
Tabular variation autoencoder (TVAE): In TVAE each row is encoded into a latent Gaussian vector defined by its mean and variance. A decoder then reconstructs the original data columns from samples drawn from this latent space. Mixed data types, such as numeric and categorical, are managed through column transformers like normalization and encoding. This approach is useful when the dataset has nonlinear relationships or interactions across features, especially with mixed numerical and categorical columns where simpler models like copulas fall short.
How long does it take to generate synthetic data and how much computer power will I need to generate the synthetic data?
The processing time depends on the size of the dataset and the number of variables. Naturally, it takes longer when the dataset has a high percentage of missing values, a large number of variables, or millions of records. In such cases, especially with deep learning models, training can sometimes take several days. For large-scale datasets, we recommend using GCP (Gaussian Copula) or SDT (Synthesizer Decision Tree) models, which are optimized for speed and scalability. We are currently conducting experiments to better understand the time requirements across different dataset sizes and characteristics.
Can ECLIPTICA® be used for data augmentation?
ECLIPTICA® supports data augmentation, allowing you to generate synthetic data ranging from 0.5× to 2× the size of your original training dataset. While there is no strict upper limit, it’s recommended to keep the augmentation within a reasonable range to maintain data quality and integrity. After augmentation, you should always review the distributions of the generated data to ensure consistency with the original dataset.
Is the data I upload secure?
ECLIPTICA® is designed to run entirely locally, ensuring full data privacy. Your files are read directly from your machine, models are trained locally, and all outputs are saved to a folder of your choice. No part of your data is uploaded or sent elsewhere. ECLIPTICA® is built to keep your data secure and under your control at all times.
How can I access ECLIPTICA®?
The next version of ECLIPTICA® will be available in January 2026. To enquire for access, please complete the form below or contact us at ecliptica@r-s-s.com.
What file types can be uploaded to ECLIPTICA®?
ECLIPTICA® supports three tabular file types for upload: CSV (.csv), Excel (.xlsx, .xls), and JSON (.json). Excel files are read from a single sheet by default, the first sheet is used, but you can also specify a sheet name from a dropdown box. JSON files are expected to follow a table-like structure, typically an array of records or objects.
How do I need to structure my data before uploading, will ECLIPTICA® accept negative values, date and text values?
Yes, ECLIPTICA® accepts negative values, dates, and text fields. Currently, it handles missing data only when cells are blank or empty. Users should avoid uploading placeholder values like -999 or NaN, as these values can distort the data distribution and lead to inaccurate statistical similarity (SD) metrics. In a future version, support for handling -999 and NaN as missing values will be added. To obtain more reliable and accurate synthetic data, users are encouraged to interpolate missing values before uploading their dataset to ECLIPTICA®. However, if interpolation is not preferred, missing entries should be left as blank cells rather than using -999 or NaN. For long text fields (i.e., text values longer than 50 characters), ECLIPTICA® treats them as unstructured notes and does not preserve their original patterns during synthesis. These fields are randomly generated and are not synthesized based on the data distribution. ECLIPTICA® does not synthesize date or long text variables directly. However, if these fields are included, the system converts date, time, or datetime columns into numeric representations so that models like CTGAN, TVAE, or GCP can learn general temporal patterns. The generated values are then converted back to standard date or time formats. It’s important to note that these date-derived values do not preserve joint distributions across records in the same way as standard synthesized variables. In summary, ECLIPTICA® can accept date, time, text, and negative values, but it does not synthesize them. Instead, it handles them in a way that allows the models to operate without compromising the integrity of the rest of the data.
How is missing data handled by ECLIPTICA®?
Empty cells or blank strings are normalized to missing values. ECLIPTICA® computes the missing ratio for each column. Based on these ratios, columns are split into two categories: those with low missingness (less than or equal to 20%) and those with high missingness (more than 20%). High-missing columns are handled separately using a simpler strategy and later re-integrated into the dataset. Categorical columns are temporarily label encoded with a reserved placeholder, and a missing mask is retained to ensure that true missing values can be restored accurately. Numerical columns have their missing values imputed using the mean or mode to maintain stability. These preprocessing steps ensure that machine learning models receive consistent and learnable inputs, while preserving information about the original missingness.
The pipeline reintroduces missing values per column to match the original missing ratio. For paired columns (such as a code and its corresponding label), if the base column is missing, the associated label column is also forced to be missing. When the base is present, the label is deterministically regenerated using the inferred one-to-one mapping. This approach allows the model to train on cleaner data without the use of placeholder values like -999 and ensures that the synthetic data file mirrors the original dataset’s missingness patterns. The missing ratios are either exactly matched or fall within a small acceptable tolerance. In the case of paired columns (e.g., code and code label), their missingness is synchronized, and the label values are deterministically derived from their corresponding codes. For improved display in spreadsheets, text and label columns have their missing values replaced with empty strings before saving.
Which model should I be using?
With ECLIPTICA®, you can use any of the 5 models below for your synthetic data generation:
CTGAN (Conditional Tabular Generative Adversarial Network): CTGAN is a machine learning model designed to generate synthetic tabular data. The CTGAN Synthesizer uses GAN-based, deep learning methods to train a model and generate synthetic data.
Gaussian Copula: The Gaussian Copula Synthesizer is a method for generating synthetic tabular data that preserves the statistical properties and dependencies of the original dataset. It uses a Gaussian copula model, a statistical tool that models the multivariate distribution of data by separating marginal distributions from their dependence structure. This allows it to capture complex relationships between columns while generating realistic synthetic data. It is particularly effective for datasets with numerical and categorical variables.
Sequential Decision Tree: The Sequential Decision Tree (SDT) Synthesizer is a method for generating synthetic tabular data by modelling each column of a dataset sequentially using decision trees. It captures dependencies between columns by using previously synthesized columns as features for predicting subsequent ones, ensuring the synthetic data preserves statistical properties and relationships of the original dataset.
Copula GAN Synthesizer (CGS): The Copula GAN Synthesizer uses a mix of classic, statistical methods and GAN-based deep learning methods to train a model and generate synthetic data. It combines the strengths of a Gaussian Copula and a Generative Adversarial Network (GAN) to create high-quality synthetic tabular data. It is designed to model complex, non-linear dependencies and statistical distributions in datasets with mixed data types (numerical, categorical, etc.), while preserving privacy by generating data that mimics the original without replicating exact records.
Tabular variation autoencoder (TVAE): In TVAE each row is encoded into a latent Gaussian vector defined by its mean and variance. A decoder then reconstructs the original data columns from samples drawn from this latent space. Mixed data types, such as numeric and categorical, are managed through column transformers like normalization and encoding. This approach is useful when the dataset has nonlinear relationships or interactions across features, especially with mixed numerical and categorical columns where simpler models like copulas fall short.
How long does it take to generate synthetic data and how much computer power will I need to generate the synthetic data?
The processing time depends on the size of the dataset and the number of variables. Naturally, it takes longer when the dataset has a high percentage of missing values, a large number of variables, or millions of records. In such cases, especially with deep learning models, training can sometimes take several days. For large-scale datasets, we recommend using GCP (Gaussian Copula) or SDT (Synthesizer Decision Tree) models, which are optimized for speed and scalability. We are currently conducting experiments to better understand the time requirements across different dataset sizes and characteristics.
Can ECLIPTICA® be used for data augmentation?
ECLIPTICA® supports data augmentation, allowing you to generate synthetic data ranging from 0.5× to 2× the size of your original training dataset. While there is no strict upper limit, it’s recommended to keep the augmentation within a reasonable range to maintain data quality and integrity. After augmentation, you should always review the distributions of the generated data to ensure consistency with the original dataset.
Is the data I upload secure?
ECLIPTICA® is designed to run entirely locally, ensuring full data privacy. Your files are read directly from your machine, models are trained locally, and all outputs are saved to a folder of your choice. No part of your data is uploaded or sent elsewhere. ECLIPTICA® is built to keep your data secure and under your control at all times.
How can I access ECLIPTICA®?
The next version of ECLIPTICA® will be available in January 2026. To enquire for access, please complete the form below or contact us at ecliptica@r-s-s.com.