Welcome to DBML 2025

3rd International Conference on Data Mining, Big Data and Machine Learning (DBML 2025)

March 28 ~ 29, 2025, Virtual conference



Accepted Papers
Elliptical Mixture Models Improve the Accuracy of Gaussian Mixture Models With Expectation-maximization Algorithm

Xiaoying Zeng and Eugene Pinsky, Department of Computer Science, Metropolitan College, Boston University 1010 Commonwealth Avenue, Boston, MA 02215 USA

ABSTRACT

This paper presents a comparative analysis of Gaussian Mixture Models (GMMs) and Ellipti- cal Mixture Models (EMMs) for clustering multi-dimensional datasets using the Expectation-Maximization (EM) algorithm. EMMs, which accommodate elliptical distributions’ covariance structures, exhibit a su- perior ability to handle complex data patterns, particularly datasets characterized by irregular shapes and heavy tails. By integrating R’s statistical tools into Python, this study enhances computational flexibility, making it easier to fit elliptical distributions. Empirical results using metrics like Weighted Average Purity, Dunn Index, Rand Index, and silhouette score show that EMMs substantially improve clustering accuracy under certain conditions, outperforming GMMs in handling data complexities common in real-world sce- narios. This research emphasizes the potential of EMMs as an alternative to traditional GMMs, offering a robust yet equally accessible approach for clustering in machine learning applications.

Keywords

Gaussian Mixture Models, Elliptical Distribution Mixture Models, Expectation- Maximization algorithm, Clustering, Multidimensional Data.


Enhancing Naive Bayes Algorithm with Stable Distributions for Classification

Nahush Bhamre, Pranjal Prasanna Ekhande, and Eugene Pinsky, Department of Computer Science, Metropolitan College, Boston University, 1010 Commonwealth Avenue, Boston, MA 02215,USA

ABSTRACT

The Naive Bayes (NB) algorithm is widely recognized for its efficiency and simplicity in classi- fication tasks, particularly in domains with high-dimensional data. While the Gaussian Naive Bayes (GNB) model assumes a Gaussian distribution for continuous features, this assumption often limits its applica-bility to real-world datasets with non-Gaussian characteristics. To address this limitation, we introduce an enhanced Naive Bayes framework that incorporates stable distributions to model feature distributions. Stable distributions, with their flexibility in handling skewness and heavy tails, provide a more realistic representation of diverse data characteristics. This paper details the theoretical integration of stable distri-butions into the NB algorithm, the implementation process utilizing R and Python, and an experimental evaluation across multiple datasets. Results indicate that the proposed approach offers competitive or superior classification accuracy, particularly when the Gaussian assumption is violated, underscoring its potential for practical applications in diverse fields.

Keywords

Machine Learning, Naive Bayes Classification, Stable Distributions.


Workload Characterization for Resource Optimization of Big Data Analytics: Best Practices, Trends, and Opportunities

Dominik Scheinert and Alexander Guttenberger, Technische Universit¨at Berlin, Berlin, Germany

ABSTRACT

As distributed processing environments grow in complexity, accurate performance prediction models are essential to optimize system efficiency and resource allocation. However, modern computing workloads typically exhibit a wide variety of characteristics, which hinders optimized resource config- urations. Diverse approaches have been suggested to tackle the challenge of workload characterization, employing various parameters for performance modeling in the process. To expand on this objective, this paper introduces a 5+1 layer classification model designed to enhance the accuracy of predictive models by classifying and reflecting on relevant modeling parameters. We conducted a systematic literature review to identify and analyze the role of six key layers: Big Data Framework, Performance, Hardware, Data, User Application, and Virtualization. Our findings reveal that while the Big Data Framework and Performance Layers are foundational, predictive accuracy improves when combined with complementary layers, especially the Data Layer, which highlights the impact of data characteristics such as size and distribution. The Hardware Layer provides critical insights into system limitations, while the emerging Virtualization Layer reflects the increasing importance of virtualized, potentially cloud-based environments. The proposed 5+1 layer classification model offers a structured approach to capture and explain the complexity of distributed analytical workflows, providing a nuanced framework for performance modeling. This layered classification model aims to support the development of more robust, adaptable, and generalizable prediction models for use in cloud-based systems.

Keywords

Big Data Analytics, Performance Modeling, Resource Management, Cloud Computing.