March 28 ~ 29, 2025, Virtual conference
Xiaoying Zeng and Eugene Pinsky, Department of Computer Science, Metropolitan College, Boston University 1010 Commonwealth Avenue, Boston, MA 02215 USA
This paper presents a comparative analysis of Gaussian Mixture Models (GMMs) and Ellipti- cal Mixture Models (EMMs) for clustering multi-dimensional datasets using the Expectation-Maximization (EM) algorithm. EMMs, which accommodate elliptical distributions’ covariance structures, exhibit a su- perior ability to handle complex data patterns, particularly datasets characterized by irregular shapes and heavy tails. By integrating R’s statistical tools into Python, this study enhances computational flexibility, making it easier to fit elliptical distributions. Empirical results using metrics like Weighted Average Purity, Dunn Index, Rand Index, and silhouette score show that EMMs substantially improve clustering accuracy under certain conditions, outperforming GMMs in handling data complexities common in real-world sce- narios. This research emphasizes the potential of EMMs as an alternative to traditional GMMs, offering a robust yet equally accessible approach for clustering in machine learning applications.
Gaussian Mixture Models, Elliptical Distribution Mixture Models, Expectation- Maximization algorithm, Clustering, Multidimensional Data.
Nahush Bhamre, Pranjal Prasanna Ekhande, and Eugene Pinsky, Department of Computer Science, Metropolitan College, Boston University, 1010 Commonwealth Avenue, Boston, MA 02215,USA
The Naive Bayes (NB) algorithm is widely recognized for its efficiency and simplicity in classi- fication tasks, particularly in domains with high-dimensional data. While the Gaussian Naive Bayes (GNB) model assumes a Gaussian distribution for continuous features, this assumption often limits its applica-bility to real-world datasets with non-Gaussian characteristics. To address this limitation, we introduce an enhanced Naive Bayes framework that incorporates stable distributions to model feature distributions. Stable distributions, with their flexibility in handling skewness and heavy tails, provide a more realistic representation of diverse data characteristics. This paper details the theoretical integration of stable distri-butions into the NB algorithm, the implementation process utilizing R and Python, and an experimental evaluation across multiple datasets. Results indicate that the proposed approach offers competitive or superior classification accuracy, particularly when the Gaussian assumption is violated, underscoring its potential for practical applications in diverse fields.
Machine Learning, Naive Bayes Classification, Stable Distributions.
Dominik Scheinert and Alexander Guttenberger, Technische Universit¨at Berlin, Berlin, Germany
As distributed processing environments grow in complexity, accurate performance prediction models are essential to optimize system efficiency and resource allocation. However, modern computing workloads typically exhibit a wide variety of characteristics, which hinders optimized resource config- urations. Diverse approaches have been suggested to tackle the challenge of workload characterization, employing various parameters for performance modeling in the process. To expand on this objective, this paper introduces a 5+1 layer classification model designed to enhance the accuracy of predictive models by classifying and reflecting on relevant modeling parameters. We conducted a systematic literature review to identify and analyze the role of six key layers: Big Data Framework, Performance, Hardware, Data, User Application, and Virtualization. Our findings reveal that while the Big Data Framework and Performance Layers are foundational, predictive accuracy improves when combined with complementary layers, especially the Data Layer, which highlights the impact of data characteristics such as size and distribution. The Hardware Layer provides critical insights into system limitations, while the emerging Virtualization Layer reflects the increasing importance of virtualized, potentially cloud-based environments. The proposed 5+1 layer classification model offers a structured approach to capture and explain the complexity of distributed analytical workflows, providing a nuanced framework for performance modeling. This layered classification model aims to support the development of more robust, adaptable, and generalizable prediction models for use in cloud-based systems.
Big Data Analytics, Performance Modeling, Resource Management, Cloud Computing.