3rd International Conference on Data Mining, Big Data and Machine Learning (DBML 2025)

Accepted Papers

Elliptical Mixture Models Improve the Accuracy of Gaussian Mixture Models With Expectation-maximization Algorithm

Xiaoying Zeng and Eugene Pinsky, Department of Computer Science, Metropolitan College, Boston University 1010 Commonwealth Avenue, Boston, MA 02215 USA

ABSTRACT

This paper presents a comparative analysis of Gaussian Mixture Models (GMMs) and Ellipti- cal Mixture Models (EMMs) for clustering multi-dimensional datasets using the Expectation-Maximization (EM) algorithm. EMMs, which accommodate elliptical distributions’ covariance structures, exhibit a su- perior ability to handle complex data patterns, particularly datasets characterized by irregular shapes and heavy tails. By integrating R’s statistical tools into Python, this study enhances computational flexibility, making it easier to fit elliptical distributions. Empirical results using metrics like Weighted Average Purity, Dunn Index, Rand Index, and silhouette score show that EMMs substantially improve clustering accuracy under certain conditions, outperforming GMMs in handling data complexities common in real-world sce- narios. This research emphasizes the potential of EMMs as an alternative to traditional GMMs, offering a robust yet equally accessible approach for clustering in machine learning applications.

Keywords

Gaussian Mixture Models, Elliptical Distribution Mixture Models, Expectation- Maximization algorithm, Clustering, Multidimensional Data.

Enhancing Naive Bayes Algorithm with Stable Distributions for Classification

Nahush Bhamre, Pranjal Prasanna Ekhande, and Eugene Pinsky, Department of Computer Science, Metropolitan College, Boston University, 1010 Commonwealth Avenue, Boston, MA 02215,USA

ABSTRACT

The Naive Bayes (NB) algorithm is widely recognized for its efficiency and simplicity in classi- fication tasks, particularly in domains with high-dimensional data. While the Gaussian Naive Bayes (GNB) model assumes a Gaussian distribution for continuous features, this assumption often limits its applica-bility to real-world datasets with non-Gaussian characteristics. To address this limitation, we introduce an enhanced Naive Bayes framework that incorporates stable distributions to model feature distributions. Stable distributions, with their flexibility in handling skewness and heavy tails, provide a more realistic representation of diverse data characteristics. This paper details the theoretical integration of stable distri-butions into the NB algorithm, the implementation process utilizing R and Python, and an experimental evaluation across multiple datasets. Results indicate that the proposed approach offers competitive or superior classification accuracy, particularly when the Gaussian assumption is violated, underscoring its potential for practical applications in diverse fields.

Keywords

Machine Learning, Naive Bayes Classification, Stable Distributions.

Workload Characterization for Resource Optimization of Big Data Analytics: Best Practices, Trends, and Opportunities

Dominik Scheinert and Alexander Guttenberger, Technische Universit¨at Berlin, Berlin, Germany

ABSTRACT

As distributed processing environments grow in complexity, accurate performance prediction models are essential to optimize system efficiency and resource allocation. However, modern computing workloads typically exhibit a wide variety of characteristics, which hinders optimized resource config- urations. Diverse approaches have been suggested to tackle the challenge of workload characterization, employing various parameters for performance modeling in the process. To expand on this objective, this paper introduces a 5+1 layer classification model designed to enhance the accuracy of predictive models by classifying and reflecting on relevant modeling parameters. We conducted a systematic literature review to identify and analyze the role of six key layers: Big Data Framework, Performance, Hardware, Data, User Application, and Virtualization. Our findings reveal that while the Big Data Framework and Performance Layers are foundational, predictive accuracy improves when combined with complementary layers, especially the Data Layer, which highlights the impact of data characteristics such as size and distribution. The Hardware Layer provides critical insights into system limitations, while the emerging Virtualization Layer reflects the increasing importance of virtualized, potentially cloud-based environments. The proposed 5+1 layer classification model offers a structured approach to capture and explain the complexity of distributed analytical workflows, providing a nuanced framework for performance modeling. This layered classification model aims to support the development of more robust, adaptable, and generalizable prediction models for use in cloud-based systems.

Keywords

Big Data Analytics, Performance Modeling, Resource Management, Cloud Computing.

Digital Twins in Pedagogy: Bridging AI, Cognitive Variance, and Individuated Learning

Élodi Vedha Donnadieru and Stephen W. Marshall Ora.Systems, Sebastopol, California, USA

ABSTRACT

Education is in crisis. Traditional, standardized models fail to accommodate the diverse cognitive, emotional, and aspirational needs of modern learners, leaving students disengaged and unprepared for an increasingly complex world (Moeller et al., 2018)¹. Reports indicate that nearly 75% of students express a lack of motivation in their studies, reflecting a systemic failure to foster engagement and intrinsic curiosity (Langan & Harris, 2022)². This paper proposes a radical reimagining of education through AI-driven Digital Twin Learning Systems (DTLS)—dynamic, adaptive, and deeply personalized learning companions that evolve alongside each student. These digital twins function as intelligent, generative entities, continuously integrating machine learning, big data analytics, and real-time adaptive feedback to create individuated learning experiences that optimize engagement, mastery, and self- actualization (Arslan et al., 2024)³. Unlike static educational platforms, digital twins leverage large-scale data mining and predictive modeling to analyze cognitive, behavioral, and emotional patterns across thousands—eventually millions —of learners (Betts et al., 2023)⁴. This allows for real-time clustering of students into dynamically evolving learning categories, optimizing pathways that ensure both personal relevance and broad pedagogical efficacy. As students interact with their digital twin, AI-driven recommender systems dynamically adjust instructional content, sequencing, and assessment strategies, anticipating learning needs rather than merely responding to them (Hancock et al., 2024)⁵. The generative design of these digital twins extends beyond content curation. Each twin is seeded at onboarding with diagnostic assessments—spanning cognitive, psychometric, and emotional intelligence data—defining its initial aesthetic and functional state (de Vries et al., 2022)⁶. Over time, this evolving digital identity transforms visually and structurally, encoding learning milestones and personal growth into an interactive representation of the student’s unique intellectual and creative trajectory. This visualized self-actualization offers an intrinsic motivation system, turning education into a deeply engaging, self-directed experience rather than a passive compliance exercise (Yeager et al., 2015)⁷. This paper explores the failures of traditional education, the AI and data-driven infrastructure behind digital twins, and the transformative impact of generative learning systems on student engagement and knowledge acquisition. It provides a blueprint for an intelligent, scalable, and human-centered educational future, where learning is no longer confined to rigid, one-size-fits-all models but instead evolves as dynamically as the learners themselves (MIT, 2024)⁸.

Keywords

Digital Twin Learning Systems (DTLS), Individuated Learning, AI-Driven Education, Generative Learning Models, Cognitive-Adaptive AI, Personalized Knowledge Systems, Algorithmic Pedagogy, Data-Driven Learning Evolution, Predictive Educational AI, Cognitive Twin Technology, Dynamic Learning Trajectories, Real-Time Feedback Systems, Gamified Intelligence, AI-Powered Self-Optimization, Decentralized Learning Networks.

Chaos Theory and Stochastic Modeling for Adaptive Counter-drone Defense: a Non- Linear Framework for UAV Threat Mitigation

Brandon L. Toliver, The George Washington University, Washington, District of Columbia, USA

ABSTRACT

The escalating sophistication of drone technology necessitates advanced defense systems capable of adapting to the unpredictable nature of adversarial Unmanned Aerial Vehicles (UAVs). This paper introduces a nonlinear framework for counter-drone defense, leveraging chaos theory and stochastic modeling. By modeling UAV flight paths as chaotic systems, this framework proposes a dynamic approach for detecting, predicting, and disrupting drone swarms. The methodology integrates chaotic attractors, fractal analysis, Lyapunov exponents, and stochastic control, laying the foundation for enhanced jamming, spoofing, and interception efficiency. This paper outlines the theoretical basis, simulation setup, and analytical tools required to implement and evaluate this framework, providing a structured approach for future research and practical applications in counter-drone technology.

Keywords

Chaos Theory, Stochastic Modeling, UAV Defense, Counter-Drone Measures, Electronic Warfare, Nonlinear Dynamics, Framework, Methodology, Simulation Setup.

Welcome to DBML 2025