Logistic regression papers

Here you can find a selection of high quality arXiv papers talking about logistic regression.

[1] arXiv:2006.09913 [pdf]

Introduction to Machine Learning for Accelerator Physics

Daniel Ratner

This pair of CAS lectures gives an introduction for accelerator physics students to the framework and terminology of machine learning (ML). We start by introducing the language of ML through a simple example of linear regression, including a probabilistic perspective to introduce the concepts of maximum likelihood estimation (MLE) and maximum a priori (MAP) estimation. We then apply the concepts to examples of neural networks and logistic regression. Next we introduce non-parametric models and the kernel method and give a brief introduction to two other machine learning paradigms, unsupervised and reinforcement learning. Finally we close with example applications of ML at a free-electron laser.

[2] arXiv:2006.09606 [pdf]

Structured Stochastic Quasi-Newton Methods for Large-Scale Optimization Problems

Minghan Yang, Dong Xu, Yongfeng Li, Zaiwen Wen, Mengyun Chen

In this paper, we consider large-scale finite-sum nonconvex problems arising from machine learning. Since the Hessian is often a summation of a relative cheap and accessible part and an expensive or even inaccessible part, a stochastic quasi-Newton matrix is constructed using partial Hessian information as much as possible. By further exploiting the low-rank structures based on the Nyström approximation, the computation of the quasi-Newton direction is affordable. To make full use of the gradient estimation, we also develop an extra-step strategy for this framework. Global convergence to stationary point in expectation and local suplinear convergence rate are established under some mild assumptions. Numerical experiments on logistic regression, deep autoencoder networks and deep learning problems show that the efficiency of our proposed method is at least comparable with the state-of-the-art methods.

[3] arXiv:2006.09091 [pdf]

Flatness is a False Friend

Diego Granziol

Hessian based measures of flatness, such as the trace, Frobenius and spectral norms, have been argued, used and shown to relate to generalisation. In this paper we demonstrate that for feed forward neural networks under the cross entropy loss, we would expect low loss solutions with large weights to have small Hessian based measures of flatness. This implies that solutions obtained using $L2$ regularisation should in principle be sharper than those without, despite generalising better. We show this to be true for logistic regression, multi-layer perceptrons, simple convolutional, pre-activated and wide residual networks on the MNIST and CIFAR-$100$ datasets. Furthermore, we show that for adaptive optimisation algorithms using iterate averaging, on the VGG-$16$ network and CIFAR-$100$ dataset, achieve superior generalisation to SGD but are $30 \times$ sharper. This theoretical finding, along with experimental results, raises serious questions about the validity of Hessian based sharpness measures in the discussion of generalisation. We further show that the Hessian rank can be bounded by the a constant times number of neurons multiplied by the number of classes, which in practice is often a small fraction of the network parameters. This explains the curious observation that many Hessian eigenvalues are either zero or very near zero which has been reported in the literature.

[4] arXiv:2006.08771 [pdf]

Multilayer network analysis of C. elegans Looking into the locomotory circuitry

Thomas Maertens, Eckehard Schöll, Jorge Ruiz, Philipp Hövel

We investigate how locomotory behavior is generated in the brain focusing on the paradigmatic connectome of nematode Caenorhabditis elegans (C. elegans) and on neuronal activity patterns that control forward locomotion. We map the neuronal network of the worm as a multilayer network that takes into account various neurotransmitters and neuropeptides. Using logistic regression analysis, we predict the neurons of the locomotory subnetwork. Combining Hindmarsh-Rose equations for neuronal activity with a leaky integrator model for muscular activity, we study the dynamics within this subnetwork and predict the forward locomotion of the worm using a harmonic wave model. The application of time-delayed feedback control reveals synchronization effects that contribute to a coordinated locomotion of C. elegans. Analyzing the synchronicity when the activity of certain neurons is silenced informs us about their significance for a coordinated locomotory behavior. Since the information processing is the same in humans and C. elegans, the study of the locomotory circuitry provides new insights for understanding how the brain generates motion behavior.

[5] arXiv:2006.08210 [pdf]

Hyperbolic Neural Networks++

Ryohei Shimizu, Yusuke Mukuta, Tatsuya Harada

Hyperbolic spaces, which have the capacity to embed tree structures without distortion owing to their exponential volume growth, have recently been applied to machine learning to better capture the hierarchical nature of data. In this study, we reconsider a way to generalize the fundamental components of neural networks in a single hyperbolic geometry model, and propose novel methodologies to construct a multinomial logistic regression, fully-connected layers, convolutional layers, and attention mechanisms under a unified mathematical interpretation, without increasing the parameters. A series of experiments show the parameter efficiency of our methods compared to a conventional hyperbolic component, and stability and outperformance over their Euclidean counterparts.

[6] arXiv:2006.07980 [pdf]

Application of Data Science to Discover Violence-Related Issues in Iraq

Merari González, Germán H. Alférez

Data science has been satisfactorily used to discover social issues in several parts of the world. However, there is a lack of governmental open data to discover those issues in countries such as Iraq. This situation arises the following questions how to apply data science principles to discover social issues despite the lack of open data in Iraq? How to use the available data to make predictions in places without data? Our contribution is the application of data science to open non-governmental big data from the Global Database of Events, Language, and Tone (GDELT) to discover particular violence-related social issues in Iraq. Specifically we applied the K-Nearest Neighbors, Näive Bayes, Decision Trees, and Logistic Regression classification algorithms to discover the following issues refugees, humanitarian aid, violent protests, fights with artillery and tanks, and mass killings. The best results were obtained with the Decision Trees algorithm to discover areas with refugee crises and artillery fights. The accuracy for these two events is 0.7629. The precision to discover the locations of refugee crises is 0.76, the recall is 0.76, and the F1-score is 0.76. Also, our approach discovers the locations of artillery fights with a precision of 0.74, a recall of 0.75, and a F1-score of 0.75.

[7] arXiv:2006.07100 [pdf]

Reinforced Data Sampling for Model Diversification

Hoang D. Nguyen, Xuan-Son Vu, Quoc-Tuan Truong, Duc-Trong Le

With the rising number of machine learning competitions, the world has witnessed an exciting race for the best algorithms. However, the involved data selection process may fundamentally suffer from evidence ambiguity and concept drift issues, thereby possibly leading to deleterious effects on the performance of various models. This paper proposes a new Reinforced Data Sampling (RDS) method to learn how to sample data adequately on the search for useful models and insights. We formulate the optimisation problem of model diversification $\delta{-div}$ in data sampling to maximise learning potentials and optimum allocation by injecting model diversity. This work advocates the employment of diverse base learners as value functions such as neural networks, decision trees, or logistic regressions to reinforce the selection process of data subsets with multi-modal belief. We introduce different ensemble reward mechanisms, including soft voting and stochastic choice to approximate optimal sampling policy. The evaluation conducted on four datasets evidently highlights the benefits of using RDS method over traditional sampling approaches. Our experimental results suggest that the trainable sampling for model diversification is useful for competition organisers, researchers, or even starters to pursue full potentials of various machine learning tasks such as classification and regression. The source code is available at this https URL.

[8] arXiv:2006.06581 [pdf]

Asymptotic Errors for Teacher-Student Convex Generalized Linear Models (or How to Prove Kabashima's Replica Formula)

Cedric Gerbelot, Alia Abbara, Florent Krzakala

There has been a recent surge of interest in the study of asymptotic reconstruction performance in various cases of generalized linear estimation problems in the teacher-student setting, especially for the case of i.i.d standard normal matrices. In this work, we prove a general analytical formula for the reconstruction performance of convex generalized linear models, and go beyond such matrices by considering all rotationally-invariant data matrices with arbitrary bounded spectrum, proving a decade-old conjecture originally derived using the replica method from statistical physics. This is achieved by leveraging on state-of-the-art advances in message passing algorithms and the statistical properties of their iterates. Our proof is crucially based on the construction of converging sequences of an oracle multi-layer vector approximate message passing algorithm, where the convergence analysis is done by checking the stability of an equivalent dynamical system. Beyond its generality, our result also provides further insight into overparametrized non-linear models, a fundamental building block of modern machine learning. We illustrate our claim with numerical examples on mainstream learning methods such as logistic regression and linear support vector classifiers, showing excellent agreement between moderate size simulation and the asymptotic prediction.

[9] arXiv:2006.06136 [pdf]

Weighted Lasso Estimates for Sparse Logistic Regression Non-asymptotic Properties with Measurement Error

Huamei Huang, Yujing Gao, Huiming Zhang, Bo Li

When we are interested in high-dimensional system and focus on classification performance, the $\ell_{1}$-penalized logistic regression is becoming important and popular. However, the Lasso estimates could be problematic when penalties of different coefficients are all the same and not related to the data. We proposed two types of weighted Lasso estimates depending on covariates by the McDiarmid inequality. Given sample size $n$ and dimension of covariates $p$, the finite sample behavior of our proposed methods with a diverging number of predictors is illustrated by non-asymptotic oracle inequalities such as $\ell_{1}$-estimation error and squared prediction error of the unknown parameters. We compare the performance of our methods with former weighted estimates on simulated data, then apply these methods to do real data analysis.

[10] arXiv:2006.06090 [pdf]

Robustified Multivariate Regression and Classification Using Distributionally Robust Optimization under the Wasserstein Metric

Ruidi Chen, Ioannis Ch. Paschalidis

We develop Distributionally Robust Optimization (DRO) formulations for Multivariate Linear Regression (MLR) and Multiclass Logistic Regression (MLG) when both the covariates and responses/labels may be contaminated by outliers. The DRO framework uses a probabilistic ambiguity set defined as a ball of distributions that are close to the empirical distribution of the training set in the sense of the Wasserstein metric. We relax the DRO formulation into a regularized learning problem whose regularizer is a norm of the coefficient matrix. We establish out-of-sample performance guarantees for the solutions to our model, offering insights on the role of the regularizer in controlling the prediction error. Experimental results show that our approach improves the predictive error by 7% -- 37% for MLR, and a metric of robustness by 100% for MLG.

[11] arXiv:2006.05482 [pdf]

Coresets for Near-Convex Functions

Murad Tukan, Alaa Maalouf, Dan Feldman

Coreset is usually a small weighted subset of $n$ input points in $\mathbb{R}^d$, that provably approximates their loss function for a given set of queries (models, classifiers, etc.). Coresets become increasingly common in machine learning since existing heuristics or inefficient algorithms may be improved by running them possibly many times on the small coreset that can be maintained for streaming distributed data. Coresets can be obtained by sensitivity (importance) sampling, where its size is proportional to the total sum of sensitivities. Unfortunately, computing the sensitivity of each point is problem dependent and may be harder to compute than the original optimization problem at hand. We suggest a generic framework for computing sensitivities (and thus coresets) for wide family of loss functions which we call near-convex functions. This is by suggesting the $f$-SVD factorization that generalizes the SVD factorization of matrices to functions. Example applications include coresets that are either new or significantly improves previous results, such as SVM, Logistic regression, M-estimators, and $\ell_z$-regression. Experimental results and open source are also provided.

[12] arXiv:2006.05095 [pdf]

Towards an Intrinsic Definition of Robustness for a Classifier

Théo Giraudon, Vincent Gripon, Matthias Löwe, Franck Vermet

The robustness of classifiers has become a question of paramount importance in the past few years. Indeed, it has been shown that state-of-the-art deep learning architectures can easily be fooled with imperceptible changes to their inputs. Therefore, finding good measures of robustness of a trained classifier is a key issue in the field. In this paper, we point out that averaging the radius of robustness of samples in a validation set is a statistically weak measure. We propose instead to weight the importance of samples depending on their difficulty. We motivate the proposed score by a theoretical case study using logistic regression, where we show that the proposed score is independent of the choice of the samples it is evaluated upon. We also empirically demonstrate the ability of the proposed score to measure robustness of classifiers with little dependence on the choice of samples in more complex settings, including deep convolutional neural networks and real datasets.

[13] arXiv:2006.04998 [pdf]

Machine Learning Automatically Detects COVID-19 using Chest CTs in a Large Multicenter Cohort

Bogdan Georgescu, Shikha Chaganti, Gorka Bastarrika Aleman, Eduardo Jose Mortani Barbosa Jr., Jordi Broncano Cabrero, Guillaume Chabin, Thomas Flohr, Philippe Grenier, Sasa Grbic, Nakul Gupta, François Mellot, Savvas Nicolaou, Thomas Re, Pina Sanelli, Alexander W. Sauter, Youngjin Yoo, Valentin Ziebandt, Dorin Comaniciu

Purpose To investigate if AI-based classifiers can distinguish COVID-19 from other pulmonary diseases and normal groups, using chest CT images. To study the interpretability of discriminative features for COVID19 detection. Materials and Methods Our database consists of 2096 CT exams that include CTs from 1150 COVID-19 patients. Training was performed on 1000 COVID-19, 131 ILD, 113 other pneumonias, 559 normal CTs, and testing on 100 COVID-19, 30 ILD, 30 other pneumonias, and 34 normal CTs. A metric-based approach for classification of COVID-19 used interpretable features, relying on logistic regression and random forests. A deep learning-based classifier differentiated COVID-19 based on 3D features extracted directly from CT intensities and from the probability distribution of airspace opacities. Results Most discriminative features of COVID-19 are percentage of airspace opacity, ground glass opacities, consolidations, and peripheral and basal opacities, which coincide with the typical characterization of COVID-19 in the literature. Unsupervised hierarchical clustering compares the distribution of these features across COVID-19 and control cohorts. The metrics-based classifier achieved AUC, sensitivity, and specificity of respectively 0.85, 0.81, and 0.77. The DL-based classifier achieved AUC, sensitivity, and specificity of respectively 0.90, 0.86, and 0.81. Most of ambiguity comes from non-COVID-19 pneumonia with manifestations that overlap with COVID-19, as well as COVID-19 cases in early stages. Conclusion A new method discriminates COVID-19 from other types of pneumonia, ILD, and normal, using quantitative patterns from chest CT. Our models balance interpretability of results and classification performance, and therefore may be useful to expedite and improve diagnosis of COVID-19.

[14] arXiv:2006.04937 [pdf]

Interpretable Signal Analysis with Knockoffs Enhances Classification of Bacterial Raman Spectra

Charmaine Chia, Matteo Sesia, Chi-Sing Ho, Stefanie S. Jeffrey, Jennifer Dionne, Emmanuel J. Candès, Roger T. Howe

Interpretability is important for many applications of machine learning to signal data, covering aspects such as how well a model fits the data, how accurately explanations are drawn from it, and how well these can be understood by people. Feature extraction and selection can improve model interpretability by identifying structures in the data that are both informative and intuitively meaningful. To this end, we propose a signal classification framework that combines feature extraction with feature selection using the knockoff filter, a method which provides guarantees on the false discovery rate (FDR) amongst selected features. We apply this to a dataset of Raman spectroscopy measurements from bacterial samples. Using a wavelet-based feature representation of the data and a logistic regression classifier, our framework achieves significantly higher predictive accuracy compared to using the original features as input. Benchmarking was also done with features obtained through principal components analysis, as well as the original features input into a neural network-based classifier. Our proposed framework achieved better predictive performance at the former task and comparable performance at the latter task, while offering the advantage of a more compact and human-interpretable set of features.

[15] arXiv:2006.04787 [pdf]

Classification Under Misspecification Halfspaces, Generalized Linear Models, and Connections to Evolvability

Sitan Chen, Frederic Koehler, Ankur Moitra, Morris Yau

In this paper we revisit some classic problems on classification under misspecification. In particular, we study the problem of learning halfspaces under Massart noise with rate $\eta$. In a recent work, Diakonikolas, Goulekakis, and Tzamos resolved a long-standing problem by giving the first efficient algorithm for learning to accuracy $\eta + \epsilon$ for any $\epsilon > 0$. However, their algorithm outputs a complicated hypothesis, which partitions space into $\text{poly}(d,1/\epsilon)$ regions. Here we give a much simpler algorithm and in the process resolve a number of outstanding open questions (1) We give the first proper learner for Massart halfspaces that achieves $\eta + \epsilon$. We also give improved bounds on the sample complexity achievable by polynomial time algorithms. (2) Based on (1), we develop a blackbox knowledge distillation procedure to convert an arbitrarily complex classifier to an equally good proper classifier. (3) By leveraging a simple but overlooked connection to evolvability, we show any SQ algorithm requires super-polynomially many queries to achieve $\mathsf{OPT} + \epsilon$. Moreover we study generalized linear models where $\mathbb{E}[Y|\mathbf{X}] = \sigma(\langle \mathbf{w}^*, \mathbf{X}\rangle)$ for any odd, monotone, and Lipschitz function $\sigma$. This family includes the previously mentioned halfspace models as a special case, but is much richer and includes other fundamental models like logistic regression. We introduce a challenging new corruption model that generalizes Massart noise, and give a general algorithm for learning in this setting. Our algorithms are based on a small set of core recipes for learning to classify in the presence of misspecification. Finally we study our algorithm for learning halfspaces under Massart noise empirically and find that it exhibits some appealing fairness properties.

[16] arXiv:2006.04532 [pdf]

Detecting Problem Statements in Peer Assessments

Yunkai Xiao, Gabriel Zingle, Qinjin Jia, Harsh R. Shah, Yi Zhang, Tianyi Li, Mohsin Karovaliya, Weixiang Zhao, Yang Song, Jie Ji, Ashwin Balasubramaniam, Harshit Patel, Priyankha Bhalasubbramanian, Vikram Patel, Edward F. Gehringer

Effective peer assessment requires students to be attentive to the deficiencies in the work they rate. Thus, their reviews should identify problems. But what ways are there to check that they do? We attempt to automate the process of deciding whether a review comment detects a problem. We use over 18,000 review comments that were labeled by the reviewees as either detecting or not detecting a problem with the work. We deploy several traditional machine-learning models, as well as neural-network models using GloVe and BERT embeddings. We find that the best performer is the Hierarchical Attention Network classifier, followed by the Bidirectional Gated Recurrent Units (GRU) Attention and Capsule model with scores of 93.1% and 90.5% respectively. The best non-neural network model was the support vector machine with a score of 89.71%. This is followed by the Stochastic Gradient Descent model and the Logistic Regression model with 89.70% and 88.98%.

[17] arXiv:2006.04248 [pdf]

Learning Convex Optimization Models

Akshay Agrawal, Shane Barratt, Stephen Boyd

A convex optimization model predicts an output from an input by solving a convex optimization problem. The class of convex optimization models is large, and includes as special cases many well-known models like linear and logistic regression. We propose a heuristic for learning the parameters in a convex optimization model given a dataset of input-output pairs, using recently developed methods for differentiating the solution of a convex optimization problem with respect to its parameters. We describe three general classes of convex optimization models, maximum a posteriori (MAP) models, utility maximization models, and agent models, and present a numerical experiment for each.

[18] arXiv:2006.03875 [pdf]

Coresets via Bilevel Optimization for Continual Learning and Streaming

Zalán Borsos, Mojmír Mutný, Andreas Krause

Coresets are small data summaries that are sufficient for model training. They can be maintained online, enabling efficient handling of large data streams under resource constraints. However, existing constructions are limited to simple models such as k-means and logistic regression. In this work, we propose a novel coreset construction via cardinality-constrained bilevel optimization. We show how our framework can efficiently generate coresets for deep neural networks, and demonstrate its empirical benefits in continual learning and in streaming settings.

[19] arXiv:2006.03146 [pdf]

COVID-19 Real-Time Tracker and Analytical Report

Jiawei Long

While the COVID-19 outbreak was reported to first originate from Wuhan, China, it has been declared as a Public Health Emergency of International Concern (PHEIC) on 30 January 2020 by WHO, and it has spread to over 180 countries by the time of this paper was being composed. As the disease spreads around the globe, it has evolved into a world-wide pandemic, endangering the state of global public health and becoming a serious threat to the global community. To combat and prevent the spread of the disease, all individuals should be well-informed of the rapidly changing state of COVID-19. In the endeavor of accomplishing this objective, a COVID-19 real-time analytical tracker has been built to provide the latest status of the disease and relevant analytical insights. The real-time tracker is designed to cater to the general audience without advanced statistical aptitude. It aims to communicate insights through various straightforward and concise data visualizations that are supported by sound statistical foundations and reliable data sources. This paper aims to discuss the major methodologies which are utilized to generate the insights displayed on the real-time tracker, which include real-time data retrieval, normalization techniques, ARIMA time-series forecasting, and logistic regression models. In addition to introducing the details and motivations of the utilized methodologies, the paper additionally features some key discoveries that have been derived in regard to COVID-19 using the methodologies.

[20] arXiv:2006.03051 [pdf]

NewB 200,000+ Sentences for Political Bias Detection

Jerry Wei

We present the Newspaper Bias Dataset (NewB), a text corpus of more than 200,000 sentences from eleven news sources regarding Donald Trump. While previous datasets have labeled sentences as either liberal or conservative, NewB covers the political views of eleven popular media sources, capturing more nuanced political viewpoints than a traditional binary classification system does. We train two state-of-the-art deep learning models to predict the news source of a given sentence from eleven newspapers and find that a recurrent neural network achieved top-1, top-3, and top-5 accuracies of 33.3%, 61.4%, and 77.6%, respectively, significantly outperforming a baseline logistic regression model's accuracies of 18.3%, 42.6%, and 60.8%. Using the news source label of sentences, we analyze the top n-grams with our model to gain meaningful insight into the portrayal of Trump by media sources.We hope that the public release of our dataset will encourage further research in using natural language processing to analyze more complex political biases. Our dataset is posted at this https URL .

[21] arXiv:2006.02537 [pdf]

CAPPA Continuous-time Accelerated Proximal Point Algorithm for Sparse Recovery

Kunal Garg, Mayank Baranwal

This paper develops a novel Continuous-time Accelerated Proximal Point Algorithm (CAPPA) for $\ell_1$-minimization problems with provable fixed-time convergence guarantees. The problem of $\ell_1$-minimization appears in several contexts, such as sparse recovery (SR) in Compressed Sensing (CS) theory, and sparse linear and logistic regressions in machine learning to name a few. Most existing algorithms for solving $\ell_1$-minimization problems are discrete-time, inefficient and require exhaustive computer-guided iterations. CAPPA alleviates this problem on two fronts (a) it encompasses a continuous-time algorithm that can be implemented using analog circuits; (b) it betters LCA and finite-time LCA (recently developed continuous-time dynamical systems for solving SR problems) by exhibiting provable fixed-time convergence to optimal solution. Consequently, CAPPA is better suited for fast and efficient handling of SR problems. Simulation studies are presented that corroborate computational advantages of CAPPA.

[22] arXiv:2006.01974 [pdf]

Countering hate on social media Large scale classification of hate and counter speech

Joshua Garland, Keyan Ghazi-Zahedi, Jean-Gabriel Young, Laurent Hébert-Dufresne, Mirta Galesic

Hateful rhetoric is plaguing online discourse, fostering extreme societal movements and possibly giving rise to real-world violence. A potential solution to this growing global problem is citizen-generated counter speech where citizens actively engage in hate-filled conversations to attempt to restore civil non-polarized discourse. However, its actual effectiveness in curbing the spread of hatred is unknown and hard to quantify. One major obstacle to researching this question is a lack of large labeled data sets for training automated classifiers to identify counter speech. Here we made use of a unique situation in Germany where self-labeling groups engaged in organized online hate and counter speech. We used an ensemble learning algorithm which pairs a variety of paragraph embeddings with regularized logistic regression functions to classify both hate and counter speech in a corpus of millions of relevant tweets from these two groups. Our pipeline achieved macro F1 scores on out of sample balanced test sets ranging from 0.76 to 0.97---accuracy in line and even exceeding the state of the art. On thousands of tweets, we used crowdsourcing to verify that the judgments made by the classifier are in close alignment with human judgment. We then used the classifier to discover hate and counter speech in more than 135,000 fully-resolved Twitter conversations occurring from 2013 to 2018 and study their frequency and interaction. Altogether, our results highlight the potential of automated methods to evaluate the impact of coordinated counter speech in stabilizing conversations on social media.

[23] arXiv:2006.00767 [pdf]

Scalable Uncertainty Quantification via GenerativeBootstrap Sampler

Minsuk Shin, Lu Wang, Jun S Liu

It has been believed that the virtue of using statistical procedures is on uncertainty quantification in statistical decisions, and the bootstrap method has been commonly used for this purpose. However, nowadays as the size of data massively increases and statistical models become more complicated, the implementation of bootstrapping turns out to be practically challenging due to its repetitive nature in computation. To overcome this issue, we propose a novel computational procedure called {\it Generative Bootstrap Sampler} (GBS), which constructs a generator function of bootstrap evaluations, and this function transforms the weights on the observed data points to the bootstrap distribution. The GBS is implemented by one single optimization, without repeatedly evaluating the optimizer of bootstrapped loss function as in standard bootstrapping procedures. As a result, the GBS is capable of reducing computational time of bootstrapping by hundreds of folds when the data size is massive. We show that the bootstrapped distribution evaluated by the GBS is asymptotically equivalent to the conventional counterpart and empirically they are indistinguishable. We examine the proposed idea to bootstrap various models such as linear regression, logistic regression, Cox proportional hazard model, and Gaussian process regression model, quantile regression, etc. The results show that the GBS procedure is not only accelerating the computational speed, but it also attains a high level of accuracy to the target bootstrap distribution. Additionally, we apply this idea to accelerate the computation of other repetitive procedures such as bootstrapped cross-validation, tuning parameter selection, and permutation test.

[24] arXiv:2006.00683 [pdf]

Logistic Regression for Massive Data with Rare Events

HaiYing Wang

This paper studies binary logistic regression for rare events data, or imbalanced data, where the number of events (observations in one class, often called cases) is significantly smaller than the number of nonevents (observations in the other class, often called controls). We first derive the asymptotic distribution of the maximum likelihood estimator (MLE) of the unknown parameter, which shows that the asymptotic variance convergences to zero in a rate of the inverse of the number of the events instead of the inverse of the full data sample size. This indicates that the available information in rare events data is at the scale of the number of events instead of the full data sample size. Furthermore, we prove that under-sampling a small proportion of the nonevents, the resulting under-sampled estimator may have identical asymptotic distribution to the full data MLE. This demonstrates the advantage of under-sampling nonevents for rare events data, because this procedure may significantly reduce the computation and/or data collection costs. Another common practice in analyzing rare events data is to over-sample (replicate) the events, which has a higher computational cost. We show that this procedure may even result in efficiency loss in terms of parameter estimation.

[25] arXiv:2006.00593 [pdf]

BPGC at SemEval-2020 Task 11 Propaganda Detection in News Articles with Multi-Granularity Knowledge Sharing and Linguistic Features based Ensemble Learning

Rajaswa Patil, Somesh Singh, Swati Agarwal

Propaganda spreads the ideology and beliefs of like-minded people, brainwashing their audiences, and sometimes leading to violence. SemEval 2020 Task-11 aims to design automated systems for news propaganda detection. Task-11 consists of two sub-tasks, namely, Span Identification - given any news article, the system tags those specific fragments which contain at least one propaganda technique; and Technique Classification - correctly classify a given propagandist statement amongst 14 propaganda techniques. For sub-task 1, we use contextual embeddings extracted from pre-trained transformer models to represent the text data at various granularities and propose a multi-granularity knowledge sharing approach. For sub-task 2, we use an ensemble of BERT and logistic regression classifiers with linguistic features. Our results reveal that the linguistic features are the strong indicators for covering minority classes in a highly imbalanced dataset.

[26] arXiv:2005.14236 [pdf]

Fuzziness-based Spatial-Spectral Class Discriminant Information Preserving Active Learning for Hyperspectral Image Classification

Muhammad Ahmad

Traditional Active/Self/Interactive Learning for Hyperspectral Image Classification (HSIC) increases the size of the training set without considering the class scatters and randomness among the existing and new samples. Second, very limited research has been carried out on joint spectral-spatial information and finally, a minor but still worth mentioning is the stopping criteria which not being much considered by the community. Therefore, this work proposes a novel fuzziness-based spatial-spectral within and between for both local and global class discriminant information preserving (FLG) method. We first investigate a spatial prior fuzziness-based misclassified sample information. We then compute the total local and global for both within and between class information and formulate it in a fine-grained manner. Later this information is fed to a discriminative objective function to query the heterogeneous samples which eliminate the randomness among the training samples. Experimental results on benchmark HSI datasets demonstrate the effectiveness of the FLG method on Generative, Extreme Learning Machine and Sparse Multinomial Logistic Regression (SMLR)-LORSAL classifiers.

[27] arXiv:2005.13995 [pdf]

Using Machine Learning to Forecast Future Earnings

Xinyue Cui, Zhaoyu Xu, Yue Zhou

In this essay, we have comprehensively evaluated the feasibility and suitability of adopting the Machine Learning Models on the forecast of corporation fundamentals (i.e. the earnings), where the prediction results of our method have been thoroughly compared with both analysts' consensus estimation and traditional statistical models. As a result, our model has already been proved to be capable of serving as a favorable auxiliary tool for analysts to conduct better predictions on company fundamentals. Compared with previous traditional statistical models being widely adopted in the industry like Logistic Regression, our method has already achieved satisfactory advancement on both the prediction accuracy and speed. Meanwhile, we are also confident enough that there are still vast potentialities for this model to evolve, where we do hope that in the near future, the machine learning model could generate even better performances compared with professional analysts.

[28] arXiv:2005.13199 [pdf]

Bayesian model selection in the $\mathcal{M}$-open setting -- Approximate posterior inference and probability-proportional-to-size subsampling for efficient large-scale leave-one-out cross-validation

Riko Kelter

Comparison of competing statistical models is an essential part of psychological research. From a Bayesian perspective, various approaches to model comparison and selection have been proposed in the literature. However, the applicability of these approaches strongly depends on the assumptions about the model space $\mathcal{M}$, the so-called model view. Furthermore, traditional methods like leave-one-out cross-validation (LOO-CV) estimate the expected log predictive density (ELPD) of a model to investigate how the model generalises out-of-sample, which quickly becomes computationally inefficient when sample size becomes large. Here, we provide a tutorial on approximate Pareto-smoothed importance sampling leave-one-out cross-validation (PSIS-LOO), a computationally efficient method for Bayesian model comparison. First, we discuss several model views and the available Bayesian model comparison methods in each. We then use Bayesian logistic regression as a running example how to apply the method in practice, and show that it outperforms other methods like LOO-CV or information criteria in terms of computational effort while providing similarly accurate ELPD estimates. In a second step, we show how even large-scale models can be compared efficiently by using posterior approximations in combination with probability-proportional-to-size subsampling. We show how to compare competing models based on the ELPD estimates provided, and how to conduct posterior predictive checks to safeguard against overconfidence in one of the models under consideration. We conclude that the method is attractive for mathematical psychologists who aim at comparing several competing statistical models, which are possibly high-dimensional and in the big-data regime.

[29] arXiv:2005.11007 [pdf]

Secure and Differentially Private Bayesian Learning on Distributed Data

Yeongjae Gil, Xiaoqian Jiang, Miran Kim, Junghye Lee

Data integration and sharing maximally enhance the potential for novel and meaningful discoveries. However, it is a non-trivial task as integrating data from multiple sources can put sensitive information of study participants at risk. To address the privacy concern, we present a distributed Bayesian learning approach via Preconditioned Stochastic Gradient Langevin Dynamics with RMSprop, which combines differential privacy and homomorphic encryption in a harmonious manner while protecting private information. We applied the proposed secure and privacy-preserving distributed Bayesian learning approach to logistic regression and survival analysis on distributed data, and demonstrated its feasibility in terms of prediction accuracy and time complexity, compared to the centralized approach.

[30] arXiv:2005.10951 [pdf]

A machine learning approach to using Quality-of-Life patient scores in guiding prostate radiation therapy dosing

Zhijian Yang, Daniel Olszewski, Chujun He, Giulia Pintea, Jun Lian, Tom Chou, Ronald Chen, Blerta Shtylla

Thanks to advancements in diagnosis and treatment, prostate cancer patients have high long-term survival rates. Currently, an important goal is to preserve quality-of-life during and after treatment. The relationship between the radiation a patient receives and the subsequent side effects he experiences is complex and difficult to model or predict. Here, we use machine learning algorithms and statistical models to explore the connection between radiation treatment and post-treatment gastro-urinary function. Since only a limited number of patient datasets are currently available, we used image flipping and curvature-based interpolation methods to generate more data in order to leverage transfer learning. Using interpolated and augmented data, we trained a convolutional autoencoder network to obtain near-optimal starting points for the weights. A convolutional neural network then analyzed the relationship between patient-reported quality-of-life and radiation. We also used analysis of variance and logistic regression to explore organ sensitivity to radiation and develop dosage thresholds for each organ region. Our findings show no connection between the bladder and quality-of-life scores. However, we found a connection between radiation applied to posterior and anterior rectal regions to changes in quality-of-life. Finally, we estimated radiation therapy dosage thresholds for each organ. Our analysis connects machine learning methods with organ sensitivity, thus providing a framework for informing cancer patient care using patient reported quality-of-life metrics.

[31] arXiv:2005.10898 [pdf]

COVID-19 Public Sentiment Insights and Machine Learning for Tweets Classification

Jim Samuel, G. G. Md. Nawaz Ali, Md. Mokhlesur Rahman, Ek Esawi, Yana Samuel

Along with the Coronavirus pandemic, another crisis has manifested itself in the form of mass fear and panic phenomena, fueled by incomplete and often inaccurate information. There is therefore a tremendous need to address and better understand COVID-19's informational crisis and gauge public sentiment, so that appropriate messaging and policy decisions can be implemented. In this research article, we identify public sentiment associated with the pandemic using Coronavirus specific Tweets and R statistical software, along with its sentiment analysis packages. We demonstrate insights into the progress of fear-sentiment over time as COVID-19 approached peak levels in the United States, using descriptive textual analytics supported by necessary textual data visualizations. Furthermore, we provide a methodological overview of two essential machine learning (ML) classification methods, in the context of textual analytics, and compare their effectiveness in classifying Coronavirus Tweets of varying lengths. We observe a strong classification accuracy of 91% for short Tweets, with the Naive Bayes method. We also observe that the logistic regression classification method provides a reasonable accuracy of 74% with shorter Tweets, and both methods showed relatively weaker performance for longer Tweets. This research provides insights into Coronavirus fear sentiment progression, and outlines associated methods, implications, limitations and opportunities.

[32] arXiv:2005.10296 [pdf]

SWIFT Super-fast and Robust Privacy-Preserving Machine Learning

Nishat Koti, Mahak Pancholi, Arpita Patra, Ajith Suresh

Performing ML computation on private data while maintaining data privacy aka Privacy-preserving Machine Learning (PPML) is an emergent field of research. Recently, PPML has seen a visible shift towards the adoption of Secure Outsourced Computation (SOC) paradigm, due to the heavy computation that it entails. In the SOC paradigm, computation is outsourced to a set of powerful and specially equipped servers that provide service on a pay-per-use basis. In this work, we propose SWIFT, a robust PPML framework for a range of ML algorithms in SOC setting, that guarantees output delivery to the users irrespective of any adversarial behaviour. Robustness, a highly desirable feature, evokes user participation without the fear of denial of service. At the heart of our framework lies a highly-efficient, maliciously-secure, three-party computation (3PC) over rings that provides guaranteed output delivery (GOD) in the honest-majority setting. To the best of our knowledge, SWIFT is the first robust and efficient PPML framework in the 3PC setting. SWIFT is as fast as the best-known 3PC framework BLAZE (Patra et al. NDSS'20) which only achieves fairness. Fairness ensures either all or none receive the output, whereas GOD ensures guaranteed output delivery no matter what. We extend our 3PC framework for four parties (4PC). In this regime, SWIFT is as fast as the best known fair 4PC framework Trident (Chaudhari et al. NDSS'20) and twice faster than the best-known robust 4PC framework FLASH (Byali et al. PETS'20). We demonstrate the practical relevance of our framework by benchmarking two important applications-- i) ML algorithms Logistic Regression and Neural Network, and ii) Biometric matching, both over a 64-bit ring in WAN setting. Our readings reflect our claims as above.

[33] arXiv:2005.09532 [pdf]

Scalable Privacy-Preserving Distributed Learning

David Froelicher, Juan R. Troncoso-Pastoriza, Apostolos Pyrgelis, Sinem Sav, Joao Sa Sousa, Jean-Philippe Bossuat, Jean-Pierre Hubaux

In this paper, we address the problem of privacy-preserving distributed learning and evaluation of machine learning models by analyzing it in the widespread MapReduce abstraction that we extend with privacy constraints. Following this abstraction, we instantiate SPINDLE (Scalable Privacy-preservINg Distributed LEarning), an operational distributed system that supports the privacy-preserving training and evaluation of generalized linear models on distributed datasets. SPINDLE enables the efficient execution of distributed gradient descent while ensuring data and model confidentiality, as long as at least one of the data providers is honest-but-curious. The trained model is then used for oblivious predictions on confidential data. SPINDLE is able to efficiently perform demanding training tasks that require a high number of iterations on large input data with thousands of features, distributed among hundreds of data providers. It relies on a multiparty homomorphic encryption scheme to execute high-depth computations on encrypted data without significant overhead. It further leverages on its distributed construction and the packing capabilities of the cryptographic scheme to efficiently parallelize the computations at multiple levels. In our evaluation, SPINDLE performs the training of a logistic-regression model on a dataset of one million samples with 32 features distributed among 160 data providers in less than 176 seconds, yielding similar accuracy to non-secure centralized models.

[34] arXiv:2005.09488 [pdf]

Simultaneous and Temporal Autoregressive Network Models

Daniel K. Sewell

While logistic regression models are easily accessible to researchers, when applied to network data there are unrealistic assumptions made about the dependence structure of the data. For temporal networks measured in discrete time, recent work has made good advances \citep{almquist2014logistic}, but there is still the assumption that the dyads are conditionally independent given the edge histories. This assumption can be quite strong and is sometimes difficult to justify. If time steps are rather large, one would typically expect not only the existence of temporal dependencies among the dyads across observed time points but also the existence of simultaneous dependencies affecting how the dyads of the network co-evolve. We propose a general observation driven model for dynamic networks which overcomes this problem by modeling both the mean and the covariance structures as functions of the edge histories using a flexible autoregressive approach. This approach can be shown to fit into a generalized linear mixed model framework. We propose a visualization method which provides evidence concerning the existence of simultaneous dependence. We describe a simulation study to determine the method's performance in the presence and absence of simultaneous dependence, and we analyze both a proximity network from conference attendees and a world trade network. We also use this last data set to illustrate how simultaneous dependencies become more prominent as the time intervals become coarser.

[35] arXiv:2005.09042 [pdf]

BLAZE Blazing Fast Privacy-Preserving Machine Learning

Arpita Patra, Ajith Suresh

Machine learning tools have illustrated their potential in many significant sectors such as healthcare and finance, to aide in deriving useful inferences. The sensitive and confidential nature of the data, in such sectors, raise natural concerns for the privacy of data. This motivated the area of Privacy-preserving Machine Learning (PPML) where privacy of the data is guaranteed. Typically, ML techniques require large computing power, which leads clients with limited infrastructure to rely on the method of Secure Outsourced Computation (SOC). In SOC setting, the computation is outsourced to a set of specialized and powerful cloud servers and the service is availed on a pay-per-use basis. In this work, we explore PPML techniques in the SOC setting for widely used ML algorithms-- Linear Regression, Logistic Regression, and Neural Networks. We propose BLAZE, a blazing fast PPML framework in the three server setting tolerating one malicious corruption over a ring (\Z{\ell}). BLAZE achieves the stronger security guarantee of fairness (all honest servers get the output whenever the corrupt server obtains the same). Leveraging an input-independent preprocessing phase, BLAZE has a fast input-dependent online phase relying on efficient PPML primitives such as (i) A dot product protocol for which the communication in the online phase is independent of the vector size, the first of its kind in the three server setting; (ii) A method for truncation that shuns evaluating expensive circuit for Ripple Carry Adders (RCA) and achieves a constant round complexity. This improves over the truncation method of ABY3 (Mohassel et al., CCS 2018) that uses RCA and consumes a round complexity that is of the order of the depth of RCA. An extensive benchmarking of BLAZE for the aforementioned ML algorithms over a 64-bit ring in both WAN and LAN settings shows massive improvements over ABY3.

[36] arXiv:2005.08961 [pdf]

Patterns in demand side financial inclusion in India -- An inquiry using IHDS Panel Data

Vinay Reddy Venumuddala

In the following study, we inquire into the financial inclusion from a demand side perspective. Utilizing IHDS round-1 (2004-05) and round-2 (2011-12), starting from a broad picture of demand side access to finance at the country level, we venture into analysing the patterns at state level, and then lastly at district level. Particularly at district level, we focus on agriculture households in rural areas to identify if there is a shift in the demand side financial access towards non-agriculture households in certain parts of the country. In order to do this, we use District level 'Basic Statistical Returns of Scheduled Commercial Banks' for the years 2004 and 2011, made available by RBI, to first construct supply side financial inclusion indices, and then infer about a relative shift in access to formal finance away from agriculture households, using a logistic regression framework.

[37] arXiv:2005.06943 [pdf]

NIT-Agartala-NLP-Team at SemEval-2020 Task 8 Building Multimodal Classifiers to tackle Internet Humor

Steve Durairaj Swamy, Shubham Laddha, Basil Abdussalam, Debayan Datta, Anupam Jamatia

The paper describes the systems submitted to SemEval-2020 Task 8 Memotion by the `NIT-Agartala-NLP-Team'. A dataset of 8879 memes was made available by the task organizers to train and test our models. Our systems include a Logistic Regression baseline, a BiLSTM + Attention-based learner and a transfer learning approach with BERT. For the three sub-tasks A, B and C, we attained ranks 24/33, 11/29 and 15/26, respectively. We highlight our difficulties in harnessing image information as well as some techniques and handcrafted features we employ to overcome these issues. We also discuss various modelling issues and theorize possible solutions and reasons as to why these problems persist.

[38] arXiv:2005.06839 [pdf]

Street Marketing How Proximity and Context drive Coupon Redemption

Sarah Spiekermann, Matthias Rothensee, Michael Klafft

Purpose In 2009, US coupons set a new record of 367 billion coupons distributed. Yet, while coupon distribution is on the rise, redemption rates remain below 1 percent. This paper aims to show how recognizing context variables, such as proximity, weather, part of town and financial incentives interplay to determine a coupon campaigns success. Design/methodology/approach. The paper reports an empirical study conducted in co-operation with a restaurant chain 9.880 Subway coupons were distributed under different experimental context conditions. Redemption behavior was analyzed with the help of logistic regressions. Findings It was found that even though proximity drives coupon redemption, city center campaigns seem to be much more sensitive to distance than suburban areas. The further away the distribution place from the restaurant, the less does the amount of monetary incentive determine the motivation to redeem. Practical implications. When designing a coupon campaign for a company, coupon distribution should not follow a -- one is good for all strategy -- even for one marketer within one product category. Instead each coupon strategy should carefully consider contextual influence. Originality/value. This paper is the first to the authors knowledge that systematically investigates the impact of context variables on coupon redemption. It focuses on context variables that electronic marketing channels will be able to easily incorporate into personalized mobile marketing campaigns.

[39] arXiv:2005.06386 [pdf]

Which bills are lobbied? Predicting and interpreting lobbying activity in the US

Ivan Slobozhan, Peter Ormosi, Rajesh Sharma

Using lobbying data from this http URL, we offer several experiments applying machine learning techniques to predict if a piece of legislation (US bill) has been subjected to lobbying activities or not. We also investigate the influence of the intensity of the lobbying activity on how discernible a lobbied bill is from one that was not subject to lobbying. We compare the performance of a number of different models (logistic regression, random forest, CNN and LSTM) and text embedding representations (BOW, TF-IDF, GloVe, Law2Vec). We report results of above 0.85% ROC AUC scores, and 78% accuracy. Model performance significantly improves (95% ROC AUC, and 88% accuracy) when bills with higher lobbying intensity are looked at. We also propose a method that could be used for unlabelled data. Through this we show that there is a considerably large number of previously unlabelled US bills where our predictions suggest that some lobbying activity took place. We believe our method could potentially contribute to the enforcement of the US Lobbying Disclosure Act (LDA) by indicating the bills that were likely to have been affected by lobbying but were not filed as such.

[40] arXiv:2005.06158 [pdf]

An Asymptotic Result of Conditional Logistic Regression Estimator

Zhulin He, Yuyuan Ouyang

In cluster-specific studies, ordinary logistic regression and conditional logistic regression for binary outcomes provide maximum likelihood estimator (MLE) and conditional maximum likelihood estimator (CMLE), respectively. In this paper, we show that CMLE is approaching to MLE asymptotically when each individual data point is replicated infinitely many times. Our theoretical derivation is based on the observation that a term appearing in the conditional average log-likelihood function is the coefficient of a polynomial, and hence can be transformed to a complex integral by Cauchy's differentiation formula. The asymptotic analysis of the complex integral can then be performed using the classical method of steepest descent. Our result implies that CMLE can be biased if individual weights are multiplied with a constant, and that we should be cautious when assigning weights to cluster-specific studies.

[41] arXiv:2005.05823 [pdf]

Perturbing Inputs to Prevent Model Stealing

Justin Grana

We show how perturbing inputs to machine learning services (ML-service) deployed in the cloud can protect against model stealing attacks. In our formulation, there is an ML-service that receives inputs from users and returns the output of the model. There is an attacker that is interested in learning the parameters of the ML-service. We use the linear and logistic regression models to illustrate how strategically adding noise to the inputs fundamentally alters the attacker's estimation problem. We show that even with infinite samples, the attacker would not be able to recover the true model parameters. We focus on characterizing the trade-off between the error in the attacker's estimate of the parameters with the error in the ML-service's output.

[42] arXiv:2005.03226 [pdf]

Detecting Latent Communities in Network Formation Models

Shujie Ma, Liangjun Su, Yichong Zhang

This paper proposes a logistic undirected network formation model which allows for assortative matching on observed individual characteristics and the presence of edge-wise fixed effects. We model the coefficients of observed characteristics to have a latent community structure and the edge-wise fixed effects to be of low rank. We propose a multi-step estimation procedure involving nuclear norm regularization, sample splitting, iterative logistic regression and spectral clustering to detect the latent communities. We show that the latent communities can be exactly recovered when the expected degree of the network is of order log n or higher, where n is the number of nodes in the network. The finite sample performance of the new estimation and inference methods is illustrated through both simulated and real datasets.

[43] arXiv:2005.02905 [pdf]

Automatic Detection and Recognition of Individuals in Patterned Species

Gullal Singh Cheema, Saket Anand

Visual animal biometrics is rapidly gaining popularity as it enables a non-invasive and cost-effective approach for wildlife monitoring applications. Widespread usage of camera traps has led to large volumes of collected images, making manual processing of visual content hard to manage. In this work, we develop a framework for automatic detection and recognition of individuals in different patterned species like tigers, zebras and jaguars. Most existing systems primarily rely on manual input for localizing the animal, which does not scale well to large datasets. In order to automate the detection process while retaining robustness to blur, partial occlusion, illumination and pose variations, we use the recently proposed Faster-RCNN object detection framework to efficiently detect animals in images. We further extract features from AlexNet of the animal's flank and train a logistic regression (or Linear SVM) classifier to recognize the individuals. We primarily test and evaluate our framework on a camera trap tiger image dataset that contains images that vary in overall image quality, animal pose, scale and lighting. We also evaluate our recognition system on zebra and jaguar images to show generalization to other patterned species. Our framework gives perfect detection results in camera trapped tiger images and a similar or better individual recognition performance when compared with state-of-the-art recognition techniques.

[44] arXiv:2005.02488 [pdf]

Instance-Dependent Cost-Sensitive Learning for Detecting Transfer Fraud

Sebastiaan Höppner, Bart Baesens, Wouter Verbeke, Tim Verdonck

Card transaction fraud is a growing problem affecting card holders worldwide. Financial institutions increasingly rely upon data-driven methods for developing fraud detection systems, which are able to automatically detect and block fraudulent transactions. From a machine learning perspective, the task of detecting fraudulent transactions is a binary classification problem. Classification models are commonly trained and evaluated in terms of statistical performance measures, such as likelihood and AUC, respectively. These measures, however, do not take into account the actual business objective, which is to minimize the financial losses due to fraud. Fraud detection is to be acknowledged as an instance-dependent cost-sensitive classification problem, where the costs due to misclassification vary between instances, and requiring adapted approaches for learning a classification model. In this article, an instance-dependent threshold is derived, based on the instance-dependent cost matrix for transfer fraud detection, that allows for making the optimal cost-based decision for each transaction. Two novel classifiers are presented, based on lasso-regularized logistic regression and gradient tree boosting, which directly minimize the proposed instance-dependent cost measure when learning a classification model. The proposed methods are implemented in the R packages cslogit and csboost, and compared against state-of-the-art methods on a publicly available data set from the machine learning competition website Kaggle and a proprietary card transaction data set. The results of the experiments highlight the potential of reducing fraud losses by adopting the proposed methods.

[45] arXiv:2005.01988 [pdf]

One-step regression and classification with crosspoint resistive memory arrays

Zhong Sun, Giacomo Pedretti, Alessandro Bricalli, Daniele Ielmini

Machine learning has been getting a large attention in the recent years, as a tool to process big data generated by ubiquitous sensors in our daily life. High speed, low energy computing machines are in demand to enable real-time artificial intelligence at the edge, i.e., without the support of a remote frame server in the cloud. Such requirements challenge the complementary metal-oxide-semiconductor (CMOS) technology, which is limited by the Moore's law approaching its end and the communication bottleneck in conventional computing architecture. Novel computing concepts, architectures and devices are thus strongly needed to accelerate data-intensive applications. Here we show a crosspoint resistive memory circuit with feedback configuration can execute linear regression and logistic regression in just one step by computing the pseudoinverse matrix of the data within the memory. The most elementary learning operation, that is the regression of a sequence of data and the classification of a set of data, can thus be executed in one single computational step by the novel technology. One-step learning is further supported by simulations of the prediction of the cost of a house in Boston and the training of a 2-layer neural network for MNIST digit recognition. The results are all obtained in one computational step, thanks to the physical, parallel, and analog computing within the crosspoint array.

[46] arXiv:2005.01365 [pdf]

Ensemble Forecasting for Intraday Electricity Prices Simulating Trajectories

Michał Narajewski, Florian Ziel

Recent studies concerning the point electricity price forecasting have shown evidence that the hourly German Intraday Continuous Market is weak-form efficient. Therefore, we take a novel, advanced approach to the problem. A probabilistic forecasting of the hourly intraday electricity prices is performed by simulating trajectories in every trading window to receive a realistic ensemble to allow for more efficient intraday trading and redispatch. A generalized additive model is fitted to the price differences with the assumption that they follow a mixture of the Dirac and the Student's t-distributions. Moreover, the mixing term is estimated using a high-dimensional logistic regression with lasso penalty. We model the expected value and volatility of the series using i.a. autoregressive and no-trade effects or load, wind and solar generation forecasts and accounting for the non-linearities in e.g. time to maturity. Both the in-sample characteristics and forecasting performance are analysed using a rolling window forecasting study. Multiple versions of the model are compared to several benchmark models. The study aims to forecast the price distribution in the German Intraday Continuous Market in the last 3 hours of trading, but the approach allows for application to other continuous markets. The results prove superiority of the mixture model over the benchmarks gaining the most from the modelling of the volatility.

[47] arXiv:2005.00869 [pdf]

Generalized Knowledge Tracing A Constrained Framework for Learner Modeling

Philip I. Pavlik Jr., Luke G. Eglington, Leigh M. Harrell-Williams

Adaptive learning technology solutions often use a learner model to trace learning and make pedagogical decisions. The present research introduces a formalized methodology for specifying learner models, Generalized Knowledge Tracing, GKT, that consolidates many extant learner modeling methods. The strength of GKT is the specification of a symbolic notation system for alternative logistic regression models that is powerful enough to specify many extant models in the literature, as well as many new models. To demonstrate the generality of GKT, it was used to fit 12 models, some variants of well-known models and some newly devised, to 6 learning technology datasets. The results indicated that no single learner model was best in all cases, further justifying a broad approach that considers multiple learner model features and the learning context. To strengthen the applicability to learning technology, the models presented here avoid student-level fixed parameters, since these are difficult to acquire in practice. We argue that to be maximally applicable a learner model needs to adapt to student differences, rather than needing to be pre-parameterized with the level of each student's ability.

[48] arXiv:2005.00180 [pdf]

Generalization Error of Generalized Linear Models in High Dimensions

Melikasadat Emami, Mojtaba Sahraee-Ardakan, Parthe Pandit, Sundeep Rangan, Alyson K. Fletcher

At the heart of machine learning lies the question of generalizability of learned rules over previously unseen data. While over-parameterized models based on neural networks are now ubiquitous in machine learning applications, our understanding of their generalization capabilities is incomplete. This task is made harder by the non-convexity of the underlying learning problems. We provide a general framework to characterize the asymptotic generalization error for single-layer neural networks (i.e., generalized linear models) with arbitrary non-linearities, making it applicable to regression as well as classification problems. This framework enables analyzing the effect of (i) over-parameterization and non-linearity during modeling; and (ii) choices of loss function, initialization, and regularizer during learning. Our model also captures mismatch between training and test distributions. As examples, we analyze a few special cases, namely linear regression and logistic regression. We are also able to rigorously and analytically explain the \emph{double descent} phenomenon in generalized linear models.

[49] arXiv:2004.14952 [pdf]

Pain and Physical Activity Association in Critically Ill Patients

Anis Davoudi, Tezcan Ozrazgat-Baslanti, Patrick J. Tighe, Azra Bihorac, Parisa Rashidi

Critical care patients experience varying levels of pain during their stay in the intensive care unit, often requiring administration of analgesics and sedation. Such medications generally exacerbate the already sedentary physical activity profiles of critical care patients, contributing to delayed recovery. Thus, it is important not only to minimize pain levels, but also to optimize analgesic strategies in order to maximize mobility and activity of ICU patients. Currently, we lack an understanding of the relation between pain and physical activity on a granular level. In this study, we examined the relationship between nurse assessed pain scores and physical activity as measured using a wearable accelerometer device. We found that average, standard deviation, and maximum physical activity counts are significantly higher before high pain reports compared to before low pain reports during both daytime and nighttime, while percentage of time spent immobile was not significantly different between the two pain report groups. Clusters detected among patients using extracted physical activity features were significant in adjusted logistic regression analysis for prediction of pain report group.

[50] arXiv:2004.14165 [pdf]

Classification of Cuisines from Sequentially Structured Recipes

Tript Sharma, Utkarsh Upadhyay, Ganesh Bagler

Cultures across the world are distinguished by the idiosyncratic patterns in their cuisines. These cuisines are characterized in terms of their substructures such as ingredients, cooking processes and utensils. A complex fusion of these substructures intrinsic to a region defines the identity of a cuisine. Accurate classification of cuisines based on their culinary features is an outstanding problem and has hitherto been attempted to solve by accounting for ingredients of a recipe as features. Previous studies have attempted cuisine classification by using unstructured recipes without accounting for details of cooking techniques. In reality, the cooking processes/techniques and their order are highly significant for the recipe's structure and hence for its classification. In this article, we have implemented a range of classification techniques by accounting for this information on the RecipeDB dataset containing sequential data on recipes. The state-of-the-art RoBERTa model presented the highest accuracy of 73.30% among a range of classification models from Logistic Regression and Naive Bayes to LSTMs and Transformers.

[51] arXiv:2004.13912 [pdf]

Neural Additive Models Interpretable Machine Learning with Neural Nets

Rishabh Agarwal, Nicholas Frosst, Xuezhou Zhang, Rich Caruana, Geoffrey E. Hinton

Deep neural networks (DNNs) are powerful black-box predictors that have achieved impressive performance on a wide variety of tasks. However, their accuracy comes at the cost of intelligibility it is usually unclear how they make their decisions. This hinders their applicability to high stakes decision-making domains such as healthcare. We propose Neural Additive Models (NAMs) which combine some of the expressivity of DNNs with the inherent intelligibility of generalized additive models. NAMs learn a linear combination of neural networks that each attend to a single input feature. These networks are trained jointly and can learn arbitrarily complex relationships between their input feature and the output. Our experiments on regression and classification datasets show that NAMs are more accurate than widely used intelligible models such as logistic regression and shallow decision trees. They perform similarly to existing state-of-the-art generalized additive models in accuracy, but can be more easily applied to real-world problems.

[52] arXiv:2004.13909 [pdf]

Improving Vertical Positioning Accuracy with the Weighted Multinomial Logistic Regression Classifier

Yiyan Yao, Xin-long Luo

In this paper, a method of improving vertical positioning accuracy with the Global Positioning System (GPS) information and barometric pressure values is proposed. Firstly, we clear null values for the raw data collected in various environments, and use the 3$\sigma$-rule to identify outliers. Secondly, the Weighted Multinomial Logistic Regression (WMLR) classifier is trained to obtain the predicted altitude of outliers. Finally, in order to verify its effect, we compare the MLR method, the WMLR method, and the Support Vector Machine (SVM) method for the cleaned dataset which is regarded as the test baseline. The numerical results show that the vertical positioning accuracy is improved from 5.9 meters (the MLR method), 5.4 meters (the SVM method) to 5 meters (the WMLR method) for 67% test points.

[53] arXiv:2004.13851 [pdf]

Sentiment Analysis of Yelp Reviews A Comparison of Techniques and Models

Siqi Liu

We use over 350,000 Yelp reviews on 5,000 restaurants to perform an ablation study on text preprocessing techniques. We also compare the effectiveness of several machine learning and deep learning models on predicting user sentiment (negative, neutral, or positive). For machine learning models, we find that using binary bag-of-word representation, adding bi-grams, imposing minimum frequency constraints and normalizing texts have positive effects on model performance. For deep learning models, we find that using pre-trained word embeddings and capping maximum length often boost model performance. Finally, using macro F1 score as our comparison metric, we find simpler models such as Logistic Regression and Support Vector Machine to be more effective at predicting sentiments than more complex models such as Gradient Boosting, LSTM and BERT.

[54] arXiv:2004.12551 [pdf]

Interpretable Multi-Task Deep Neural Networks for Dynamic Predictions of Postoperative Complications

Benjamin Shickel, Tyler J. Loftus, Shounak Datta, Tezcan Ozrazgat-Baslanti, Azra Bihorac, Parisa Rashidi

Accurate prediction of postoperative complications can inform shared decisions between patients and surgeons regarding the appropriateness of surgery, preoperative risk-reduction strategies, and postoperative resource use. Traditional predictive analytic tools are hindered by suboptimal performance and usability. We hypothesized that novel deep learning techniques would outperform logistic regression models in predicting postoperative complications. In a single-center longitudinal cohort of 43,943 adult patients undergoing 52,529 major inpatient surgeries, deep learning yielded greater discrimination than logistic regression for all nine complications. Predictive performance was strongest when leveraging the full spectrum of preoperative and intraoperative physiologic time-series electronic health record data. A single multi-task deep learning model yielded greater performance than separate models trained on individual complications. Integrated gradients interpretability mechanisms demonstrated the substantial importance of missing data. Interpretable, multi-task deep neural networks made accurate, patient-level predictions that harbor the potential to augment surgical decision-making.

[55] arXiv:2004.12504 [pdf]

How Much Should I Pay? An Empirical Analysis on Monetary Prize in TopCoder

Mostaan Lotfalian Saremi, Razieh Saremi, Denisse Martinez-Mejorado

It is reported that task monetary prize is one of the most important motivating factors to attract crowd workers. While using expert-based methods to price Crowdsourcing tasks is a common practice, the challenge of validating the associated prices across different tasks is a constant issue. To address this issue, three different classifications of multiple linear regression, logistic regression, and K-nearest neighbor were compared to find the most accurate predicted price, using a dataset from the TopCoder website. The result of comparing chosen algorithms showed that the logistics regression model will provide the highest accuracy of 90% to predict the associated price to tasks and KNN ranked the second with an accuracy of 64% for K = 7. Also, applying PCA wouldn't lead to any better prediction accuracy as data components are not correlated.

[56] arXiv:2004.11122 [pdf]

Optimizing the reliability of a bank with Logistic Regression and Particle Swarm Optimization

Vadlamani Ravi, Vadlamani Madhav

It is well-known that disciplines such as mechanical engineering, electrical engineering, civil engineering, aerospace engineering, chemical engineering and software engineering witnessed successful applications of reliability engineering concepts. However, the concept of reliability in its strict sense is missing in financial services. Therefore, in order to fill this gap, in a first-of-its-kind-study, we define the reliability of a bank/firm in terms of the financial ratios connoting the financial health of the bank to withstand the likelihood of insolvency or bankruptcy. For the purpose of estimating the reliability of a bank, we invoke a statistical and machine learning algorithm namely, logistic regression (LR). Once, the parameters are estimated in the 1st stage, we fix them and treat the financial ratios as decision variables. Thus, in the 1st stage, we accomplish the hitherto unknown way of estimating the reliability of a bank. Subsequently, in the 2nd stage, in order to maximize the reliability of the bank, we formulate an unconstrained optimization problem in a single-objective environment and solve it using the well-known particle swarm optimization (PSO) algorithm. Thus, in essence, these two stages correspond to predictive and prescriptive analytics respectively. The proposed 2-stage strategy of using them in tandem is beneficial to the decision-makers within a bank who can try to achieve the optimal or near-optimal values of the financial ratios in order to maximize the reliability which is tantamount to safeguarding their bank against solvency or bankruptcy.

[57] arXiv:2004.10075 [pdf]

Propensity Score Weighting for Covariate Adjustment in Randomized Clinical Trials

Shuxi Zeng, Fan Li, Rui Wang, Fan Li

Imbalance in baseline characteristics due to chance is common in randomized clinical trials. Regression adjustment such as the analysis of covariance (ANCOVA) is often used to account for imbalance and increase precision of the treatment effect estimate. An objective alternative is through inverse probability weighting (IPW) of the propensity scores. Although IPW and ANCOVA are asymptotically equivalent, the former may demonstrate inferior performance in finite samples. In this article, we point out that IPW is a special case of the general class of balancing weights, and propose the overlap weighting (OW) method for covariate adjustment. The OW approach has a unique advantage of completely removing chance imbalance when the propensity score is estimated by logistic regression. We show that the OW estimator attains the same semiparametric variance lower bound as the most efficient ANCOVA estimator and the IPW estimator with a continuous outcome, and derive closed-form variance estimators for OW when estimating additive and ratio estimands. Through extensive simulations, we demonstrate OW consistently outperforms IPW in finite samples and improves the efficiency over ANCOVA and augmented IPW when the degree of treatment effect heterogeneity is moderate or when the outcome model is incorrectly specified. We apply the proposed OW estimator to the Best Apnea Interventions for Research (BestAIR) randomized trial to evaluate the effect of continuous positive airway pressure on patient health outcomes.

[58] arXiv:2004.09575 [pdf]

Predicting nucleation near the spinodal in the Ising model using machine learning

Shan Huang, William Klein, Harvey Gould

We predict the occurrence of nucleation in the two-dimensional Ising model using the Convolutional Neural Network (CNN) and two logistic regression models. CNN outperforms the latter in systems with different interaction ranges and sizes, especially when the size of the system becomes large. We find that the CNN decreases its prediction power as system gets closer to the spinodal. We give explanation using the ramified droplet structure predicted by spinodal nucleation theory.

[59] arXiv:2004.09466 [pdf]

Causality-aware counterfactual confounding adjustment for feature representations learned by deep models with an application to image classification tasks

Elias Chaibub Neto

Causal modeling has been recognized as a potential solution to many challenging problems in machine learning (ML). Here, we propose a counterfactual approach to remove/reduce the influence of confounders from the predictions generated a deep neural network (DNN). Rather than attempting to prevent DNNs from directly learning the confounding signal, we propose a counterfactual approach to remove confounding from the feature representations learned by DNNs in anticausal prediction tasks. By training an accurate DNN using softmax activation at the classification layer, and then adopting the representation learned by the last layer prior to the output layer as our features, we have that, by construction, the learned features will fit well a logistic regression model, and will be linearly associated with the labels. Then, in order to generate classifiers that are free from the influence of the observed confounders we (i) use linear models to regress each learned feature on the labels and on the confounders and estimate the respective regression coefficients and model residuals; (ii) generate new counterfactual features by adding back to the estimated residuals to a linear predictor which no longer includes the confounder variables; and (iii) train and evaluate a logistic classifier using the counterfactual features as inputs. We validate the proposed methodology using colored versions of the MNIST and fashion-MNIST datasets, and show how the approach can effectively combat confounding and improve generalization in the context of dataset shift. Comparison against a variation of the SMOTE \cite{chawla2002} approach showed that the causality-aware approach compared favorably against SMOTE balancing in our experiments. Finally, we also describe how to use conditional independence tests to evaluate if the counterfactual approach has effectively removed the confounder signals from the predictions.

[60] arXiv:2004.09071 [pdf]

Stochastic primal dual fixed point method for composite optimization

YaNanZhu, XiaoqunZhang

In this paper we propose a stochastic primal dual fixed point method (SPDFP) for solving the sum of two proper lower semi-continuous convex function and one of which is composite. The method is based on the primal dual fixed point method (PDFP) proposed in [7] that does not require subproblem solving. Under some mild condition, the convergence is established based on two sets of assumptions bounded and unbounded gradients and the convergence rate of the expected error of iterate is of the order O(k^{\alpha}) where k is iteration number and \alpha \in (0, 1]. Finally, numerical examples on graphic Lasso and logistic regressions are given to demonstrate the effectiveness of the proposed algorithm.

[61] arXiv:2004.07427 [pdf]

Asymmetrical Vertical Federated Learning

Yang Liu, Xiong Zhang, Libin Wang

Federated learning is a distributed machine learning method that aims to preserve the privacy of sample features and labels. In a federated learning system, ID-based sample alignment approaches are usually applied with few efforts made on the protection of ID privacy. In real-life applications, however, the confidentiality of sample IDs, which are the strongest row identifiers, is also drawing much attention from many participants. To relax their privacy concerns about ID privacy, this paper formally proposes the notion of asymmetrical vertical federated learning and illustrates the way to protect sample IDs. The standard private set intersection protocol is adapted to achieve the asymmetrical ID alignment phase in an asymmetrical vertical federated learning system. Correspondingly, a Pohlig-Hellman realization of the adapted protocol is provided. This paper also presents a genuine with dummy approach to achieving asymmetrical federated model training. To illustrate its application, a federated logistic regression algorithm is provided as an example. Experiments are also made for validating the feasibility of this approach.

[62] arXiv:2004.06672 [pdf]

Fidelity of Statistical Reporting in 10 Years of Cyber Security User Studies

Thomas Groß

Studies in socio-technical aspects of security often rely on user studies and statistical inferences on investigated relations to make their case. They, thereby, enable practitioners and scientists alike to judge on the validity and reliability of the research undertaken. To ascertain this capacity, we investigated the reporting fidelity of security user studies. Based on a systematic literature review of $114$ user studies in cyber security from selected venues in the 10 years 2006--2016, we evaluated fidelity of the reporting of $1775$ statistical inferences using the \textsf{R} package \textsf{statcheck}. We conducted a systematic classification of incomplete reporting, reporting inconsistencies and decision errors, leading to multinomial logistic regression (MLR) on the impact of publication venue/year as well as a comparison to a compatible field of psychology. We found that half the cyber security user studies considered reported incomplete results, in stark difference to comparable results in a field of psychology. Our MLR on analysis outcomes yielded a slight increase of likelihood of incomplete tests over time, while SOUPS yielded a few percent greater likelihood to report statistics correctly than other venues. In this study, we offer the first fully quantitative analysis of the state-of-play of socio-technical studies in security. While we highlight the impact and prevalence of incomplete reporting, we also offer fine-grained diagnostics and recommendations on how to respond to the situation.

[63] arXiv:2004.06465 [pdf]

Deep Learning Models for Multilingual Hate Speech Detection

Sai Saketh Aluru, Binny Mathew, Punyajoy Saha, Animesh Mukherjee

Hate speech detection is a challenging problem with most of the datasets available in only one language English. In this paper, we conduct a large scale analysis of multilingual hate speech in 9 languages from 16 different sources. We observe that in low resource setting, simple models such as LASER embedding with logistic regression performs the best, while in high resource setting BERT based models perform better. In case of zero-shot classification, languages such as Italian and Portuguese achieve good results. Our proposed framework could be used as an efficient solution for low-resource languages. These models could also act as good baselines for future multilingual hate speech detection tasks. We have made our code and experimental settings public for other researchers at this https URL.

[64] arXiv:2004.04898 [pdf]

Secret Sharing based Secure Regressions with Applications

Chaochao Chen, Liang Li, Wenjing Fang, Jun Zhou, Li Wang, Lei Wang, Shuang Yang, Alex Liu, Hao Wang

Nowadays, the utilization of the ever expanding amount of data has made a huge impact on web technologies while also causing various types of security concerns. On one hand, potential gains are highly anticipated if different organizations could somehow collaboratively share their data for technological improvements. On the other hand, data security concerns may arise for both data holders and data providers due to commercial or sociological concerns. To make a balance between technical improvements and security limitations, we implement secure and scalable protocols for multiple data holders to train linear regression and logistic regression models. We build our protocols based on the secret sharing scheme, which is scalable and efficient in applications. Moreover, our proposed paradigm can be generalized to any secure multiparty training scenarios where only matrix summation and matrix multiplications are used. We demonstrate our approach by experiments which shows the scalability and efficiency of our proposed protocols, and finally present its real-world applications.

[65] arXiv:2004.03953 [pdf]

File Classification Based on Spiking Neural Networks

Ana Stanojevic, Giovanni Cherubini, Timoleon Moraitis, Abu Sebastian

In this paper, we propose a system for file classification in large data sets based on spiking neural networks (SNNs). File information contained in key-value metadata pairs is mapped by a novel correlative temporal encoding scheme to spike patterns that are input to an SNN. The correlation between input spike patterns is determined by a file similarity measure. Unsupervised training of such networks using spike-timing-dependent plasticity (STDP) is addressed first. Then, supervised SNN training is considered by backpropagation of an error signal that is obtained by comparing the spike pattern at the output neurons with a target pattern representing the desired class. The classification accuracy is measured for various publicly available data sets with tens of thousands of elements, and compared with other learning algorithms, including logistic regression and support vector machines. Simulation results indicate that the proposed SNN-based system using memristive synapses may represent a valid alternative to classical machine learning algorithms for inference tasks, especially in environments with asynchronous ingest of input data and limited resources.

[66] arXiv:2004.03336 [pdf]

Predict the model of a camera

Ciro Javier Diaz Penedo

In this work we address the problem of predicting the model of a camera based on the content of their photographs. We use two set of features, one set consist in properties extracted from a Discrete Wavelet Domain (DWD) obtained by applying a 4 level Fast Wavelet Decomposition of the images, and a second set are Local Binary Patterns (LBP) features from the after filter noise of images. The algorithms used for classification were Logistic regression, K-NN and Artificial Neural Networks

[67] arXiv:2004.03188 [pdf]

Increasing the Inference and Learning Speed of Tsetlin Machines with Clause Indexing

Saeed Rahimi Gorji, Ole-Christoffer Granmo, Sondre Glimsdal, Jonathan Edwards, Morten Goodwin

The Tsetlin Machine (TM) is a machine learning algorithm founded on the classical Tsetlin Automaton (TA) and game theory. It further leverages frequent pattern mining and resource allocation principles to extract common patterns in the data, rather than relying on minimizing output error, which is prone to overfitting. Unlike the intertwined nature of pattern representation in neural networks, a TM decomposes problems into self-contained patterns, represented as conjunctive clauses. The clause outputs, in turn, are combined into a classification decision through summation and thresholding, akin to a logistic regression function, however, with binary weights and a unit step output function. In this paper, we exploit this hierarchical structure by introducing a novel algorithm that avoids evaluating the clauses exhaustively. Instead we use a simple look-up table that indexes the clauses on the features that falsify them. In this manner, we can quickly evaluate a large number of clauses through falsification, simply by iterating through the features and using the look-up table to eliminate those clauses that are falsified. The look-up table is further structured so that it facilitates constant time updating, thus supporting use also during learning. We report up to 15 times faster classification and three times faster learning on MNIST and Fashion-MNIST image classification, and IMDb sentiment analysis.

[68] arXiv:2004.02406 [pdf]

Using generalized logistics regression to forecast population infected by Covid-19

Mario Villalobos-Arias

In this work, a proposal to forecast the populations using generalized logistics regression curve fitting is presented. This type of curve is used to study population growth, in this case population of people infected with the Covid-19 virus; and it can also be used to approximate the survival curve used in actuarial and similar studies.

[69] arXiv:2004.02264 [pdf]

PrivFL Practical Privacy-preserving Federated Regressions on High-dimensional Data over Mobile Networks

Kalikinkar Mandal, Guang Gong

Federated Learning (FL) enables a large number of users to jointly learn a shared machine learning (ML) model, coordinated by a centralized server, where the data is distributed across multiple devices. This approach enables the server or users to train and learn an ML model using gradient descent, while keeping all the training data on users' devices. We consider training an ML model over a mobile network where user dropout is a common phenomenon. Although federated learning was aimed at reducing data privacy risks, the ML model privacy has not received much attention. In this work, we present PrivFL, a privacy-preserving system for training (predictive) linear and logistic regression models and oblivious predictions in the federated setting, while guaranteeing data and model privacy as well as ensuring robustness to users dropping out in the network. We design two privacy-preserving protocols for training linear and logistic regression models based on an additive homomorphic encryption (HE) scheme and an aggregation protocol. Exploiting the training algorithm of federated learning, at the core of our training protocols is a secure multiparty global gradient computation on alive users' data. We analyze the security of our training protocols against semi-honest adversaries. As long as the aggregation protocol is secure under the aggregation privacy game and the additive HE scheme is semantically secure, PrivFL guarantees the users' data privacy against the server, and the server's regression model privacy against the users. We demonstrate the performance of PrivFL on real-world datasets and show its applicability in the federated learning system.

[70] arXiv:2004.00281 [pdf]

A generalised OMP algorithm for feature selection with application to gene expression data

Michail Tsagris, Zacharias Papadovasilakis, Kleanthi Lakiotaki, Ioannis Tsamardinos

Feature selection for predictive analytics is the problem of identifying a minimal-size subset of features that is maximally predictive of an outcome of interest. To apply to molecular data, feature selection algorithms need to be scalable to tens of thousands of available features. In this paper, we propose gOMP, a highly-scalable generalisation of the Orthogonal Matching Pursuit feature selection algorithm to several directions (a) different types of outcomes, such as continuous, binary, nominal, and time-to-event, (b) different types of predictive models (e.g., linear least squares, logistic regression), (c) different types of predictive features (continuous, categorical), and (d) different, statistical-based stopping criteria. We compare the proposed algorithm against LASSO, a prototypical, widely used algorithm for high-dimensional data. On dozens of simulated datasets, as well as, real gene expression datasets, gOMP is on par, or outperforms LASSO for case-control binary classification, quantified outcomes (regression), and (censored) survival times (time-to-event) analysis. gOMP has also several theoretical advantages that are discussed. While gOMP is based on quite simple and basic statistical ideas, easy to implement and to generalize, we also show in an extensive evaluation that it is also quite effective in bioinformatics analysis settings.

[71] arXiv:2004.00026 [pdf]

Small quantum computers and large classical data sets

Aram W. Harrow

We introduce hybrid classical-quantum algorithms for problems involving a large classical data set X and a space of models Y such that a quantum computer has superposition access to Y but not X. These algorithms use data reduction techniques to construct a weighted subset of X called a coreset that yields approximately the same loss for each model. The coreset can be constructed by the classical computer alone, or via an interactive protocol in which the outputs of the quantum computer are used to help decide which elements of X to use. By using the quantum computer to perform Grover search or rejection sampling, this yields quantum speedups for maximum likelihood estimation, Bayesian inference and saddle-point optimization. Concrete applications include k-means clustering, logistical regression, zero-sum games and boosting.

[72] arXiv:2003.14257 [pdf]

Detection of FLOSS version release events from Stack Overflow message data

A. Sokolovsky, T. Gross, J. Bacardit

Topic Detection and Tracking (TDT) is a very active research question within the area of text mining, generally applied to news feeds and Twitter datasets, where topics and events are detected. The notion of "event" is broad, but typically it applies to occurrences that can be detected from a single post or a message. Little attention has been drawn to what we call "micro-events", which, due to their nature, cannot be detected from a single piece of textual information. The study investigates micro-event detection on textual data using a sample of messages from the Stack Overflow Q&A platform in order to detect Free/Libre Open Source Software (FLOSS) version releases. Micro-events are detected using logistic regression models with step-wise forward regression feature selection from a set of LDA topics and sentiment analysis features. We perform a detailed statistical analysis of the models, including influential cases, variance inflation factors, validation of the linearity assumption, pseudo R sq. measures and no-information rate. Finally, in order to understand the detection limits and improve the performance of the estimators, we suggest a method for generating micro-event synthetic datasets and use them identify the micro-event detectability thresholds.

[73] arXiv:2003.12012 [pdf]

TRACER A Framework for Facilitating Accurate and Interpretable Analytics for High Stakes Applications

Kaiping Zheng, Shaofeng Cai, Horng Ruey Chua, Wei Wang, Kee Yuan Ngiam, Beng Chin Ooi

In high stakes applications such as healthcare and finance analytics, the interpretability of predictive models is required and necessary for domain practitioners to trust the predictions. Traditional machine learning models, e.g., logistic regression (LR), are easy to interpret in nature. However, many of these models aggregate time-series data without considering the temporal correlations and variations. Therefore, their performance cannot match up to recurrent neural network (RNN) based models, which are nonetheless difficult to interpret. In this paper, we propose a general framework TRACER to facilitate accurate and interpretable predictions, with a novel model TITV devised for healthcare analytics and other high stakes applications such as financial investment and risk management. Different from LR and other existing RNN-based models, TITV is designed to capture both the time-invariant and the time-variant feature importance using a feature-wise transformation subnetwork and a self-attention subnetwork, for the feature influence shared over the entire time series and the time-related importance respectively. Healthcare analytics is adopted as a driving use case, and we note that the proposed TRACER is also applicable to other domains, e.g., fintech. We evaluate the accuracy of TRACER extensively in two real-world hospital datasets, and our doctors/clinicians further validate the interpretability of TRACER in both the patient level and the feature level. Besides, TRACER is also validated in a high stakes financial application and a critical temperature forecasting application. The experimental results confirm that TRACER facilitates both accurate and interpretable analytics for high stakes applications.

[74] arXiv:2003.11196 [pdf]

Dimension Independent Generalization Error with Regularized Online Optimization

Xi Chen, Qiang Liu, Xin T. Tong

One classical canon of statistics is that large models are prone to overfitting and model selection procedures are necessary for high-dimensional data. However, many overparameterized models such as neural networks, which are often trained with simple online methods and regularization, perform very well in practice. The empirical success of overparameterized models, which is often known as benign overfitting, motivates us to have a new look at the statistical generalization theory for online optimization. In particular, we present a general theory on the generalization error of stochastic gradient descent (SGD) for both convex and non-convex loss functions. We further provide the definition of "low effective dimension" so that the generalization error either does not depend on the ambient dimension $p$ or depends on $p$ via a poly-logarithmic factor. We also demonstrate on several widely used statistical models that the "low effect dimension" arises naturally in overparameterized settings. The studied statistical applications include both convex models such as linear regression and logistic regression, and non-convex models such as $M$-estimator and two-layer neural networks.

[75] arXiv:2003.10113 [pdf]

Algorithms for Non-Stationary Generalized Linear Bandits

Yoan Russac (DI-ENS), Olivier Cappé (DI-ENS), Aurélien Garivier (UMPA-ENSL)

The statistical framework of Generalized Linear Models (GLM) can be applied to sequential problems involving categorical or ordinal rewards associated, for instance, with clicks, likes or ratings. In the example of binary rewards, logistic regression is well-known to be preferable to the use of standard linear modeling. Previous works have shown how to deal with GLMs in contextual online learning with bandit feedback when the environment is assumed to be stationary. In this paper, we relax this latter assumption and propose two upper confidence bound based algorithms that make use of either a sliding window or a discounted maximum-likelihood estimator. We provide theoretical guarantees on the behavior of these algorithms for general context sequences and in the presence of abrupt changes. These results take the form of high probability upper bounds for the dynamic regret that are of order d^2/3 G^1/3 T^2/3 , where d, T and G are respectively the dimension of the unknown parameter, the number of rounds and the number of breakpoints up to time T. The empirical performance of the algorithms is illustrated in simulated environments.

[76] arXiv:2003.09015 [pdf]

Multilayer Dense Connections for Hierarchical Concept Classification

Toufiq Parag, Hongcheng Wang

Classification is a pivotal function for many computer vision tasks such as object classification, detection, scene segmentation. Multinomial logistic regression with a single final layer of dense connections has become the ubiquitous technique for CNN-based classification. While these classifiers learn a mapping between the input and a set of output category classes, they do not typically learn a comprehensive knowledge about the category. In particular, when a CNN based image classifier correctly identifies the image of a Chimpanzee, it does not know that it is a member of Primate, Mammal, Chordate families and a living thing. We propose a multilayer dense connectivity for a CNN to simultaneously predict the category and its conceptual superclasses in hierarchical order. We experimentally demonstrate that our proposed dense connections, in conjunction with popular convolutional feature layers, can learn to predict the conceptual classes with minimal increase in network size while maintaining the categorical classification accuracy.

[77] arXiv:2003.08573 [pdf]

Uncertainty Estimation in Cancer Survival Prediction

Hrushikesh Loya, Pranav Poduval, Deepak Anand, Neeraj Kumar, Amit Sethi

Survival models are used in various fields, such as the development of cancer treatment protocols. Although many statistical and machine learning models have been proposed to achieve accurate survival predictions, little attention has been paid to obtain well-calibrated uncertainty estimates associated with each prediction. The currently popular models are opaque and untrustworthy in that they often express high confidence even on those test cases that are not similar to the training samples, and even when their predictions are wrong. We propose a Bayesian framework for survival models that not only gives more accurate survival predictions but also quantifies the survival uncertainty better. Our approach is a novel combination of variational inference for uncertainty estimation, neural multi-task logistic regression for estimating nonlinear and time-varying risk models, and an additional sparsity-inducing prior to work with high dimensional data.

[78] arXiv:2003.08259 [pdf]

Logistic-Regression with peer-group effects via inference in higher order Ising models

Constantinos Daskalakis, Nishanth Dikkala, Ioannis Panageas

Spin glass models, such as the Sherrington-Kirkpatrick, Hopfield and Ising models, are all well-studied members of the exponential family of discrete distributions, and have been influential in a number of application domains where they are used to model correlation phenomena on networks. Conventionally these models have quadratic sufficient statistics and consequently capture correlations arising from pairwise interactions. In this work we study extensions of these to models with higher-order sufficient statistics, modeling behavior on a social network with peer-group effects. In particular, we model binary outcomes on a network as a higher-order spin glass, where the behavior of an individual depends on a linear function of their own vector of covariates and some polynomial function of the behavior of others, capturing peer-group effects. Using a {\em single}, high-dimensional sample from such model our goal is to recover the coefficients of the linear function as well as the strength of the peer-group effects. The heart of our result is a novel approach for showing strong concavity of the log pseudo-likelihood of the model, implying statistical error rate of $\sqrt{d/n}$ for the Maximum Pseudo-Likelihood Estimator (MPLE), where $d$ is the dimensionality of the covariate vectors and $n$ is the size of the network (number of nodes). Our model generalizes vanilla logistic regression as well as the peer-effect models studied in recent works, and our results extend these results to accommodate higher-order interactions.

[79] arXiv:2003.08239 [pdf]

Patient-centric HetNets Powered by Machine Learning and Big Data Analytics for 6G Networks

Mohammed S. Hadi, Ahmed Q. Lawey, Taisir E. H. El-Gorashi, Jaafar M. H. Elmirghani

Having a cognitive and self-optimizing network that proactively adapts not only to channel conditions, but also according to its users needs can be one of the highest forthcoming priorities of future 6G Heterogeneous Networks (HetNets). In this paper, we introduce an interdisciplinary approach linking the concepts of e-healthcare, priority, big data analytics (BDA) and radio resource optimization in a multi-tier 5G network. We employ three machine learning (ML) algorithms, namely, naive Bayesian (NB) classifier, logistic regression (LR), and decision tree (DT), working as an ensemble system to analyze historical medical records of stroke out-patients (OPs) and readings from body-attached internet-of-things (IoT) sensors to predict the likelihood of an imminent stroke. We convert the stroke likelihood into a risk factor functioning as a priority in a mixed integer linear programming (MILP) optimization model. Hence, the task is to optimally allocate physical resource blocks (PRBs) to HetNet users while prioritizing OPs by granting them high gain PRBs according to the severity of their medical state. Thus, empowering the OPs to send their critical data to their healthcare provider with minimized delay. To that end, two optimization approaches are proposed, a weighted sum rate maximization (WSRMax) approach and a proportional fairness (PF) approach. The proposed approaches increased the OPs average signal to interference plus noise (SINR) by 57% and 95%, respectively. The WSRMax approach increased the system total SINR to a level higher than that of the PF approach, nevertheless, the PF approach yielded higher SINRs for the OPs, better fairness and a lower margin of error.

[80] arXiv:2003.08137 [pdf]

A Novel Twitter Sentiment Analysis Model with Baseline Correlation for Financial Market Prediction with Improved Efficiency

Xinyi Guo, Jinfeng Li

A novel social networks sentiment analysis model is proposed based on Twitter sentiment score (TSS) for real-time prediction of the future stock market price FTSE 100, as compared with conventional econometric models of investor sentiment based on closed-end fund discount (CEFD). The proposed TSS model features a new baseline correlation approach, which not only exhibits a decent prediction accuracy, but also reduces the computation burden and enables a fast decision making without the knowledge of historical data. Polynomial regression, classification modelling and lexicon-based sentiment analysis are performed using R. The obtained TSS predicts the future stock market trend in advance by 15 time samples (30 working hours) with an accuracy of 67.22% using the proposed baseline criterion without referring to historical TSS or market data. Specifically, TSS's prediction performance of an upward market is found far better than that of a downward market. Under the logistic regression and linear discriminant analysis, the accuracy of TSS in predicting the upward trend of the future market achieves 97.87%.

You can also browse papers in other categories.