Data science papers

Here you can find a selection of high quality arXiv papers involving data science.

[1] arXiv:2006.09912 [pdf]

Bute A Bottom-Up Exact Solver for Treedepth (Submitted to PACE 2020 under username peaty)

James Trimble

This note introduces the exact solver Bute for the exact treedepth problem, along with two variants of the solver. Each of these solvers computes an elimination tree in a bottom-up fashion by finding sets of vertices that induce subgraphs of small treedepth, then combining sets of vertices together with a root vertex to produce larger sets. The algorithms make use of a trie data structure to reduce the effort required to determine acceptable combinations of subtrees. Bute-Plus and Bute-Plus-Plus add a heuristic presolve step, which can quickly find a treedepth decomposition of optimal depth for many instances.

[2] arXiv:2006.09877 [pdf]

Twin-width II small classes

Édouard Bonnet, Colin Geniet, Eun Jung Kim, Stéphan Thomassé, Rémi Watrigant

The twin-width of a graph $G$ is the minimum integer $d$ such that $G$ has a $d$-contraction sequence, that is, a sequence of $|V(G)|-1$ iterated vertex identifications for which the overall maximum number of red edges incident to a single vertex is at most $d$, where a red edge appears between two sets of identified vertices if they are not homogeneous in $G$. We show that if a graph admits a $d$-contraction sequence, then it also has a linear-arity tree of $f(d)$-contractions, for some function $f$. First this permits to show that every bounded twin-width class is small, i.e., has at most $n!c^n$ graphs labeled by $[n]$, for some constant $c$. This unifies and extends the same result for bounded treewidth graphs [Beineke and Pippert, JCT '69], proper subclasses of permutations graphs [Marcus and Tardos, JCTA '04], and proper minor-free classes [Norine et al., JCTB '06]. The second consequence is an $O(\log n)$-adjacency labeling scheme for bounded twin-width graphs, confirming several cases of the implicit graph conjecture. We then explore the "small conjecture" that, conversely, every small hereditary class has bounded twin-width. Inspired by sorting networks of logarithmic depth, we show that $\log_{\Theta(\log \log d)}n$-subdivisions of $K_n$ (a small class when $d$ is constant) have twin-width at most $d$. We obtain a rather sharp converse with a surprisingly direct proof the $\log_{d+1}n$-subdivision of $K_n$ has twin-width at least $d$. Secondly graphs with bounded stack or queue number (also small classes) have bounded twin-width. Thirdly we show that cubic expanders obtained by iterated random 2-lifts from $K_4$~[Bilu and Linial, Combinatorica '06] have bounded twin-width, too. We suggest a promising connection between the small conjecture and group theory. Finally we define a robust notion of sparse twin-width and discuss how it compares with other sparse classes.

[3] arXiv:2006.09872 [pdf]

Scheduling a Proportionate Flow Shop of Batching Machines

Christoph Hertrich, Christian Weiß, Heiner Ackermann, Sandy Heydrich, Sven O. Krumke

In this paper we study a proportionate flow shop of batching machines with release dates and a fixed number $m \geq 2$ of machines. The scheduling problem has so far barely received any attention in the literature, but recently its importance has increased significantly, due to applications in the industrial scaling of modern bio-medicine production processes. We show that for any fixed number of machines, the makespan and the sum of completion times can be minimized in polynomial time. Furthermore, we show that the obtained algorithm can also be used to minimize the weighted total completion time, maximum lateness, total tardiness and (weighted) number of late jobs in polynomial time if all release dates are $0$. Previously, polynomial time algorithms have only been known for two machines.

[4] arXiv:2006.09735 [pdf]

Efficient Statistics for Sparse Graphical Models from Truncated Samples

Arnab Bhattacharyya, Rathin Desai, Sai Ganesh Nagarajan, Ioannis Panageas

In this paper, we study high-dimensional estimation from truncated samples. We focus on two fundamental and classical problems (i) inference of sparse Gaussian graphical models and (ii) support recovery of sparse linear models. (i) For Gaussian graphical models, suppose $d$-dimensional samples ${\bf x}$ are generated from a Gaussian $N(\mu,\Sigma)$ and observed only if they belong to a subset $S \subseteq \mathbb{R}^d$. We show that ${\mu}$ and ${\Sigma}$ can be estimated with error $\epsilon$ in the Frobenius norm, using $\tilde{O}\left(\frac{\textrm{nz}({\Sigma}^{-1})}{\epsilon^2}\right)$ samples from a truncated $\mathcal{N}({\mu},{\Sigma})$ and having access to a membership oracle for $S$. The set $S$ is assumed to have non-trivial measure under the unknown distribution but is otherwise arbitrary. (ii) For sparse linear regression, suppose samples $({\bf x},y)$ are generated where $y = {\bf x}^\top{{\Omega}^*} + \mathcal{N}(0,1)$ and $({\bf x}, y)$ is seen only if $y$ belongs to a truncation set $S \subseteq \mathbb{R}$. We consider the case that ${\Omega}^*$ is sparse with a support set of size $k$. Our main result is to establish precise conditions on the problem dimension $d$, the support size $k$, the number of observations $n$, and properties of the samples and the truncation that are sufficient to recover the support of ${\Omega}^*$. Specifically, we show that under some mild assumptions, only $O(k^2 \log d)$ samples are needed to estimate ${\Omega}^*$ in the $\ell_\infty$-norm up to a bounded error. For both problems, our estimator minimizes the sum of the finite population negative log-likelihood function and an $\ell_1$-regularization term.

[5] arXiv:2006.09665 [pdf]

Caching with Time Windows and Delays

Anupam Gupta, Amit Kumar, Debmalya Panigrahi

We consider two generalizations of the classical weighted paging problem that incorporate the notion of delayed service of page requests. The first is the (weighted) Paging with Time Windows (PageTW) problem, which is like the classical weighted paging problem except that each page request only needs to be served before a given deadline. This problem arises in many practical applications of online caching, such as the "deadline" I/O scheduler in the Linux kernel and video-on-demand streaming. The second, and more general, problem is the (weighted) Paging with Delay (PageD) problem, where the delay in serving a page request results in a penalty being assessed to the objective. This problem generalizes the caching problem to allow delayed service, a line of work that has recently gained traction in online algorithms (e.g., Emek et al. STOC '16, Azar et al. STOC '17, Azar and Touitou FOCS '19). We give $O(\log k\log n)$-competitive algorithms for both the PageTW and PageD problems on $n$ pages with a cache of size $k$. This significantly improves on the previous best bounds of $O(k)$ for both problems (Azar et al. STOC '17). We also consider the offline PageTW and PageD problems, for which we give $O(1)$ approximation algorithms and prove APX-hardness. These are the first results for the offline problems; even NP-hardness was not known before our work. At the heart of our algorithms is a novel "hitting-set" LP relaxation of the PageTW problem that overcomes the $\Omega(k)$ integrality gap of the natural LP for the problem. To the best of our knowledge, this is the first example of an LP-based algorithm for an online algorithm with delays/deadlines.

[6] arXiv:2006.09509 [pdf]

Online Algorithms for Weighted Paging with Predictions

Zhihao Jiang, Debmalya Panigrahi, Kevin Sun

In this paper, we initiate the study of the weighted paging problem with predictions. This continues the recent line of work in online algorithms with predictions, particularly that of Lykouris and Vassilvitski (ICML 2018) and Rohatgi (SODA 2020) on unweighted paging with predictions. We show that unlike unweighted paging, neither a fixed lookahead nor knowledge of the next request for every page is sufficient information for an algorithm to overcome existing lower bounds in weighted paging. However, a combination of the two, which we call the strong per request prediction (SPRP) model, suffices to give a 2-competitive algorithm. We also explore the question of gracefully degrading algorithms with increasing prediction error, and give both upper and lower bounds for a set of natural measures of prediction error.

[7] arXiv:2006.09352 [pdf]

A One-Pass Private Sketch for Most Machine Learning Tasks

Benjamin Coleman, Anshumali Shrivastava

Differential privacy (DP) is a compelling privacy definition that explains the privacy-utility tradeoff via formal, provable guarantees. Inspired by recent progress toward general-purpose data release algorithms, we propose a private sketch, or small summary of the dataset, that supports a multitude of machine learning tasks including regression, classification, density estimation, near-neighbor search, and more. Our sketch consists of randomized contingency tables that are indexed with locality-sensitive hashing and constructed with an efficient one-pass algorithm. We prove competitive error bounds for DP kernel density estimation. Existing methods for DP kernel density estimation scale poorly, often exponentially slower with an increase in dimensions. In contrast, our sketch can quickly run on large, high-dimensional datasets in a single pass. Exhaustive experiments show that our generic sketch delivers a similar privacy-utility tradeoff when compared to existing DP methods at a fraction of the computation cost. We expect that our sketch will enable differential privacy in distributed, large-scale machine learning settings.

[8] arXiv:2006.09327 [pdf]

A Note on Monotone Submodular Maximization with Cardinality Constraint

Wenxin Li

We show that for the cardinality constrained monotone submodular maximization problem, there exists a $(1-1/e-\varepsilon)$-approximate deterministic algorithm with linear query complexity, which performs $O(n/\varepsilon)$ queries in total.

[9] arXiv:2006.09221 [pdf]

Testing systems of real quadratic equations for approximate solutions

Alexander Barvinok

Consider systems of equations $q_i(x)=0$, where $q_i {\Bbb R}^n \longrightarrow {\Bbb R}$, $i=1, \ldots, m$, are quadratic forms. We want to be able to tell efficiently systems with many non-trivial solutions or near solutions $x \ne 0$ from systems that are far from having a solution. For that, we pick a penalty function $F {\Bbb R} \longrightarrow [0, 1]$ with $F(0)=1$ and $F(y) 0$ is an absolute constant.

[10] arXiv:2006.09167 [pdf]

Heterogeneous Parallelization and Acceleration of Molecular Dynamics Simulations in GROMACS

Szilárd Páll, Artem Zhmurov, Paul Bauer, Mark Abraham, Magnus Lundborg, Alan Gray, Berk Hess, Erik Lindahl

The introduction of accelerator devices such as graphics processing units (GPUs) has had profound impact on molecular dynamics simulations and has enabled order-of-magnitude performance advances using commodity hardware. To fully reap these benefits, it has been necessary to reformulate some of the most fundamental algorithms, including the Verlet list, pair searching and cut-offs. Here, we present the heterogeneous parallelization and acceleration design of molecular dynamics implemented in the GROMACS codebase over the last decade. The setup involves a general cluster-based approach to pair lists and non-bonded pair interactions that utilizes both GPUs and CPU SIMD acceleration efficiently, including the ability to load-balance tasks between CPUs and GPUs. The algorithm work efficiency is tuned for each type of hardware, and to use accelerators more efficiently we introduce dual pair lists with rolling pruning updates. Combined with new direct GPU-GPU communication as well as GPU integration, this enables excellent performance from single GPU simulations through strong scaling across multiple GPUs and efficient multi-node parallelization.

[11] arXiv:2006.09123 [pdf]

Algorithms with Predictions

Michael Mitzenmacher, Sergei Vassilvitskii

We introduce algorithms that use predictions from machine learning applied to the input to circumvent worst-case analysis. We aim for algorithms that have near optimal performance when these predictions are good, but recover the prediction-less worst case behavior when the predictions have large errors.

[12] arXiv:2006.09085 [pdf]

MCRapper Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining

Leonardo Pellegrina, Cyrus Cousins, Fabio Vandin, Matteo Riondato

We present MCRapper, an algorithm for efficient computation of Monte-Carlo Empirical Rademacher Averages (MCERA) for families of functions exhibiting poset (e.g., lattice) structure, such as those that arise in many pattern mining tasks. The MCERA allows us to compute upper bounds to the maximum deviation of sample means from their expectations, thus it can be used to find both statistically-significant functions (i.e., patterns) when the available data is seen as a sample from an unknown distribution, and approximations of collections of high-expectation functions (e.g., frequent patterns) when the available data is a small sample from a large dataset. This feature is a strong improvement over previously proposed solutions that could only achieve one of the two. MCRapper uses upper bounds to the discrepancy of the functions to efficiently explore and prune the search space, a technique borrowed from pattern mining itself. To show the practical use of MCRapper, we employ it to develop an algorithm TFP-R for the task of True Frequent Pattern (TFP) mining. TFP-R gives guarantees on the probability of including any false positives (precision) and exhibits higher statistical power (recall) than existing methods offering the same guarantees. We evaluate MCRapper and TFP-R and show that they outperform the state-of-the-art for their respective tasks.

[13] arXiv:2006.08949 [pdf]

Utility-Based Graph Summarization New and Improved

Mahdi Hajiabadi, Jasbir Singh, Venkatesh Srinivasan, Alex Thomo

A fundamental challenge in graph mining is the ever-increasing size of datasets. Graph summarization aims to find a compact representation resulting in faster algorithms and reduced storage needs. The flip side of graph summarization is the loss of utility which diminishes its usability. The key questions we address in this paper are (1)How to summarize a graph without any loss of utility? (2)How to summarize a graph with some loss of utility but above a user-specified threshold? (3)How to query graph summaries without graph reconstruction?} We also aim at making graph summarization available for the masses by efficiently handling web-scale graphs using only a consumer-grade machine. Previous works suffer from conceptual limitations and lack of scalability. In this work, we make three key contributions. First, we present a utility-driven graph summarization method, based on a clique and independent set decomposition, that produces significant compression with zero loss of utility. The compression provided is significantly better than state-of-the-art in lossless graph summarization, while the runtime is two orders of magnitude lower. Second, we present a highly scalable algorithm for the lossy case, which foregoes the expensive iterative process that hampers previous work. Our algorithm achieves this by combining a memory reduction technique and a novel binary-search approach. In contrast to the competition, we are able to handle web-scale graphs in a single machine without a performance impediment as the utility threshold (and size of summary) decreases. Third, we show that our graph summaries can be used as-is to answer several important classes of queries, such as triangle enumeration, Pagerank, and shortest paths. This is in contrast to other works that incrementally reconstruct the original graph for answering queries, thus incurring additional time costs.

[14] arXiv:2006.08945 [pdf]

The algebra and machine representation of statistical models

Evan Patterson

As the twin movements of open science and open source bring an ever greater share of the scientific process into the digital realm, new opportunities arise for the meta-scientific study of science itself, including of data science and statistics. Future science will likely see machines play an active role in processing, organizing, and perhaps even creating scientific knowledge. To make this possible, large engineering efforts must be undertaken to transform scientific artifacts into useful computational resources, and conceptual advances must be made in the organization of scientific theories, models, experiments, and data. This dissertation takes steps toward digitizing and systematizing two major artifacts of data science, statistical models and data analyses. Using tools from algebra, particularly categorical logic, a precise analogy is drawn between models in statistics and logic, enabling statistical models to be seen as models of theories, in the logical sense. Statistical theories, being algebraic structures, are amenable to machine representation and are equipped with morphisms that formalize the relations between different statistical methods. Turning from mathematics to engineering, a software system for creating machine representations of data analyses, in the form of Python or R programs, is designed and implemented. The representations aim to capture the semantics of data analyses, independent of the programming language and libraries in which they are implemented.

[15] arXiv:2006.08926 [pdf]

Computing Igusa's local zeta function of univariates in deterministic polynomial-time

Ashish Dwivedi, Nitin Saxena

Igusa's local zeta function $Z_{f,p}(s)$ is the generating function that counts the number of integral roots, $N_{k}(f)$, of $f(\mathbf x) \bmod p^k$, for all $k$. It is a famous result, in analytic number theory, that $Z_{f,p}$ is a rational function in $\mathbb{Q}(p^s)$. We give an elementary proof of this fact for a univariate polynomial $f$. Our proof is constructive as it gives a closed-form expression for the number of roots $N_{k}(f)$. Our proof, when combined with the recent root-counting algorithm of (Dwivedi, Mittal, Saxena, CCC, 2019), yields the first deterministic poly($|f|, \log p$) time algorithm to compute $Z_{f,p}(s)$. Previously, an algorithm was known only in the case when $f$ completely splits over $\mathbb{Q}_p$; it required the rational roots to use the concept of generating function of a tree (Zúñiga-Galindo, J.Int.Seq., 2003).

[16] arXiv:2006.08668 [pdf]

Algorithmic Aspects of Temporal Betweenness

Sebastian Buß, Hendrik Molter, Rolf Niedermeier, Maciej Rymar

The betweenness centrality of a graph vertex measures how often this vertex is visited on shortest paths between other vertices of the graph. In the analysis of many real-world graphs or networks, betweenness centrality of a vertex is used as an indicator for its relative importance in the network. In particular, it is among the most popular tools in social network analysis. In recent years, a growing number of real-world networks is modeled as temporal graphs, where we have a fixed set of vertices and there is a finite discrete set of time steps and every edge might be present only at some time steps. While shortest paths are straightforward to define in static graphs, temporal paths can be considered "optimal" with respect to many different criteria, including length, arrival time, and overall travel time (shortest, foremost, and fastest paths). This leads to different concepts of temporal betweenness centrality and we provide a systematic study of temporal betweenness variants based on various concepts of optimal temporal paths. Computing the betweenness centrality for vertices in a graph is closely related to counting the number of optimal paths between vertex pairs. We show that counting foremost and fastest paths is computationally intractable (#P-hard) and hence the computation of the corresponding temporal betweenness values is intractable as well. For shortest paths and two selected special cases of foremost paths, we devise polynomial-time algorithms for temporal betweenness computation. Moreover, we also explore the distinction between strict (ascending time labels) and non-strict (non-descending time labels) time labels in temporal paths. In our experiments with established real-world temporal networks, we demonstrate the practical effectiveness of our algorithms, compare the various betweenness concepts, and derive recommendations on their practical use.

[17] arXiv:2006.08473 [pdf]

Improved algorithm for permutation testing

Xiaojin Zhang

We study the problem of testing forbidden patterns. The patterns that are of significant interest include monotone pattern and $(1,3,2)$-pattern. For the problem of testing monotone patterns, \cite{newman2019testing} propose a non-adaptive algorithm with query complexity $(\log n)^{O(k^2)}$. \cite{ben2019finding} then improve the query complexity of non-adaptive algorithm to $\Omega((\log n)^{\lfloor\log k\rfloor})$. Further, \cite{ben2019optimal} propose an adaptive algorithm for testing monotone pattern with optimal sample complexity $O(\log n)$. However, the adaptive algorithm and the analysis are rather complicated. In this paper, we provide a simple adaptive algorithm with one-sided error for testing monotone permutation. We also present an algorithm with improved query complexity for testing $(1,3,2)$ pattern.

[18] arXiv:2006.08420 [pdf]

Combinatorial Group Testing and Sparse Recovery Schemes with Near-Optimal Decoding Time

Mahdi Cheraghchi, Vasileios Nakos

In the long-studied problem of combinatorial group testing, one is asked to detect a set of $k$ defective items out of a population of size $n$, using $m \ll n$ disjunctive measurements. In the non-adaptive setting, the most widely used combinatorial objects are disjunct and list-disjunct matrices, which define incidence matrices of test schemes. Disjunct matrices allow the identification of the exact set of defectives, whereas list disjunct matrices identify a small superset of the defectives. Apart from the combinatorial guarantees, it is often of key interest to equip measurement designs with efficient decoding algorithms. The most efficient decoders should run in sublinear time in $n$, and ideally near-linear in the number of measurements $m$. In this work, we give several constructions with an optimal number of measurements and near-optimal decoding time for the most fundamental group testing tasks, as well as for central tasks in the compressed sensing and heavy hitters literature. For many of those tasks, the previous measurement-optimal constructions needed time either quadratic in the number of measurements or linear in the universe size. Most of our results are obtained via a clean and novel approach which avoids list-recoverable codes or related complex techniques which were present in almost every state-of-the-art work on efficiently decodable constructions of such objects.

[19] arXiv:2006.08397 [pdf]

Nearly Linear Row Sampling Algorithm for Quantile Regression

Yi Li, Ruosong Wang, Lin Yang, Hanrui Zhang

We give a row sampling algorithm for the quantile loss function with sample complexity nearly linear in the dimensionality of the data, improving upon the previous best algorithm whose sampling complexity has at least cubic dependence on the dimensionality. Based upon our row sampling algorithm, we give the fastest known algorithm for quantile regression and a graph sparsification algorithm for balanced directed graphs. Our main technical contribution is to show that Lewis weights sampling, which has been used in row sampling algorithms for $\ell_p$ norms, can also be applied in row sampling algorithms for a variety of loss functions. We complement our theoretical results by experiments to demonstrate the practicality of our approach.

[20] arXiv:2006.08334 [pdf]

StackOverflow vs Kaggle A Study of Developer Discussions About Data Science

David Hin

Software developers are increasingly required to understand fundamental Data science (DS) concepts. Recently, the presence of machine learning (ML) and deep learning (DL) has dramatically increased in the development of user applications, whether they are leveraged through frameworks or implemented from scratch. These topics attract much discussion on online platforms. This paper conducts large-scale qualitative and quantitative experiments to study the characteristics of 197836 posts from StackOverflow and Kaggle. Latent Dirichlet Allocation topic modelling is used to extract twenty-four DS discussion topics. The main findings include that TensorFlow-related topics were most prevalent in StackOverflow, while meta discussion topics were the prevalent ones on Kaggle. StackOverflow tends to include lower-level troubleshooting, while Kaggle focuses on practicality and optimising leaderboard performance. In addition, across both communities, DS discussion is increasing at a dramatic rate. While TensorFlow discussion on StackOverflow is slowing, interest in Keras is rising. Finally, ensemble algorithms are the most mentioned ML/DL algorithms in Kaggle but are rarely discussed on StackOverflow. These findings can help educators and researchers to more effectively tailor and prioritise efforts in researching and communicating DS concepts towards different developer communities.

[21] arXiv:2006.08302 [pdf]

Hypergraph Clustering Based on PageRank

Yuuki Takai, Atsushi Miyauchi, Masahiro Ikeda, Yuichi Yoshida

A hypergraph is a useful combinatorial object to model ternary or higher-order relations among entities. Clustering hypergraphs is a fundamental task in network analysis. In this study, we develop two clustering algorithms based on personalized PageRank on hypergraphs. The first one is local in the sense that its goal is to find a tightly connected vertex set with a bounded volume including a specified vertex. The second one is global in the sense that its goal is to find a tightly connected vertex set. For both algorithms, we discuss theoretical guarantees on the conductance of the output vertex set. Also, we experimentally demonstrate that our clustering algorithms outperform existing methods in terms of both the solution quality and running time. To the best of our knowledge, ours are the first practical algorithms for hypergraphs with theoretical guarantees on the conductance of the output set.

[22] arXiv:2006.08012 [pdf]

High-precision Wasserstein barycenters in polynomial time

Jason M. Altschuler, Enric Boix-Adsera

Computing Wasserstein barycenters is a fundamental geometric problem with widespread applications in machine learning, statistics, and computer graphics. However, it is unknown whether Wasserstein barycenters can be computed in polynomial time, either exactly or to high precision (i.e., with $\textrm{polylog}(1/\varepsilon)$ runtime dependence). This paper answers these questions in the affirmative for any fixed dimension. Our approach is to solve an exponential-size linear programming formulation by efficiently implementing the corresponding separation oracle using techniques from computational geometry.

[23] arXiv:2006.07999 [pdf]

Support Estimation with Sampling Artifacts and Errors

Eli Chien, Olgica Milenkovic, Angelia Nedich

The problem of estimating the support of a distribution is of great importance in many areas of machine learning, computer science, physics and biology. Most of the existing work in this domain has focused on settings that assume perfectly accurate sampling approaches, which is seldom true in practical data science. Here we introduce the first known approach to support estimation in the presence of sampling artifacts and errors where each sample is assumed to arise from a Poisson repeat channel which simultaneously captures repetitions and deletions of samples. The proposed estimator is based on regularized weighted Chebyshev approximations, with weights governed by evaluations of so-called Touchard (Bell) polynomials. The supports in the presence of sampling artifacts are calculated using discretized semi-infite programming methods. The estimation approach is tested on synthetic and textual data, as well as on GISAID data collected to address a new problem in computational biology mutational support estimation in genes of the SARS-Cov-2 virus. In the later setting, the Poisson channel captures the fact that many individuals are tested multiple times for the presence of viral RNA, thereby leading to repeated samples, while other individual's results are not recorded due to test errors. For all experiments performed, we observed significant improvements of our integrated methods compared to those obtained through adequate modifications of state-of-the-art noiseless support estimation methods.

[24] arXiv:2006.07998 [pdf]

Optimal Transport for Stationary Markov Chains via Policy Iteration

Kevin O'Connor, Kevin McGoff, Andrew Nobel

We study an extension of optimal transport techniques to stationary Markov chains from a computational perspective. In this context, naively applying optimal transport to the stationary distributions of the Markov chains of interest would not capture the Markovian dynamics. Instead, we study a new problem, called the optimal transition coupling problem, in which the optimal transport problem is constrained to the set of stationary Markovian couplings satisfying a certain transition matrix condition. After drawing a connection between this problem and Markov decision processes, we prove that solutions can be obtained via the policy iteration algorithm. For settings with large state spaces, we also define a regularized problem, propose a faster, approximate algorithm, and provide bounds on the computational complexity of each iteration. Finally, we validate our theoretical results empirically, demonstrating that the approximate algorithm exhibits faster overall runtime with low error in a simulation study.

[25] arXiv:2006.07980 [pdf]

Application of Data Science to Discover Violence-Related Issues in Iraq

Merari González, Germán H. Alférez

Data science has been satisfactorily used to discover social issues in several parts of the world. However, there is a lack of governmental open data to discover those issues in countries such as Iraq. This situation arises the following questions how to apply data science principles to discover social issues despite the lack of open data in Iraq? How to use the available data to make predictions in places without data? Our contribution is the application of data science to open non-governmental big data from the Global Database of Events, Language, and Tone (GDELT) to discover particular violence-related social issues in Iraq. Specifically we applied the K-Nearest Neighbors, Näive Bayes, Decision Trees, and Logistic Regression classification algorithms to discover the following issues refugees, humanitarian aid, violent protests, fights with artillery and tanks, and mass killings. The best results were obtained with the Decision Trees algorithm to discover areas with refugee crises and artillery fights. The accuracy for these two events is 0.7629. The precision to discover the locations of refugee crises is 0.76, the recall is 0.76, and the F1-score is 0.76. Also, our approach discovers the locations of artillery fights with a precision of 0.74, a recall of 0.75, and a F1-score of 0.75.

[26] arXiv:2006.07928 [pdf]

Global Convergence of Sobolev Training for Overparametrized Neural Networks

Jorio Cocola, Paul Hand

Sobolev loss is used when training a network to approximate the values and derivatives of a target function at a prescribed set of input points. Recent works have demonstrated its successful applications in various tasks such as distillation or synthetic gradient prediction. In this work we prove that an overparametrized two-layer relu neural network trained on the Sobolev loss with gradient flow from random initialization can fit any given function values and any given directional derivatives, under a separation condition on the input data.

[27] arXiv:2006.07919 [pdf]

Vehicle Redistribution in Ride-Sourcing Markets using Convex Minimum Cost Flows

Renos Karamanis, Eleftherios Anastasiadis, Marc Stettler, Panagiotis Angeloudis

Ride-sourcing platforms often face imbalances in the demand and supply of rides across areas in their operating road-networks. As such, dynamic pricing methods have been used to mediate these demand asymmetries through surge price multipliers, thus incentivising higher driver participation in the market. However, the anticipated commercialisation of autonomous vehicles could transform the current ride-sourcing platforms to fleet operators. The absence of human drivers fosters the need for empty vehicle management to address any vehicle supply deficiencies. Proactive redistribution using integer programming and demand predictive models have been proposed in research to address this problem. A shortcoming of existing models, however, is that they ignore the market structure and underlying customer choice behaviour. As such, current models do not capture the real value of redistribution. To resolve this, we formulate the vehicle redistribution problem as a non-linear minimum cost flow problem which accounts for the relationship of supply and demand of rides, by assuming a customer discrete choice model and a market structure. We demonstrate that this model can have a convex domain, and we introduce an edge splitting algorithm to solve a transformed convex minimum cost flow problem for vehicle redistribution. By testing our model using simulation, we show that our redistribution algorithm can decrease wait times up to 50% and increase vehicle utilization up to 8%. Our findings outline that the value of redistribution is contingent on localised market structure and customer behaviour.

[28] arXiv:2006.07846 [pdf]

From Graph Low-Rank Global Attention to 2-FWL Approximation

Omri Puny, Heli Ben-Hamu, Yaron Lipman

Graph Neural Networks (GNNs) are known to have an expressive power bounded by that of the vertex coloring algorithm (Xu et al., 2019a; Morris et al., 2018). However, for rich node features, such a bound does not exist and GNNs can be shown to be universal, namely, have the theoretical ability to approximate arbitrary graph functions. It is well known, however, that expressive power alone does not imply good generalization. In an effort to improve generalization of GNNs we suggest the Low-Rank Global Attention (LRGA) module, taking advantage of the efficiency of low rank matrix-vector multiplication, that improves the algorithmic alignment (Xu et al., 2019b) of GNNs with the 2-folklore Weisfeiler-Lehman (FWL) algorithm; 2-FWL is a graph isomorphism algorithm that is strictly more powerful than vertex coloring. Concretely, we (i) formulate 2-FWL using polynomial kernels; (ii) show LRGA aligns with this 2-FWL formulation; and (iii) bound the sample complexity of the kernel's feature map when learned with a randomly initialized two-layer MLP. The latter means the generalization error can be made arbitrarily small when training LRGA to learn the 2-FWL algorithm. From a practical point of view, augmenting existing GNN layers with LRGA produces state of the art results on most datasets in a GNN standard benchmark.

[29] arXiv:2006.07832 [pdf]

Group Fairness for Knapsack Problems

Deval Patel (1), Arindam Khan (1), Anand Louis (1) ((1) Indian Institute of Science, Bengaluru, India)

We study the knapsack problem with group fairness constraints. The input of the problem consists of a knapsack of bounded capacity and a set of items, each item belongs to a particular category and has and associated weight and value. The goal of this problem is to select a subset of items such that all categories are fairly represented, the total weight of the selected items does not exceed the capacity of the knapsack,and the total value is maximized. We study the fairness parameters such as the bounds on the total value of items from each category, the total weight of items from each category, and the total number of items from each category. We give approximation algorithms for these problems. These fairness notions could also be extended to the min-knapsack problem. The fair knapsack problems encompass various important problems, such as participatory budgeting, fair budget allocation, advertising.

[30] arXiv:2006.07677 [pdf]

Total Coloring for some classes of Circulant graphs

Prajnanaswaroopa S, Geetha J, Somasundaram K

The Total coloring conjecture states that any simple graph G with maximum degree D can be totally colored with at most D+2 colors. In this paper, we have obtained the total chromatic number for some classes of Cayley graphs.

[31] arXiv:2006.07628 [pdf]

When Algorithms for Maximal Independent Set and Maximal Matching Run in Sublinear-Time

Sepehr Assadi, Shay Solomon

Maximal independent set (MIS), maximal matching (MM), and $(\Delta+1)$-coloring in graphs of maximum degree $\Delta$ are among the most prominent algorithmic graph theory problems. They are all solvable by a simple linear-time greedy algorithm and up until very recently this constituted the state-of-the-art. In SODA 2019, Assadi, Chen, and Khanna gave a randomized algorithm for $(\Delta+1)$-coloring that runs in $\widetilde{O}(n\sqrt{n})$ time, which even for moderately dense graphs is sublinear in the input size. The work of Assadi et al. however contained a spoiler for MIS and MM neither problems provably admits a sublinear-time algorithm in general graphs. In this work, we dig deeper into the possibility of achieving sublinear-time algorithms for MIS and MM. The neighborhood independence number of a graph $G$, denoted by $\beta(G)$, is the size of the largest independent set in the neighborhood of any vertex. We identify $\beta(G)$ as the ``right'' parameter to measure the runtime of MIS and MM algorithms Although graphs of bounded neighborhood independence may be very dense (clique is one example), we prove that carefully chosen variants of greedy algorithms for MIS and MM run in $O(n\beta(G))$ and $O(n\log{n}\cdot\beta(G))$ time respectively on any $n$-vertex graph $G$. We complement this positive result by observing that a simple extension of the lower bound of Assadi this http URL. implies that $\Omega(n\beta(G))$ time is also necessary for any algorithm to either problem for all values of $\beta(G)$ from $1$ to $\Theta(n)$. We note that our algorithm for MIS is deterministic while for MM we use randomization which we prove is unavoidable any deterministic algorithm for MM requires $\Omega(n^2)$ time even for $\beta(G) = 2$.

[32] arXiv:2006.07588 [pdf]

Balanced Allocation on Dynamic Hypergraphs

Catherine Greenhill, Bernard Mans, Ali Pourmiri

The {balls-into-bins model} randomly allocates $n$ sequential balls into $n$ bins, as follows each ball selects a set $D$ of $d\ge 2$ bins, independently and uniformly at random, then the ball is allocated to a least-loaded bin from $D$ (ties broken randomly). The \emph{maximum load} is the maximum number of balls in any bin. {In 1999, Azar et al.}\ showed that provided ties are broken randomly, after $n$ balls have been placed the \emph{maximum load}, is ${\log_d\log n}+O(1)$, with high probability. We consider this popular paradigm in a dynamic environment where the bins are structured as a \emph{dynamic hypergraph}. A dynamic hypergraph is a sequence of hypergraphs, say $\mathcal{H}^{(t)}$, arriving over discrete times $t=1,2,\ldots$, such that the vertex set of $\mathcal{H}^{(t)}$'s is the set of $n$ bins, but (hyper)edges may change over time. In our model, the $t$-th ball chooses an edge from $\mathcal{H}^{(t)}$ uniformly at random, and then chooses a set $D$ of $d\ge 2$ random bins from the selected edge. The ball is allocated to a least-loaded bin from $D$, with ties broken randomly. We quantify the dynamicity of the model by introducing the notion of \emph{pair visibility}, which measures the number of rounds in which a pair of bins appears within a (hyper)edge. We prove that if, for some $\varepsilon>0$, a dynamic hypergraph has pair visibility at most $n^{1-\varepsilon}$, and some mild additional conditions hold, then with high probability the process has maximum load $O(\log_d\log n)$. Our proof is based on a variation of the witness tree technique, which is of independent interest. The model can also be seen as an adversarial model where an adversary decides the structure of the possible sets of $d$ bins available to each ball.

[33] arXiv:2006.07486 [pdf]

On Packing Low-Diameter Spanning Trees

Julia Chuzhoy, Merav Parter, Zihan Tan

Edge connectivity of a graph is one of the most fundamental graph-theoretic concepts. The celebrated tree packing theorem of Tutte and Nash-Williams from 1961 states that every $k$-edge connected graph $G$ contains a collection $\cal{T}$ of $\lfloor k/2 \rfloor$ edge-disjoint spanning trees, that we refer to as a tree packing; the diameter of the tree packing $\cal{T}$ is the largest diameter of any tree in $\cal{T}$. A desirable property of a tree packing, that is both sufficient and necessary for leveraging the high connectivity of a graph in distributed communication, is that its diameter is low. Yet, despite extensive research in this area, it is still unclear how to compute a tree packing, whose diameter is sublinear in $|V(G)|$, in a low-diameter graph $G$, or alternatively how to show that such a packing does not exist. In this paper we provide first non-trivial upper and lower bounds on the diameter of tree packing. First, we show that, for every $k$-edge connected $n$-vertex graph $G$ of diameter $D$, there is a tree packing $\cal{T}$ of size $\Omega(k)$, diameter $O((101k\log n)^D)$, that causes edge-congestion at most $2$. Second, we show that for every $k$-edge connected $n$-vertex graph $G$ of diameter $D$, the diameter of $G[p]$ is $O(k^{D(D+1)/2})$ with high probability, where $G[p]$ is obtained by sampling each edge of $G$ independently with probability $p=\Theta(\log n/k)$. This provides a packing of $\Omega(k/\log n)$ edge-disjoint trees of diameter at most $O(k^{(D(D+1)/2)})$ each. We then prove that these two results are nearly tight. Lastly, we show that if every pair of vertices in a graph has $k$ edge-disjoint paths of length at most $D$ connecting them, then there is a tree packing of size $k$, diameter $O(D\log n)$, causing edge-congestion $O(\log n)$. We also provide several applications of low-diameter tree packing in distributed computation.

[34] arXiv:2006.07340 [pdf]

Fourier Sparse Leverage Scores and Approximate Kernel Learning

Tamás Erdélyi, Cameron Musco, Christopher Musco

We prove new explicit upper bounds on the leverage scores of Fourier sparse functions under both the Gaussian and Laplace measures. In particular, we study $s$-sparse functions of the form $f(x) = \sum_{j=1}^s a_j e^{i \lambda_j x}$ for coefficients $a_j \in \mathbb{C}$ and frequencies $\lambda_j \in \mathbb{R}$. Bounding Fourier sparse leverage scores under various measures is of pure mathematical interest in approximation theory, and our work extends existing results for the uniform measure [Erd17,CP19a]. Practically, our bounds are motivated by two important applications in machine learning 1. Kernel Approximation. They yield a new random Fourier features algorithm for approximating Gaussian and Cauchy (rational quadratic) kernel matrices. For low-dimensional data, our method uses a near optimal number of features, and its runtime is polynomial in the $statistical\ dimension$ of the approximated kernel matrix. It is the first "oblivious sketching method" with this property for any kernel besides the polynomial kernel, resolving an open question of [AKM+17,AKK+20b]. 2. Active Learning. They can be used as non-uniform sampling distributions for robust active learning when data follows a Gaussian or Laplace distribution. Using the framework of [AKM+19], we provide essentially optimal results for bandlimited and multiband interpolation, and Gaussian process regression. These results generalize existing work that only applies to uniformly distributed data.

[35] arXiv:2006.07333 [pdf]

Targeting Learning Robust Statistics for Reproducible Research

Jeremy R. Coyle, Nima S. Hejazi, Ivana Malenica, Rachael V. Phillips, Benjamin F. Arnold, Andrew Mertens, Jade Benjamin-Chung, Weixin Cai, Sonali Dayal, John M. Colford Jr., Alan E. Hubbard, Mark J. van der Laan

Targeted Learning is a subfield of statistics that unifies advances in causal inference, machine learning and statistical theory to help answer scientifically impactful questions with statistical confidence. Targeted Learning is driven by complex problems in data science and has been implemented in a diversity of real-world scenarios observational studies with missing treatments and outcomes, personalized interventions, longitudinal settings with time-varying treatment regimes, survival analysis, adaptive randomized trials, mediation analysis, and networks of connected subjects. In contrast to the (mis)application of restrictive modeling strategies that dominate the current practice of statistics, Targeted Learning establishes a principled standard for statistical estimation and inference (i.e., confidence intervals and p-values). This multiply robust approach is accompanied by a guiding roadmap and a burgeoning software ecosystem, both of which provide guidance on the construction of estimators optimized to best answer the motivating question. The roadmap of Targeted Learning emphasizes tailoring statistical procedures so as to minimize their assumptions, carefully grounding them only in the scientific knowledge available. The end result is a framework that honestly reflects the uncertainty in both the background knowledge and the available data in order to draw reliable conclusions from statistical analyses - ultimately enhancing the reproducibility and rigor of scientific findings.

[36] arXiv:2006.07305 [pdf]

Reflection on modern methods Good practices for applied statistical learning in epidemiology

Yanelli Nunez, Elizabeth A. Gibson, Eva M. Tanner, Chris Gennings, Brent A.Coull, Jeff A. Goldsmith, Marianthi-Anna Kioumourtzoglou

Statistical learning (SL) includes methods that extract knowledge from complex data. SL methods beyond generalized linear models are being increasingly implemented in public health research and epidemiology because they can perform better in instances with complex or high-dimensional data---settings when traditional statistical methods fail. These novel methods, however, often include random sampling which may induce variability in results. Best practices in data science can help to ensure robustness. As a case study, we included four SL models that have been applied previously to analyze the relationship between environmental mixtures and health outcomes. We ran each model across 100 initializing values for random number generation, or "seeds," and assessed variability in resulting estimation and inference. All methods exhibited some seed-dependent variability in results. The degree of variability differed across methods and exposure of interest. Any SL method reliant on a random seed will exhibit some degree of seed sensitivity. We recommend that researchers repeat their analysis with various seeds as a sensitivity analysis when implementing these methods to enhance interpretability and robustness of results.

[37] arXiv:2006.07302 [pdf]

SMS in PACE 2020

Tuukka Korhonen

We describe SMS, our submission to the exact treedepth track of PACE 2020. SMS computes the treedepth of a graph by branching on the small minimal separators of the graph.

[38] arXiv:2006.07086 [pdf]

An Adaptive Approach to Recoverable Mutual Exlcusion

Sahil Dhoked, Neeraj Mittal

Mutual exclusion (ME) is one of the most commonly used techniques to handle conflicts in concurrent systems. Traditionally, mutual exclusion algorithms have been designed under the assumption that a process does not fail while acquiring/releasing a lock or while executing its critical section. However, failures do occur in real life, potentially leaving the lock in an inconsistent state. This gives rise to the problem of \emph{recoverable mutual exclusion (RME)} that involves designing a mutual exclusion algorithm that can tolerate failures, while maintaining safety and liveness properties. One of the important measures of performance of any ME algorithm, including an RME algorithm, is the number of \emph{remote memory references (RMRs)} made by a process (for acquiring and releasing a lock as well as recovering the lock structure after a failure). The best known RME algorithm solves the problem for $n$ processes in sub-logarithmic number of RMRs, given by $\mathcal{O}(\frac{\log n}{\log \log n})$, irrespective of the number of failures in the system. In this work, we present a new algorithm for solving the RME problem whose RMR complexity gradually \emph{adapts} to the number of failures that have occurred in the system "recently". In the absence of failures, our algorithm generates only $\mathcal{O}(1)$ RMRs. Furthermore, its RMR complexity is given by $\mathcal{O}(\min\{ \sqrt{F}, \frac{\log n}{\log \log n} \})$ where $F$ is the total number of failures in the "recent" past. In addition to read and write instructions, our algorithm uses compare-and-swap (\CAS{}) and fetch-and-store (\FAS{}) hardware instructions, both of which are commonly available in most modern processors.

[39] arXiv:2006.07057 [pdf]

Linear Time Sinkhorn Divergences using Positive Features

Meyer Scetbon, Marco Cuturi

Although Sinkhorn divergences are now routinely used in data sciences to compare probability distributions, the computational effort required to compute them remains expensive, growing in general quadratically in the size $n$ of the support of these distributions. Indeed, solving optimal transport (OT) with an entropic regularization requires computing a $n\times n$ kernel matrix (the neg-exponential of a $n\times n$ pairwise ground cost matrix) that is repeatedly applied to a vector. We propose to use instead ground costs of the form $c(x,y)=-\log\dotp{\varphi(x)}{\varphi(y)}$ where $\varphi$ is a map from the ground space onto the positive orthant $\RR^r_+$, with $r\ll n$. This choice yields, equivalently, a kernel $k(x,y)=\dotp{\varphi(x)}{\varphi(y)}$, and ensures that the cost of Sinkhorn iterations scales as $O(nr)$. We show that usual cost functions can be approximated using this form. Additionaly, we take advantage of the fact that our approach yields approximation that remain fully differentiable with respect to input distributions, as opposed to previously proposed adaptive low-rank approximations of the kernel matrix, to train a faster variant of OT-GAN \cite{salimans2018improving}.

[40] arXiv:2006.07050 [pdf]

Sallow -- a heuristic algorithm for treedepth decompositions

Marcin Wrochna

We describe a heuristic algorithm for computing treedepth decompositions, submitted for the PACE 2020 challenge. It relies on a variety of greedy algorithms computing elimination orderings, as well as a Divide & Conquer approach on balanced cuts obtained using a from-scratch reimplementation of the 2016 FlowCutter algorithm by Hamann & Strasser [ACM JEA 2018].

[41] arXiv:2006.07016 [pdf]

Distance-based phylogenetic inference from typing data a unifying view

Cátia Vaz, Marta Nascimento, João A. Carriço, Tatiana Rocher, Alexandre P. Francisco

Typing methods are widely used in the surveillance of infectious diseases, outbreaks investigation and studies of the natural history of an infection. And their use is becoming standard, in particular with the introduction of High Throughput Sequencing (HTS). On the other hand, the data being generated is massive and many algorithms have been proposed for phylogenetic analysis of typing data, addressing both correctness and scalability issues. Most of the distance-based algorithms for inferring phylogenetic trees follow the closest-pair joining scheme. This is one of the approaches used in hierarchical clustering. And although phylogenetic inference algorithms may seem rather different, the main difference among them resides on how one defines cluster proximity and on which optimization criterion is used. Both cluster proximity and optimization criteria rely often on a model of evolution. In this work we review, and we provide an unified view of these algorithms. This is an important step not only to better understand such algorithms, but also to identify possible computational bottlenecks and improvements, important to deal with large data sets.

[42] arXiv:2006.07013 [pdf]

A Unified Analysis of Stochastic Gradient Methods for Nonconvex Federated Optimization

Zhize Li, Peter Richtárik

In this paper, we study the performance of a large family of SGD variants in the smooth nonconvex regime. To this end, we propose a generic and flexible assumption capable of accurate modeling of the second moment of the stochastic gradient. Our assumption is satisfied by a large number of specific variants of SGD in the literature, including SGD with arbitrary sampling, SGD with compressed gradients, and a wide variety of variance-reduced SGD methods such as SVRG and SAGA. We provide a single convergence analysis for all methods that satisfy the proposed unified assumption, thereby offering a unified understanding of SGD variants in the nonconvex regime instead of relying on dedicated analyses of each variant. Moreover, our unified analysis is accurate enough to recover or improve upon the best-known convergence results of several classical methods, and also gives new convergence results for many new methods which arise as special cases. In the more general distributed/federated nonconvex optimization setup, we propose two new general algorithmic frameworks differing in whether direct gradient compression (DC) or compression of gradient differences (DIANA) is used. We show that all methods captured by these two frameworks also satisfy our unified assumption. Thus, our unified convergence analysis also captures a large variety of distributed methods utilizing compressed communication. Finally, we also provide a unified analysis for obtaining faster linear convergence rates in this nonconvex regime under the PL condition.

[43] arXiv:2006.06980 [pdf]

Robust Sub-Gaussian Principal Component Analysis and Width-Independent Schatten Packing

Arun Jambulapati, Jerry Li, Kevin Tian

We develop two methods for the following fundamental statistical task given an $\epsilon$-corrupted set of $n$ samples from a $d$-dimensional sub-Gaussian distribution, return an approximate top eigenvector of the covariance matrix. Our first robust PCA algorithm runs in polynomial time, returns a $1 - O(\epsilon\log\epsilon^{-1})$-approximate top eigenvector, and is based on a simple iterative filtering approach. Our second, which attains a slightly worse approximation factor, runs in nearly-linear time and sample complexity under a mild spectral gap assumption. These are the first polynomial-time algorithms yielding non-trivial information about the covariance of a corrupted sub-Gaussian distribution without requiring additional algebraic structure of moments. As a key technical tool, we develop the first width-independent solvers for Schatten-$p$ norm packing semidefinite programs, giving a $(1 + \epsilon)$-approximate solution in $O(p\log(\tfrac{nd}{\epsilon})\epsilon^{-1})$ input-sparsity time iterations (where $n$, $d$ are problem dimensions).

[44] arXiv:2006.06951 [pdf]

Planar Rectilinear Drawings of Outerplanar Graphs in Linear Time

Fabrizio Frati

We show how to test in linear time whether an outerplanar graph admits a planar rectilinear drawing, both if the graph has a prescribed plane embedding that the drawing has to respect and if it does not. Our algorithm returns a planar rectilinear drawing if the graph admits one.

[45] arXiv:2006.06856 [pdf]

Bandit-PAM Almost Linear Time $k$-Medoids Clustering via Multi-Armed Bandits

Mo Tiwari, Martin Jinye Zhang, James Mayclin, Sebastian Thrun, Chris Piech, Ilan Shomorony

Clustering is a ubiquitous task in data science. Compared to the commonly used $k$-means clustering algorithm, $k$-medoids clustering algorithms require the cluster centers to be actual data points and support arbitrary distance metrics, allowing for greater interpretability and the clustering of structured objects. Current state-of-the-art $k$-medoids clustering algorithms, such as Partitioning Around Medoids (PAM), are iterative and are quadratic in the dataset size $n$ for each iteration, being prohibitively expensive for large datasets. We propose Bandit-PAM, a randomized algorithm inspired by techniques from multi-armed bandits, that significantly improves the computational efficiency of PAM. We theoretically prove that Bandit-PAM reduces the complexity of each PAM iteration from $O(n^2)$ to $O(n \log n)$ and returns the same results with high probability, under assumptions on the data that often hold in practice. We empirically validate our results on several large-scale real-world datasets, including a coding exercise submissions dataset from this http URL, the 10x Genomics 68k PBMC single-cell RNA sequencing dataset, and the MNIST handwritten digits dataset. We observe that Bandit-PAM returns the same results as PAM while performing up to 200x fewer distance computations. The improvements demonstrated by Bandit-PAM enable $k$-medoids clustering on a wide range of applications, including identifying cell types in large-scale single-cell data and providing scalable feedback for students learning computer science online. We also release Python and C++ implementations of our algorithm.

[46] arXiv:2006.06850 [pdf]

List Learning with Attribute Noise

Mahdi Cheraghchi, Elena Grigorescu, Brendan Juba, Karl Wimmer, Ning Xie

We introduce and study the model of list learning with attribute noise. Learning with attribute noise was introduced by Shackelford and Volper (COLT 1988) as a variant of PAC learning, in which the algorithm has access to noisy examples and uncorrupted labels, and the goal is to recover an accurate hypothesis. Sloan (COLT 1988) and Goldman and Sloan (Algorithmica 1995) discovered information-theoretic limits to learning in this model, which have impeded further progress. In this article we extend the model to that of list learning, drawing inspiration from the list-decoding model in coding theory, and its recent variant studied in the context of learning. On the positive side, we show that sparse conjunctions can be efficiently list learned under some assumptions on the underlying ground-truth distribution. On the negative side, our results show that even in the list-learning model, efficient learning of parities and majorities is not possible regardless of the representation used.

[47] arXiv:2006.06843 [pdf]

Robust Optimization and Inference on Manifolds

Lizhen Lin, Drew Lazar, Bayan Sarpabayeva, David B. Dunson

We propose a robust and scalable procedure for general optimization and inference problems on manifolds leveraging the classical idea of `median-of-means' estimation. This is motivated by ubiquitous examples and applications in modern data science in which a statistical learning problem can be cast as an optimization problem over manifolds. Being able to incorporate the underlying geometry for inference while addressing the need for robustness and scalability presents great challenges. We address these challenges by first proving a key lemma that characterizes some crucial properties of geometric medians on manifolds. In turn, this allows us to prove robustness and tighter concentration of our proposed final estimator in a subsequent theorem. This estimator aggregates a collection of subset estimators by taking their geometric median over the manifold. We illustrate bounds on this estimator via calculations in explicit examples. The robustness and scalability of the procedure is illustrated in numerical examples on both simulated and real data sets.

[48] arXiv:2006.06771 [pdf]

Randomized Consensus with Regular Registers

Vassos Hadzilacos, Xing Hu, Sam Toueg

The well-known randomized consensus algorithm by Aspnes and Herlihy for asynchronous shared-memory systems was proved to work, even against a strong adversary, under the assumption that the registers that it uses are atomic registers. With atomic registers, every read or write operation is instantaneous (and thus indivisible). As pointed out by Golab et al. (2011), however, a randomized algorithm that works with atomic registers does not necessarily work if we replace the atomic registers that it uses with linearizable implementations of registers. This raises the following question does the randomized consensus algorithm by Aspnes and Herlihy still work against a strong adversary if we replace its atomic registers with linearizable registers? We show that the answer is affirmative, in fact, we show that even linearizable registers are not necessary. More precisely, we prove that the algorithm by Aspnes and Herlihy works against a strong adversary even if the algorithm uses only regular registers.

[49] arXiv:2006.06618 [pdf]

CoinPress Practical Private Mean and Covariance Estimation

Sourav Biswas, Yihe Dong, Gautam Kamath, Jonathan Ullman

We present simple differentially private estimators for the mean and covariance of multivariate sub-Gaussian data that are accurate at small sample sizes. We demonstrate the effectiveness of our algorithms both theoretically and empirically using synthetic and real-world datasets---showing that their asymptotic error rates match the state-of-the-art theoretical bounds, and that they concretely outperform all previous methods. Specifically, previous estimators either have weak empirical accuracy at small sample sizes, perform poorly for multivariate data, or require the user to provide strong a priori estimates for the parameters.

[50] arXiv:2006.06566 [pdf]

Optimally Deceiving a Learning Leader in Stackelberg Games

Georgios Birmpas, Jiarui Gan, Alexandros Hollender, Francisco J. Marmolejo-Cossío, Ninad Rajgopal, Alexandros A. Voudouris

Recent results in the ML community have revealed that learning algorithms used to compute the optimal strategy for the leader to commit to in a Stackelberg game, are susceptible to manipulation by the follower. Such a learning algorithm operates by querying the best responses or the payoffs of the follower, who consequently can deceive the algorithm by responding as if his payoffs were much different than what they actually are. For this strategic behavior to be successful, the main challenge faced by the follower is to pinpoint the payoffs that would make the learning algorithm compute a commitment so that best responding to it maximizes the follower's utility, according to his true payoffs. While this problem has been considered before, the related literature only focused on the simplified scenario in which the payoff space is finite, thus leaving the general version of the problem unanswered. In this paper, we fill in this gap, by showing that it is always possible for the follower to compute (near-)optimal payoffs for various scenarios about the learning interaction between leader and follower.

[51] arXiv:2006.06467 [pdf]

Learning Halfspaces with Tsybakov Noise

Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, Nikos Zarifis

We study the efficient PAC learnability of halfspaces in the presence of Tsybakov noise. In the Tsybakov noise model, each label is independently flipped with some probability which is controlled by an adversary. This noise model significantly generalizes the Massart noise model, by allowing the flipping probabilities to be arbitrarily close to $1/2$ for a fraction of the samples. Our main result is the first non-trivial PAC learning algorithm for this problem under a broad family of structured distributions -- satisfying certain concentration and (anti-)anti-concentration properties -- including log-concave distributions. Specifically, we given an algorithm that achieves misclassification error $\epsilon$ with respect to the true halfspace, with quasi-polynomial runtime dependence in $1/\epsilin$. The only previous upper bound for this problem -- even for the special case of log-concave distributions -- was doubly exponential in $1/\epsilon$ (and follows via the naive reduction to agnostic learning). Our approach relies on a novel computationally efficient procedure to certify whether a candidate solution is near-optimal, based on semi-definite programming. We use this certificate procedure as a black-box and turn it into an efficient learning algorithm by searching over the space of halfspaces via online convex optimization.

[52] arXiv:2006.06427 [pdf]

Knowing your FATE Friendship, Action and Temporal Explanations for User Engagement Prediction on Social Apps

Xianfeng Tang, Yozen Liu, Neil Shah, Xiaolin Shi, Prasenjit Mitra, Suhang Wang

With the rapid growth and prevalence of social network applications (Apps) in recent years, understanding user engagement has become increasingly important, to provide useful insights for future App design and development. While several promising neural modeling approaches were recently pioneered for accurate user engagement prediction, their black-box designs are unfortunately limited in model explainability. In this paper, we study a novel problem of explainable user engagement prediction for social network Apps. First, we propose a flexible definition of user engagement for various business scenarios, based on future metric expectations. Next, we design an end-to-end neural framework, FATE, which incorporates three key factors that we identify to influence user engagement, namely friendships, user actions, and temporal dynamics to achieve explainable engagement predictions. FATE is based on a tensor-based graph neural network (GNN), LSTM and a mixture attention mechanism, which allows for (a) predictive explanations based on learned weights across different feature categories, (b) reduced network complexity, and (c) improved performance in both prediction accuracy and training/inference time. We conduct extensive experiments on two large-scale datasets from Snapchat, where FATE outperforms state-of-the-art approaches by ${\approx}10\%$ error and ${\approx}20\%$ runtime reduction. We also evaluate explanations from FATE, showing strong quantitative and qualitative performance.

[53] arXiv:2006.06380 [pdf]

Pointer Graph Networks

Petar Veličković, Lars Buesing, Matthew C. Overlan, Razvan Pascanu, Oriol Vinyals, Charles Blundell

Graph neural networks (GNNs) are typically applied to static graphs that are assumed to be known upfront. This static input structure is often informed purely by insight of the machine learning practitioner, and might not be optimal for the actual task the GNN is solving. In absence of reliable domain expertise, one might resort to inferring the latent graph structure, which is often difficult due to the vast search space of possible graphs. Here we introduce Pointer Graph Networks (PGNs) which augment sets or graphs with additional inferred edges for improved model expressivity. PGNs allow each node to dynamically point to another node, followed by message passing over these pointers. The sparsity of this adaptable graph structure makes learning tractable while still being sufficiently expressive to simulate complex algorithms. Critically, the pointing mechanism is directly supervised to model long-term sequences of operations on classical data structures, incorporating useful structural inductive biases from theoretical computer science. Qualitatively, we demonstrate that PGNs can learn parallelisable variants of pointer-based data structures, namely disjoint set unions and link/cut trees. PGNs generalise out-of-distribution to 5x larger test inputs on dynamic graph connectivity tasks, outperforming unrestricted GNNs and Deep Sets.

[54] arXiv:2006.06298 [pdf]

New Interpretable Statistics for Large Scale Structure Analysis and Generation

E. Allys, T. Marchand, J.-F. Cardoso, F. Villaescusa-Navarro, S. Ho, S. Mallat

This paper introduces the Wavelet Phase Harmonics (WPH) statistics. They are interpretable low-dimensional statistics which describe 2D non-Gaussian density fields. These statistics are built from WPH moments, which have been recently introduced in data science and machine learning community. In this paper, we applied WPH statistics to projected matter density fields from the Quijote N-body simulations. We find by computing the Fisher information matrix, that the WPH statistics can place more stringent constraints on 5 cosmological parameters when compared to the combination of power-spectrum and bi-spectrum. We also use the WPH statistics to successfully generate from a maximum entropy model new 2D density fields that reproduce the PDF, mean, power-spectrum, bispectrum and the Minkowski functionals of the input density fields. While separate methods have been proven very efficient for parameter estimations and statistical syntheses for for large scale structure, WPH statistics are the first statistics that can both achieve a more stringent cosmological parameter constraint, and produce a sufficiently accurate simulation of the Universe while being interpretable.

[55] arXiv:2006.06171 [pdf]

A General Framework for Analyzing Stochastic Dynamics in Learning Algorithms

Chi-Ning Chou, Mien Brabeeba Wang, Tiancheng Yu

We present a general framework for analyzing high-probability bounds for stochastic dynamics in learning algorithms. Our framework composes standard techniques such as a stopping time, a martingale concentration and a closed-from solution to give a streamlined three-step recipe with a general and flexible principle to implement it. To demonstrate the power and the flexibility of our framework, we apply the framework on three very different learning problems stochastic gradient descent for strongly convex functions, streaming principal component analysis and linear bandit with stochastic gradient descent updates. We improve the state of the art bounds on all three dynamics.

[56] arXiv:2006.06067 [pdf]

Treewidth versus clique number in graph classes with a forbidden structure

Clément Dallard, Martin Milanič, Kenny Štorgel

Treewidth is an important graph invariant, relevant for both structural and algorithmic reasons. A necessary condition for a graph class to have bounded treewidth is the absence of large cliques. We study graph classes in which this condition is also sufficient, which we call $(tw,\omega)$-bounded. Such graph classes are known to have useful algorithmic applications related to variants of the clique and $k$-coloring problems. We consider six well-known graph containment relations the minor, topological minor, subgraph, induced minor, induced topological minor, and induced subgraph relations. For each of them, we give a complete characterization of the graphs $H$ for which the class of graphs excluding $H$ is $(tw,\omega)$-bounded. Our results imply that the class of $1$-perfectly orientable graphs is $(tw,\omega)$-bounded, answering a question of Brešar, Hartinger, Kos and Milanič from 2018. We also reveal some further algorithmic implications of $(tw,\omega)$-boundedness related to list $k$-coloring and clique problems.

[57] arXiv:2006.06052 [pdf]

Accelerating linear solvers for large-scale Stokes problems with C++ metaprogramming

Denis Demidov, Lin Mu, Bin Wang

Ability to solve large sparse linear systems of equations is very important in modern numerical methods. Creating a solver with a user-friendly interface that can work in many specific scenarios is a challenging task. We describe the C ++ programming techniques that can help in creating flexible and extensible programming interfaces for linear solvers. The approach is based on policy-based design and partial template specialization, and is implemented in the open source AMGCL library. Convenience for the user and efficiency is demonstrated on the example of accelerating a large-scale Stokes problem solution with a Schur pressure correction preconditioner. The user may select algorithmic components of the solver by adjusting template parameters without any change to the codebase. It is also possible to switch to block values, or use mixed precision solution, which results in up to 4 times speedup, and reduces the memory footprint of the algorithm by about 50%.

[58] arXiv:2006.06048 [pdf]

Efficient Partial Snapshot Implementations

Nikolaos D. Kallimanis, Eleni Kanellou, Charidimos Kiosterakis

In this work, we propose the $\lambda$-scanner snapshot, a variation of the snapshot object, which supports any fixed amount of $0 In this work, we first provide a simple single-scanner version of $\lambda-Snap$, which is called $1-Snap$. We provide $1-Snap$ just for presentation purposes, since it is simpler than $\lambda-Snap$. The $UPDATE$ in $1-Snap$ has a step complexity of $O(1)$, while the $SCAN$ has a step complexity of $O(m)$. This implementation uses $O(m)$ $CAS$ registers.

[59] arXiv:2006.05976 [pdf]

Composite Logconcave Sampling with a Restricted Gaussian Oracle

Ruoqi Shen, Kevin Tian, Yin Tat Lee

We consider sampling from composite densities on $\mathbb{R}^d$ of the form $d\pi(x) \propto \exp(-f(x) - g(x))dx$ for well-conditioned $f$ and convex (but possibly non-smooth) $g$, a family generalizing restrictions to a convex set, through the abstraction of a restricted Gaussian oracle. For $f$ with condition number $\kappa$, our algorithm runs in $O \left(\kappa^2 d \log^2\tfrac{\kappa d}{\epsilon}\right)$ iterations, each querying a gradient of $f$ and a restricted Gaussian oracle, to achieve total variation distance $\epsilon$. The restricted Gaussian oracle, which draws samples from a distribution whose negative log-likelihood sums a quadratic and $g$, has been previously studied and is a natural extension of the proximal oracle used in composite optimization. Our algorithm is conceptually simple and obtains stronger provable guarantees and greater generality than existing methods for composite sampling. We conduct experiments showing our algorithm vastly improves upon the hit-and-run algorithm for sampling the restriction of a (non-diagonal) Gaussian to the positive orthant.

[60] arXiv:2006.05871 [pdf]

Tailoring r-index for metagenomics

Dustin Cobas, Veli Mäkinen, Massimiliano Rossi

A basic problem in metagenomics is to assign a sequenced read to the correct species in the reference collection. In typical applications in genomic epidemiology and viral metagenomics the reference collection consists of set of species with each species represented by its highly similar strains. It has been recently shown that accurate read assignment can be achieved with $k$-mer hashing-based pseudoalignment A read is assigned to species A if each of its $k$-mer hits to reference collection is located only on strains of A. We study the underlying primitives required in pseudoalignment and related tasks. We propose three space-efficient solutions building upon the document listing with frequencies problem. All the solutions use an $r$-index (Gagie et al., SODA 2018) as an underlying index structure for the text obtained as concatenation of the set of species, as well as for each species. Given $t$ species whose concatenation length is $n$, and whose Burrows-Wheeler transform contains $r$ runs, our first solution, based on a grammar-compressed document array with precomputed queries at non terminal symbols, reports the frequencies for the ${\tt ndoc}$ distinct documents in which the pattern of length $m$ occurs in ${\cal O}(m + \log(n){\tt ndoc}) $ time. Our second solution is also based on a grammar-compressed document array, but enhanced with bitvectors and reports the frequencies in ${\cal O}(m + ((t/w)\log n + \log(n/r)){\tt ndoc})$ time, over a machine with wordsize $w$. Our third solution, based on the interleaved LCP array, answers the same query in ${\cal O}(m + \log(n/r){\tt ndoc})$. We implemented our solutions and tested them on real-world and synthetic datasets. The results show that all the solutions are fast on highly-repetitive data, and the size overhead introduced by the indexes are comparable with the size of the $r$-index.

[61] arXiv:2006.05850 [pdf]

Sliding Window Algorithms for k-Clustering Problems

Michele Borassi, Alessandro Epasto, Silvio Lattanzi, Sergei Vassilvitskii, Morteza Zadimoghaddam

The sliding window model of computation captures scenarios in which data is arriving continuously, but only the latest $w$ elements should be used for analysis. The goal is to design algorithms that update the solution efficiently with each arrival rather than recomputing it from scratch. In this work, we focus on $k$-clustering problems such as $k$-means and $k$-median. In this setting, we provide simple and practical algorithms that offer stronger performance guarantees than previous results. Empirically, we show that our methods store only a small fraction of the data, are orders of magnitude faster, and find solutions with costs only slightly higher than those returned by algorithms with access to the full dataset.

[62] arXiv:2006.05828 [pdf]

Introducing Structure to Expedite Quantum Search

Marcin Briański, Jan Gwinner, Vladyslav Hlembotskyi, Witold Jarnicki, Szymon Pliś, Adam Szady

We present a novel quantum algorithm for solving the unstructured search problem with one marked element. Our algorithm allows generating quantum circuits that use asymptotically fewer additional quantum gates than the famous Grover's algorithm and may be successfully executed on NISQ devices. We prove that our algorithm is optimal in the total number of elementary gates up to a multiplicative constant. As many NP-hard problems are not in fact unstructured, we also describe the \emph{partial uncompute} technique which exploits the oracle structure and allows a significant reduction in the number of elementary gates required to find the solution. Combining these results allows us to use asymptotically smaller number of elementary gates than the Grover's algorithm in various applications, keeping the number of queries to the oracle essentially the same. We show how the results can be applied to solve hard combinatorial problems, for example Unique k-SAT. Additionally, we show how to asymptotically reduce the number of elementary gates required to solve the unstructured search problem with multiple marked elements.

[63] arXiv:2006.05757 [pdf]

Data science on industrial data -- Today's challenges in brown field applications

Tilman Klaeger, Sebastian Gottschall, Lukas Oehm

Much research is done on data analytics and machine learning. In industrial processes large amounts of data are available and many researchers are trying to work with this data. In practical approaches one finds many pitfalls restraining the application of modern technologies especially in brown field applications. With this paper we want to show state of the art and what to expect when working with stock machines in the field. A major focus in this paper is on data collection which can be more cumbersome than most people might expect. Also data quality for machine learning applications is a challenge once leaving the laboratory. In this area one has to expect the lack of semantic description of the data as well as very little ground truth being available for training and verification of machine learning models. A last challenge is IT security and passing data through firewalls.

[64] arXiv:2006.05740 [pdf]

An Asymptotically Optimal Algorithm for Online Stacking

Martin Olsen, Allan Gross

Consider a storage area where arriving items are stored temporarily in bounded capacity stacks until their departure. We look into the problem of deciding where to put an arriving item with the objective of minimizing the maximum number of stacks used over time. The decision has to be made as soon as an item arrives, and we assume that we only have information on the departure times for the arriving item and the items currently at the storage area. We are only allowed to put an item on top of another item if the item below departs at a later time. We refer to this problem as online stacking. We assume that the storage time intervals are picked i.i.d. from $[0, 1] \times [0, 1]$ using an unknown distribution with a bounded probability density function. Under this mild condition, we present a simple polynomial time online algorithm and show that the competitive ratio converges to $1$ in probability. The result holds if the stack capacity is $o(\sqrt{n})$, where $n$ is the number of items, including the realistic case where the capacity is a constant. Our experiments show that our results also have practical relevance. To the best of our knowledge, we are the first to present an asymptotically optimal algorithm for online stacking, which is an important problem with many real-world applications within computational logistics.

[65] arXiv:2006.05685 [pdf]

Noisy polynomial interpolation modulo prime powers

Marek Karpinski, Igor Shparlinski

We consider the {\it noisy polynomial interpolation problem\/} of recovering an unknown $s$-sparse polynomial $f(X)$ over the ring $\mathbb Z_{p^k}$ of residues modulo $p^k$, where $p$ is a small prime and $k$ is a large integer parameter, from approximate values of the residues of $f(t) \in \mathbb Z_{p^k}$. Similar results are known for residues modulo a large prime $p$, however the case of prime power modulus $p^k$, with small $p$ and large $k$, is new and requires different techniques. We give a deterministic polynomials time algorithm, which for almost given more than a half bits of $f(t)$ for sufficiently many randomly chosen points $t \in \mathbb Z_{p^k}^*$, recovers $f(X)$.

[66] arXiv:2006.05660 [pdf]

The nearest-colattice algorithm

Thomas Espitau, Paul Kirchner

In this work, we exhibit a hierarchy of polynomial time algorithms solving approximate variants of the Closest Vector Problem (CVP). Our first contribution is a heuristic algorithm achieving the same distance tradeoff as HSVP algorithms, namely $\approx \beta^{\frac{n}{2\beta}}\textrm{covol}(\Lambda)^{\frac{1}{n}}$ for a random lattice $\Lambda$ of rank $n$. Compared to the so-called Kannan's embedding technique, our algorithm allows using precomputations and can be used for efficient batch CVP instances. This implies that some attacks on lattice-based signatures lead to very cheap forgeries, after a precomputation. Our second contribution is a proven reduction from approximating the closest vector with a factor $\approx n^{\frac32}\beta^{\frac{3n}{2\beta}}$ to the Shortest Vector Problem (SVP) in dimension $\beta$.

[67] arXiv:2006.05592 [pdf]

Node Embeddings and Exact Low-Rank Representations of Complex Networks

Sudhanshu Chanpuriya, Cameron Musco, Konstantinos Sotiropoulos, Charalampos E. Tsourakakis

Low-dimensional embeddings, from classical spectral embeddings to modern neural-net-inspired methods, are a cornerstone in the modeling and analysis of complex networks. Recent work by Seshadhri et al. (PNAS 2020) suggests that such embeddings cannot capture local structure arising in complex networks. In particular, they show that any network generated from a natural low-dimensional model cannot be both sparse and have high triangle density (high clustering coefficient), two hallmark properties of many real-world networks. In this work we show that the results of Seshadhri et al. are intimately connected to the model they use rather than the low-dimensional structure of complex networks. Specifically, we prove that a minor relaxation of their model can generate sparse graphs with high triangle density. Surprisingly, we show that this same model leads to exact low-dimensional factorizations of many real-world networks. We give a simple algorithm based on logistic principal component analysis (LPCA) that succeeds in finding such exact embeddings. Finally, we perform a large number of experiments that verify the ability of very low-dimensional embeddings to capture local structure in real-world networks.

[68] arXiv:2006.05573 [pdf]

Global Data Science Project for COVID-19 Summary Report

Dario Garcia-Gasulla, Sergio Alvarez Napagao, Irene Li, Hiroshi Maruyama, Hiroki Kanezashi, Raquel P'erez-Arnal, Kunihiko Miyoshi, Euma Ishii, Keita Suzuki, Sayaka Shiba, Mariko Kurokawa, Yuta Kanzawa, Naomi Nakagawa, Masatoshi Hanai, Yixin Li, Tianxiao Li

This paper aims at providing the summary of the Global Data Science Project (GDSC) for COVID-19. as on May 31 2020. COVID-19 has largely impacted on our societies through both direct and indirect effects transmitted by the policy measures to counter the spread of viruses. We quantitatively analysed the multifaceted impacts of the COVID-19 pandemic on our societies including people's mobility, health, and social behaviour changes. People's mobility has changed significantly due to the implementation of travel restriction and quarantine measurements. Indeed, the physical distance has widened at international (cross-border), national and regional level. At international level, due to the travel restrictions, the number of international flights has plunged overall at around 88 percent during March. In particular, the number of flights connecting Europe dropped drastically in mid of March after the United States announced travel restrictions to Europe and the EU and participating countries agreed to close borders, at 84 percent decline compared to March 10th. Similarly, we examined the impacts of quarantine measures in the major city Tokyo (Japan), New York City (the United States), and Barcelona (Spain). Within all three cities, we found the significant decline in traffic volume. We also identified the increased concern for mental health through the analysis of posts on social networking services such as Twitter and Instagram. Notably, in the beginning of April 2020, the number of post with #depression on Instagram doubled, which might reflect the rise in mental health awareness among Instagram users. Besides, we identified the changes in a wide range of people's social behaviors, as well as economic impacts through the analysis of Instagram data and primary survey data.

[69] arXiv:2006.05490 [pdf]

Sublinear Algorithms and Lower Bounds for Metric TSP Cost Estimation

Yu Chen, Sampath Kannan, Sanjeev Khanna

We consider the problem of designing sublinear time algorithms for estimating the cost of a minimum metric traveling salesman (TSP) tour. Specifically, given access to a $n \times n$ distance matrix $D$ that specifies pairwise distances between $n$ points, the goal is to estimate the TSP cost by performing only sublinear (in the size of $D$) queries. For the closely related problem of estimating the weight of a metric minimum spanning tree (MST), it is known that for any $\varepsilon > 0$, there exists an $\tilde{O}(n/\varepsilon^{O(1)})$ time algorithm that returns a $(1 + \varepsilon)$-approximate estimate of the MST cost. This result immediately implies an $\tilde{O}(n/\varepsilon^{O(1)})$ time algorithm to estimate the TSP cost to within a $(2 + \varepsilon)$ factor for any $\varepsilon > 0$. However, no $o(n^2)$ time algorithms are known to approximate metric TSP to a factor that is strictly better than $2$. On the other hand, there were also no known barriers that rule out the existence of $(1 + \varepsilon)$-approximate estimation algorithms for metric TSP with $\tilde{O}(n)$ time for any fixed $\varepsilon > 0$. In this paper, we make progress on both algorithms and lower bounds for estimating metric TSP cost. We also show that the problem of estimating metric TSP cost is closely connected to the problem of estimating the size of a maximum matching in a graph.

[70] arXiv:2006.05440 [pdf]

On Coresets For Regularized Regression

Rachit Chhaya, Anirban Dasgupta, Supratim Shit

We study the effect of norm based regularization on the size of coresets for regression problems. Specifically, given a matrix $ \mathbf{A} \in {\mathbb{R}}^{n \times d}$ with $n\gg d$ and a vector $\mathbf{b} \in \mathbb{R} ^ n $ and $\lambda > 0$, we analyze the size of coresets for regularized versions of regression of the form $\|\mathbf{Ax}-\mathbf{b}\|_p^r + \lambda\|{\mathbf{x}}\|_q^s$ . Prior work has shown that for ridge regression (where $p,q,r,s=2$) we can obtain a coreset that is smaller than the coreset for the unregularized counterpart i.e. least squares regression (Avron et al.). We show that when $r \neq s$, no coreset for regularized regression can have size smaller than the optimal coreset of the unregularized version. The well known lasso problem falls under this category and hence does not allow a coreset smaller than the one for least squares regression. We propose a modified version of the lasso problem and obtain for it a coreset of size smaller than the least square regression. We empirically show that the modified version of lasso also induces sparsity in solution, similar to the original lasso. We also obtain smaller coresets for $\ell_p$ regression with $\ell_p$ regularization. We extend our methods to multi response regularized regression. Finally, we empirically demonstrate the coreset performance for the modified lasso and the $\ell_1$ regression with $\ell_1$ regularization.

[71] arXiv:2006.05104 [pdf]

Faster Queries on BWT-runs Compressed Indexes

Takaaki Nishimoto, Yasuo Tabei

Although a significant number of compressed indexes for highly repetitive strings have been proposed thus far, developing compressed indexes that support faster queries remains a challenge. Run-length Burrows-Wheeler transform (RLBWT) is a lossless data compression by a reversible permutation of an input string and run-length encoding, and it has become a popular research topic in string processing. Recently, Gagie et al. presented r-index, an efficient compressed index on RLBWT whose space usage does not depend on text length. In this paper, we present a new compressed index on RLBWT, which we call r-index-f, in which r-index is improved for faster locate queries. We introduce a novel division of RLBWT into blocks, which we call balanced BWT-sequence as follows the RLBWT of a string is divided into several blocks, and a parent-child relationship between each pair of blocks is defined. In addition, we present a novel backward search algorithm on the balanced BWT-sequences, resulting in faster locate queries of r-index-f. We also present new algorithms for solving the queries of count query, extract query, decompression and prefix search on r-index-f.

[72] arXiv:2006.05051 [pdf]

Constrained episodic reinforcement learning in concave-convex and knapsack settings

Kianté Brantley, Miroslav Dudik, Thodoris Lykouris, Sobhan Miryoosefi, Max Simchowitz, Aleksandrs Slivkins, Wen Sun

We propose an algorithm for tabular episodic reinforcement learning with constraints. We provide a modular analysis with strong theoretical guarantees for settings with concave rewards and convex constraints, and for settings with hard constraints (knapsacks). Most of the previous work in constrained reinforcement learning is limited to linear constraints, and the remaining work focuses on either the feasibility question or settings with a single episode. Our experiments demonstrate that the proposed algorithm significantly outperforms these approaches in existing constrained episodic environments.

[73] arXiv:2006.05028 [pdf]

Online Page Migration with ML Advice

Piotr Indyk, Frederik Mallmann-Trenn, Slobodan Mitrović, Ronitt Rubinfeld

We consider online algorithms for the {\em page migration problem} that use predictions, potentially imperfect, to improve their performance. The best known online algorithms for this problem, due to Westbrook'94 and Bienkowski et al'17, have competitive ratios strictly bounded away from 1. In contrast, we show that if the algorithm is given a prediction of the input sequence, then it can achieve a competitive ratio that tends to $1$ as the prediction error rate tends to $0$. Specifically, the competitive ratio is equal to $1+O(q)$, where $q$ is the prediction error rate. We also design a ``fallback option'' that ensures that the competitive ratio of the algorithm for {\em any} input sequence is at most $O(1/q)$. Our result adds to the recent body of work that uses machine learning to improve the performance of ``classic'' algorithms.

[74] arXiv:2006.05003 [pdf]

Universal Vector Neural Machine Translation With Effective Attention

Satish Mylapore, Ryan Quincy Paul, Joshua Yi, Robert D. Slater

Neural Machine Translation (NMT) leverages one or more trained neural networks for the translation of phrases. Sutskever introduced a sequence to sequence based encoder-decoder model which became the standard for NMT based systems. Attention mechanisms were later introduced to address the issues with the translation of long sentences and improving overall accuracy. In this paper, we propose a singular model for Neural Machine Translation based on encoder-decoder models. Most translation models are trained as one model for one translation. We introduce a neutral/universal model representation that can be used to predict more than one language depending on the source and a provided target. Secondly, we introduce an attention model by adding an overall learning vector to the multiplicative model. With these two changes, by using the novel universal model the number of models needed for multiple language translation applications are reduced.

[75] arXiv:2006.04940 [pdf]

An Improved and Parallel Version of a Scalable Algorithm for Analyzing Time Series Data

Andreas Vitalis

Today, very large amounts of data are produced and stored in all branches of society including science. Mining these data meaningfully has become a considerable challenge and is of the broadest possible interest. The size, both in numbers of observations and dimensionality thereof, requires data mining algorithms to possess time complexities with both variables that are linear or nearly linear. One such algorithm, see Comput. Phys. Commun. 184, 2446-2453 (2013), arranges observations into a sequence called the progress index. The progress index steps through distinct regions of high sampling density sequentially. By means of suitable annotations, it allows a compact representation of the behavior of complex systems, which is encoded in the original data set. The only essential parameter is a notion of distance between observations. Here, we present the shared memory parallelization of the key step in constructing the progress index, which is the calculation of an approximation of the minimum spanning tree of the complete graph of observations. We demonstrate that excellent parallel efficiencies are obtained for up to 72 logical (CPU) cores. In addition, we introduce three conceptual advances to the algorithm that improve its controllability and the interpretability of the progress index itself.

[76] arXiv:2006.04933 [pdf]

A New Integer Programming Formulation of the Graphical Traveling Salesman Problem

Robert D. Carr, Neil Simonetti

In the Traveling Salesman Problem (TSP), a salesman wants to visit a set of cities and return home. There is a cost $c_{ij}$ of traveling from city $i$ to city $j$, which is the same in either direction for the Symmetric TSP. The objective is to visit each city exactly once, minimizing total travel costs. In the Graphical TSP, a city may be visited more than once, which may be necessary on a sparse graph. We present a new integer programming formulation for the Graphical TSP requiring only two classes of constraints that are either polynomial in number or polynomially separable, while addressing an open question proposed by Denis Naddef.

[77] arXiv:2006.04868 [pdf]

DoubleU-Net A Deep Convolutional Neural Network for Medical Image Segmentation

Debesh Jha, Michael A. Riegler, Dag Johansen, Pål Halvorsen, Håvard D. Johansen

Semantic image segmentation is the process of labeling each pixel of an image with its corresponding class. An encoder-decoder based approach, like U-Net and its variants, is a popular strategy for solving medical image segmentation tasks. To improve the performance of U-Net on various segmentation tasks, we propose a novel architecture called DoubleU-Net, which is a combination of two U-Net architectures stacked on top of each other. The first U-Net uses a pre-trained VGG-19 as the encoder, which has already learned features from ImageNet and can be transferred to another task easily. To capture more semantic information efficiently, we added another U-Net at the bottom. We also adopt Atrous Spatial Pyramid Pooling (ASPP) to capture contextual information within the network. We have evaluated DoubleU-Net using four medical segmentation datasets, covering various imaging modalities such as colonoscopy, dermoscopy, and microscopy. Experiments on the MICCAI 2015 segmentation challenge, the CVC-ClinicDB, the 2018 Data Science Bowl challenge, and the Lesion boundary segmentation datasets demonstrate that the DoubleU-Net outperforms U-Net and the baseline models. Moreover, DoubleU-Net produces more accurate segmentation masks, especially in the case of the CVC-ClinicDB and MICCAI 2015 segmentation challenge datasets, which have challenging images such as smaller and flat polyps. These results show the improvement over the existing U-Net model. The encouraging results, produced on various medical image segmentation datasets, show that DoubleU-Net can be used as a strong baseline for both medical image segmentation and cross-dataset evaluation testing to measure the generalizability of Deep Learning (DL) models.

[78] arXiv:2006.04787 [pdf]

Classification Under Misspecification Halfspaces, Generalized Linear Models, and Connections to Evolvability

Sitan Chen, Frederic Koehler, Ankur Moitra, Morris Yau

In this paper we revisit some classic problems on classification under misspecification. In particular, we study the problem of learning halfspaces under Massart noise with rate $\eta$. In a recent work, Diakonikolas, Goulekakis, and Tzamos resolved a long-standing problem by giving the first efficient algorithm for learning to accuracy $\eta + \epsilon$ for any $\epsilon > 0$. However, their algorithm outputs a complicated hypothesis, which partitions space into $\text{poly}(d,1/\epsilon)$ regions. Here we give a much simpler algorithm and in the process resolve a number of outstanding open questions (1) We give the first proper learner for Massart halfspaces that achieves $\eta + \epsilon$. We also give improved bounds on the sample complexity achievable by polynomial time algorithms. (2) Based on (1), we develop a blackbox knowledge distillation procedure to convert an arbitrarily complex classifier to an equally good proper classifier. (3) By leveraging a simple but overlooked connection to evolvability, we show any SQ algorithm requires super-polynomially many queries to achieve $\mathsf{OPT} + \epsilon$. Moreover we study generalized linear models where $\mathbb{E}[Y|\mathbf{X}] = \sigma(\langle \mathbf{w}^*, \mathbf{X}\rangle)$ for any odd, monotone, and Lipschitz function $\sigma$. This family includes the previously mentioned halfspace models as a special case, but is much richer and includes other fundamental models like logistic regression. We introduce a challenging new corruption model that generalizes Massart noise, and give a general algorithm for learning in this setting. Our algorithms are based on a small set of core recipes for learning to classify in the presence of misspecification. Finally we study our algorithm for learning halfspaces under Massart noise empirically and find that it exhibits some appealing fairness properties.

[79] arXiv:2006.04778 [pdf]

Fair Classification with Noisy Protected Attributes

L. Elisa Celis, Lingxiao Huang, Nisheeth K. Vishnoi

Due to the growing deployment of classification algorithms in various social contexts, developing methods that are fair with respect to protected attributes such as gender or race is an important problem. However, the information about protected attributes in datasets may be inaccurate due to either issues with data collection or when the protected attributes used are themselves predicted by algorithms. Such inaccuracies can prevent existing fair classification algorithms from achieving desired fairness guarantees. Motivated by this, we study fair classification problems when the protected attributes in the data may be ``noisy''. In particular, we consider a noise model where any protected type may be flipped to another with some fixed probability. We propose a ``denoised'' fair optimization formulation that can incorporate very general fairness goals via a set of constraints, mitigates the effects of such noise perturbations, and comes with provable guarantees. Empirically, we show that our framework can lead to near-perfect statistical parity with only a slight loss in accuracy for significant noise levels.

[80] arXiv:2006.04704 [pdf]

Fully Dynamic Algorithm for Constrained Submodular Optimization

Silvio Lattanzi, Slobodan Mitrović, Ashkan Norouzi-Fard, Jakub Tarnawski, Morteza Zadimoghaddam

The task of maximizing a monotone submodular function under a cardinality constraint is at the core of many machine learning and data mining applications, including data summarization, sparse regression and coverage problems. We study this classic problem in the fully dynamic setting, where elements can be both inserted and removed. Our main result is a randomized algorithm that maintains an efficient data structure with a poly-logarithmic amortized update time and yields a $(1/2-\epsilon)$-approximate solution. We complement our theoretical analysis with an empirical study of the performance of our algorithm.

You can also browse papers in other categories.