[1] arXiv:2006.07364 [pdf][2] arXiv:2006.07331 [pdf][3] arXiv:2006.07327 [pdf][4] arXiv:2006.07301 [pdf][5] arXiv:2006.07262 [pdf][6] arXiv:2006.07185 [pdf][7] arXiv:2006.07152 [pdf][8] arXiv:2006.07137 [pdf][9] arXiv:2006.07115 [pdf][10] arXiv:2006.07046 [pdf][11] arXiv:2006.07043 [pdf][12] arXiv:2006.07042 [pdf][13] arXiv:2006.07026 [pdf][14] arXiv:2006.07024 [pdf][15] arXiv:2006.07021 [pdf][16] arXiv:2006.07013 [pdf][17] arXiv:2006.06997 [pdf][18] arXiv:2006.06972 [pdf][19] arXiv:2006.06969 [pdf][20] arXiv:2006.06954 [pdf][21] arXiv:2006.06945 [pdf][22] arXiv:2006.06850 [pdf][23] arXiv:2006.06821 [pdf][24] arXiv:2006.06780 [pdf][25] arXiv:2006.06743 [pdf][26] arXiv:2006.06733 [pdf][27] arXiv:2006.06728 [pdf][28] arXiv:2006.06676 [pdf][29] arXiv:2006.06657 [pdf][30] arXiv:2006.06629 [pdf][31] arXiv:2006.06587 [pdf][32] arXiv:2006.06562 [pdf][33] arXiv:2006.06500 [pdf][34] arXiv:2006.06462 [pdf][35] arXiv:2006.06457 [pdf][36] arXiv:2006.06455 [pdf][37] arXiv:2006.06443 [pdf][38] arXiv:2006.06438 [pdf][39] arXiv:2006.06365 [pdf][40] arXiv:2006.06359 [pdf][41] arXiv:2006.06357 [pdf][42] arXiv:2006.06355 [pdf][43] arXiv:2006.06186 [pdf][44] arXiv:2006.06183 [pdf][45] arXiv:2006.06164 [pdf][46] arXiv:2006.06135 [pdf][47] arXiv:2006.06134 [pdf][48] arXiv:2006.06082 [pdf][49] arXiv:2006.05978 [pdf][50] arXiv:2006.05961 [pdf][51] arXiv:2006.05720 [pdf][52] arXiv:2006.05697 [pdf][53] arXiv:2006.05609 [pdf][54] arXiv:2006.05604 [pdf][55] arXiv:2006.05576 [pdf][56] arXiv:2006.05509 [pdf][57] arXiv:2006.05493 [pdf][58] arXiv:2006.05467 [pdf][59] arXiv:2006.05440 [pdf][60] arXiv:2006.05419 [pdf][61] arXiv:2006.05389 [pdf][62] arXiv:2006.05252 [pdf][63] arXiv:2006.05163 [pdf][64] arXiv:2006.05145 [pdf][65] arXiv:2006.05132 [pdf][66] arXiv:2006.05097 [pdf][67] arXiv:2006.05078 [pdf][68] arXiv:2006.05057 [pdf][69] arXiv:2006.05022 [pdf][70] arXiv:2006.04976 [pdf][71] arXiv:2006.04893 [pdf][72] arXiv:2006.04873 [pdf][73] arXiv:2006.04863 [pdf][74] arXiv:2006.04858 [pdf][75] arXiv:2006.04821 [pdf][76] arXiv:2006.04803 [pdf][77] arXiv:2006.04778 [pdf][78] arXiv:2006.04769 [pdf][79] arXiv:2006.04766 [pdf][80] arXiv:2006.04748 [pdf]

Residual Force Control for Agile Human Behavior Imitation and Extended Motion Synthesis

Reinforcement learning has shown great promise for synthesizing realistic human behaviors by learning humanoid control policies from motion capture data. However, it is still very challenging to reproduce sophisticated human skills like ballet dance, or to stably imitate long-term human behaviors with complex transitions. The main difficulty lies in the dynamics mismatch between the humanoid model and real humans. That is, motions of real humans may not be physically possible for the humanoid model. To overcome the dynamics mismatch, we propose a novel approach, residual force control (RFC), that augments a humanoid control policy by adding external residual forces into the action space. During training, the RFC-based policy learns to apply residual forces to the humanoid to compensate for the dynamics mismatch and better imitate the reference motion. Experiments on a wide range of dynamic motions demonstrate that our approach outperforms state-of-the-art methods in terms of convergence speed and the quality of learned motions. For the first time, we show a physics-based virtual character performing highly agile ballet dance moves such as pirouette, arabesque and jeté. Furthermore, we propose a dual-policy control framework, where a kinematic policy and an RFC-based policy work in tandem to synthesize multi-modal infinite-horizon human motions without any task guidance or user input. Our approach is the first humanoid control method that successfully learns from a large-scale human motion dataset (Human3.6M) and generates diverse long-term motions.

Generalized Multi-Relational Graph Convolution Network

Graph Convolutional Networks (GCNs) have received increasing attention in recent machine learning. How to effectively leverage the rich structural information in complex graphs, such as knowledge graphs with heterogeneous types of entities and relations, is a primary open challenge in the field. Most GCN methods are either restricted to graphs with a homogeneous type of edges (e.g., citation links only), or focusing on representation learning for nodes only instead of jointly optimizing the embeddings of both nodes and edges for target-driven objectives. This paper addresses these limitations by proposing a novel framework, namely the GEneralized Multi-relational Graph Convolutional Networks (GEM-GCN), which combines the power of GCNs in graph-based belief propagation and the strengths of advanced knowledge-base embedding methods, and goes beyond. Our theoretical analysis shows that GEM-GCN offers an elegant unification of several well-known GCN methods as specific cases, with a new perspective of graph convolution. Experimental results on benchmark datasets show the advantageous performance of GEM-GCN over strong baseline methods in the tasks of knowledge graph alignment and entity classification.

GNN3DMOT Graph Neural Network for 3D Multi-Object Tracking with Multi-Feature Learning

3D Multi-object tracking (MOT) is crucial to autonomous systems. Recent work uses a standard tracking-by-detection pipeline, where feature extraction is first performed independently for each object in order to compute an affinity matrix. Then the affinity matrix is passed to the Hungarian algorithm for data association. A key process of this standard pipeline is to learn discriminative features for different objects in order to reduce confusion during data association. In this work, we propose two techniques to improve the discriminative feature learning for MOT (1) instead of obtaining features for each object independently, we propose a novel feature interaction mechanism by introducing the Graph Neural Network. As a result, the feature of one object is informed of the features of other objects so that the object feature can lean towards the object with similar feature (i.e., object probably with a same ID) and deviate from objects with dissimilar features (i.e., object probably with different IDs), leading to a more discriminative feature for each object; (2) instead of obtaining the feature from either 2D or 3D space in prior work, we propose a novel joint feature extractor to learn appearance and motion features from 2D and 3D space simultaneously. As features from different modalities often have complementary information, the joint feature can be more discriminate than feature from each individual modality. To ensure that the joint feature extractor does not heavily rely on one modality, we also propose an ensemble training paradigm. Through extensive evaluation, our proposed method achieves state-of-the-art performance on KITTI and nuScenes 3D MOT benchmarks. Our code will be made available at this https URL

Human and Multi-Agent collaboration in a human-MARL teaming framework

Collaborative multi-agent reinforcement learning (MARL) as a specific category of reinforcement learning provides effective results with agents learning from their observations, received rewards, and internal interactions between agents. However, centralized learning methods with a joint global policy in a highly dynamic environment present unique challenges in dealing with large amounts of information. This study proposes two innovative solutions to address the complexities of a collaboration between a human and multiple reinforcement learning (RL)-based agents (referred to thereafter as Human-MARL teaming) where the goals pursued cannot be achieved by a human alone or agents alone. The first innovation is the introduction of a new open-source MARL framework, called COGMENT, to unite humans and agents in real-time complex dynamic systems and efficiently leverage their interactions as a source of learning. The second innovation is our proposal of a new hybrid MARL method, named Dueling Double Deep Q learning MADDPG (D3-MADDPG) to allow agents to train decentralized policies parallelly in a joint centralized policy. This method can solve the overestimation problem in Q-learning methods of value-based MARL. We demonstrate these innovations by using a designed real-time environment with unmanned aerial vehicles driven by RL agents, collaborating with a human to fight fires. The team of RL agent drones autonomously look for fire seats and the human pilot douses the fires. The results of this study show that the proposed collaborative paradigm and the open-source framework leads to significant reductions in both human effort and exploration costs. Also, the results of the proposed hybrid MARL method shows that it effectively improves the learning process to achieve more reliable Q-values for each action, by decoupling the estimation between state value and advantage value.

A Brief Look at Generalization in Visual Meta-Reinforcement Learning

Due to the realization that deep reinforcement learning algorithms trained on high-dimensional tasks can strongly overfit to their training environments, there have been several studies that investigated the generalization performance of these algorithms. However, there has been no similar study that evaluated the generalization performance of algorithms that were specifically designed for generalization, i.e. meta-reinforcement learning algorithms. In this paper, we assess the generalization performance of these algorithms by leveraging high-dimensional, procedurally generated environments. We find that these algorithms can display strong overfitting when they are evaluated on challenging tasks. We also observe that scalability to high-dimensional tasks with sparse rewards remains a significant problem among many of the current meta-reinforcement learning algorithms. With these results, we highlight the need for developing meta-reinforcement learning algorithms that can both generalize and scale.

DECSTR Learning Goal-Directed Abstract Behaviors using Pre-Verbal Spatial Predicates in Intrinsically Motivated Agents

Intrinsically motivated agents freely explore their environment and set their own goals. Such goals are traditionally represented as specific states, but recent works introduced the use of language to facilitate abstraction. Language can, for example, represent goals as sets of general properties that surrounding objects should verify. However, language-conditioned agents are trained simultaneously to understand language and to act, which seems to contrast with how children learn infants demonstrate goal-oriented behaviors and abstract spatial concepts very early in their development, before language mastery. Guided by these findings from developmental psychology, we introduce a high-level state representation based on natural semantic predicates that describe spatial relations between objects and that are known to be present early in infants. In a robotic manipulation environment, our DECSTR system explores this representation space by manipulating objects, and efficiently learns to achieve any reachable configuration within it. It does so by leveraging an object-centered modular architecture, a symmetry inductive bias, and a new form of automatic curriculum learning for goal selection and policy learning. As with children, language acquisition takes place in a second phase, independently from goal-oriented sensorimotor learning. This is done via a new goal generation module, conditioned on instructions describing expected transformations in object relations. We present ablations studies for each component and highlight several advantages of targeting abstract goals over specific ones. We further show that using this intermediate representation enables efficient language grounding by evaluating agents on sequences of language instructions and their logical combinations.

Move-to-Data A new Continual Learning approach with Deep CNNs, Application for image-class recognition

In many real-life tasks of application of supervised learning approaches, all the training data are not available at the same time. The examples are lifelong image classification or recognition of environmental objects during interaction of instrumented persons with their environment, enrichment of an online-database with more images. It is necessary to pre-train the model at a "training recording phase" and then adjust it to the new coming data. This is the task of incremental/continual learning approaches. Amongst different problems to be solved by these approaches such as introduction of new categories in the model, refining existing categories to sub-categories and extending trained classifiers over them, ... we focus on the problem of adjusting pre-trained model with new additional training data for existing categories. We propose a fast continual learning layer at the end of the neuronal network. Obtained results are illustrated on the opensource CIFAR benchmark dataset. The proposed scheme yields similar performances as retraining but with drastically lower computational cost.

STONNE A Detailed Architectural Simulator for Flexible Neural Network Accelerators

The design of specialized architectures for accelerating the inference procedure of Deep Neural Networks (DNNs) is a booming area of research nowadays. First-generation rigid proposals have been rapidly replaced by more advanced flexible accelerator architectures able to efficiently support a variety of layer types and dimensions. As the complexity of the designs grows, it is more and more appealing for researchers to have cycle-accurate simulation tools at their disposal to allow for fast and accurate design-space exploration, and rapid quantification of the efficacy of architectural enhancements during the early stages of a design. To this end, we present STONNE (Simulation TOol of Neural Network Engines), a cycle-accurate, highly-modular and highly-extensible simulation framework that enables end-to-end evaluation of flexible accelerator architectures running complete contemporary DNN models. We use STONNE to model the recently proposed MAERI architecture and show how it can closely approach the performance results of the publicly available BSV-coded MAERI implementation. Then, we conduct a comprehensive evaluation and demonstrate that the folding strategy implemented for MAERI results in very low compute unit utilization (25% on average across 5 DNN models) which in the end translates into poor performance.

Simulating Tariff Impact in Electrical Energy Consumption Profiles with Conditional Variational Autoencoders

The implementation of efficient demand response (DR) programs for household electricity consumption would benefit from data-driven methods capable of simulating the impact of different tariffs schemes. This paper proposes a novel method based on conditional variational autoencoders (CVAE) to generate, from an electricity tariff profile combined with exogenous weather and calendar variables, daily consumption profiles of consumers segmented in different clusters. First, a large set of consumers is gathered into clusters according to their consumption behavior and price-responsiveness. The clustering method is based on a causality model that measures the effect of a specific tariff on the consumption level. Then, daily electrical energy consumption profiles are generated for each cluster with CVAE. This non-parametric approach is compared to a semi-parametric data generator based on generalized additive models and that uses prior knowledge of energy consumption. Experiments in a publicly available data set show that, the proposed method presents comparable performance to the semi-parametric one when it comes to generating the average value of the original data. The main contribution from this new method is the capacity to reproduce rebound and side effects in the generated consumption profiles. Indeed, the application of a special electricity tariff over a time window may also affect consumption outside this time window. Another contribution is that the clustering approach segments consumers according to their daily consumption profile and elasticity to tariff changes. These two results combined are very relevant for an ex-ante testing of future DR policies by system operators, retailers and energy regulators.

Disentangled Representation Learning and Generation with Manifold Optimization

Disentanglement is an enjoyable property in representation learning which increases the interpretability of generative models such as Variational Auto-Encoders (VAE), Generative Adversarial Models and their many variants. In the context of latent space models, this work presents a representation learning framework that explicitly promotes disentanglement thanks to the combination of an auto-encoder with Principal Component Analysis (PCA) in latent space. The proposed objective is the sum of an auto-encoder error term along with a PCA reconstruction error in the feature space. This has an interpretation of a Restricted Kernel Machine with an interconnection matrix on the Stiefel manifold. The construction encourages a matching between the principal directions in latent space and the directions of orthogonal variation in data space. The training algorithm involves a stochastic optimization method on the Stiefel manifold, which increases only marginally the computing time compared to an analogous VAE. Our theoretical discussion and various experiments show that the proposed model improves over many VAE variants along with special emphasis on disentanglement learning.

Language-Conditioned Goal Generation a New Approach to Language Grounding for RL

In the real world, linguistic agents are also embodied agents they perceive and act in the physical world. The notion of Language Grounding questions the interactions between language and embodiment how do learning agents connect or ground linguistic representations to the physical world ? This question has recently been approached by the Reinforcement Learning community under the framework of instruction-following agents. In these agents, behavioral policies or reward functions are conditioned on the embedding of an instruction expressed in natural language. This paper proposes another approach using language to condition goal generators. Given any goal-conditioned policy, one could train a language-conditioned goal generator to generate language-agnostic goals for the agent. This method allows to decouple sensorimotor learning from language acquisition and enable agents to demonstrate a diversity of behaviors for any given instruction. We propose a particular instantiation of this approach and demonstrate its benefits.

Recurrent Neural Networks for Stochastic Control in Real-Time Bidding

Bidding in real-time auctions can be a difficult stochastic control task; especially if underdelivery incurs strong penalties and the market is very uncertain. Most current works and implementations focus on optimally delivering a campaign given a reasonable forecast of the market. Practical implementations have a feedback loop to adjust and be robust to forecasting errors, but no implementation, to the best of our knowledge, uses a model of market risk and actively anticipates market shifts. Solving such stochastic control problems in practice is actually very challenging. This paper proposes an approximate solution based on a Recurrent Neural Network (RNN) architecture that is both effective and practical for implementation in a production environment. The RNN bidder provisions everything it needs to avoid missing its goal. It also deliberately falls short of its goal when buying the missing impressions would cost more than the penalty for not reaching it.

Backdoor Attacks on Federated Meta-Learning

Federated learning allows multiple users to collaboratively train a shared classification model while preserving data privacy. This approach, where model updates are aggregated by a central server, was shown to be vulnerable to backdoor attacks a malicious user can alter the shared model to arbitrarily classify specific inputs from a given class. In this paper, we analyze the effects of backdoor attacks in federated meta-learning, where users train a model that can be adapted to different sets of output classes using only a few training examples. While the ability to adapt could, in principle, make federated learning more robust to backdoor attacks when new training examples are benign, we find that even 1-shot poisoning attacks can be very successful and persist after additional training. To address these vulnerabilities, we propose a defense mechanism inspired by matching networks, where the class of an input is predicted from the cosine similarity of its features with a support set of labeled examples. By removing the decision logic from the model shared with the federation, success and persistence of backdoor attacks are greatly reduced.

Provably Robust Metric Learning

Metric learning is an important family of algorithms for classification and similarity search, but the robustness of learned metrics against small adversarial perturbations is less studied. In this paper, we show that existing metric learning algorithms, which focus on boosting the clean accuracy, can result in metrics that are less robust than the Euclidean distance. To overcome this problem, we propose a novel metric learning algorithm to find a Mahalanobis distance that is robust against adversarial perturbations, and the robustness of the resulting model is certifiable. Experimental results show that the proposed metric learning algorithm improves both certified robust errors and empirical robust errors (errors under adversarial attacks). Furthermore, unlike neural network defenses which usually encounter a trade-off between clean and robust errors, our method does not sacrifice clean errors compared with previous metric learning methods. Our code is available at this https URL.

A benchmark study on reliable molecular supervised learning via Bayesian learning

Virtual screening aims to find desirable compounds from chemical library by using computational methods. For this purpose with machine learning, model outputs that can be interpreted as predictive probability will be beneficial, in that a high prediction score corresponds to high probability of correctness. In this work, we present a study on the prediction performance and reliability of graph neural networks trained with the recently proposed Bayesian learning algorithms. Our work shows that Bayesian learning algorithms allow well-calibrated predictions for various GNN architectures and classification tasks. Also, we show the implications of reliable predictions on virtual screening, where Bayesian learning may lead to higher success in finding hit compounds.

A Unified Analysis of Stochastic Gradient Methods for Nonconvex Federated Optimization

In this paper, we study the performance of a large family of SGD variants in the smooth nonconvex regime. To this end, we propose a generic and flexible assumption capable of accurate modeling of the second moment of the stochastic gradient. Our assumption is satisfied by a large number of specific variants of SGD in the literature, including SGD with arbitrary sampling, SGD with compressed gradients, and a wide variety of variance-reduced SGD methods such as SVRG and SAGA. We provide a single convergence analysis for all methods that satisfy the proposed unified assumption, thereby offering a unified understanding of SGD variants in the nonconvex regime instead of relying on dedicated analyses of each variant. Moreover, our unified analysis is accurate enough to recover or improve upon the best-known convergence results of several classical methods, and also gives new convergence results for many new methods which arise as special cases. In the more general distributed/federated nonconvex optimization setup, we propose two new general algorithmic frameworks differing in whether direct gradient compression (DC) or compression of gradient differences (DIANA) is used. We show that all methods captured by these two frameworks also satisfy our unified assumption. Thus, our unified convergence analysis also captures a large variety of distributed methods utilizing compressed communication. Finally, we also provide a unified analysis for obtaining faster linear convergence rates in this nonconvex regime under the PL condition.

Complex Dynamics in Simple Neural Networks Understanding Gradient Flow in Phase Retrieval

Despite the widespread use of gradient-based algorithms for optimizing high-dimensional non-convex functions, understanding their ability of finding good minima instead of being trapped in spurious ones remains to a large extent an open problem. Here we focus on gradient flow dynamics for phase retrieval from random measurements. When the ratio of the number of measurements over the input dimension is small the dynamics remains trapped in spurious minima with large basins of attraction. We find analytically that above a critical ratio those critical points become unstable developing a negative direction toward the signal. By numerical experiments we show that in this regime the gradient flow algorithm is not trapped; it drifts away from the spurious critical points along the unstable direction and succeeds in finding the global minimum. Using tools from statistical physics we characterize this phenomenon, which is related to a BBP-type transition in the Hessian of the spurious minima.

Towards Deeper Graph Neural Networks with Differentiable Group Normalization

Graph neural networks (GNNs), which learn the representation of a node by aggregating its neighbors, have become an effective computational tool in downstream applications. Over-smoothing is one of the key issues which limit the performance of GNNs as the number of layers increases. It is because the stacked aggregators would make node representations converge to indistinguishable vectors. Several attempts have been made to tackle the issue by bringing linked node pairs close and unlinked pairs distinct. However, they often ignore the intrinsic community structures and would result in sub-optimal performance. The representations of nodes within the same community/class need be similar to facilitate the classification, while different classes are expected to be separated in embedding space. To bridge the gap, we introduce two over-smoothing metrics and a novel technique, i.e., differentiable group normalization (DGN). It normalizes nodes within the same group independently to increase their smoothness, and separates node distributions among different groups to significantly alleviate the over-smoothing issue. Experiments on real-world datasets demonstrate that DGN makes GNN models more robust to over-smoothing and achieves better performance with deeper GNNs.

Multi Layer Neural Networks as Replacement for Pooling Operations

Pooling operations are a layer found in almost every modern neural network, which can be calculated at low cost and serves as a linear or nonlinear transfer function for data reduction. Many modern approaches have already dealt with replacing the common maximum value selection and mean value operations by others or even to provide a function that includes different functions which can be selected through changing parameters. Additional neural networks are used to estimate the parameters of these pooling functions. Therefore, these pooling layers need many additional parameters and increase the complexity of the whole model. In this work, we show that already one perceptron can be used very effectively as a pooling operation without increasing the complexity of the model. This kind of pooling allows to integrate multi-layer neural networks directly into a model as a pooling operation by restructuring the data and thus learning complex pooling operations. We compare our approach to tensor convolution with strides as a pooling operation and show that our approach is effective and reduces complexity. The restructuring of the data in combination with multiple perceptrons allows also to use our approach for upscaling, which is used for transposed convolutions in semantic segmentation.

Towards Flexible Device Participation in Federated Learning for Non-IID Data

Traditional federated learning algorithms impose strict requirements on the participation rates of devices, which limit the potential reach of federated learning. In this paper, we extend the current learning paradigm and consider devices that may become inactive, compute incomplete updates, and leave or join in the middle of training. We derive analytical results to illustrate how the flexible participation of devices could affect the convergence when data is not independently and identically distributed (IID), and when devices are heterogeneous. This paper proposes a new federated aggregation scheme that converges even when devices may be inactive or return incomplete updates. We finally discuss practical research questions an operator would encounter during the training, and provide answers based on our convergence analysis.

Smartphone Transportation Mode Recognition Using a Hierarchical Machine Learning Classifier and Pooled Features From Time and Frequency Domains

This paper develops a novel two-layer hierarchical classifier that increases the accuracy of traditional transportation mode classification algorithms. This paper also enhances classification accuracy by extracting new frequency domain features. Many researchers have obtained these features from global positioning system data; however, this data was excluded in this paper, as the system use might deplete the smartphone's battery and signals may be lost in some areas. Our proposed two-layer framework differs from previous classification attempts in three distinct ways 1) the outputs of the two layers are combined using Bayes' rule to choose the transportation mode with the largest posterior probability; 2) the proposed framework combines the new extracted features with traditionally used time domain features to create a pool of features; and 3) a different subset of extracted features is used in each layer based on the classified modes. Several machine learning techniques were used, including k-nearest neighbor, classification and regression tree, support vector machine, random forest, and a heterogeneous framework of random forest and support vector machine. Results show that the classification accuracy of the proposed framework outperforms traditional approaches. Transforming the time domain features to the frequency domain also adds new features in a new space and provides more control on the loss of information. Consequently, combining the time domain and the frequency domain features in a large pool and then choosing the best subset results in higher accuracy than using either domain alone. The proposed two-layer classifier obtained a maximum classification accuracy of 97.02%.

List Learning with Attribute Noise

We introduce and study the model of list learning with attribute noise. Learning with attribute noise was introduced by Shackelford and Volper (COLT 1988) as a variant of PAC learning, in which the algorithm has access to noisy examples and uncorrupted labels, and the goal is to recover an accurate hypothesis. Sloan (COLT 1988) and Goldman and Sloan (Algorithmica 1995) discovered information-theoretic limits to learning in this model, which have impeded further progress. In this article we extend the model to that of list learning, drawing inspiration from the list-decoding model in coding theory, and its recent variant studied in the context of learning. On the positive side, we show that sparse conjunctions can be efficiently list learned under some assumptions on the underlying ground-truth distribution. On the negative side, our results show that even in the list-learning model, efficient learning of parities and majorities is not possible regardless of the representation used.

To Each Optimizer a Norm, To Each Norm its Generalization

We study the implicit regularization of optimization methods for linear models interpolating the training data in the under-parametrized and over-parametrized regimes. Since it is difficult to determine whether an optimizer converges to solutions that minimize a known norm, we flip the problem and investigate what is the corresponding norm minimized by an interpolating solution. Using this reasoning, we prove that for over-parameterized linear regression, projections onto linear spans can be used to move between different interpolating solutions. For under-parameterized linear classification, we prove that for any linear classifier separating the data, there exists a family of quadratic norms ||.||_P such that the classifier's direction is the same as that of the maximum P-margin solution. For linear classification, we argue that analyzing convergence to the standard maximum l2-margin is arbitrary and show that minimizing the norm induced by the data results in better generalization. Furthermore, for over-parameterized linear classification, projections onto the data-span enable us to use techniques from the under-parameterized setting. On the empirical side, we propose techniques to bias optimizers towards better generalizing solutions, improving their test performance. We validate our theoretical results via synthetic experiments, and use the neural tangent kernel to handle non-linear models.

Tangent Space Sensitivity and Distribution of Linear Regions in ReLU Networks

Recent articles indicate that deep neural networks are efficient models for various learning problems. However they are often highly sensitive to various changes that cannot be detected by an independent observer. As our understanding of deep neural networks with traditional generalization bounds still remains incomplete, there are several measures which capture the behaviour of the model in case of small changes at a specific state. In this paper we consider adversarial stability in the tangent space and suggest tangent sensitivity in order to characterize stability. We focus on a particular kind of stability with respect to changes in parameters that are induced by individual examples without known labels. We derive several easily computable bounds and empirical measures for feed-forward fully connected ReLU (Rectified Linear Unit) networks and connect tangent sensitivity to the distribution of the activation regions in the input space realized by the network. Our experiments suggest that even simple bounds and measures are associated with the empirical generalization gap.

Faster DBSCAN via subsampled similarity queries

DBSCAN is a popular density-based clustering algorithm. It computes the $\epsilon$-neighborhood graph of a dataset and uses the connected components of the high-degree nodes to decide the clusters. However, the full neighborhood graph may be too costly to compute with a worst-case complexity of $O(n^2)$. In this paper, we propose a simple variant called SNG-DBSCAN, which clusters based on a subsampled $\epsilon$-neighborhood graph, only requires access to similarity queries for pairs of points and in particular avoids any complex data structures which need the embeddings of the data points themselves. The runtime of the procedure is $O(sn^2)$, where $s$ is the sampling rate. We show under some natural theoretical assumptions that $s \approx \log n/n$ is sufficient for statistical cluster recovery guarantees leading to an $O(n\log n)$ complexity. We provide an extensive experimental analysis showing that on large datasets, one can subsample as little as $0.1\%$ of the neighborhood graph, leading to as much as over 200x speedup and 250x reduction in RAM consumption compared to scikit-learn's implementation of DBSCAN, while still maintaining competitive clustering performance.

IDEAL Inexact DEcentralized Accelerated Augmented Lagrangian Method

We introduce a framework for designing primal methods under the decentralized optimization setting where local functions are smooth and strongly convex. Our approach consists of approximately solving a sequence of sub-problems induced by the accelerated augmented Lagrangian method, thereby providing a systematic way for deriving several well-known decentralized algorithms including EXTRA arXiv1404.6264 and SSDA arXiv1702.08704. When coupled with accelerated gradient descent, our framework yields a novel primal algorithm whose convergence rate is optimal and matched by recently derived lower bounds. We provide experimental results that demonstrate the effectiveness of the proposed algorithm on highly ill-conditioned problems.

Deep Reinforcement Learning for Electric Transmission Voltage Control

Today, human operators primarily perform voltage control of the electric transmission system. As the complexity of the grid increases, so does its operation, suggesting additional automation could be beneficial. A subset of machine learning known as deep reinforcement learning (DRL) has recently shown promise in performing tasks typically performed by humans. This paper applies DRL to the transmission voltage control problem, presents open-source DRL environments for voltage control, proposes a novel modification to the "deep Q network" (DQN) algorithm, and performs experiments at scale with systems up to 500 buses. The promise of applying DRL to voltage control is demonstrated, though more research is needed to enable DRL-based techniques to consistently outperform conventional methods.

Training Generative Adversarial Networks with Limited Data

Training generative adversarial networks (GAN) using too little data typically leads to discriminator overfitting, causing training to diverge. We propose an adaptive discriminator augmentation mechanism that significantly stabilizes training in limited data regimes. The approach does not require changes to loss functions or network architectures, and is applicable both when training from scratch and when fine-tuning an existing GAN on another dataset. We demonstrate, on several datasets, that good results are now possible using only a few thousand training images, often matching StyleGAN2 results with an order of magnitude fewer images. We expect this to open up new application domains for GANs. We also find that the widely used CIFAR-10 is, in fact, a limited data benchmark, and improve the record FID from 5.59 to 2.67.

Directional convergence and alignment in deep learning

In this paper, we show that although the minimizers of cross-entropy and related classification losses are off at infinity, network weights learned by gradient flow converge in direction, with an immediate corollary that network predictions, training errors, and the margin distribution also converge. This proof holds for deep homogeneous networks -- a broad class of networks allowing for ReLU, max pooling, linear, and convolutional layers -- and we additionally provide empirical support not just close to the theory (e.g., the AlexNet), but also on non-homogeneous networks (e.g., the ResNet). If the network further has locally Lipschitz gradients, we show that these gradients converge in direction, and asymptotically align with the gradient flow path, with consequences on margin maximization. Our analysis complements and is distinct from the well-known neural tangent and mean-field theories, and in particular makes no requirements on network width and initialization, instead merely requiring perfect classification accuracy. The proof proceeds by developing a theory of unbounded nonsmooth Kurdyka-Lojasiewicz inequalities for functions definable in an o-minimal structure, and is also applicable outside deep learning.

Growing Artificial Neural Networks

Pruning is a legitimate method for reducing the size of a neural network to fit in low SWaP hardware, but the networks must be trained and pruned offline. We propose an algorithm, Artificial Neurogenesis (ANG), that grows rather than prunes the network and enables neural networks to be trained and executed in low SWaP embedded hardware. ANG accomplishes this by using the training data to determine critical connections between layers before the actual training takes place. Our experiments use a modified LeNet-5 as a baseline neural network that achieves a test accuracy of 98.74% using a total of 61,160 weights. An ANG grown network achieves a test accuracy of 98.80% with only 21,211 weights.

AdaS Adaptive Scheduling of Stochastic Gradients

The choice of step-size used in Stochastic Gradient Descent (SGD) optimization is empirically selected in most training procedures. Moreover, the use of scheduled learning techniques such as Step-Decaying, Cyclical-Learning, and Warmup to tune the step-size requires extensive practical experience--offering limited insight into how the parameters update--and is not consistent across applications. This work attempts to answer a question of interest to both researchers and practitioners, namely \textit{"how much knowledge is gained in iterative training of deep neural networks?"} Answering this question introduces two useful metrics derived from the singular values of the low-rank factorization of convolution layers in deep neural networks. We introduce the notions of \textit{"knowledge gain"} and \textit{"mapping condition"} and propose a new algorithm called Adaptive Scheduling (AdaS) that utilizes these derived metrics to adapt the SGD learning rate proportionally to the rate of change in knowledge gain over successive iterations. Experimentation reveals that, using the derived metrics, AdaS exhibits (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training. Code is available at \url{this https URL}.

Superconducting radio-frequency cavity fault classification using machine learning at Jefferson Laboratory

We report on the development of machine learning models for classifying C100 superconducting radio-frequency (SRF) cavity faults in the Continuous Electron Beam Accelerator Facility (CEBAF) at Jefferson Lab. CEBAF is a continuous-wave recirculating linac utilizing 418 SRF cavities to accelerate electrons up to 12 GeV through 5-passes. Of these, 96 cavities (12 cryomodules) are designed with a digital low-level RF system configured such that a cavity fault triggers waveform recordings of 17 RF signals for each of the 8 cavities in the cryomodule. Subject matter experts (SME) are able to analyze the collected time-series data and identify which of the eight cavities faulted first and classify the type of fault. This information is used to find trends and strategically deploy mitigations to problematic cryomodules. However manually labeling the data is laborious and time-consuming. By leveraging machine learning, near real-time (rather than post-mortem) identification of the offending cavity and classification of the fault type has been implemented. We discuss performance of the ML models during a recent physics run. Results show the cavity identification and fault classification models have accuracies of 84.9% and 78.2%, respectively.

Rethinking the Truly Unsupervised Image-to-Image Translation

Every recent image-to-image translation model uses either image-level (i.e. input-output pairs) or set-level (i.e. domain labels) supervision at minimum. However, even the set-level supervision can be a serious bottleneck for data collection in practice. In this paper, we tackle image-to-image translation in a fully unsupervised setting, i.e., neither paired images nor domain labels. To this end, we propose the truly unsupervised image-to-image translation method (TUNIT) that simultaneously learns to separate image domains via an information-theoretic approach and generate corresponding images using the estimated domain labels. Experimental results on various datasets show that the proposed method successfully separates domains and translates images across those domains. In addition, our model outperforms existing set-level supervised methods under a semi-supervised setting, where a subset of domain labels is provided. The source code is available at this https URL

Deep Differential System Stability -- Learning advanced computations from examples

Can advanced mathematical computations be learned from examples? Using transformers over large generated datasets, we train models to learn properties of differential systems, such as local stability, behavior at infinity and controllability. We achieve near perfect estimates of qualitative characteristics of the systems, and good approximations of numerical quantities, demonstrating that neural networks can learn advanced theorems and complex computations without built-in mathematical knowledge.

Text as data a machine learning-based approach to measuring uncertainty

The Economic Policy Uncertainty index had gained considerable traction with both academics and policy practitioners. Here, we analyse news feed data to construct a simple, general measure of uncertainty in the United States using a highly cited machine learning methodology. Over the period January 1996 through May 2020, we show that the series unequivocally Granger-causes the EPU and there is no Granger-causality in the reverse direction

Learning Individually Inferred Communication for Multi-Agent Cooperation

Communication lays the foundation for human cooperation. It is also crucial for multi-agent cooperation. However, existing work focuses on broadcast communication, which is not only impractical but also leads to information redundancy that could even impair the learning process. To tackle these difficulties, we propose \textit{Individually Inferred Communication} (I2C), a simple yet effective model to enable agents to learn a prior for agent-agent communication. The prior knowledge is learned via causal inference and realized by a feed-forward neural network that maps the agent's local observation to a belief about who to communicate with. The influence of one agent on another is inferred via the joint action-value function in multi-agent reinforcement learning and quantified to label the necessity of agent-agent communication. Furthermore, the agent policy is regularized to better exploit communicated messages. Empirically, we show that I2C can not only reduce communication overhead but also improve the performance in a variety of multi-agent cooperative scenarios, comparing to existing methods.

Convolutional neural networks compression with low rank and sparse tensor decompositions

Convolutional neural networks show outstanding results in a variety of computer vision tasks. However, a neural network architecture design usually faces a trade-off between model performance and computational/memory complexity. For some real-world applications, it is crucial to develop models, which can be fast and light enough to run on edge systems and mobile devices. However, many modern architectures that demonstrate good performance don't satisfy inference time and storage limitation requirements. Thus, arises a problem of neural network compression to obtain a smaller and faster model, which is on par with the initial one. In this work, we consider a neural network compression method based on tensor decompositions. Namely, we propose to approximate the convolutional layer weight with a tensor, which can be represented as a sum of low-rank and sparse components. The motivation for such approximation is based on the assumption that low-rank and sparse terms allow eliminating two different types of redundancy and thus yield a better compression rate. An efficient CPU implementation for the proposed method has been developed. Our algorithm has demonstrated up to 3.5x CPU layer speedup and 11x layer size reduction when compressing Resnet50 architecture for the image classification task.

GAIT-prop A biologically plausible learning rule derived from backpropagation of error

Traditional backpropagation of error, though a highly successful algorithm for learning in artificial neural network models, includes features which are biologically implausible for learning in real neural circuits. An alternative called target propagation proposes to solve this implausibility by using a top-down model of neural activity to convert an error at the output of a neural network into layer-wise and plausible 'targets' for every unit. These targets can then be used to produce weight updates for network training. However, thus far, target propagation has been heuristically proposed without demonstrable equivalence to backpropagation. Here, we derive an exact correspondence between backpropagation and a modified form of target propagation (GAIT-prop) where the target is a small perturbation of the forward pass. Specifically, backpropagation and GAIT-prop give identical updates when synaptic weight matrices are orthogonal. In a series of simple computer vision experiments, we show near-identical performance between backpropagation and GAIT-prop with a soft orthogonality-inducing regularizer.

Sparse recovery by reduced variance stochastic approximation

In this paper, we discuss application of iterative Stochastic Optimization routines to the problem of sparse signal recovery from noisy observation. Using Stochastic Mirror Descent algorithm as a building block, we develop a multistage procedure for recovery of sparse solutions to Stochastic Optimization problem under assumption of smoothness and quadratic minoration on the expected objective. An interesting feature of the proposed algorithm is its linear convergence of the approximate solution during the preliminary phase of the routine when the component of stochastic error in the gradient observation which is due to bad initial approximation of the optimal solution is larger than the "ideal" asymptotic error component owing to observation noise "at the optimal solution." We also show how one can straightforwardly enhance reliability of the corresponding solution by using Median-of-Means like techniques. We illustrate the performance of the proposed algorithms in application to classical problems of recovery of sparse and low rank signals in linear regression framework. We show, under rather weak assumption on the regressor and noise distributions, how they lead to parameter estimates which obey (up to factors which are logarithmic in problem dimension and confidence level) the best known to us accuracy bounds.

Improved Algorithms for Convex-Concave Minimax Optimization

This paper studies minimax optimization problems $\min_x \max_y f(x,y)$, where $f(x,y)$ is $m_x$-strongly convex with respect to $x$, $m_y$-strongly concave with respect to $y$ and $(L_x,L_{xy},L_y)$-smooth. \citet{zhang2019lower} provided the following lower bound of the gradient complexity for any first-order method $\Omega\Bigl(\sqrt{\frac{L_x}{m_x}+\frac{L_{xy}^2}{m_x m_y}+\frac{L_y}{m_y}}\ln(1/\epsilon)\Bigr).$ This paper proposes a new algorithm with gradient complexity upper bound $\tilde{O}\Bigl(\sqrt{\frac{L_x}{m_x}+\frac{L\cdot L_{xy}}{m_x m_y}+\frac{L_y}{m_y}}\ln\left(1/\epsilon\right)\Bigr),$ where $L=\max\{L_x,L_{xy},L_y\}$. This improves over the best known upper bound $\tilde{O}\left(\sqrt{\frac{L^2}{m_x m_y}} \ln^3\left(1/\epsilon\right)\right)$ by \citet{lin2020near}. Our bound achieves linear convergence rate and tighter dependency on condition numbers, especially when $L_{xy}\ll L$ (i.e., when the interaction between $x$ and $y$ is weak). Via reduction, our new bound also implies improved bounds for strongly convex-concave and convex-concave minimax optimization problems. When $f$ is quadratic, we can further improve the upper bound, which matches the lower bound up to a small sub-polynomial factor.

Understanding Regularisation Methods for Continual Learning

The problem of Catastrophic Forgetting has received a lot of attention in the past years. An important class of proposed solutions are so-called regularisation approaches, which protect weights from large changes according to their importances. Various ways to measure this importance have been put forward, all stemming from different theoretical or intuitive motivations. We present mathematical and empirical evidence that two of these methods -- Synaptic Intelligence and Memory Aware Synapses -- approximate a rescaled version of the Fisher Information, a theoretically justified importance measure also used in the literature. As part of our methods, we show that the importance approximation of Synaptic Intelligence is biased and that, in fact, this bias explains its performance best. Altogether, our results offer a theoretical account for the effectiveness of different regularisation approaches and uncover similarities between the methods proposed so far.

Improved Design of Quadratic Discriminant Analysis Classifier in Unbalanced Settings

The use of quadratic discriminant analysis (QDA) or its regularized version (R-QDA) for classification is often not recommended, due to its well-acknowledged high sensitivity to the estimation noise of the covariance matrix. This becomes all the more the case in unbalanced data settings for which it has been found that R-QDA becomes equivalent to the classifier that assigns all observations to the same class. In this paper, we propose an improved R-QDA that is based on the use of two regularization parameters and a modified bias, properly chosen to avoid inappropriate behaviors of R-QDA in unbalanced settings and to ensure the best possible classification performance. The design of the proposed classifier builds on a refined asymptotic analysis of its performance when the number of samples and that of features grow large simultaneously, which allows to cope efficiently with the high-dimensionality frequently met within the big data paradigm. The performance of the proposed classifier is assessed on both real and synthetic data sets and was shown to be much better than what one would expect from a traditional R-QDA.

Investigating Robustness of Adversarial Samples Detection for Automatic Speaker Verification

Recently adversarial attacks on automatic speaker verification (ASV) systems attracted widespread attention as they pose severe threats to ASV systems. However, methods to defend against such attacks are limited. Existing approaches mainly focus on retraining ASV systems with adversarial data augmentation. Also, countermeasure robustness against different attack settings are insufficiently investigated. Orthogonal to prior approaches, this work proposes to defend ASV systems against adversarial attacks with a separate detection network, rather than augmenting adversarial data into ASV training. A VGG-like binary classification detector is introduced and demonstrated to be effective on detecting adversarial samples. To investigate detector robustness in a realistic defense scenario where unseen attack settings exist, we analyze various attack settings and observe that the detector is robust (6.27\% EER_{det} degradation in the worst case) against unseen substitute ASV systems, but it has weak robustness (50.37\% EER_{det} degradation in the worst case) against unseen perturbation methods. The weak robustness against unseen perturbation methods shows a direction for developing stronger countermeasures.

G5 A Universal GRAPH-BERT for Graph-to-Graph Transfer and Apocalypse Learning

The recent GRAPH-BERT model introduces a new approach to learning graph representations merely based on the attention mechanism. GRAPH-BERT provides an opportunity for transferring pre-trained models and learned graph representations across different tasks within the same graph dataset. In this paper, we will further investigate the graph-to-graph transfer of a universal GRAPH-BERT for graph representation learning across different graph datasets, and our proposed model is also referred to as the G5 for simplicity. Many challenges exist in learning G5 to adapt the distinct input and output configurations for each graph data source, as well as the information distributions differences. G5 introduces a pluggable model architecture (a) each data source will be pre-processed with a unique input representation learning component; (b) each output application task will also have a specific functional component; and (c) all such diverse input and output components will all be conjuncted with a universal GRAPH-BERT core component via an input size unification layer and an output representation fusion layer, respectively. The G5 model removes the last obstacle for cross-graph representation learning and transfer. For the graph sources with very sparse training data, the G5 model pre-trained on other graphs can still be utilized for representation learning with necessary fine-tuning. What's more, the architecture of G5 also allows us to learn a supervised functional classifier for data sources without any training data at all. Such a problem is also named as the Apocalypse Learning task in this paper. Two different label reasoning strategies, i.e., Cross-Source Classification Consistency Maximization (CCCM) and Cross-Source Dynamic Routing (CDR), are introduced in this paper to address the problem.

Quantum inspired k-means algorithm using matrix product states

Matrix product state has become the algorithm of choice when studying one-dimensional interacting quantum many-body systems, which demonstrates to be able to explore the most relevant portion of the exponentially large quantum Hilbert space and find accurate solutions. Here we propose a quantum inspired K-means clustering algorithm which first maps the classical data into quantum states represented as matrix product states, and then minimize the loss function using the variational matrix product states method in the enlarged space. We demonstrate the performance of this algorithm by applying it to several commonly used machine learning datasets and show that this algorithm could reach higher prediction accuracies and that it is less likely to be trapped in local minima compared to the classical K-means algorithm.

Sample Efficient Reinforcement Learning via Low-Rank Matrix Estimation

We consider the question of learning $Q$-function in a sample efficient manner for reinforcement learning with continuous state and action spaces under a generative model. If $Q$-function is Lipschitz continuous, then the minimal sample complexity for estimating $\epsilon$-optimal $Q$-function is known to scale as ${\Omega}(\frac{1}{\epsilon^{d_1+d_2 +2}})$ per classical non-parametric learning theory, where $d_1$ and $d_2$ denote the dimensions of the state and action spaces respectively. The $Q$-function, when viewed as a kernel, induces a Hilbert-Schmidt operator and hence possesses square-summable spectrum. This motivates us to consider a parametric class of $Q$-functions parameterized by its "rank" $r$, which contains all Lipschitz $Q$-functions as $r \to \infty$. As our key contribution, we develop a simple, iterative learning algorithm that finds $\epsilon$-optimal $Q$-function with sample complexity of $\widetilde{O}(\frac{1}{\epsilon^{\max(d_1, d_2)+2}})$ when the optimal $Q$-function has low rank $r$ and the discounting factor $\gamma$ is below a certain threshold. Thus, this provides an exponential improvement in sample complexity. To enable our result, we develop a novel Matrix Estimation algorithm that faithfully estimates an unknown low-rank matrix in the $\ell_\infty$ sense even in the presence of arbitrary bounded noise, which might be of interest in its own right. Empirical results on several stochastic control tasks confirm the efficacy of our "low-rank" algorithms.

Kalman Filter Based Multiple Person Head Tracking

For multi-target tracking, target representation plays a crucial rule in performance. State-of-the-art approaches rely on the deep learning-based visual representation that gives an optimal performance at the cost of high computational complexity. In this paper, we come up with a simple yet effective target representation for human tracking. Our inspiration comes from the fact that the human body goes through severe deformation and inter/intra occlusion over the passage of time. So, instead of tracking the whole body part, a relative rigid organ tracking is selected for tracking the human over an extended period of time. Hence, we followed the tracking-by-detection paradigm and generated the target hypothesis of only the spatial locations of heads in every frame. After the localization of head location, a Kalman filter with a constant velocity motion model is instantiated for each target that follows the temporal evolution of the targets in the scene. For associating the targets in the consecutive frames, combinatorial optimization is used that associates the corresponding targets in a greedy fashion. Qualitative results are evaluated on four challenging video surveillance dataset and promising results has been achieved.

System to Integrate Fairness Transparently An Industry Approach

There have been significant research efforts to address the issue of unintentional bias in Machine Learning (ML). Many well-known companies have dealt with the fallout after the deployment of their products due to this issue. In an industrial context, enterprises have large-scale ML solutions for a broad class of use cases deployed for different swaths of customers. Trading off the cost of detecting and mitigating bias across this landscape over the lifetime of each use case against the risk of impact to the brand image is a key consideration. We propose a framework for industrial uses that addresses their methodological and mechanization needs. Our approach benefits from prior experience handling security and privacy concerns as well as past internal ML projects. Through significant reuse of bias handling ability at every stage in the ML development lifecycle to guide users we can lower overall costs of reducing bias.

Structure Learning for Cyclic Linear Causal Models

We consider the problem of structure learning for linear causal models based on observational data. We treat models given by possibly cyclic mixed graphs, which allow for feedback loops and effects of latent confounders. Generalizing related work on bow-free acyclic graphs, we assume that the underlying graph is simple. This entails that any two observed variables can be related through at most one direct causal effect and that (confounding-induced) correlation between error terms in structural equations occurs only in absence of direct causal effects. We show that, despite new subtleties in the cyclic case, the considered simple cyclic models are of expected dimension and that a previously considered criterion for distributional equivalence of bow-free acyclic graphs has an analogue in the cyclic case. Our result on model dimension justifies in particular score-based methods for structure learning of linear Gaussian mixed graph models, which we implement via greedy search.

Model-Free Algorithm and Regret Analysis for MDPs with Long-Term Constraints

In the optimization of dynamical systems, the variables typically have constraints. Such problems can be modeled as a constrained Markov Decision Process (CMDP). This paper considers a model-free approach to the problem, where the transition probabilities are not known. In the presence of long-term (or average) constraints, the agent has to choose a policy that maximizes the long-term average reward as well as satisfy the average constraints in each episode. The key challenge with the long-term constraints is that the optimal policy is not deterministic in general, and thus standard Q-learning approaches cannot be directly used. This paper uses concepts from constrained optimization and Q-learning to propose an algorithm for CMDP with long-term constraints. For any $\gamma\in(0,\frac{1}{2})$, the proposed algorithm is shown to achieve $O(T^{1/2+\gamma})$ regret bound for the obtained reward and $O(T^{1-\gamma/2})$ regret bound for the constraint violation, where $T$ is the total number of steps. We note that these are the first results on regret analysis for MDP with long-term constraints, where the transition probabilities are not known apriori.

Extrapolation for Large-batch Training in Deep Learning

Deep learning networks are typically trained by Stochastic Gradient Descent (SGD) methods that iteratively improve the model parameters by estimating a gradient on a very small fraction of the training data. A major roadblock faced when increasing the batch size to a substantial fraction of the training data for improving training time is the persistent degradation in performance (generalization gap). To address this issue, recent work propose to add small perturbations to the model parameters when computing the stochastic gradients and report improved generalization performance due to smoothing effects. However, this approach is poorly understood; it requires often model-specific noise and fine-tuning. To alleviate these drawbacks, we propose to use instead computationally efficient extrapolation (extragradient) to stabilize the optimization trajectory while still benefiting from smoothing to avoid sharp minima. This principled approach is well grounded from an optimization perspective and we show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer. We demonstrate that in a variety of experiments the scheme allows scaling to much larger batch sizes than before whilst reaching or surpassing SOTA accuracy.

Meta Transition Adaptation for Robust Deep Learning with Noisy Labels

To discover intrinsic inter-class transition probabilities underlying data, learning with noise transition has become an important approach for robust deep learning on corrupted labels. Prior methods attempt to achieve such transition knowledge by pre-assuming strongly confident anchor points with 1-probability belonging to a specific class, generally infeasible in practice, or directly jointly estimating the transition matrix and learning the classifier from the noisy samples, always leading to inaccurate estimation misguided by wrong annotation information especially in large noise cases. To alleviate these issues, this study proposes a new meta-transition-learning strategy for the task. Specifically, through the sound guidance of a small set of meta data with clean labels, the noise transition matrix and the classifier parameters can be mutually ameliorated to avoid being trapped by noisy training samples, and without need of any anchor point assumptions. Besides, we prove our method is with statistical consistency guarantee on correctly estimating the desired transition matrix. Extensive synthetic and real experiments validate that our method can more accurately extract the transition matrix, naturally following its more robust performance than prior arts. Its essential relationship with label distribution learning is also discussed, which explains its fine performance even under no-noise scenarios.

Learning With Differential Privacy

The leakage of data might have been an extreme effect on the personal level if it contains sensitive information. Common prevention methods like encryption-decryption, endpoint protection, intrusion detection system are prone to leakage. Differential privacy comes to the rescue with a proper promise of protection against leakage, as it uses a randomized response technique at the time of collection of the data which promises strong privacy with better utility. Differential privacy allows one to access the forest of data by describing their pattern of groups without disclosing any individual trees. The current adaption of differential privacy by leading tech companies and academia encourages authors to explore the topic in detail. The different aspects of differential privacy, it's application in privacy protection and leakage of information, a comparative discussion, on the current research approaches in this field, its utility in the real world as well as the trade-offs - will be discussed.

Machine Learning and Control Theory

We survey in this article the connections between Machine Learning and Control Theory. Control Theory provide useful concepts and tools for Machine Learning. Conversely Machine Learning can be used to solve large control problems. In the first part of the paper, we develop the connections between reinforcement learning and Markov Decision Processes, which are discrete time control problems. In the second part, we review the concept of supervised learning and the relation with static optimization. Deep learning which extends supervised learning, can be viewed as a control problem. In the third part, we present the links between stochastic gradient descent and mean-field theory. Conversely, in the fourth and fifth parts, we review machine learning approaches to stochastic control problems, and focus on the deterministic case, to explain, more easily, the numerical algorithms.

Demystifying Self-Supervised Learning An Information-Theoretical Framework

Self-supervised representation learning adopts self-defined signals as supervision and uses the learned representation for downstream tasks, such as masked language modeling (e.g., BERT) for natural language processing and contrastive visual representation learning (e.g., SimCLR) for computer vision applications. In this paper, we present a theoretical framework explaining that self-supervised learning is likely to work under the assumption that only the shared information (e.g., contextual information or content) between the input (e.g., non-masked words or original images) and self-supervised signals (e.g., masked-words or augmented images) contributes to downstream tasks. Under this assumption, we demonstrate that self-supervisedly learned representation can extract task-relevant and discard task-irrelevant information. We further connect our theoretical analysis to popular contrastive and predictive (self-supervised) learning objectives. In the experimental section, we provide controlled experiments on two popular tasks 1) visual representation learning with various self-supervised learning objectives to empirically support our analysis; and 2) visual-textual representation learning to challenge that input and self-supervised signal lie in different modalities.

Can artificial intelligence (AI) be used to accurately detect tuberculosis (TB) from chest x-ray? A multiplatform evaluation of five AI products used for TB screening in a high TB-burden setting

Powered by artificial intelligence (AI), particularly deep neural networks, computer aided detection (CAD) tools can be trained to recognize TB-related abnormalities on chest radiographs, thereby screening large numbers of people and reducing the pressure on healthcare professionals. Addressing the lack of studies comparing the performance of different products, we evaluated five AI software platforms specific to TB CAD4TB (v6), InferReadDR (v2), Lunit INSIGHT for Chest Radiography (v4.9.0) , JF CXR-1 (v2) by and qXR (v3) by on an unseen dataset of chest X-rays collected in three TB screening center in Dhaka, Bangladesh. The 23,566 individuals included in the study all received a CXR read by a group of three Bangladeshi board-certified radiologists. A sample of CXRs were re-read by US board-certified radiologists. Xpert was used as the reference standard. All five AI platforms significantly outperformed the human readers. The areas under the receiver operating characteristic curves are qXR 0.91 (95% CI0.90-0.91), Lunit INSIGHT CXR 0.89 (95% CI0.88-0.89), InferReadDR 0.85 (95% CI0.84-0.86), JF CXR-1 0.85 (95% CI0.84-0.85), CAD4TB 0.82 (95% CI0.81-0.83). We also proposed a new analytical framework that evaluates a screening and triage test and informs threshold selection through tradeoff between cost efficiency and ability to triage. Further, we assessed the performance of the five AI algorithms across the subgroups of age, use cases, and prior TB history, and found that the threshold scores performed differently across different subgroups. The positive results of our evaluation indicate that these AI products can be useful screening and triage tools for active case finding in high TB-burden regions.

Predicting and Analyzing Law-Making in Kenya

Modelling and analyzing parliamentary legislation, roll-call votes and order of proceedings in developed countries has received significant attention in recent years. In this paper, we focused on understanding the bills introduced in a developing democracy, the Kenyan bicameral parliament. We developed and trained machine learning models on a combination of features extracted from the bills to predict the outcome - if a bill will be enacted or not. We observed that the texts in a bill are not as relevant as the year and month the bill was introduced and the category the bill belongs to.

Pruning neural networks without any data by iteratively conserving synaptic flow

Pruning the parameters of deep neural networks has generated intense interest due to potential savings in time, memory and energy both during training and at test time. Recent works have identified, through an expensive sequence of training and pruning cycles, the existence of winning lottery tickets or sparse trainable subnetworks at initialization. This raises a foundational question can we identify highly sparse trainable subnetworks at initialization, without ever training, or indeed without ever looking at the data? We provide an affirmative answer to this question through theory driven algorithm design. We first mathematically formulate and experimentally verify a conservation law that explains why existing gradient-based pruning algorithms at initialization suffer from layer-collapse, the premature pruning of an entire layer rendering a network untrainable. This theory also elucidates how layer-collapse can be entirely avoided, motivating a novel pruning algorithm Iterative Synaptic Flow Pruning (SynFlow). This algorithm can be interpreted as preserving the total flow of synaptic strengths through the network at initialization subject to a sparsity constraint. Notably, this algorithm makes no reference to the training data and consistently outperforms existing state-of-the-art pruning algorithms at initialization over a range of models (VGG and ResNet), datasets (CIFAR-10/100 and Tiny ImageNet), and sparsity constraints (up to 99.9 percent). Thus our data-agnostic pruning algorithm challenges the existing paradigm that data must be used to quantify which synapses are important.

On Coresets For Regularized Regression

We study the effect of norm based regularization on the size of coresets for regression problems. Specifically, given a matrix $ \mathbf{A} \in {\mathbb{R}}^{n \times d}$ with $n\gg d$ and a vector $\mathbf{b} \in \mathbb{R} ^ n $ and $\lambda > 0$, we analyze the size of coresets for regularized versions of regression of the form $\|\mathbf{Ax}-\mathbf{b}\|_p^r + \lambda\|{\mathbf{x}}\|_q^s$ . Prior work has shown that for ridge regression (where $p,q,r,s=2$) we can obtain a coreset that is smaller than the coreset for the unregularized counterpart i.e. least squares regression (Avron et al.). We show that when $r \neq s$, no coreset for regularized regression can have size smaller than the optimal coreset of the unregularized version. The well known lasso problem falls under this category and hence does not allow a coreset smaller than the one for least squares regression. We propose a modified version of the lasso problem and obtain for it a coreset of size smaller than the least square regression. We empirically show that the modified version of lasso also induces sparsity in solution, similar to the original lasso. We also obtain smaller coresets for $\ell_p$ regression with $\ell_p$ regularization. We extend our methods to multi response regularized regression. Finally, we empirically demonstrate the coreset performance for the modified lasso and the $\ell_1$ regression with $\ell_1$ regularization.

Cost-effective Interactive Attention Learning with Neural Attention Processes

We propose a novel interactive learning framework which we refer to as Interactive Attention Learning (IAL), in which the human supervisors interactively manipulate the allocated attentions, to correct the model's behavior by updating the attention-generating network. However, such a model is prone to overfitting due to scarcity of human annotations, and requires costly retraining. Moreover, it is almost infeasible for the human annotators to examine attentions on tons of instances and features. We tackle these challenges by proposing a sample-efficient attention mechanism and a cost-effective reranking algorithm for instances and features. First, we propose Neural Attention Process (NAP), which is an attention generator that can update its behavior by incorporating new attention-level supervisions without any retraining. Secondly, we propose an algorithm which prioritizes the instances and the features by their negative impacts, such that the model can yield large improvements with minimal human feedback. We validate IAL on various time-series datasets from multiple domains (healthcare, real-estate, and computer vision) on which it significantly outperforms baselines with conventional attention mechanisms, or without cost-effective reranking, with substantially less retraining and human-model interaction cost.

A t-distribution based operator for enhancing out of distribution robustness of neural network classifiers

Neural Network (NN) classifiers can assign extreme probabilities to samples that have not appeared during training (out-of-distribution samples) resulting in erroneous and unreliable predictions. One of the causes for this unwanted behaviour lies in the use of the standard softmax operator which pushes the posterior probabilities to be either zero or unity hence failing to model uncertainty. The statistical derivation of the softmax operator relies on the assumption that the distributions of the latent variables for a given class are Gaussian with known variance. However, it is possible to use different assumptions in the same derivation and attain from other families of distributions as well. This allows derivation of novel operators with more favourable properties. Here, a novel operator is proposed that is derived using $t$-distributions which are capable of providing a better description of uncertainty. It is shown that classifiers that adopt this novel operator can be more robust to out of distribution samples, often outperforming NNs that use the standard softmax operator. These enhancements can be reached with minimal changes to the NN architecture.

A bio-inspired bistable recurrent cell allows for long-lasting memory

Recurrent neural networks (RNNs) provide state-of-the-art performances in a wide variety of tasks that require memory. These performances can often be achieved thanks to gated recurrent cells such as gated recurrent units (GRU) and long short-term memory (LSTM). Standard gated cells share a layer internal state to store information at the network level, and long term memory is shaped by network-wide recurrent connection weights. Biological neurons on the other hand are capable of holding information at the cellular level for an arbitrary long amount of time through a process called bistability. Through bistability, cells can stabilize to different stable states depending on their own past state and inputs, which permits the durable storing of past information in neuron state. In this work, we take inspiration from biological neuron bistability to embed RNNs with long-lasting memory at the cellular level. This leads to the introduction of a new bistable biologically-inspired recurrent cell that is shown to strongly improves RNN performance on time-series which require very long memory, despite using only cellular connections (all recurrent connections are from neurons to themselves, i.e. a neuron state is not influenced by the state of other neurons). Furthermore, equipping this cell with recurrent neuromodulation permits to link them to standard GRU cells, taking a step towards the biological plausibility of GRU.

ConfNet2Seq Full Length Answer Generation from Spoken Questions

Conversational and task-oriented dialogue systems aim to interact with the user using natural responses through multi-modal interfaces, such as text or speech. These desired responses are in the form of full-length natural answers generated over facts retrieved from a knowledge source. While the task of generating natural answers to questions from an answer span has been widely studied, there has been little research on natural sentence generation over spoken content. We propose a novel system to generate full length natural language answers from spoken questions and factoid answers. The spoken sequence is compactly represented as a confusion network extracted from a pre-trained Automatic Speech Recognizer. This is the first attempt towards generating full-length natural answers from a graph input(confusion network) to the best of our knowledge. We release a large-scale dataset of 259,788 samples of spoken questions, their factoid answers and corresponding full-length textual answers. Following our proposed approach, we achieve comparable performance with best ASR hypothesis.

Stochastic matrix games with bandit feedback

We study a version of the classical zero-sum matrix game with unknown payoff matrix and bandit feedback, where the players only observe each others actions and a noisy payoff. This generalizes the usual matrix game, where the payoff matrix is known to the players. Despite numerous applications, this problem has received relatively little attention. Although adversarial bandit algorithms achieve low regret, they do not exploit the matrix structure and perform poorly relative to the new algorithms. The main contributions are regret analyses of variants of UCB and K-learning that hold for any opponent, e.g., even when the opponent adversarially plays the best response to the learner's mixed strategy. Along the way, we show that Thompson fails catastrophically in this setting and provide empirical comparison to existing algorithms.

A Survey on Generative Adversarial Networks Variants, Applications, and Training

The Generative Models have gained considerable attention in the field of unsupervised learning via a new and practical framework called Generative Adversarial Networks (GAN) due to its outstanding data generation capability. Many models of GAN have proposed, and several practical applications emerged in various domains of computer vision and machine learning. Despite GAN's excellent success, there are still obstacles to stable training. The problems are due to Nash-equilibrium, internal covariate shift, mode collapse, vanishing gradient, and lack of proper evaluation metrics. Therefore, stable training is a crucial issue in different applications for the success of GAN. Herein, we survey several training solutions proposed by different researchers to stabilize GAN training. We survey, (I) the original GAN model and its modified classical versions, (II) detail analysis of various GAN applications in different domains, (III) detail study about the various GAN training obstacles as well as training solutions. Finally, we discuss several new issues as well as research outlines to the topic.

GAP++ Learning to generate target-conditioned adversarial examples

Adversarial examples are perturbed inputs which can cause a serious threat for machine learning models. Finding these perturbations is such a hard task that we can only use the iterative methods to traverse. For computational efficiency, recent works use adversarial generative networks to model the distribution of both the universal or image-dependent perturbations directly. However, these methods generate perturbations only rely on input images. In this work, we propose a more general-purpose framework which infers target-conditioned perturbations dependent on both input image and target label. Different from previous single-target attack models, our model can conduct target-conditioned attacks by learning the relations of attack target and the semantics in image. Using extensive experiments on the datasets of MNIST and CIFAR10, we show that our method achieves superior performance with single target attack models and obtains high fooling rates with small perturbation norms.

Differentiable Expected Hypervolume Improvement for Parallel Multi-Objective Bayesian Optimization

In many real-world scenarios, decision makers seek to efficiently optimize multiple competing objectives in a sample-efficient fashion. Multi-objective Bayesian optimization (BO) is a common approach, but many existing acquisition functions do not have known analytic gradients and suffer from high computational overhead. We leverage recent advances in programming models and hardware acceleration for multi-objective BO using Expected Hypervolume Improvement (EHVI)---an algorithm notorious for its high computational complexity. We derive a novel formulation of $q$-Expected Hypervolume Improvement ($q$EHVI), an acquisition function that extends EHVI to the parallel, constrained evaluation setting. $q$EHVI is an exact computation of the joint EHVI of $q$ new candidate points (up to Monte-Carlo (MC) integration error). Whereas previous EHVI formulations rely on gradient-free acquisition optimization or approximated gradients, we compute exact gradients of the MC estimator via auto-differentiation, thereby enabling efficient and effective optimization using first-order and quasi-second-order methods. Lastly, our empirical evaluation demonstrates that $q$EHVI is computationally tractable in many practical scenarios and outperforms state-of-the-art multi-objective BO algorithms at a fraction of their wall time.

Black-Box Adversarial Attacks on Graph Neural Networks with Limited Node Access

We study the black-box attacks on graph neural networks (GNNs) under a novel and realistic constraint attackers have access to only a subset of nodes in the network, and they can only attack a small number of them. A node selection step is essential under this setup. We demonstrate that the structural inductive biases of GNN models can be an effective source for this type of attacks. Specifically, by exploiting the connection between the backward propagation of GNNs and random walks, we show that the common gradient-based white-box attacks can be generalized to the black-box setting via the connection between the gradient and an importance score similar to PageRank. In practice, we find attacks based on this importance score indeed increase the classification loss by a large margin, but they fail to significantly increase the mis-classification rate. Our theoretical and empirical analyses suggest that there is a discrepancy between the loss and mis-classification rate, as the latter presents a diminishing-return pattern when the number of attacked nodes increases. Therefore, we propose a greedy procedure to correct the importance score that takes into account of the diminishing-return pattern. Experimental results show that the proposed procedure can significantly increase the mis-classification rate of common GNNs on real-world data without access to model parameters nor predictions.

Near-Optimal Confidence Sequences for Bounded Random Variables

Many inference problems, such as sequential decision problems like A/B testing, adaptive sampling schemes like bandit selection, are often online in nature. The fundamental problem for online inference is to provide a sequence of confidence intervals that are valid uniformly over the growing-into-infinity sample sizes. To address this question, we provide a near-optimal confidence sequence for bounded random variables by utilizing Bentkus' concentration results. We show that it improves on the existing approaches that use the Cram{é}r-Chernoff technique such as the Hoeffding, Bernstein, and Bennett inequalities. The resulting confidence sequence is confirmed to be favorable in both synthetic coverage problems and an application to adaptive stopping algorithms.

Physics Regularized Gaussian Processes

We consider incorporating incomplete physics knowledge, expressed as differential equations with latent functions, into Gaussian processes (GPs) to improve their performance, especially for limited data and extrapolation. While existing works have successfully encoded such knowledge via kernel convolution, they only apply to linear equations with analytical Green's functions. The convolution can further restrict us from fusing physics with highly expressive kernels, e.g., deep kernels. To overcome these limitations, we propose Physics Regularized Gaussian Process (PRGP) that can incorporate both linear and nonlinear equations, does not rely on Green's functions, and is free to use arbitrary kernels. Specifically, we integrate the standard GP with a generative model to encode the differential equation in a principled Bayesian hybrid framework. For efficient and effective inference, we marginalize out the latent variables and derive a simplified model evidence lower bound (ELBO), based on which we develop a stochastic collapsed inference algorithm. Our ELBO can be viewed as a posterior regularization objective. We show the advantage of our approach in both simulation and real-world applications.

Neural ODEs for Multi-State Survival Analysis

Survival models are a popular tool for the analysis of time to event data with applications in medicine, engineering, economics and many more. Advances like the Cox proportional hazard model have enabled researchers to better describe hazard rates for the occurrence of single fatal events, but are limited by modeling assumptions, like proportionality of hazard rates and linear effects. Moreover, common phenomena are often better described through multiple states, for example, the progress of a disease might be modeled as healthy, sick and dead instead of healthy and dead, where the competing nature of death and disease has to be taken into account. Also, individual characteristics can vary significantly between observational units, like patients, resulting in idiosyncratic hazard rates and different disease trajectories. These considerations require flexible modeling assumptions. Current standard models, however, are often ill-suited for such an analysis. To overcome these issues, we propose the use of neural ordinary differential equations as a flexible and general method for estimating multi-state survival models by directly solving the Kolmogorov forward equations. To quantify the uncertainty in the resulting individual cause-specific hazard rates, we further introduce a variational latent variable model. We show that our model exhibits state-of-the-art performance on popular survival data sets and demonstrate its efficacy in a multi-state setting.

A Stochastic Subgradient Method for Distributionally Robust Non-Convex Learning

We consider a distributionally robust formulation of stochastic optimization problems arising in statistical learning, where robustness is with respect to uncertainty in the underlying data distribution. Our formulation builds on risk-averse optimization techniques and the theory of coherent risk measures. It uses semi-deviation risk for quantifying uncertainty, allowing us to compute solutions that are robust against perturbations in the population data distribution. We consider a large family of loss functions that can be non-convex and non-smooth and develop an efficient stochastic subgradient method. We prove that it converges to a point satisfying the optimality conditions. To our knowledge, this is the first method with rigorous convergence guarantees in the context of non-convex non-smooth distributionally robust stochastic optimization. Our method can achieve any desired level of robustness with little extra computational cost compared to population risk minimization. We also illustrate the performance of our algorithm on real datasets arising in convex and non-convex supervised learning problems.

Learning to Utilize Correlated Auxiliary Classical or Quantum Noise

This paper has two messages. First, we demonstrate that neural networks can learn to exploit correlations between noisy data and suitable auxiliary noise. In effect, the network learns to use the correlated auxiliary noise as an approximate key to decipher its noisy input data. Second, we show that the scaling behavior with increasing noise is such that future quantum machines should possess an advantage. For a concrete example, we reduce the image classification performance of convolutional neural networks (CNNs) by adding noise of different amounts and quality to the input images. We then demonstrate that the CNNs are able to partly recover their performance if, along with each noisy image, they are given auxiliary noise that is correlated with the image noise. We analyze the scaling of a CNN ability to learn and utilize these noise correlations as the level, dimensionality, or complexity of the noise is increased. We thereby find numerical and theoretical indications that quantum machines, due to their efficiency in representing complex correlations, could possess a significant advantage over classical machines.

Learning the Truth From Only One Side of the Story

Learning under one-sided feedback (i.e., where examples arrive in an online fashion and the learner only sees the labels for examples it predicted positively on) is a fundamental problem in machine learning -- applications include lending and recommendation systems. Despite this, there has been surprisingly little progress made in ways to mitigate the effects of the sampling bias that arises. We focus on generalized linear models and show that without adjusting for this sampling bias, the model may converge sub-optimally or even fail to converge to the optimal solution. We propose an adaptive Upper Confidence Bound approach that comes with rigorous regret guarantees and we show that it outperforms several existing methods experimentally. Our method leverages uncertainty estimation techniques for generalized linear models to more efficiently explore uncertain areas than existing approaches which explore randomly.

Gaussian states provide universal and versatile quantum reservoir computing

We establish the potential of continuous-variable Gaussian states in performing reservoir computing with linear dynamical systems in classical and quantum regimes. Reservoir computing is a machine learning approach to time series processing. It exploits the computational power, high-dimensional state space and memory of generic complex systems to achieve its goal, giving it considerable engineering freedom compared to conventional computing or recurrent neural networks. We prove that universal reservoir computing can be achieved without nonlinear terms in the Hamiltonian or non-Gaussian resources. We find that encoding the input time series into Gaussian states is both a source and a means to tune the nonlinearity of the overall input-output map. We further show that reservoir computing can in principle be powered by quantum fluctuations, such as squeezed vacuum, instead of classical intense fields. Our results introduce a new research paradigm for quantum reservoir computing and the engineering of Gaussian quantum states, pushing both fields into a new direction.

A two-level solution to fight against dishonest opinions in recommendation-based trust systems

In this paper, we propose a mechanism to deal with dishonest opinions in recommendation-based trust models, at both the collection and processing levels. We consider a scenario in which an agent requests recommendations from multiple parties to build trust toward another agent. At the collection level, we propose to allow agents to self-assess the accuracy of their recommendations and autonomously decide on whether they would participate in the recommendation process or not. At the processing level, we propose a recommendations aggregation technique that is resilient to collusion attacks, followed by a credibility update mechanism for the participating agents. The originality of our work stems from its consideration of dishonest opinions at both the collection and processing levels, which allows for better and more persistent protection against dishonest recommenders. Experiments conducted on the Epinions dataset show that our solution yields better performance in protecting the recommendation process against Sybil attacks, in comparison with a competing model that derives the optimal network of advisors based on the agents' trust values.

Fair Classification with Noisy Protected Attributes

Due to the growing deployment of classification algorithms in various social contexts, developing methods that are fair with respect to protected attributes such as gender or race is an important problem. However, the information about protected attributes in datasets may be inaccurate due to either issues with data collection or when the protected attributes used are themselves predicted by algorithms. Such inaccuracies can prevent existing fair classification algorithms from achieving desired fairness guarantees. Motivated by this, we study fair classification problems when the protected attributes in the data may be ``noisy''. In particular, we consider a noise model where any protected type may be flipped to another with some fixed probability. We propose a ``denoised'' fair optimization formulation that can incorporate very general fairness goals via a set of constraints, mitigates the effects of such noise perturbations, and comes with provable guarantees. Empirically, we show that our framework can lead to near-perfect statistical parity with only a slight loss in accuracy for significant noise levels.

The Penalty Imposed by Ablated Data Augmentation

There is a set of data augmentation techniques that ablate parts of the input at random. These include input dropout, cutout, and random erasing. We term these techniques ablated data augmentation. Though these techniques seems similar in spirit and have shown success in improving model performance in a variety of domains, we do not yet have a mathematical understanding of the differences between these techniques like we do for other regularization techniques like L1 or L2. First, we study a formal model of mean ablated data augmentation and inverted dropout for linear regression. We prove that ablated data augmentation is equivalent to optimizing the ordinary least squares objective along with a penalty that we call the Contribution Covariance Penalty and inverted dropout, a more common implementation than dropout in popular frameworks, is equivalent to optimizing the ordinary least squares objective along with Modified L2. For deep networks, we demonstrate an empirical version of the result if we replace contributions with attributions and coefficients with average gradients, i.e., the Contribution Covariance Penalty and Modified L2 Penalty drop with the increase of the corresponding ablated data augmentation across a variety of networks.

A Heuristically Self-Organised Linguistic Attribute Deep Learning in Edge Computing For IoT Intelligence

With the development of Internet of Things (IoT), IoT intelligence becomes emerging technology. "Curse of Dimensionality" is the barrier of data fusion in edge devices for the success of IoT intelligence. A Linguistic Attribute Hierarchy (LAH), embedded with Linguistic Decision Trees (LDTs), can represent a new attribute deep learning. In contrast to the conventional deep learning, an LAH could overcome the shortcoming of missing interpretation by providing transparent information propagation through the rules, produced by LDTs in the LAH. Similar to the conventional deep learning, the computing complexity of optimising LAHs blocks the applications of LAHs. In this paper, we propose a heuristic approach to constructing an LAH, embedded with LDTs for decision making or classification by utilising the distance correlations between attributes and between attributes and the goal variable. The set of attributes is divided to some attribute clusters, and then they are heuristically organised to form a linguistic attribute hierarchy. The proposed approach was validated with some benchmark decision making or classification problems from the UCI machine learning repository. The experimental results show that the proposed self-organisation algorithm can construct an effective and efficient linguistic attribute hierarchy. Such a self-organised linguistic attribute hierarchy embedded with LDTs can not only efficiently tackle "curse of dimensionality" in a single LDT for data fusion with massive attributes, but also achieve better or comparable performance on decision making or classification, compared to the single LDT for the problem to be solved. The self-organisation algorithm is much efficient than the Genetic Algorithm in Wrapper for the optimisation of LAHs. This makes it feasible to embed the self-organisation algorithm in edge devices for IoT intelligence.

Serverless on FHIR Deploying machine learning models for healthcare on the cloud

Machine Learning (ML) plays a vital role in implementing digital health. The advances in hardware and the democratization of software tools have revolutionized machine learning. However, the deployment of ML models -- the mathematical representation of the task to be performed -- for effective and efficient clinical decision support at the point of care is still a challenge. ML models undergo constant improvement of their accuracy and predictive power with a high turnover rate. Updating models consumed by downstream health information systems is essential for patient safety. We introduce a functional taxonomy and a four-tier architecture for cloud-based model deployment for digital health. The four tiers are containerized microservices for maintainability, serverless architecture for scalability, function as a service for portability and FHIR schema for discoverability. We call this architecture Serverless on FHIR and propose this as a standard to deploy digital health applications that can be consumed by downstream systems such as EMRs and visualization tools.

You can also browse papers in other categories.