[1] arXiv:2006.09899 [pdf]
HiFLEx -- a highly flexible package to reduce cross-dispersed echelle spectra
We describe a flexible data reduction package for high resolution cross-dispersed echelle data. This open-source package is developed in Python and includes optional GUIs for most of the steps. It does not require any pre-knowledge about the form or position of the echelle-orders. It has been tested on cross-dispersed echelle spectrographs between 13k and 115k resolution (bifurcated fiber-fed spectrogaph ESO-HARPS and single fiber-fed spectrograph TNT-MRES). HiFLEx can be used to determine radial velocities and is designed to use the TERRA package but can also control the radial velocity packages such as CERES and SERVAL to perform the radial velocity analysis. Tests on HARPS data indicates radial velocities results within 3m/s of the literature pipelines without any fine tuning of extraction parameters.
[2] arXiv:2006.09416 [pdf]
Empirical completeness assessment of the Gaia DR2, Pan-STARRS 1 and ASAS-SN-II RR Lyrae catalogues
RR Lyrae stars are an important and widely used tracer of the most ancient populations of our Galaxy, mainly due to their standard candle nature. The availability of large scale surveys of variable stars is allowing us to trace the structure of our entire Galaxy, even in previously inaccessible areas like the Galactic disc. In this work we aim to provide an empirical assessment of the completeness of the three largest RR Lyrae catalogues available Gaia DR2, PanSTARRS-1 and ASAS-SN-II. Using a joint probabilistic analysis of the three surveys we compute 2D and 3D completeness maps in each survey's full magnitude range. At the bright end (G20deg); ASAS-SN-II has the best completeness at low latitude for RRab and at all latitudes for RRc. At the faint end (G>13), Gaia DR2 is the most complete catalogue for both RR Lyrae types, at any latitude, with median completeness rates of 95% (RRab) and >85% (RRc) outside the ecliptic plane (|\beta|>25deg). We confirm a high and uniform completeness of PanSTARRS-1 RR Lyrae at 91% (RRab) and 82% (RRc) down to G~18, and provide the first estimate of its completeness at low galactic latitude (|b|<20deg) at an estimated median 65% (RRab) and 50-60% (RRc). Our results are publicly available as 2D and 3D completeness maps, and as functions to evaluate each survey's completeness versus distance or per line-of sight.
[3] arXiv:2006.09387 [pdf]
Simplified fast detector simulation in MadAnalysis 5
We introduce a new simplified fast detector simulator in the MadAnalysis 5 platform. The Python-like interpreter of the programme has been augmented by new commands allowing for a detector parametrisation through smearing and efficiency functions. On run time, an associated C++ code is automatically generated and executed to produce reconstructed-level events. In addition, we have extended the MadAnalysis 5 recasting infrastructure to support our detector emulator and provide predefined LHC detector configurations. We have compared predictions obtained with our approach to those from the Delphes 3 software, both for Standard Model processes and a few new physics signals. Results generally agree to a level of about 10% or better, although Delphes 3 is 30% to 50% slower and requires 100 times more disk space.
[4] arXiv:2006.09379 [pdf]
CosTuuM polarized thermal dust emission by magnetically oriented spheroidal grains
We present the new open source C++-based Python library CosTuuM that can be used to generate infrared absorption and emission coefficients for arbitrary mixtures of spheroidal dust grains that are (partially) aligned with a magnetic field. We outline the algorithms underlying the software, demonstrate the accuracy of our results using benchmarks from literature, and use our tool to investigate some commonly used approximative recipes. We find that the linear polarization fraction for a partially aligned dust grain mixture can be accurately represented by an appropriate linear combination of perfectly aligned grains and grains that are randomly oriented, but that the commonly used picket fence alignment breaks down for short wavelengths. We also find that for a fixed dust grain size, the absorption coefficients and linear polarization fraction for a realistic mixture of grains with various shapes cannot both be accurately represented by a single representative grain with a fixed shape, but that instead an average over an appropriate shape distribution should be used. Insufficient knowledge of an appropriate shape distribution is the main obstacle in obtaining accurate optical properties. CosTuuM is available as a standalone Python library and can be used to generate optical properties to be used in radiative transfer applications.
[5] arXiv:2006.08976 [pdf]
Analysing the resilience of the European commodity production system with PyResPro, the Python Production Resilience package
This paper presents a Python object-oriented software and code to compute the annual production resilience indicator. The annual production resilience indicator can be applied to different anthropic and natural systems such as agricultural production, natural vegetation and water resources. Here, we show an example of resilience analysis of the economic values of the agricultural production in Europe. The analysis is conducted for individual time-series in order to estimate the resilience of a single commodity and to groups of time-series in order to estimate the overall resilience of diversified production systems composed of different crops and/or different countries. The proposed software is powerful and easy to use with publicly available datasets such as the one used in this study.
[6] arXiv:2006.08963 [pdf]
Enhanced force-field calibration via machine learning
The influence of microscopic force fields on the motion of Brownian particles plays a fundamental role in a broad range of fields, including soft matter, biophysics, and active matter. Often, the experimental calibration of these force fields relies on the analysis of the trajectories of these Brownian particles. However, such an analysis is not always straightforward, especially if the underlying force fields are non-conservative or time-varying, driving the system out of thermodynamic equilibrium. Here, we introduce a toolbox to calibrate microscopic force fields by analyzing the trajectories of a Brownian particle using machine learning, namely recurrent neural networks. We demonstrate that this machine-learning approach outperforms standard methods when characterizing the force fields generated by harmonic potentials if the available data are limited. More importantly, it provides a tool to calibrate force fields in situations for which there are no standard methods, such as non-conservative and time-varying force fields. In order to make this method readily available for other users, we provide a Python software package named DeepCalib, which can be easily personalized and optimized for specific applications.
[7] arXiv:2006.08945 [pdf]
The algebra and machine representation of statistical models
As the twin movements of open science and open source bring an ever greater share of the scientific process into the digital realm, new opportunities arise for the meta-scientific study of science itself, including of data science and statistics. Future science will likely see machines play an active role in processing, organizing, and perhaps even creating scientific knowledge. To make this possible, large engineering efforts must be undertaken to transform scientific artifacts into useful computational resources, and conceptual advances must be made in the organization of scientific theories, models, experiments, and data. This dissertation takes steps toward digitizing and systematizing two major artifacts of data science, statistical models and data analyses. Using tools from algebra, particularly categorical logic, a precise analogy is drawn between models in statistics and logic, enabling statistical models to be seen as models of theories, in the logical sense. Statistical theories, being algebraic structures, are amenable to machine representation and are equipped with morphisms that formalize the relations between different statistical methods. Turning from mathematics to engineering, a software system for creating machine representations of data analyses, in the form of Python or R programs, is designed and implemented. The representations aim to capture the semantics of data analyses, independent of the programming language and libraries in which they are implemented.
[8] arXiv:2006.08640 [pdf]
Forecasting Chemical Abundance Precision for Extragalactic Stellar Archaeology
Increasingly powerful and multiplexed spectroscopic facilities promise detailed chemical abundance patterns for millions of resolved stars in galaxies beyond the Milky Way (MW). Here, we employ the Cramér-Rao Lower Bound (CRLB) to forecast the precision to which stellar abundances for metal-poor, low-mass stars outside the MW can be measured for 41 current (e.g., Keck, MMT, VLT, DESI) and planned (e.g., MSE, JWST, ELTs) spectrograph configurations. We show that moderate resolution ($R\lesssim5000$) spectroscopy at blue-optical wavelengths ($\lambda\lesssim4500$ Å) (i) enables the recovery of 2-4 times as many elements as red-optical spectroscopy ($5000\lesssim\lambda\lesssim10000$ Å) at similar or higher resolutions ($R\sim 10000$) and (ii) can constrain the abundances of several neutron capture elements to $\lesssim$0.3 dex. We further show that high-resolution ($R\gtrsim 20000$), low S/N ($\sim$10 pixel$^{-1}$) spectra contain rich abundance information when modeled with full spectral fitting techniques. We demonstrate that JWST/NIRSpec and ELTs can recover (i) $\sim$10 and 30 elements, respectively, for metal-poor red giants throughout the Local Group and (ii) [Fe/H] and [$\alpha$/Fe] for resolved stars in galaxies out to several Mpc with modest integration times. We show that select literature abundances are within a factor of $\sim$2 (or better) of our CRLBs. We suggest that, like ETCs, CRLBs should be used when planning stellar spectroscopic observations. We include an open source python package, \texttt{Chem-I-Calc}, that allows users to compute CRLBs for spectrographs of their choosing.
[9] arXiv:2006.08444 [pdf]
Taxonomy and Practical Evaluation of Primality Testing Algorithms
Modern cryptography algorithms are commonly used to ensure information security. Prime numbers are needed in many asymmetric cryptography algorithms. For example, RSA algorithm selects two large prime numbers and multiplies to each other to obtain a large composite number whose factorization is very difficult. Producing a prime number is not an easy task as they are not distributed regularly through integers. Primality testing algorithms are used to determine whether a particular number is prime or composite. In this paper, an intensive survey is thoroughly conducted among the several primality testing algorithms showing the pros and cons, the time complexity, and a brief summary of each algorithm. Besides, an implementation of these algorithms is accomplished using Java and Python as programming languages to evaluate the efficiency of both the algorithms and the programming languages.
[10] arXiv:2006.08296 [pdf]
[11] arXiv:2006.08031 [pdf]
From conception to clinical trial IViST -- the first multi-sensor-based platform for real-time In Vivo dosimetry and Source Tracking in HDR brachytherapy
This study aims to introduce IViST (In Vivo Source Tracking), a novel multi-sensors dosimetry platform for real-time treatment monitoring in HDR brachytherapy. IViST is a platform that comprises 3 parts 1) an optimized and characterized multi-point plastic scintillator dosimeter (3 points mPSD; using BCF-60, BCF-12, and BCF-10 scintillators), 2) a compact assembly of photomultiplier tubes (PMTs) coupled to dichroic mirrors and filters for high-sensitivity scintillation light collection, and 3) a Python-based graphical user interface used for system management and signal processing. IViST can simultaneously measure dose, triangulate source position, and measure dwell time. By making 100 000 measurements/s, IViST samples enough data to quickly perform key QA/QC tasks such as identifying wrong individual dwell time or interchanged transfer tubes. By using 3 co-linear sensors and planned information for an implant geometry (from DICOM RT), the platform can also triangulate source position in real-time. A clinical trial is presently on-going using the IViST system.
[12] arXiv:2006.07484 [pdf]
dagger A Python Framework for Reproducible Machine Learning Experiment Orchestration
Many research directions in machine learning, particularly in deep learning, involve complex, multi-stage experiments, commonly involving state-mutating operations acting on models along multiple paths of execution. Although machine learning frameworks provide clean interfaces for defining model architectures and unbranched flows, burden is often placed on the researcher to track experimental provenance, that is, the state tree that leads to a final model configuration and result in a multi-stage experiment. Originally motivated by analysis reproducibility in the context of neural network pruning research, where multi-stage experiment pipelines are common, we present dagger, a framework to facilitate reproducible and reusable experiment orchestration. We describe the design principles of the framework and example usage.
[13] arXiv:2006.07456 [pdf]
Evidence of Crowding on Russell 3000 Reconstitution Events
We develop a methodology which replicates in great accuracy the FTSE Russell indexes reconstitutions, including the quarterly rebalancings due to new initial public offerings (IPOs). While using only data available in the CRSP US Stock database for our index reconstruction, we demonstrate the accuracy of this methodology by comparing it to the original Russell US indexes for the time period between 1989 to 2019. A python package that generates the replicated indexes is also provided. As an application, we use our index reconstruction protocol to compute the permanent and temporary price impact on the Russell 3000 annual additions and deletions, and on the quarterly additions of new IPOs . We find that the index portfolios following the Russell 3000 index and rebalanced on an annual basis are overall more crowded than those following the index on a quarterly basis. This phenomenon implies that transaction costs of indexing strategies could be significantly reduced by buying new IPOs additions in proximity to quarterly rebalance dates.
[14] arXiv:2006.07357 [pdf]
Hindsight Logging for Model Training
Due to the long time-lapse between the triggering and detection of a bug in the machine learning lifecycle, model developers favor data-centric logfile analysis over traditional interactive debugging techniques. But when useful execution data is missing from the logs after training, developers have little recourse beyond re-executing training with more logging statements, or guessing. In this paper, we present hindsight logging, a novel technique for efficiently querying ad-hoc execution data, long after model training. The goal of hindsight logging is to enable analysis of past executions as if the logs had been exhaustive. Rather than materialize logs up front, we draw on the idea of physiological database recovery, and adapt it to arbitrary programs. Developers can query the state in past runs of a program by adding arbitrary log statements to their code; a combination of physical and logical recovery is used to quickly produce the output of the new log statements. We implement these ideas in Flor, a record-replay system for hindsight logging in Python. We evaluate Flor's performance on eight different model training workloads from current computer vision and NLP benchmarks. We find that Flor replay achieves near-ideal scale-out and order-of-magnitude speedups in replay, with just 1.47% average runtime overhead from record.
[15] arXiv:2006.07236 [pdf]
A novel fusion Python application of data mining techniques to evaluate airborne magnetic datasets
A novel fusion python application of data mining techniques (DMT) was designed and implemented to locate, identify, and delineate the subsurface structural pattern (SSP) of source rocks for the features of interest underlain the study area. The techniques of machine learning tools (MLT) helped to define magnetic anomaly source (MAS) rock and the various depths of these subsurface source rock features. The principal objective is to use straightforward DMT to locate magnetic anomaly features of interest that host mineralization. The required geo-referenced radiometric data, which facilitated the delineation of SSP, were sufficiently covered by combining the application of the Oasis Montaj\c{opyright} 2014 source parameter imaging functions. Relevance basic filtering techniques of data reduction were used to improve the signal-to-noise (S/N) ratio and hence automatically determine depths to the various engrossed features from gridded geo-referenced airborne magnetic datasets before the DMT application was performed. Geological source rock models (GSRM) (i.e., rock contacts, dykes) served as the delineated features based on their structural index (SI) values. The anomalies were perpendicularly oriented, with few inconsequential nonvertical features, and all were generally aligned in NNE-SSW and NE-SW directions. The DMT approach showed that magnetic anomaly patterns (MAP) control the SSP and the ground surface stratigraphy (GSS) on a geological time-scale (GTS) by fusing the subsurface gravitational structural features (SGSF) in the area. The DMT facilitated the determination of depths to these subsurface geological source rock features with a maximum depth of approximately 1.277 km using a 3x3 window size to map the concealed features of interest.
[16] arXiv:2006.07162 [pdf]
An Optimized Ly$α$ Forest Inversion Tool Based on a Quantitative Comparison of Existing Reconstruction Methods
We present a same-level comparison of the most prominent inversion methods for the reconstruction of the matter density field in the quasi-linear regime from the Ly$\alpha$ forest flux. Moreover, we present a pathway for refining the reconstruction in the framework of numerical optimization. We apply this approach to construct a novel hybrid method. The methods which are used so far for matter reconstructions are the Richardson-Lucy algorithm, an iterative Gauss-Newton method and a statistical approach assuming a one-to-one correspondence between matter and flux. We study these methods for high spectral resolutions such that thermal broadening becomes relevant. The inversion methods are compared on synthetic data (generated with the lognormal approach) with respect to their performance, accuracy, their stability against noise, and their robustness against systematic uncertainties. We conclude that the iterative Gauss-Newton method offers the most accurate reconstruction, in particular at small S/N, but has also the largest numerical complexity and requires the strongest assumptions. The other two algorithms are faster, comparably precise at small noise-levels, and, in the case of the statistical approach, more robust against inaccurate assumptions on the thermal history of the intergalactic medium (IGM). We use these results to refine the statistical approach using regularization. Our new approach has low numerical complexity and makes few assumptions about the history of the IGM, and is shown to be the most accurate reconstruction at small S/N, even if the thermal history of the IGM is not known. Our code will be made publicly available under this https URL.
[17] arXiv:2006.06856 [pdf]
Bandit-PAM Almost Linear Time $k$-Medoids Clustering via Multi-Armed Bandits
Clustering is a ubiquitous task in data science. Compared to the commonly used $k$-means clustering algorithm, $k$-medoids clustering algorithms require the cluster centers to be actual data points and support arbitrary distance metrics, allowing for greater interpretability and the clustering of structured objects. Current state-of-the-art $k$-medoids clustering algorithms, such as Partitioning Around Medoids (PAM), are iterative and are quadratic in the dataset size $n$ for each iteration, being prohibitively expensive for large datasets. We propose Bandit-PAM, a randomized algorithm inspired by techniques from multi-armed bandits, that significantly improves the computational efficiency of PAM. We theoretically prove that Bandit-PAM reduces the complexity of each PAM iteration from $O(n^2)$ to $O(n \log n)$ and returns the same results with high probability, under assumptions on the data that often hold in practice. We empirically validate our results on several large-scale real-world datasets, including a coding exercise submissions dataset from this http URL, the 10x Genomics 68k PBMC single-cell RNA sequencing dataset, and the MNIST handwritten digits dataset. We observe that Bandit-PAM returns the same results as PAM while performing up to 200x fewer distance computations. The improvements demonstrated by Bandit-PAM enable $k$-medoids clustering on a wide range of applications, including identifying cell types in large-scale single-cell data and providing scalable feedback for students learning computer science online. We also release Python and C++ implementations of our algorithm.
[18] arXiv:2006.06691 [pdf]
PyPopStar A Python-Based Simple Stellar Population Synthesis Code for Star Clusters
We present PyPopStar, an open-source Python package that simulates simple stellar populations. The strength of PyPopStar is its modular interface which offers the user control of 13 input properties including (but not limited to) the Initial Mass Function, stellar multiplicity, extinction law, and the metallicity-dependent stellar evolution and atmosphere model grids used. The user also has control over the Initial-Final Mass Relation in order to produce compact stellar remnants (black holes, neutron stars, and white dwarfs). We demonstrate several outputs produced by the code, including color-magnitude diagrams, HR-diagrams, luminosity functions, and mass functions. PyPopStar is object-oriented and extensible, and we welcome contributions from the community. The code and documentation are available on GitHub and ReadtheDocs, respectively.
[19] arXiv:2006.06639 [pdf]
PESummary the code agnostic Parameter Estimation Summary page builder
PESummary is a Python software package for processing and visualising data from any parameter estimation code. The easy to use Python executable scripts and extensive online documentation has resulted in PESummary becoming a key component in the international gravitational-wave analysis toolkit. PESummary has been developed to be more than just a post-processing tool with all outputs fully self-contained. PESummary has become central to making gravitational-wave inference analysis open and easily reproducible.
[20] arXiv:2006.06062 [pdf]
Empirical Time Complexity of Generic Dijkstra Algorithm
Generic Dijkstra is a novel algorithm for finding the optimal shortest path in both wavelength-division multiplexed networks (WDM) and elastic optical networks (EON), claimed to outperform known algorithms considerably. Because of its novelty, it has not been independently implemented and verified. Its time complexity also remains unknown. In this paper we provide an independent open source implementation of Generic Dijkstra in the Python language. We confirm correctness of the algorithm and its superior performance. In comparison to the Filtered Graphs algorithm, Generic Dijkstra is approximately 3.5 times faster. In 95% of calls Generic Dijkstra is faster than Filtered Graphs. Moreover, we perform run-time analysis and show that Generic Dijkstra running time grows quadratically with the number of graph vertices and logarithmically with the number of edge units. We also discover that the running time of the Generic Dijkstra algorithm in function of network utilization is not monotonic - peak running time is at approximately 0.25 network utilization. This is the first complexity analysis of Generic Dijkstra algorithm.
[21] arXiv:2006.05648 [pdf]
Evaluating Graph Vulnerability and Robustness using TIGER
The study of network robustness is a critical tool in the characterization and understanding of complex interconnected systems such as transportation, infrastructure, communication, and computer networks. Through analyzing and understanding the robustness of these networks we can(1) quantify network vulnerability and robustness,(2) augment a network's structure to resist attacks and recover from failure, and (3) control the dissemination of entities on the network (e.g., viruses, propaganda). While significant research has been conducted on all of these tasks, no comprehensive open-source toolbox currently exists to assist researchers and practitioners in this important topic. This lack of available tools hinders reproducibility and examination of existing work, development of new research, and dissemination of new ideas. We contribute TIGER, an open-sourced Python toolbox to address these challenges. TIGER contains 22 graph robustness measures with both original and fast approximate versions; 17 failure and attack strategies; 15 heuristic and optimization based defense techniques; and 4 simulation tools. By democratizing the tools required to study network robustness, our goal is to assist researchers and practitioners in analyzing their own networks; and facilitate the development of new research in the field. TIGER is open-sourced at this https URL
[22] arXiv:2006.05528 [pdf]
Environmental effects with Frozen Density Embedding in Real-Time Time-Dependent Density Functional Theory using localized basis functions
Frozen Density Embedding (FDE) represents a versatile embedding scheme to describe the environmental effect on the electron dynamics in molecular systems. The extension of the general theory of FDE to the real-time time-dependent Kohn-Sham method has previously been presented and implemented in plane-waves and periodic boundary conditions (Pavanello et al. J. Chem. Phys. 142, 154116, 2015). In the current paper, we extend our recent formulation of real-time time-dependent Kohn-Sham method based on localized basis set functions and developed within the Psi4NumPy framework (De Santis et al. J. Chem. Theory Comput. 2020, 16, 2410) to the FDE scheme. The latter has been implemented in its "uncoupled" flavor (in which the time evolution is only carried out for the active subsystem, while the environment subsystems remain at their ground state), using and adapting the FDE implementation already available in the PyEmbed module of the scripting framework PyADF. The implementation was facilitated by the fact that both Psi4NumPy and PyADF, being native Python API, provided an ideal framework of development using the Python advantages in terms of code readability and reusability. We demonstrate that the inclusion of the FDE potential does not introduce any numerical instability in time propagation of the density matrix of the active subsystem and in the limit of weak external field, the numerical results for low-lying transition energies are consistent with those obtained using the reference FDE calculations based on the linear response TDDFT. The method is found to give stable numerical results also in the presence of strong external field inducing non-linear effects.
[23] arXiv:2006.05380 [pdf]
Tropes in films an initial analysis
TVTropes is a wiki that describes tropes and which ones are used in which artistic work. We are mostly interested in films, so after releasing the TropeScraper Python module that extracts data from this site, in this report we use scraped information to describe statistically how tropes and films are related to each other and how these relations evolve in time. In order to do so, we generated a dataset through the tool TropeScraper in April 2020. We have compared it to the latest snapshot of DB Tropes, a dataset covering the same site and published in July 2016, providing descriptive analysis, studying the fundamental differences and addressing the evolution of the wiki in terms of the number of tropes, the number of films and connections. The results show that the number of tropes and films doubled their value and quadrupled their relations, and films are, at large, better described in terms of tropes. However, while the types of films with the most tropes has not changed significantly in years, the list of most popular tropes has. This outcome can help on shedding some light on how popular tropes evolve, which ones become more popular or fade away, and in general how a set of tropes represents a film and might be a key to its success. The dataset generated, the information extracted, and the summaries provided are useful resources for any research involving films and tropes. They can provide proper context and explanations about the behaviour of models built on top of the dataset, including the generation of new content or its use in machine learning.
[24] arXiv:2006.05241 [pdf]
New Fusion Algorithm provides an alternative approach to Robotic Path planning
For rapid growth in technology and automation, human tasks are being taken over by robots as robots have proven to be better with both speed and precision. One of the major and widespread usages of these robots is in the industrial businesses, where they are employed to carry massive loads in and around work areas. As these working environments might not be completely localized and could be dynamically changing, new approaches must be evaluated to guarantee a crash-free way of performing duties. This paper presents a new and efficient fusion algorithm for solving the path planning problem in a custom 2D environment. This fusion algorithm integrates an improved and optimized version of both, A* algorithm and the Artificial potential field method. Firstly, an initial or preliminary path is planned in the environmental model by adopting the A* algorithm. The heuristic function of this A* algorithm is optimized and improved according to the environmental model. This is followed by selecting and saving the key nodes in the initial path. Lastly, on the basis of these saved key nodes, path smoothing is done by artificial potential field method. Our simulation results carried out using Python viz. libraries indicate that the new fusion algorithm is feasible and superior in smoothness performance and can satisfy as a time-efficient and cheaper alternative to conventional A* strategies of path planning.
[25] arXiv:2006.04951 [pdf]
Network visualizations with Pyvis and VisJS
Pyvis is a Python module that enables visualizing and interactively manipulating network graphs in the Jupyter notebook, or as a standalone web application. Pyvis is built on top of the powerful and mature VisJS JavaScript library, which allows for fast and responsive interactions while also abstracting away the low-level JavaScript and HTML. This means that elements of the rendered graph visualization, such as node/edge attributes can be specified within Python and shipped to the JavaScript layer for VisJS to render. This declarative approach makes it easy to quickly explore graph visualizations and investigate data relationships. In addition, Pyvis is highly customizable so that colors, sizes, and hover tooltips can be assigned to the rendered graph. The network graph layout is controlled by a front-end physics engine that is configurable from a Python interface, allowing for the detailed placement of the graph elements. In this paper, we outline use cases for Pyvis with specific examples to highlight key features for any analysis workflow. A brief overview of Pyvis' implementation describes how the Python front-end binding uses simple Pyvis calls.
[26] arXiv:2006.04942 [pdf]
CRISP A Probabilistic Model for Individual-Level COVID-19 Infection Risk Estimation Based on Contact Data
We present CRISP (COVID-19 Risk Score Prediction), a probabilistic graphical model for COVID-19 infection spread through a population based on the SEIR model where we assume access to (1) mutual contacts between pairs of individuals across time across various channels (e.g., Bluetooth contact traces), as well as (2) test outcomes at given times for infection, exposure and immunity tests. Our micro-level model keeps track of the infection state for each individual at every point in time, ranging from susceptible, exposed, infectious to recovered. We develop a Monte Carlo EM algorithm to infer contact-channel specific infection transmission probabilities. Our algorithm uses Gibbs sampling to draw samples of the latent infection status of each individual over the entire time period of analysis, given the latent infection status of all contacts and test outcome data. Experimental results with simulated data demonstrate our CRISP model can be parametrized by the reproduction factor $R_0$ and exhibits population-level infectiousness and recovery time series similar to those of the classical SEIR model. However, due to the individual contact data, this model allows fine grained control and inference for a wide range of COVID-19 mitigation and suppression policy measures. Moreover, the algorithm is able to support efficient testing in a test-trace-isolate approach to contain COVID-19 infection spread. To the best of our knowledge, this is the first model with efficient inference for COVID-19 infection spread based on individual-level contact data; most epidemic models are macro-level models that reason over entire populations. The implementation of CRISP is available in Python and C++ at this https URL.
[27] arXiv:2006.04845 [pdf]
This paper focuses on mitigating the impact of stragglers in distributed learning system. Unlike the existing results designed for a fixed number of stragglers, we developed a new scheme called \emph{Adaptive Gradient Coding(AGC)} with flexible tolerance of various number of stragglers. Our scheme gives an optimal tradeoff between computation load, straggler tolerance and communication cost. In particular, it allows to minimize the communication cost according to the real-time number of stragglers in the practical environments. Implementations on Amazon EC2 clusters using Python with mpi4py package verify the flexibility in several situations.
[28] arXiv:2006.04836 [pdf]
A Modified AUC for Training Convolutional Neural Networks Taking Confidence into Account
Receiver operating characteristic (ROC) curve is an informative tool in binary classification and Area Under ROC Curve (AUC) is a popular metric for reporting performance of binary classifiers. In this paper, first we present a comprehensive review of ROC curve and AUC metric. Next, we propose a modified version of AUC that takes confidence of the model into account and at the same time, incorporates AUC into Binary Cross Entropy (BCE) loss used for training a Convolutional neural Network for classification tasks. We demonstrate this on two datasets MNIST and prostate MRI. Furthermore, we have published GenuineAI, a new python library, which provides the functions for conventional AUC and the proposed modified AUC along with metrics including sensitivity, specificity, recall, precision, and F1 for each point of the ROC curve.
[29] arXiv:2006.04311 [pdf]
Little Ball of Fur A Python Library for Graph Sampling
Sampling graphs is an important task in data mining. In this paper, we describe Little Ball of Fur a Python library that includes more than twenty graph sampling algorithms. Our goal is to make node, edge, and exploration-based network sampling techniques accessible to a large number of professionals, researchers, and students in a single streamlined framework. We created this framework with a focus on a coherent application public interface which has a convenient design, generic input data requirements, and reasonable baseline settings of algorithms. Here we overview these design foundations of the framework in detail with illustrative code snippets. We show the practical usability of the library by estimating various global statistics of social networks and web graphs. Experiments demonstrate that Little Ball of Fur can speed up node and whole graph embedding techniques considerably with mildly deteriorating the predictive value of distilled features.
[30] arXiv:2006.03879 [pdf]
Scalene Scripting-Language Aware Profiling for Python
Existing profilers for scripting languages (a.k.a. "glue" languages) like Python suffer from numerous problems that drastically limit their usefulness. They impose order-of-magnitude overheads, report information at too coarse a granularity, or fail in the face of threads. Worse, past profilers---essentially variants of their counterparts for C---are oblivious to the fact that optimizing code in scripting languages requires information about code spanning the divide between the scripting language and libraries written in compiled languages. This paper introduces scripting-language aware profiling, and presents Scalene, an implementation of scripting-language aware profiling for Python. Scalene employs a combination of sampling, inference, and disassembly of byte-codes to efficiently and precisely attribute execution time and memory usage to either Python, which developers can optimize, or library code, which they cannot. It includes a novel sampling memory allocator that reports line-level memory consumption and trends with low overhead, helping developers reduce footprints and identify leaks. Finally, it introduces a new metric, copy volume, to help developers root out insidious copying costs across the Python/library boundary, which can drastically degrade performance. Scalene works for single or multi-threaded Python code, is precise, reporting detailed information at the line granularity, while imposing modest overheads (26%--53%).
[31] arXiv:2006.03562 [pdf]
Blind De-Blurring of Microscopy Images for Cornea Cell Counting
Cornea cell count is an important diagnostic tool commonly used by practitioners to assess the health of a patient's cornea. Unfortunately, clinical specular microscopy requires the acquisition of a large number of images at different focus depths because the curved shape of the cornea makes it impossible to acquire a single all-in-focus image. This paper describes two methods and their implementations to reduce the number of images required to run a cell-counting algorithm, thus shortening the duration of the examination and increasing the patient's comfort. The basic idea is to apply de-blurring techniques on the raw images to reconstruct the out-of-focus areas and expand the sharp regions of the image. Our approach is based on blind-deconvolution reconstruction that performs a depth-from-deblur so to either model Gaussian kernel or to fit kernels from an ad hoc lookup table.
[32] arXiv:2006.03511 [pdf]
Unsupervised Translation of Programming Languages
A transcompiler, also known as source-to-source translator, is a system that converts source code from a high-level programming language (such as C++ or Python) to another. Transcompilers are primarily used for interoperability, and to port codebases written in an obsolete or deprecated language (e.g. COBOL, Python 2) to a modern one. They typically rely on handcrafted rewrite rules, applied to the source code abstract syntax tree. Unfortunately, the resulting translations often lack readability, fail to respect the target language conventions, and require manual modifications in order to work properly. The overall translation process is timeconsuming and requires expertise in both the source and target languages, making code-translation projects expensive. Although neural models significantly outperform their rule-based counterparts in the context of natural language translation, their applications to transcompilation have been limited due to the scarcity of parallel data in this domain. In this paper, we propose to leverage recent approaches in unsupervised machine translation to train a fully unsupervised neural transcompiler. We train our model on source code from open source GitHub projects, and show that it can translate functions between C++, Java, and Python with high accuracy. Our method relies exclusively on monolingual source code, requires no expertise in the source or target languages, and can easily be generalized to other programming languages. We also build and release a test set composed of 852 parallel functions, along with unit tests to check the correctness of translations. We show that our model outperforms rule-based commercial baselines by a significant margin.
[33] arXiv:2006.03294 [pdf]
Autonomous Gaussian decomposition of the Galactic Ring Survey. II. The Galactic distribution of 13CO
Knowledge about the distribution of CO emission in the Milky Way is essential to understand the impact of Galactic environment on the formation and evolution of structures in the interstellar medium. However, currently our insight about the fraction of CO in spiral arm and interarm regions is still limited by large uncertainties in assumed rotation curve models or distance determination techniques. In this work we use a Bayesian approach to obtain the current best assessment of the distribution of 13CO from the Galactic Ring Survey. We performed two different distance estimates that either included or excluded a model for Galactic features. We also include a prior for the solution of the kinematic distance ambiguity that was determined from a compilation of literature distances and an assumed size-linewidth relationship. We find that the fraction of 13CO emission associated with spiral arm features varies from 76% to 84% between the two distance runs. The vertical distribution of the gas is concentrated around the Galactic midplane showing FWHM values of ~75 pc. We do not find any significant difference between gas emission properties associated with spiral arm and interarm features. In particular the distribution of velocity dispersion values of gas emission in spurs and spiral arms is very similar. We detect a trend of higher velocity dispersion values with increasing heliocentric distance, which we attribute to beam averaging effects caused by differences in spatial resolution. We argue that the true distribution of the gas emission is likely more similar to a combination of our two distance results, and highlight the importance of using complementary distance estimations to safeguard against the pitfalls of any single approach. We conclude that the methodology presented in this work is a promising way to determine distances to gas emission features in Galactic plane surveys.
[34] arXiv:2006.02932 [pdf]
Vulnerability Analysis of 2500 Docker Hub Images
The use of container technology has skyrocketed during the last few years, with Docker as the leading container platform. Docker's online repository for publicly available container images, called Docker Hub, hosts over 3.5 million images at the time of writing, making it the world's largest community of container images. We perform an extensive vulnerability analysis of 2500 Docker images. It is of particular interest to perform this type of analysis because the vulnerability landscape is a rapidly changing category, the vulnerability scanners are constantly developed and updated, new vulnerabilities are discovered, and the volume of images on Docker Hub is increasing every day. Our main findings reveal that (1) the number of newly introduced vulnerabilities on Docker Hub is rapidly increasing; (2) certified images are the most vulnerable; (3) official images are the least vulnerable; (4) there is no correlation between the number of vulnerabilities and image features (i.e., number of pulls, number of stars, and days since the last update); (5) the most severe vulnerabilities originate from two of the most popular scripting languages, JavaScript and Python; and (6) Python 2.x packages and jackson-databind packages contain the highest number of severe vulnerabilities. We perceive our study as the most extensive vulnerability analysis published in the open literature in the last couple of years.
[35] arXiv:2006.02837 [pdf]
Extending XACC for Quantum Optimal Control
Quantum computing vendors are beginning to open up application programming interfaces for direct pulse-level quantum control. With this, programmers can begin to describe quantum kernels of execution via sequences of arbitrary pulse shapes. This opens new avenues of research and development with regards to smart quantum compilation routines that enable direct translation of higher-level digital assembly representations to these native pulse instructions. In this work, we present an extension to the XACC system-level quantum-classical software framework that directly enables this compilation lowering phase via user-specified quantum optimal control techniques. This extension enables the translation of digital quantum circuit representations to equivalent pulse sequences that are optimal with respect to the backend system dynamics. Our work is modular and extensible, enabling third party optimal control techniques and strategies in both C++ and Python. We demonstrate this extension with familiar gradient-based methods like gradient ascent pulse engineering (GRAPE), gradient optimization of analytic controls (GOAT), and Krotov's method. Our work serves as a foundational component of future quantum-classical compiler designs that lower high-level programmatic representations to low-level machine instructions.
[36] arXiv:2006.02267 [pdf]
FastONN -- Python based open-source GPU implementation for Operational Neural Networks
Operational Neural Networks (ONNs) have recently been proposed as a special class of artificial neural networks for grid structured data. They enable heterogenous non-linear operations to generalize the widely adopted convolution-based neuron model. This work introduces a fast GPU-enabled library for training operational neural networks, FastONN, which is based on a novel vectorized formulation of the operational neurons. Leveraging on automatic reverse-mode differentiation for backpropagation, FastONN enables increased flexibility with the incorporation of new operator sets and customized gradient flows. Additionally, bundled auxiliary modules offer interfaces for performance tracking and checkpointing across different data partitions and customized metrics.
[37] arXiv:2006.01969 [pdf]
REL An Entity Linker Standing on the Shoulders of Giants
Entity linking is a standard component in modern retrieval system that is often performed by third-party toolkits. Despite the plethora of open source options, it is difficult to find a single system that has a modular architecture where certain components may be replaced, does not depend on external sources, can easily be updated to newer Wikipedia versions, and, most important of all, has state-of-the-art performance. The REL system presented in this paper aims to fill that gap. Building on state-of-the-art neural components from natural language processing research, it is provided as a Python package as well as a web API. We also report on an experimental comparison against both well-established systems and the current state-of-the-art on standard entity linking benchmarks.
[38] arXiv:2006.01818 [pdf]
Securing Your Collaborative Jupyter Notebooks in the Cloud using Container and Load Balancing Services
Jupyter has become the go-to platform for developing data applications but data and security concerns, especially when dealing with healthcare, have become paramount for many institutions and applications dealing with sensitive information. How then can we continue to enjoy the data analysis and machine learning opportunities provided by Jupyter and the Python ecosystem while guaranteeing auditable compliance with security and privacy concerns? We will describe the architecture and implementation of a cloud based platform based on Jupyter that integrates with Amazon Web Services (AWS) and uses containerized services without exposing the platform to the vulnerabilities present in Kubernetes and JupyterHub. This architecture addresses the HIPAA requirements to ensure both security and privacy of data. The architecture uses an AWS service to provide JSON Web Tokens (JWT) for authentication as well as network control. Furthermore, our architecture enables secure collaboration and sharing of Jupyter notebooks. Even though our platform is focused on Jupyter notebooks and JupyterLab, it also supports R-Studio and bespoke applications that share the same authentication mechanisms. Further, the platform can be extended to other cloud services other than AWS.
[39] arXiv:2006.01804 [pdf]
Fast and accurate aberration estimation from 3D bead images using convolutional neural networks
Estimating optical aberrations from volumetric intensity images is a key step in sensorless adaptive optics for microscopy. Here we describe a method (PHASENET) for fast and accurate aberration measurement from experimentally acquired 3D bead images using convolutional neural networks. Importantly, we show that networks trained only on synthetically generated data can successfully predict aberrations from experimental images. We demonstrate our approach on two data sets acquired with different microscopy modalities and find that PHASENET yields results better than or comparable to classical methods while being orders of magnitude faster. We furthermore show that the number of focal planes required for satisfactory prediction is related to different symmetry groups of Zernike modes. PHASENET is freely available as open-source software in Python.
[40] arXiv:2006.01635 [pdf]
direpack A Python 3 package for state-of-the-art statistical dimension reduction methods
The direpack package aims to establish a set of modern statistical dimension reduction techniques into the Python universe as a single, consistent package. The dimension reduction methods included resort into three categories projection pursuit based dimension reduction, sufficient dimension reduction, and robust M estimators for dimension reduction. As a corollary, regularized regression estimators based on these reduced dimension spaces are provided as well, ranging from classical principal component regression up to sparse partial robust M regression. The package also contains a set of classical and robust pre-processing utilities, including generalized spatial signs, as well as dedicated plotting functionality and cross-validation utilities. Finally, direpack has been written consistent with the scikit-learn API, such that the estimators can flawlessly be included into (statistical and/or machine) learning pipelines in that framework.
[41] arXiv:2006.01171 [pdf]
Regression Enrichment Surfaces a Simple Analysis Technique for Virtual Drug Screening Models
We present a new method for understanding the performance of a model in virtual drug screening tasks. While most virtual screening problems present as a mix between ranking and classification, the models are typically trained as regression models presenting a problem requiring either a choice of a cutoff or ranking measure. Our method, regression enrichment surfaces (RES), is based on the goal of virtual screening to detect as many of the top-performing treatments as possible. We outline history of virtual screening performance measures and the idea behind RES. We offer a python package and details on how to implement and interpret the results.
[42] arXiv:2006.01082 [pdf]
Transport of dust grain particles in the accretion disk
Entrainment of dust particles in the flow inside and outside of the proto-planetary disk has implications for the disk evolution and composition of planets. Using quasi-stationary solutions in our star-disk simulations as a background, we add dust particles of different radii in post-processing of the results, using our Python tool DUSTER. The distribution and motion of particles in the disk is followed in the cases with and without the backflow in the disk. We also compare the results with and without the radiation pressure included in the computation.
[43] arXiv:2006.01015 [pdf]
PlenoptiSign an optical design tool for plenoptic imaging
Plenoptic imaging enables a light-field to be captured by a single monocular objective lens and an array of micro lenses attached to an image sensor. Metric distances of the light-field's depth planes remain unapparent prior to acquisition. Recent research showed that sampled depth locations rely on the parameters of the system's optical components. This paper presents PlenoptiSign, which implements these findings as a Python software package to help assist in an experimental or prototyping stage of a plenoptic system.
[44] arXiv:2006.00909 [pdf]
Using Cosmic Rays detected by HST as Geophysical Markers I Detection and Characterization of Cosmic Rays
The Hubble Space Telescope (HST) has been operational for almost 30 years and throughout that time it has been bombarded by high energy charged particles colloquially referred to as cosmic rays. In this paper, we present a comprehensive study of more than 1.2 billion cosmic rays observed with HST using a custom written python package, \texttt{HSTcosmicrays}, that is available to the astronomical community. We analyzed $75,908$ dark calibration files taken as part of routine calibration programs for five different CCD imagers with operational coverage of Solar Cycle 23 and 24. We observe the expected modulation of galactic cosmic rays by solar activity. For the three imagers with the largest non-uniformity in thickness, we independently confirm the overall structure produced by fringing analyses by analyzing cosmic ray strikes across the detector field of view. We analyze STIS/CCD observations taken as HST crosses over the South Atlantic Anomaly and find a peak cosmic ray flux of $\sim1100$ $CR/s/cm^2$. We find strong evidence for two spatially confined regions over North America and Australia that exhibit increased cosmic ray fluxes at the $5\sigma$ level.
[45] arXiv:2006.00233 [pdf]
JoXSZ Joint X-SZ fitter for galaxy clusters
High-resolution observations of the thermal Sunyaev-Zeldovich (SZ) effect and of the X-ray emission of galaxy clusters are becoming more and more widespread, offering us an unique asset to the study of the thermodynamic properties of the intracluster medium. We present JoXSZ, a Bayesian forward-modelling Python code designed to jointly fit the SZ data and the three dimensional X-ray data cube. JoXSZ is able to derive the thermodynamic profiles of galaxy clusters for the first time making full and consistent use of all the information contained in the observations. JoXSZ will be publicly available on GitHub in the near future.
[46] arXiv:2006.00038 [pdf]
Quasi-orthonormal Encoding for Machine Learning Applications
Most machine learning models, especially artificial neural networks, require numerical, not categorical data. We briefly describe the advantages and disadvantages of common encoding schemes. For example, one-hot encoding is commonly used for attributes with a few unrelated categories and word embeddings for attributes with many related categories (e.g., words). Neither is suitable for encoding attributes with many unrelated categories, such as diagnosis codes in healthcare applications. Application of one-hot encoding for diagnosis codes, for example, can result in extremely high dimensionality with low sample size problems or artificially induce machine learning artifacts, not to mention the explosion of computing resources needed. Quasi-orthonormal encoding (QOE) fills the gap. We briefly show how QOE compares to one-hot encoding. We provide example code of how to implement QOE using popular ML libraries such as Tensorflow and PyTorch and a demonstration of QOE to MNIST handwriting samples.
[47] arXiv:2005.14344 [pdf]
Chook -- A comprehensive suite for generating binary optimization problems with planted solutions
We present Chook, an open-source Python-based tool to generate discrete optimization problems of tunable complexity with a priori known solutions. Chook provides a cross-platform unified environment for solution planting using a number of techniques, such as tile planting, Wishart planting, equation planting, and deceptive cluster loop planting. Chook also incorporates planted solutions for higher-order (beyond quadratic) binary optimization problems. The support for various planting schemes and the tunable hardness allows the user to generate problems with a wide range of complexity on different graph topologies ranging from hypercubic lattices to fully-connected graphs.
[48] arXiv:2005.14143 [pdf]
DQM Tools and Techniques of the SND Detector
SND detector operates at the VEPP-2000 collider (BINP, Novosibirsk). To improve events selection for physical analysis and facilitate online detector control we developed new data quality monitoring (DQM) system. The system includes online and reprocess control modules, automatic decision making scripts, interactive (web based) and program (python) access to various quality estimates. This access is implemented with node.js server with data in RDBMS MySQL. We describe here general system logics, its components and some implementation details.
[49] arXiv:2005.13663 [pdf]
A Comparative Study of Long and Short GRBs. II. A Multi-wavelength Method to distinguish Type II (massive star) and Type I (compact star) GRBs
Gamma Ray Burst (GRBs) are empirically classified as long-duration GRBs (LGRBs, $>$ 2s) and short-duration GRBs (SGRBs, $0$). The only confirmed Type I GRB, GRB 170817A, has log $O({\rm III})=-10$. According to this criterion, the supernova-less long GRBs 060614 and 060505 belong to Type I, and two controversial short GRBs 090426 and 060121 belong to Type II.
[50] arXiv:2005.13483 [pdf]
Kernel methods library for pattern analysis and machine learning in python
Kernel methods have proven to be powerful techniques for pattern analysis and machine learning (ML) in a variety of domains. However, many of their original or advanced implementations remain in Matlab. With the incredible rise and adoption of Python in the ML and data science world, there is a clear need for a well-defined library that enables not only the use of popular kernels, but also allows easy definition of customized kernels to fine-tune them for diverse applications. The kernelmethods library fills that important void in the python ML ecosystem in a domain-agnostic fashion, allowing the sample data type to be anything from numerical, categorical, graphs or a combination of them. In addition, this library provides a number of well-defined classes to make various kernel-based operations efficient (for large scale datasets), modular (for ease of domain adaptation), and inter-operable (across different ecosystems). The library is available at this https URL.
[51] arXiv:2005.12315 [pdf]
JoXSZ Joint X-SZ fitting code for galaxy clusters
The thermal Sunyaev-Zeldovich (SZ) effect and the X-ray emission offer separate and highly complementary probes of the thermodynamics of the intracluster medium. We present JoXSZ, the first publicly available code designed to jointly fit SZ and X-ray data coming from various instruments to derive the thermodynamic profiles of galaxy clusters. JoXSZ follows a fully Bayesian forward-modelling approach, accounts for the SZ calibration uncertainty and X-ray background level systematic. It improves upon most state-of-the-art, and not publicly available, analyses because it adopts the correct Poisson-Gauss expression for the joint likelihood, makes full use of the information contained in the observations, even in the case of missing values within the datasets, has a more inclusive error budget, and adopts a consistent temperature across the various parts of the code, allowing for differences between X-ray and SZ gas mass weighted temperatures when required by the user. JoXSZ accounts for beam smearing and data analysis transfer function, accounts for the temperature and metallicity dependencies of the SZ and X-ray conversion factors, adopts flexible parametrization for the thermodynamic profiles, and on user request allows either adopting or relaxing the assumption of hydrostatic equilibrium (HE). When HE holds, JoXSZ uses a physical (positive) prior on the radial derivative of the enclosed mass and derives the mass profile and overdensity radii $r_\Delta$. For these reasons, JoXSZ goes beyond simple SZ and electron density fits. We illustrate the use of JoXSZ by combining Chandra and NIKA data on the high-redshift cluster CL J1226.9+3332. The code is written in Python, it is fully documented and the users are free to customize their analysis in accordance with their needs and requirements. JoXSZ is publicly available on GitHub.
[52] arXiv:2005.12131 [pdf]
MAISE Construction of neural network interatomic models and evolutionary structure optimization
MAISE is an open-source package for materials modeling and prediction. The code's main feature is an automated generation of neural network (NN) interatomic potentials for use in global structure searches. The systematic construction of Behler-Parrinello-type NN models approximating ab initio energy and forces relies on two approaches introduced in our recent studies. An evolutionary sampling scheme for generating reference structures improves the NNs' mapping of regions visited in unconstrained searches, while a stratified training approach enables the creation of standardized NN models for multiple elements. A more flexible NN architecture proposed here expands the applicability of the stratified scheme for an arbitrary number of elements. The full workflow in the NN development is managed with a customizable 'MAISE-NET' wrapper written in Python. The global structure optimization capability in MAISE is based on an evolutionary algorithm applicable for nanoparticles, films, and bulk crystals. A multitribe extension of the algorithm allows for an efficient simultaneous optimization of nanoparticles in a given size range. Implemented structure analysis functions include fingerprinting with radial distribution functions and finding space groups with the SPGLIB tool. This work overviews MAISE's available features, constructed models, and confirmed predictions.
[53] arXiv:2005.11890 [pdf]
mvlearn Multiview Machine Learning in Python
As data are generated more and more from multiple disparate sources, multiview datasets, where each sample has features in distinct views, have ballooned in recent years. However, no comprehensive package exists that enables non-specialists to use these methods easily. mvlearn, is a Python library which implements the leading multiview machine learning methods. Its simple API closely follows that of scikit-learn for increased ease-of-use. The package can be installed from Python Package Index (PyPI) or the conda package manager and is released under the Apache 2.0 open-source license. The documentation, detailed tutorials, and all releases are available at this https URL.
[54] arXiv:2005.11841 [pdf]
scadnano A browser-based, easily scriptable tool for designing DNA nanostructures
We introduce $\textit{scadnano}$ (this https URL) (short for "scriptable cadnano"), a computational tool for designing synthetic DNA structures. Its design is based heavily on cadnano, the most widely-used software for designing DNA origami, with three main differences 1. scadnano runs entirely in the browser, with $\textit{no software installation}$ required. 2. scadnano designs, while they can be edited manually, can also be created and edited by a $\textit{well-documented Python scripting library}$, to help automate tedious tasks. 3. The scadnano file format is $\textit{easily human-readable}$. This goal is closely aligned with the scripting library, intended to be helpful when debugging scripts or interfacing with other software. The format is also somewhat more expressive than that of cadnano, able to describe a broader range of DNA structures than just DNA origami.
[55] arXiv:2005.11820 [pdf]
SWinvert A workflow for performing rigorous surface wave inversions
SWinvert is a workflow developed at The University of Texas at Austin for the inversion of surface wave dispersion data. SWinvert encourages analysts to investigate inversion uncertainty and non-uniqueness in shear wave velocity (Vs) by providing a systematic procedure and open-source tools for surface wave inversion. In particular, the workflow enables the use of multiple layering parameterizations to address the inversion's non-uniqueness, multiple global searches for each parameterization to address the inverse problem's non-linearity, and quantification of Vs uncertainty in the resulting profiles. To encourage its adoption, the SWinvert workflow is supported by an open-source Python package, SWprepost, for surface wave inversion pre- and post-processing and an application on the DesignSafe-CyberInfracture, SWbatch, that enlists high-performance computing for performing batch-style surface wave inversion through an intuitive and easy-to-use web interface. While the workflow uses the Dinver module of the popular open-source Geopsy software as its inversion engine, the principles presented can be readily extended to other inversion programs. To illustrate the effectiveness of the SWinvert workflow and to develop a set of benchmarks for use in future surface wave inversion studies, synthetic experimental dispersion data for 12 subsurface models of varying complexity are inverted. While the effects of inversion uncertainty and non-uniqueness are shown to be minimal for simple subsurface models characterized by broadband dispersion data, these effects cannot be ignored in the Vs profiles derived for more complex models with band-limited dispersion data. The SWinvert workflow is shown to provide a methodical procedure and a powerful set of tools for performing rigorous surface wave inversions and quantifying the uncertainty in the resulting Vs profiles.
[56] arXiv:2005.11644 [pdf]
miniKanren as a Tool for Symbolic Computation in Python
In this article, we give a brief overview of the current state and future potential of symbolic computation within the Python statistical modeling and machine learning community. We detail the use of miniKanren as an underlying framework for term rewriting and symbolic mathematics, as well as its ability to orchestrate the use of existing Python libraries. We also discuss the relevance and potential of relational programming for implementing more robust, portable, domain-specific "math-level" optimizations--with a slight focus on Bayesian modeling. Finally, we describe the work going forward and raise some questions regarding potential cross-overs between statistical modeling and programming language theory.
[57] arXiv:2005.11577 [pdf]
PhyAAt Physiology of Auditory Attention to Speech Dataset
Auditory attention to natural speech is a complex brain process. Its quantification from physiological signals can be valuable to improving and widening the range of applications of current brain-computer-interface systems, however it remains a challenging task. In this article, we present a dataset of physiological signals collected from an experiment on auditory attention to natural speech. In this experiment, auditory stimuli consisting of reproductions of English sentences in different auditory conditions were presented to 25 non-native participants, who were asked to transcribe the sentences. During the experiment, 14 channel electroencephalogram, galvanic skin response, and photoplethysmogram signals were collected from each participant. Based on the number of correctly transcribed words, an attention score was obtained for each auditory stimulus presented to subjects. A strong correlation ($pthis https URL. [58] arXiv:2005.11394 [pdf] MANGO A Python Library for Parallel Hyperparameter Tuning Tuning hyperparameters for machine learning algorithms is a tedious task, one that is typically done manually. To enable automated hyperparameter tuning, recent works have started to use techniques based on Bayesian optimization. However, to practically enable automated tuning for large scale machine learning training pipelines, significant gaps remain in existing libraries, including lack of abstractions, fault tolerance, and flexibility to support scheduling on any distributed computing framework. To address these challenges, we present Mango, a Python library for parallel hyperparameter tuning. Mango enables the use of any distributed scheduling framework, implements intelligent parallel search strategies, and provides rich abstractions for defining complex hyperparameter search spaces that are compatible with scikit-learn. Mango is comparable in performance to Hyperopt, another widely used library. Mango is available open-source and is currently used in production at Arm Research to provide state-of-art hyperparameter tuning capabilities. [59] arXiv:2005.11288 [pdf] EinsteinPy A Community Python Package for General Relativity This paper presents EinsteinPy (version 0.3), a community-developed Python package for gravitational and relativistic astrophysics. Python is a free, easy to use a high-level programming language which has seen a huge expansion in the number of its users and developers in recent years. Specifically, a lot of recent studies show that the use of Python in Astrophysics and general physics has increased exponentially. We aim to provide a very high level of abstraction, an easy to use interface and pleasing user experience. EinsteinPy is developed keeping in mind the state of a theoretical gravitational physicist with little or no background in computer programming and trying to work in the field of numerical relativity or trying to use simulations in their research. Currently, EinsteinPy supports simulation of time-like and null geodesics and calculates trajectories in different background geometries some of which are Schwarzschild, Kerr, and KerrNewmann along with coordinate inter-conversion pipeline. It has a partially developed pipeline for plotting and visualization with dependencies on libraries like Plotly, matplotlib, etc. One of the unique features of EinsteinPy is a sufficiently developed symbolic tensor manipulation utilities which are a great tool in itself for teaching yourself tensor algebra which for many beginner students can be overwhelmingly tricky. EinsteinPy also provides few utility functions for hypersurface embedding of Schwarzschild spacetime which further will be extended to model gravitational lensing simulation. [60] arXiv:2005.11251 [pdf] A machine learning based software pipeline to pick the variable ordering for algorithms with polynomial inputs We are interested in the application of Machine Learning (ML) technology to improve mathematical software. It may seem that the probabilistic nature of ML tools would invalidate the exact results prized by such software, however, the algorithms which underpin the software often come with a range of choices which are good candidates for ML application. We refer to choices which have no effect on the mathematical correctness of the software, but do impact its performance. In the past we experimented with one such choice the variable ordering to use when building a Cylindrical Algebraic Decomposition (CAD). We used the Python library Scikit-Learn (sklearn) to experiment with different ML models, and developed new techniques for feature generation and hyper-parameter selection. These techniques could easily be adapted for making decisions other than our immediate application of CAD variable ordering. Hence in this paper we present a software pipeline to use sklearn to pick the variable ordering for an algorithm that acts on a polynomial system. The code described is freely available online. [61] arXiv:2005.11233 [pdf] Scanner data in inflation measurement from raw data to price indices Scanner data offer new opportunities for CPI or HICP calculation. They can be obtained from a~wide variety of~retailers (supermarkets, home electronics, Internet shops, etc.) and provide information at the level of~the barcode. One of~advantages of~using scanner data is the fact that they contain complete transaction information, i.e. prices and quantities for every sold item. To use scanner data, it must be carefully processed. After clearing data and unifying product names, products should be carefully classified (e.g. into COICOP 5 or below), matched, filtered and aggregated. These procedures often require creating new IT or writing custom scripts (R, Python, Mathematica, SAS, others). One of~new challenges connected with scanner data is the appropriate choice of~the index formula. In this article we present a~proposal for the implementation of~individual stages of~handling scanner data. We also point out potential problems during scanner data processing and their solutions. Finally, we compare a~large number of~price index methods based on real scanner datasets and we verify their sensitivity on adopted data filtering and aggregating methods. [62] arXiv:2005.11225 [pdf] BDAQ53, a versatile Readout and Test System for Pixel Detector Systems for the ATLAS and CMS HL-LHC Upgrades BDAQ53 is a readout system and verification framework for hybrid pixel detector readout chips of the RD53 family. These chips are designed for the upgrade of the inner tracking detectors of the ATLAS and CMS experiments. BDAQ53 is used in applications where versatility and rapid customization are required, such as in lab testing environments, test beam campaigns, and permanent setups for quality control measurements. It consists of custom and commercial hardware, a Python-based software framework, and FPGA firmware. BDAQ53 is developed as open source software with both software and firmware being hosted in a public repository. [63] arXiv:2005.10862 [pdf] SudoQ -- a quantum variant of the popular game We introduce SudoQ, a quantum version of the classical game Sudoku. Allowing the entries of the grid to be (non-commutative) projections instead of integers, the solution set of SudoQ puzzles can be much larger than in the classical (commutative) setting. We introduce and analyze a randomized algorithm for computing solutions of SudoQ puzzles. Finally, we state two important conjectures relating the quantum and the classical solutions of SudoQ puzzles, corroborated by analytical and numerical evidence. [64] arXiv:2005.10219 [pdf] BlaBla Linguistic Feature Extraction for Clinical Analysis in Multiple Languages We introduce BlaBla, an open-source Python library for extracting linguistic features with proven clinical relevance to neurological and psychiatric diseases across many languages. BlaBla is a unifying framework for accelerating and simplifying clinical linguistic research. The library is built on state-of-the-art NLP frameworks and supports multithreaded/GPU-enabled feature extraction via both native Python calls and a command line interface. We describe BlaBla's architecture and clinical validation of its features across 12 diseases. We further demonstrate the application of BlaBla to a task visualizing and classifying language disorders in three languages on real clinical data from the AphasiaBank dataset. We make the codebase freely available to researchers with the hope of providing a consistent, well-validated foundation for the next generation of clinical linguistic research. [65] arXiv:2005.10157 [pdf] Generating Question Titles for Stack Overflow from Mined Code Snippets Stack Overflow has been heavily used by software developers as a popular way to seek programming-related information from peers via the internet. The Stack Overflow community recommends users to provide the related code snippet when they are creating a question to help others better understand it and offer their help. Previous studies have shown that} a significant number of these questions are of low-quality and not attractive to other potential experts in Stack Overflow. These poorly asked questions are less likely to receive useful answers and hinder the overall knowledge generation and sharing process. Considering one of the reasons for introducing low-quality questions in SO is that many developers may not be able to clarify and summarize the key problems behind their presented code snippets due to their lack of knowledge and terminology related to the problem, and/or their poor writing skills, in this study we propose an approach to assist developers in writing high-quality questions by automatically generating question titles for a code snippet using a deep sequence-to-sequence learning approach. Our approach is fully data-driven and uses an attention mechanism to perform better content selection, a copy mechanism to handle the rare-words problem and a coverage mechanism to eliminate word repetition problem. We evaluate our approach on Stack Overflow datasets over a variety of programming languages (e.g., Python, Java, Javascript, C# and SQL) and our experimental results show that our approach significantly outperforms several state-of-the-art baselines in both automatic and human evaluation. We have released our code and datasets to facilitate other researchers to verify their ideas and inspire the follow-up work. [66] arXiv:2005.10060 [pdf] Magnetic-field modeling with surface currents Physical and computational principles of bfieldtools Surface currents provide a general way to model static magnetic fields in source-free volumes. To facilitate the use of surface currents in magneto-quasistatic problems, we have implemented a set of computational tools in a Python package named bfieldtools. In this work, we describe the physical and computational principles of this toolset. To be able to work with surface currents of arbitrary shape, we discretize the currents on triangle meshes using piecewise-linear stream functions. We apply analytical discretizations of integral equations to obtain the magnetic field and potentials associated with the discrete stream function. In addition, we describe the computation of the spherical multipole expansion and a novel surface-harmonic expansion for surface currents, both of which are useful for representing the magnetic field in source-free volumes with a small number of parameters. Last, we share examples related to magnetic shielding and surface-coil design using the presented tools. [67] arXiv:2005.10056 [pdf] Magnetic-field modeling with surface currents Implementation and usage of bfieldtools We present a novel open-source Python software package, bfieldtools, for magneto-quasistatic calculations with current densities on surfaces of arbitrary shape. The core functionality of the software relies on a stream-function representation of surface-current density and its discretization on a triangle mesh. Although this stream-function technique is well-known in certain fields, to date the related software implementations have not been published or have been limited to specific applications. With bfieldtools, we aimed to produce a general, easy-to-use and well-documented open-source software. The software package is written purely in Python; instead of explicitly using lower-level languages, we address computational bottlenecks through extensive vectorization and use of the NumPy library. The package enables easy deployment, rapid code development and facilitates application of the software to practical problems. In this paper, we describe the software package and give an extensive demonstration of its use with an emphasis on one of its main applications -- coil design. [68] arXiv:2005.09941 [pdf] Non-Uniform Gaussian Blur of Hexagonal Bins in Cartesian Coordinates In a recent application of the Bokeh Python library for visualizing physico-chemical properties of chemical entities text-mined from the scientific literature, we found ourselves facing the task of smoothing hexagonally binned data in Cartesian coordinates. To the best of our knowledge, no documentation for how to do this exist in the public domain. This short paper shows how to accomplish this in general and for Bokeh in particular. We illustrate the method with a real-world example and discuss some potential advantages of using hexagonal bins in these and similar applications. [69] arXiv:2005.09890 [pdf] Interactive exploration of population scale pharmacoepidemiology datasets Population-scale drug prescription data linked with adverse drug reaction (ADR) data supports the fitting of models large enough to detect drug use and ADR patterns that are not detectable using traditional methods on smaller datasets. However, detecting ADR patterns in large datasets requires tools for scalable data processing, machine learning for data analysis, and interactive visualization. To our knowledge no existing pharmacoepidemiology tool supports all three requirements. We have therefore created a tool for interactive exploration of patterns in prescription datasets with millions of samples. We use Spark to preprocess the data for machine learning and for analyses using SQL queries. We have implemented models in Keras and the scikit-learn framework. The model results are visualized and interpreted using live Python coding in Jupyter. We apply our tool to explore a 384 million prescription data set from the Norwegian Prescription Database combined with a 62 million prescriptions for elders that were hospitalized. We preprocess the data in two minutes, train models in seconds, and plot the results in milliseconds. Our results show the power of combining computational power, short computation times, and ease of use for analysis of population scale pharmacoepidemiology datasets. The code is open source and available at this https URL [70] arXiv:2005.09625 [pdf] Inference, prediction and optimization of non-pharmaceutical interventions using compartment models the PyRoss library PyRoss is an open-source Python library that offers an integrated platform for inference, prediction and optimisation of NPIs in age- and contact-structured epidemiological compartment models. This report outlines the rationale and functionality of the PyRoss library, with various illustrations and examples focusing on well-mixed, age-structured populations. The PyRoss library supports arbitrary structured models formulated stochastically (as master equations) or deterministically (as ODEs) and allows mid-run transitioning from one to the other. By supporting additional compartmental subdivision ad libitum, PyRoss can emulate time-since-infection models and allows medical stages such as hospitalization or quarantine to be modelled and forecast. The PyRoss library enables fitting to epidemiological data, as available, using Bayesian parameter inference, so that competing models can be weighed by their evidence. PyRoss allows fully Bayesian forecasts of the impact of idealized NPIs by convolving uncertainties arising from epidemiological data, model choice, parameters, and intrinsic stochasticity. Algorithms to optimize time-dependent NPI scenarios against user-defined cost functions are included. PyRoss's current age-structured compartment framework for well-mixed populations will in future reports be extended to include compartments structured by location, occupation, use of travel networks and other attributes relevant to assessing disease spread and the impact of NPIs. We argue that such compartment models, by allowing social data of arbitrary granularity to be combined with Bayesian parameter estimation for poorly-known disease variables, could enable more powerful and robust prediction than other approaches to detailed epidemic modelling. We invite others to use the PyRoss library for research to address today's COVID-19 crisis, and to plan for future pandemics. [71] arXiv:2005.08983 [pdf] Completeness of the Gaia-verse II what are the odds that a star is missing from Gaia DR2? The second data release of the Gaia mission contained astrometry and photometry for an incredible 1,692,919,135 sources, but how many sources did Gaia miss and where do they lie on the sky? The answer to this question will be crucial for any astronomer attempting to map the Milky Way with Gaia DR2. We infer the completeness of Gaia DR2 by exploiting the fact that it only contains sources with at least five astrometric detections. The odds that a source achieves those five detections depends on both the number of observations and the probability that an observation of that source results in a detection. We predict the number of times that each source was observed by Gaia and assume that the probability of detection is either a function of magnitude or a distribution as a function of magnitude. We fit both these models to the 1.7 billion stars of Gaia DR2, and thus are able to robustly predict the completeness of Gaia across the sky as a function of magnitude. We extend our selection function to account for crowding in dense regions of the sky, and show that this is vitally important, particularly in the Galactic bulge and the Large and Small Magellanic Clouds. We find that the magnitude limit at which Gaia is still 99% complete varies over the sky from$G=18.9$to$21.3$. We have created a new Python package selectionfunctions (this https URL) which provides easy access to our selection functions. [72] arXiv:2005.08848 [pdf] Surfboard Audio Feature Extraction for Modern Machine Learning We introduce Surfboard, an open-source Python library for extracting audio features with application to the medical domain. Surfboard is written with the aim of addressing pain points of existing libraries and facilitating joint use with modern machine learning frameworks. The package can be accessed both programmatically in Python and via its command line interface, allowing it to be easily integrated within machine learning workflows. It builds on state-of-the-art audio analysis packages and offers multiprocessing support for processing large workloads. We review similar frameworks and describe Surfboard's architecture, including the clinical motivation for its features. Using the mPower dataset, we illustrate Surfboard's application to a Parkinson's disease classification task, highlighting common pitfalls in existing research. The source code is opened up to the research community to facilitate future audio research in the clinical domain. [73] arXiv:2005.08803 [pdf] SciANN A Keras wrapper for scientific computations and physics-informed deep learning using artificial neural networks In this paper, we introduce SciANN, a Python package for scientific computing and physics-informed deep learning using artificial neural networks. SciANN uses the widely used deep-learning packages Tensorflow and Keras to build deep neural networks and optimization models, thus inheriting many of Keras's functionalities, such as batch optimization and model reuse for transfer learning. SciANN is designed to abstract neural network construction for scientific computations and solution and discovery of partial differential equations (PDE) using the physics-informed neural networks (PINN) architecture, therefore providing the flexibility to set up complex functional forms. We illustrate, in a series of examples, how the framework can be used for curve fitting on discrete data, and for solution and discovery of PDEs in strong and weak forms. We summarize the features currently available in SciANN, and also outline ongoing and future developments. [74] arXiv:2005.08700 [pdf] Mask CTC Non-Autoregressive End-to-End ASR with CTC and Mask Predict We present Mask CTC, a novel non-autoregressive end-to-end automatic speech recognition (ASR) framework, which generates a sequence by refining outputs of the connectionist temporal classification (CTC). Neural sequence-to-sequence models are usually \textit{autoregressive} each output token is generated by conditioning on previously generated tokens, at the cost of requiring as many iterations as the output length. On the other hand, non-autoregressive models can simultaneously generate tokens within a constant number of iterations, which results in significant inference time reduction and better suits end-to-end ASR model for real-world scenarios. In this work, Mask CTC model is trained using a Transformer encoder-decoder with joint training of mask prediction and CTC. During inference, the target sequence is initialized with the greedy CTC outputs and low-confidence tokens are masked based on the CTC probabilities. Based on the conditional dependence between output tokens, these masked low-confidence tokens are then predicted conditioning on the high-confidence tokens. Experimental results on different speech recognition tasks show that Mask CTC outperforms the standard CTC model (e.g., 17.9% -> 12.1% WER on WSJ) and approaches the autoregressive model, requiring much less inference time using CPUs (0.07 RTF in Python implementation). All of our codes will be publicly available. [75] arXiv:2005.08373 [pdf] A Tutorial on Multivariate$k$-Statistics and their Computation This document aims to provide an accessible tutorial on the unbiased estimation of multivariate cumulants, using$k$-statistics. We offer an explicit and general formula for multivariate$k$-statistics of arbitrary order. We also prove that the$k$-statistics are unbiased, using Möbius inversion and rudimentary combinatorics. Many detailed examples are considered throughout the paper. We conclude with a discussion of$k$-statistics computation, including the challenge of time complexity, and we examine a couple of possible avenues to improve the efficiency of this computation. The purpose of this document is threefold to provide a clear introduction to$k$-statistics without relying on specialized tools like the umbral calculus; to construct an explicit formula for$k$-statistics that might facilitate future approximations and faster algorithms; and to serve as a companion paper to our Python library PyMoments, which implements this formula. [76] arXiv:2005.08067 [pdf] Forecasting with sktime Designing sktime's New Forecasting API and Applying It to Replicate and Extend the M4 Study We present a new open-source framework for forecasting in Python. Our framework forms part of sktime, a more general machine learning toolbox for time series with scikit-learn compatible interfaces for different learning tasks. Our new framework provides dedicated forecasting algorithms and tools to build, tune and evaluate composite models. We use sktime to both replicate and extend key results from the M4 forecasting study. In particular, we further investigate the potential of simple off-the-shelf machine learning approaches for univariate forecasting. Our main results are that simple hybrid approaches can boost the performance of statistical models, and that simple pure approaches can achieve competitive performance on the hourly data set, outperforming the statistical algorithms and coming close to the M4 winner. [77] arXiv:2005.08025 [pdf] IntelliCode Compose Code Generation Using Transformer In software development through integrated development environments (IDEs), code completion is one of the most widely used features. Nevertheless, majority of integrated development environments only support completion of methods and APIs, or arguments. In this paper, we introduce IntelliCode Compose$-$a general-purpose multilingual code completion tool which is capable of predicting sequences of code tokens of arbitrary types, generating up to entire lines of syntactically correct code. It leverages state-of-the-art generative transformer model trained on 1.2 billion lines of source code in Python,$C\#$, JavaScript and TypeScript programming languages. IntelliCode Compose is deployed as a cloud-based web service. It makes use of client-side tree-based caching, efficient parallel implementation of the beam search decoder, and compute graph optimizations to meet edit-time completion suggestion requirements in the Visual Studio Code IDE and Azure Notebook. Our best model yields an average edit similarity of$86.7\%$and a perplexity of 1.82 for Python programming language. [78] arXiv:2005.08021 [pdf] PyCDFT A Python package for constrained density functional theory We present PyCDFT, a Python package to compute diabatic states using constrained density functional theory (CDFT). PyCDFT provides an object-oriented, customizable implementation of CDFT, and allows for both single-point self-consistent-field calculations and geometry optimizations. PyCDFT is designed to interface with existing density functional theory (DFT) codes to perform CDFT calculations where constraint potentials are added to the Kohn-Sham Hamiltonian. Here we demonstrate the use of PyCDFT by performing calculations with a massively parallel first-principles molecular dynamics code, Qbox, and we benchmark its accuracy by computing the electronic coupling between diabatic states for a set of organic molecules. We show that PyCDFT yields results in agreement with existing implementations and is a robust and flexible package for performing CDFT calculations. The program is available at this https URL. [79] arXiv:2005.07786 [pdf] A flexible, extensible software framework for model compression based on the LC algorithm We propose a software framework based on the ideas of the Learning-Compression (LC) algorithm, that allows a user to compress a neural network or other machine learning model using different compression schemes with minimal effort. Currently, the supported compressions include pruning, quantization, low-rank methods (including automatically learning the layer ranks), and combinations of those, and the user can choose different compression types for different parts of a neural network. The LC algorithm alternates two types of steps until convergence a learning (L) step, which trains a model on a dataset (using an algorithm such as SGD); and a compression (C) step, which compresses the model parameters (using a compression scheme such as low-rank or quantization). This decoupling of the "machine learning" aspect from the "signal compression" aspect means that changing the model or the compression type amounts to calling the corresponding subroutine in the L or C step, respectively. The library fully supports this by design, which makes it flexible and extensible. This does not come at the expense of performance the runtime needed to compress a model is comparable to that of training the model in the first place; and the compressed model is competitive in terms of prediction accuracy and compression ratio with other algorithms (which are often specialized for specific models or compression schemes). The library is written in Python and PyTorch and available in Github. [80] arXiv:2005.07710 [pdf] Flare Statistics for Young Stars from a Convolutional Neural Network Analysis of$\textit{TESS}$Data All-sky photometric time-series missions have allowed for the monitoring of thousands of young ($t_{\rm age} 50$Myr across all temperatures$T_{\rm eff} \geq 4000$K, while stars from$2300 \leq T_{\rm eff} < 4000$K show no evolution across 800 Myr. Stars of$T_{\rm eff} \leq 4000$K also show higher flare rates and amplitudes across all ages. We investigate the effects of high flare rates on photoevaporative atmospheric mass loss for young planets. In the presence of flares, planets lose 4-7% more atmosphere over the first 1 Gyr.$\texttt{stella}\$ is an open-source Python tool-kit hosted on GitHub and PyPI.

You can also browse papers in other categories.