The 2023 conference on Machine Learning in astronomical surveys intend to critically review new techniques in the Machine Learning methods for astronomy.
In order to bring together the widest possible community, while limiting the carbon impact, this conference will be organized in hybrid mode and on two physical sites simultaneously, at the IAP in Paris and at the CCA/Flatiron Institute in New York. For the same reason, we encourage the participants to travel by train if possible. The workshop will take place from Monday to Friday in the afternoon only during hours compatible with both time zones.
Our final list of invited scholars is still evolving beyond the confirm list here-in-below.
Reviewers (confirmed):
Invited debaters (confirmed):
Credit image: Jean Mouette (IAP)
We acknowledge the financial support of the following agencies, institutes and national initiatives : (CNES, Simons Foundation, PNCG, IAP, DIM ORIGINES, Région Ile-de-France, SCAI, Learning the Universe).
Extracting optimal information from upcoming cosmological surveys is a pressing task, for which a promising path to success is performing field-level inference with differentiable forward modeling. A key computational challenge in this approach is that it requires sampling a high-dimensional parameter space. In this talk I will present a new promising method to sample such large parameter spaces, which improves upon the traditional Hamiltonian Monte Carlo, to both reconstruct the initial conditions of the Universe and obtain cosmological constraints.
Large-volume cosmological hydrodynamic simulations have become a primary tool to understand supermassive black holes (SMBHs), galaxies, and the large-scale structure of the Universe. However, current uncertainties in sub-grid models for core physical processes such as feedback from massive stars and SMBHs limit their predictive power and plausible use to extract information from extragalactic surveys. In this talk, I will present an overview of the Cosmology and Astrophysics with MachinE Learning Simulations (CAMELS) project, containing thousands of simulations implementing different cosmological and astrophysical parameters, sub-grid galaxy formation implementation, and hydrodynamics solver, and designed to train machine learning algorithms to maximize the extraction of information from cosmological surveys while marginalizing over uncertainties in sub-grid physics. I will show illustrative examples of the broad range of possible applications of CAMELS, discuss recent progress and challenges building robust simulation-based inference models for cosmology, and advertise the latest additions to the ever-growing CAMELS public data repository.
The influx of massive amounts of data from current and upcoming cosmological surveys necessitates compression schemes that can efficiently summarize the data with minimal loss of information. We introduce a method that leverages the paradigm of self-supervised machine learning in a novel manner to construct representative summaries of massive datasets using simulation-based augmentations. Deploying the method on hydrodynamical cosmological simulations, we show that it can deliver highly informative summaries, which can be used for a variety of downstream tasks, including precise and accurate parameter inference. We demonstrate how this paradigm can be used to construct summary representations that are insensitive to prescribed systematic effects, such as the influence of baryonic physics. Our results indicate that self-supervised machine learning techniques offer a promising new approach for compression of cosmological data as well its analysis.
Simulations of galaxy clusters that are well-matched to upcoming data sets are a key tool for addressing systematics (e.g., cluster mass inference) that limit current and future cluster-based cosmology constraints. However, most state-of-the-art simulations are too computationally intensive to produce multiple versions of relevant physics systematics. We present DeepSZSim, a lightweight framework for generating simulations of Sunyaev–Zel’dovich (SZ) effect clusters based on average thermal pressure profile models. These simulations offer a fast and flexible method for generating large datasets for testing mass inference methods like machine learning and simulation-based inference. We present these simulations and their place within the larger Deep Skies nexus of versatile, multi-wavelength galaxy cluster and cosmic microwave background simulators. We discuss progress and prospects for using these SZ simulations for machine learning, including simulation-based inference of cluster mass.
Inflation remains one of the enigmas in fundamental physics. While it is difficult to distinguish different inflation models, information contained in primordial non-Gaussianity (PNG) offers a route to break the degeneracy. In galaxy surveys, the local type PNG is usually probed by measuring the scale-dependent bias in the power spectrum. We introduce a new approach to measure the local type PNG by computing a three-point estimator using reconstructed density field, a density field reversed to the initial conditions from late time. This approach offers an alternative way to the existing method with different systematics and also organically follows the procedure of BAO analysis in large galaxy surveys. We introduce a reconstruction method using convolutional neural networks that significantly improves the performance of traditional reconstruction algorithms in matter density field, which is crucial for more effectively probing PNG. This pipeline can be applied to the ongoing Dark Energy Spectroscopical Instrument (DESI) and Euclid surveys, as well as upcoming projects, such as the Nancy Roman Space Telescope.
In recent years, non-Gaussian statistics have been growing in popularity as powerful tools for efficiently extracting cosmological information from current weak lensing data. Their use can improve constraints on cosmological parameters over standard two-point statistics, can additionally help discriminate between general relativity and modified gravity theories, and can help to self-calibrate astrophysical and observational nuisance parameters. During this talk, I will present an end-to-end simulation-based inference (SBI) framework that allows us to use common non-Gaussian statistics (e.g., higher order moments, peaks, scattering transform, phase wavelet harmonics) to constraints cosmological parameters. The pipeline relies on a neural network compression of the summary statistics and estimates the parameter posteriors using a mixture of Neural Density Estimators (NDEs). I will use the pipeline to compare the performance of different summary statistics in terms of cosmological parameters constraining power. I will then show constraints on data using the Dark Energy Survey year 3 weak lensing data. I will also be discussing the impact of observational systematics, and the main challenges ahead in view of stage IV surveys.
In the era of wide-field surveys and big data in astronomy, the SNAD team (https://snad.space) is exploiting the potential of modern datasets for discovery new, unforeseen, or rare astrophysical phenomena. The SNAD pipeline was built under the hypothesis that, although automatic learning algorithms have a crucial role to play in this task, the scientific discovery is only completely realized when such systems are designed to boost the impact of domain knowledge experts. Our key contributions include the development of the Coniferest Python library, which offers implementations of two adaptive learning algorithms with an “expert in loop”, and the creation of the SNAD Transient Miner, facilitating the search for specific types of transients. We have also developed the SNAD Viewer, a web portal that provides a centralized view of individual objects from the Zwicky Transient Facility’s (ZTF) data releases, making the analysis of candidates in anomalies more efficient. Finally, when applied to ZTF data, our approach has yielded over a hundred new supernova candidates, along with few other non-catalogued objects, such as red dwarf flares, active galactic nuclei, RS CVn type variables, and young stellar objects.
We present DE-VAE, a variational autoencoder (VAE) architecture to search for a compressed representation of beyond-ΛCDM models. We train DE-VAE on matter power spectra boosts generated at wavenumbers k ∈ (0.01 − 2.5) h/Mpc and at four redshift values z ∈ (0.1, 0.48, 0.78, 1.5) for a dynamic dark energy (DE) model with two extra parameters describing an evolving DE equation of state. The boosts are compressed to a lower-dimensional representation, which is concatenated with standard CDM parameters and then mapped back to reconstructed boosts; both the compression (“encoder”) and the reconstruction (“decoder”) components are parametrized as neural networks. We demonstrate that a single latent parameter can be used to predict DE power spectra at all k and z within 2σ, where the Gaussian error includes cosmic variance, shot noise and systematic effects for a Stage IV-like survey. This single parameter shows a high mutual information (MI) with the two DE parameters, and we obtain an explicit equation linking these variables through symbolic regression. We further show that considering a model with two latent variables only marginally improves the accuracy predictions, and that a third latent variable has no significant impact on the model’s performance. We discuss how the DE-VAE framework could be extended to search for a common lower-dimensional parametrization of different beyond-ΛCDM models, including modified gravity and braneworld models. Such a framework could then both potentially serve as an indicator of the existence of new physics in cosmological datasets, and provide theoretical insight into the common aspects of beyond-ΛCDM models.
Observations of the Cosmic Microwave Background (CMB) radiation have made significant contributions to our understanding of cosmology. While temperature observations of the CMB have greatly advanced our knowledge, the next frontier lies in detecting the elusive B-modes and obtaining precise reconstructions of the CMB's polarized signal in general. In anticipation of proposed and upcoming CMB polarization missions, this study introduces a novel method for accurately determining the angular power spectrum of CMB E-modes and B-modes. We have developed a Bayesian Neural Network (BNN)-based approach to enhance the performance of the Internal Linear Combination (ILC) technique. Our method is applied separately to the frequency channels of both the LiteBird and ECHO (also known as CMB-Bharat) missions and its performance is rigorously assessed for both missions. Our findings demonstrate the method's efficiency in achieving precise reconstructions of both CMB E-modes and CMB B-mode angular power spectra, with errors constrained primarily by cosmic variance.
Symbolic Regression is a data-driven method that searches the space of mathematical equations with the goal of finding the best analytical representation of a given dataset. It is a very powerful tool, which enables the emergence of underlying behavior governing the data generation process. Furthermore, in the case of physical equations, obtaining an analytical form adds a layer of interpretability to the answer which might highlight interesting physical properties.
However equations built with traditional symbolic regression approaches are limited to describing one particular event at a time. That is, if a given parametric equation was at the origin of two datasets produced using two sets of parameters, the method would output two particular solutions, with specific parameter values for each event, instead of finding a common parametric equation. In fact there are many real world applications – in particular astrophysics -- where we want to propose a formula for a family of events which may share the same functional shape, but with different numerical parameters
In this work we propose an adaptation of the Symbolic Regression method that is capable of recovering a common parametric equation hidden behind multiple examples generated using different parameter values. We call this approach Multiview Symbolic Regression and we demonstrate how it can reconstruct well known physical equations. Additionally we explore possible applications in the domain of astronomy for light curves modeling. Building equations to describe astrophysical object behaviors can lead to better flux prediction as well as new feature extraction for future machine learning applications.
While the benefits of machine learning for data analysis are widely discussed, I will argue that machine learning has also the great potential to inform us on interesting directions in new physics. Indeed, the current approach to solve the big questions of cosmology today is to constrain a wide range of cosmological models (such as cosmic inflation or modified gravity models), which is costly. In our recently published approach https://arxiv.org/abs/2110.13171, we propose to use unsupervised learning to map models according to their impact on cosmological observables. We can thus visualize which models have a different impact and therefore are worth investigating further, using this map as a guide to unlock to information about new physics from the new generation of cosmological surveys. In this talk, I will explain the approach, its use case and its application to the space of modified gravity probed by cosmic shear.
Symbolic Regression is the study of algorithms that automate the search for analytic expressions that fit data. With new advances in deep learning there has been much renewed interest in such approaches, yet efforts have not been focused on physics, where we have important additional constraints due to the units associated with our data.
I will present Φ-SO, a Physical Symbolic Optimization framework for recovering analytical symbolic expressions from physical data using deep reinforcement learning techniques. Our system is built, from the ground up, to propose solutions where the physical units are consistent by construction, resulting in compact, physical, interpretable and intellegible analytical models. This is useful not only in eliminating physically impossible solutions, but because it restricts enormously the freedom of the equation generator, thus vastly improving performances.
The algorithm can be used to fit noiseless data, which can be useful for instance when attempting to derive an analytical property of a physical model, and it can also be used to obtain analytical approximations to noisy data or even open up the black box that are neural networks. I will showcase our machinery on a panel of astrophysical cases ranging from high energy astrophysics to galactic dynamics, all the way to cosmology. I will then touch on our preliminary results in applying this type of approach to physical differential equations.
Upcoming photometric surveys such as the Legacy Survey of Space and Time (LSST) will image billions of galaxies, an amount required for extracting the faint weak lensing signal at a large range of cosmological distances. The combination of depth and area coverage of the imagery will be unprecedented ($r \sim 27.5$, $\sim20\,000\,\text{deg}^2$), and processing it will be fraught with many challenges. One of the most pressing issues is the fact that roughly 50% of the galaxies will be “blended”, where its projection on our detectors will overlap with other astronomical objects along the same line of sight. Without appropriate “deblending” algorithms, the blends introduce an unacceptable error on the weak lensing signal.
Several deblending algorithms have emerged up the past years, of which the most promising are based on deep neural networks (DNNs). DNNs are known to be highly sensitive to a difference in the distributions of the training and validation datasets. As the true deblended image of a blend, needed for supervised learning, is in most cases unobtainable due to the line of sight projection, training data has to be generated algorithmically. This training data will by its nature have a limited coverage of the high dimensional space that spans all galaxies that will be observed with the LSST. In other words, many galaxies and blends observed by the LSST will be out of distribution (OOD) and the DNNs will perform poorly on them. We have developed a method to classify blends on being OOD or in-distribution (IID) based on the distribution of an input blend sample in the latent space of a $\beta$-VAE, compared to the latent space distribution of the training sample. We will present the results of the OOD flagging, demonstrating that the latent space is indeed a useful tool for identifying OOD samples. Furthermore, we will discuss the ensuing reduction on the error of shear and photometry measurements when rejecting OOD samples for the weak lensing analysis.
Though core components of our method build on an existing deblending algorithm by Arcelin et al. (2021), the addition of this successful OOD detection technique is essential for its proper functioning on future LSST imagery. The blends flagged as OOD can, in future pipelines, be separated from the IID blends to prevent contamination of the weak lensing signal or be deblended with a method specifically tuned to OOD blends.
Data compression to informative summaries is essential for modern data analysis. Neural regression is a popular simulation-based technique for mapping data to parameters as summaries over a prior, but is usually agnostic to how uncertainties in information geometry, or data-summary relationship, changes over parameter space. We present Fishnets, a general simulation-based, neural compression approach to calculating the Fisher information and score for arbitrary data structures as functions of parameters. These compression networks can be scaled information-optimally to arbitrary data structures, and are robust to changes in data distribution, making them ideal tools for cosmological and graph dataset analyses.
Studying Active Galactic Nuclei (AGN) is crucial to understand processes regarding birth and evolution of Super-Massive Black Holes and their connection with star formation and galaxy evolution. However, few AGN have been identified in the EoR (z > 6) making it difficult to study their properties. In particular, a very small fraction of these AGN have been radio detected. Simulations and models predict that future observatories might increase these numbers drastically.
It becomes fundamental, then, to establish connections between radio emission and other multi-wavelength properties at high z. Recent wide-area multi-survey data have opened a window into obtaining these connections and rules.
At the same time, the development and operation of large-scale radio observatories, renders the use of regular AGN detection and redshift determination techniques inefficient. Machine Learning (ML) methods can help to predict the detection of AGN and some of their properties. We have developed, then, a series of ML models that, using multi-wavelength photometry, can produce a list of Radio Galaxy candidates, with their predicted redshift values.
More importantly, we have also applied some state-of-the-art feature importance techniques to understand which physical properties drive the predictions made by our models. From these techniques, it is possible to derive indicators for the selection of studied sources.
We will present the results of applying these models and techniques on near-infrared (NIR)-selected sources from the HETDEX Spring Field and the Stripe 82 Field. Furthermore, using feature importances, we will describe which properties hold the highest predicting power and the derivation of a efficient colour-colour criterion for the identification of AGN candidates. Moreover, we will introduce our efforts to apply said models and procedures to data in the area of the Evolutionary Map of the Universe (EMU, a precursor of the SKA Observatory) Pilot Survey.
Machine learning (ML) is having a transformative impact on astrophysics. The field is starting to mature, where we are moving beyond the naive application of off-the-shelf, black-box ML models towards approaches where ML is an integral component in a larger, principled analysis methodology. Furthermore, not only are astrophysical analyses benefiting from the use of ML, but ML models themselves can be greatly enhanced by integrating knowledge of relevant physics. I will review three maturing areas where ML and astrophysics have already demonstrated some success, while still providing many further opportunities and challenges. (1) Physics-enhanced learning integrates knowledge of relevant physics into ML models, either through augmentation, encoding symmetries and invariances, encoding dynamics, or directly through physical models that are integrated into the ML model. (2) In statistical learning, ML and statistical models are tightly coupled to provide probabilistic frameworks, often in a Bayesian setting, that offer uncertainty quantification, generative models, accelerated inference, and data-driven priors. (3) For scientific analyses in particular, it is important that ML models are not opaque, black-boxes but are intelligible, ensuring truthfulness, explainability and interpretability. Throughout I will provide numerous examples of astrophysical studies where such approaches have or are being developed and applied, in the context of upcoming observations from the Euclid satellite, the Rubin Observatory Legacy Survey of Space and Time (LSST), and the Square Kilometre Array (SKA). Finally, I will highlight outstanding challenges and some thoughts on how these may be overcome.
The matter power spectrum of cosmology, P(k), is of fundamental importance in cosmological analyses, yet solving the Boltzmann equations can be computationally prohibitive if required several thousand times, e.g. in a MCMC. Emulators for P(k) as a function of cosmology have therefore become popular, whether they be neural network or Gaussian process based. Yet one of the oldest emulators we have is an analytic, physics-informed fit proposed by Eisenstein and Hu (E&H). Given this is already accurate to within a few percent, does one really need a large, black-box, numerical method for calculating P(k), or can one simply add a few terms to E&H? In this talk I demonstrate that Symbolic Regression can obtain such a correction, yielding sub-percent level predictions for P(k).
IAP Cocktail
Deep generative models parametrize very flexible families of distributions able to fit complicated datasets of images or text. These models provide independent samples from complex high-distributions at negligible costs. On the other hand, sampling exactly a target distribution, such a Bayesian posterior or the Boltzmann distribution of a physical system, is typically challenging: either because of dimensionality, multi-modality, ill-conditioning or a combination of the previous. In this talk, I will discuss opportunities and challenges in enhancing traditional inference and sampling algorithms with learning.
A significant statement regarding the existence of primordial non-Gaussianity stands as one of the key objectives of next-generation galaxy surveys. However, traditional methods are burdened by a variety of issues, such as the handling of unknown systematic effects, the combination of multiple probes of primordial non-Gaussianity, and the capturing of information beyond the largest scales in the data. In my presentation, I will introduce my pioneering work of applying field-level inference to constrain primordial non-Gaussianity galaxy surveys. I will discuss how my method can resolve the challenges faced by other approaches and how I can capture more information from the data compared to traditional methods. Additionally, I will explore the additional data products that my method enables and delve into other potential applications. Finally, I will briefly touch upon the future use of field-level inference to study the primordial universe, along with the promises and challenges inherent in this approach.
Simulation-based inference (SBI) building on machine-learnt density estimation and massive data compression has the potential to become the method of choice for analysing large, complex datasets in survey cosmology. I will present recent work that implements every ingredient of the current Kilo-Degree Survey weak lensing analysis into an SBI framework which runs on similar timescales as a traditional analysis relying on analytic models and a Gaussian likelihood. We show how the SBI analysis recovers and, in several key aspects, goes beyond the traditional approach. I will also discuss challenges and their solutions to SBI-related data compression and goodness-of-fit in several real-world cosmology applications.
Normalizing Flows (NF) are Generative models which transform a simple prior distribution into the desired target. They however require the design of an invertible mapping whose Jacobian determinant has to be computable. Recently introduced, Neural Hamiltonian Flows (NHF) are Hamiltonian dynamics-based flows, which are continuous, volume-preserving and invertible and thus make for natural candidates for robust NF architectures. In particular, their similarity to classical Mechanics could lead to easier interpretability of the learned mapping. In this presentation, I will detail the NHF architecture and show that they may still pose a challenge to interpretability. For this reason, I will introduce a fixed kinetic energy version of the model. Inspired by physics, this approach improves interpretability and requires less parameters than the original model. I will talk about the robustness of the NHF architecture, especially its fixed-kinetic version, on a simple 2D problem and present first results in higher dimension. Finally, I will show how to adapt NHF to the context of Bayesian inference and illustrate the method on an example from cosmology.
The cosmic web, or Large-Scale Structure (LSS) is the massive spiderweb- like arrangement of galaxy clusters and the dark matter holding them together under gravity. The lumpy, spindly universe we see today evolved from a much smoother, infant universe. How this structure formed and the information embedded within is considered one of the “Holy Grails” of modern cosmology, and might hold the key to resolving existing “tensions” in cosmological theory. But how do we go about linking this data to theory? Cosmological surveys are comprised of millions of pixels, which can be difficult for samplers and analytic likelihood analysis. This also poses a problem for simulation- based inference: how can we best compare simulations to observed data? Information Maximising Neural Networks (IMNNs) offer a way to compress massive datasets down to (asymptotically) lossless summaries that contain the same cosmological information as a full sky survey, as well as quantify the information content of an unknown distribution. We will look at LSS assembled as a graph (or network) from discrete catalogue data, and use graph neural networks in the IMNN framework to optimally extract information about cosmological parameters (theory) from this representation. We will make use of the modular graph structure as a way to open the “black box” of simulation-based inference and neural network compression to show where cosmological information is stored.
Knowledge of the primordial matter density field from which the present non-linear observations formed is of fundamental importance for cosmology, as it contains an immense wealth of information about the physics, evolution, and initial conditions of the universe. Reconstructing this density field from the galaxy survey data is a notoriously difficult task, requiring sophisticated statistical methods, advanced cosmological simulators, and exploration of a multi-million-dimensional parameter space. In this talk, I will discuss how Gaussian Autoregressive Neural Ratio Estimation (a recent approach in simulation-based inference) allows us to tackle this problem and sequentially obtain data-constrained realisations of the primordial dark matter density field in a simulation-efficient way for general non-differentiable simulators. In addition, I will describe how graph neural networks can be used to get optimal data summaries for galaxy maps, and how our results compare to those obtained with classical likelihood-based methods such as Hamiltonian Monte Carlo.
Modern cosmological inference typically relies on likelihood expressions and covariance estimations, which can become inaccurate and cumbersome depending on the scales and summary statistics under consideration. Simulation-based inference, in contrast, does not require an analytical form for the likelihood but only a prior distribution and a simulator, thereby naturally circumventing these issues. In this talk, we will explore how this technique can be used to infer $\sigma_8$ from a forward model based on Lagrangian Perturbation Theory and the bias expansion. The power spectrum and the bispectrum are used as summary statistics to obtain the posterior of the cosmological, bias and noise parameters via neural density estimation.
Unlocking the full potential of next-generation cosmological data requires navigating the balance between sophisticated physics models and computational demands. We propose a solution by introducing a machine learning-based field-level emulator within the HMC-based Bayesian Origin Reconstruction from Galaxies (BORG) inference algorithm. The emulator, an extension of the first-order Lagrangian Perturbation Theory (LPT), achieves remarkable accuracy compared to N-body simulations while significantly reducing evaluation time. Leveraging its differentiable neural network architecture, the emulator enables efficient sampling of the high-dimensional space of initial conditions. To demonstrate its efficacy, we use the inferred posterior samples of initial conditions to run constrained N-body simulations, yielding highly accurate present-day non-linear dark matter fields compared to the underlying truth used during inference.
Model misspecification is a long-standing problem for Bayesian inference: when the model differs from the actual data-generating process, posteriors tend to be biased and/or overly concentrated. This issue is particularly critical for cosmological data analysis in the presence of systematic effects. I will briefly review state-of-the-art approaches based on an explicit field-level likelihood, which sample known foregrounds and automatically report unknown data contaminations. I will then present recent methodological advances in the implicit likelihood approach, with arbitrarily complex forward models of galaxy surveys where all relevant statistics can be determined from numerical simulations. The method (Simulator Expansion for Likelihood-Free Inference, SELFI) allows to push analyses further into the non-linear regime than state-of-the-art backward modelling techniques. Importantly, it allows a check for model misspecification at the level of the initial matter power spectrum before final inference of cosmological parameters. I will present an application to a Euclid-like configuration.
State of the art astronomical simulations have provided datasets which enabled the training of novel deep learning techniques for constraining cosmological parameters. However, differences in subgrid physics implementation and numerical approximations among simulation suites lead to differences in simulated datasets, which pose a hard challenge when trying to generalize across diverse data domains and ultimately when applying models to observational data.
Recent work reveals deep learning algorithms are able to extract more information from complex cosmological simulations than summary statistics like power spectra. We introduce Domain Adaptive Graph Neural Networks (DA-GNNs), trained on CAMELS data, inspired by CosmoGraphNet (Villanueva-Domingo et al 2023). By utilizing GNNs, we can capitalize on their capacity to capture both astrophysical and topological features of galaxy distributions. Mixing these capabilities with domain adaptation techniques such as Maximum Mean Discrepancy (MMD), which enable extraction of domain-invariant features, our framework demonstrates enhanced accuracy and robustness. We present experimental results, including the alignment of distributions across domains through data visualization.
These findings suggest that DA-GNNs are an efficient way of extracting domain independent cosmological information, a vital step toward robust deep learning for real cosmic survey data.
Detection, deblending, and parameter inference for large galaxy surveys have been and still are performed with simplified parametric models, such as bulge-disk or single Sersic profiles. The complex structure of galaxies, revealed by higher resolution imaging data, such as those gathered by HST or, in the future, by Euclid and Roman, makes these simplifying assumptions problematic. Biases arise in photometry and shape measurements, and I will discuss examples for both.
On the other hand, non-parametric modeling also has a long history in many fields of image processing. But it is limited to signal-to-noise regimes that are high by the standards of most astrophysical surveys. This weakness can be overcome by specifying priors over the space of galaxy images. I will present a new codebase, scarlet2, written entirely in jax, for modeling complex extragalactic scenes. I will also discuss how to integrate data-driven priors in the form of score models, and show examples of sampling from posteriors to assess the uncertainties in heavily blended configurations. I will conclude with an outlook of how these tools need to be extended to fully exploit the data from the combination of optical surveys that will shape astrophysics in the 2020s.
Cosmic voids identified in the spatial distribution of galaxies provide complementary information to two-point statistics. In particular, constraints on the neutrino mass sum, $\sum m_\nu$, promise to benefit from the inclusion of void statistics. We perform inference on the CMASS NGC sample of SDSS-III/BOSS with the aim of constraining $\sum m_\nu$. We utilize the void size function, the void-galaxy cross power spectrum, and the galaxy auto power spectrum. To extract constraints from these summary statistics we use a simulation-based approach, specifically implicit likelihood inference. We populate approximate gravity-only, particle neutrino cosmological simulations with an expressive halo occupation distribution model. With a conservative scale cut of $k_\text{max}=0.15\,h\text{Mpc}^{-1}$ and a Planck-inspired $\Lambda$CDM prior, we find upper bounds on $\sum m_\nu$ of $0.43$ and $0.35\,\text{eV}$ from the galaxy auto power spectrum and the full data vector, respectively ($95\,\%$ credible interval). We observe hints that the void statistics may be most effective at constraining $\sum m_\nu$ from below. We also substantiate the usual assumption that the void size function is Poisson distributed.
Cosmic voids are the largest and most underdense structures in the Universe. Their properties have been shown to encode precious information about the laws and constituents of the Universe. We show that machine learning techniques can unlock the information in void features for cosmological parameter inference. Using thousands of void catalogs from the GIGANTES dataset, we explore three properties of voids: ellipticity, density contrast, and radius. Specifically, we train 1) fully connected neural networks on histograms from void properties and 2) deep sets from void catalogs, to perform likelihood-free inference on the value of cosmological parameters. Our results provide an illustration of how machine learning can be a powerful tool for constraining cosmology with voids.
I present a novel, general-purpose Python-based framework for scalable and efficient statistical inference by means of hierarchical modelling and simulation-based inference.
The framework is built combining the JAX and NumPyro libraries. The combination of differentiable and probabilistic programming offers the benefits of automatic differentiation, XLA optimization, and the ability to further improve the computational performance by running on GPUs and TPUs as well. These properties allow for efficient sampling through gradient-based methods, and for significantly enhanced performance of neural density estimation for simulation-based inference, augmented by the simulator gradients.
The framework seamlessly integrates with the recently developed COSMOPOWER-JAX and JAX-COSMO libraries, making it an ideal platform to solve Bayesian inverse problems in cosmology. Beyond cosmology, the framework is designed to be a versatile, robust tool for cutting-edge analysis of astronomical surveys. I demonstrate its practical utility through applications to various domains, including but not limited to weak lensing, supernovae, and galaxy clusters.
Optimal extraction of the non-Gaussian information encoded in the Large-Scale Structure (LSS) of the universe lies at the forefront of modern precision cosmology. We propose achieving this task through the use of the Wavelet Scattering Transform (WST), which subjects an input field to a layer of non-linear transformations that are sensitive to non-Gaussianities through a generated set of WST coefficients. In order to assess its applicability in the context of LSS surveys, we perform the first WST application on actual galaxy observations, through a WST analysis of the BOSS DR12 CMASS dataset. We lay out the detailed procedure on how to capture all necessary layers of realism for an application on data obtained from a spectroscopic survey, including the effects of redshift-space anisotropy, non-trivial survey geometry, the shortcomings of the dataset through a set of systematic weights and the Alcock-Paczynski distortion effect. Using the suite of Abacus summit simulations, we construct an emulator for the cosmological dependence of the WST coefficients and perform a likelihood analysis of the CMASS data to obtain the marginalized errors on cosmological parameters. The WST is found to deliver a substantial improvement in the values of the predicted 1σ errors compared to the regular galaxy power spectrum. Lastly, we discuss recent progress towards applying these techniques in order to fully harness the constraining power of upcoming spectroscopic observations by Stage-IV surveys such as DESI and Euclid.
The Lyman-$\alpha$ forest presents a unique opportunity to study the distribution of matter in the high-redshift universe and extract precise constraints on the nature of dark matter, neutrino masses, and other extensions to the ΛCDM model. However, accurately interpreting this observable requires precise modeling of the thermal and ionization state of the intergalactic medium, which often relies on computationally intensive hydrodynamical simulations. In this study, we introduce the first neural-network emulator capable of rapidly predicting the one-dimensional Lyman-$\alpha$ flux power spectrum ($P_{1D}$) as a function of cosmological and IGM parameters.
Traditionally, Gaussian processes have been the preferred choice for emulators due to their ability to make robust predictions with fewer training data points. However, this advantage comes at the cost of runtimes that scale cubically with the number of data points. With the continuous growth of training data sets, the need to transition to algorithms such as neural networks becomes increasingly crucial. Unlike other methods, neural networks provide a linear scaling between time and the number of training points. This scalability is particularly advantageous as it allows for efficient processing even with large datasets. Additionally, the use of GPUs further accelerates neural-network computations, enhancing the speed and efficiency of the training process.
Our emulator has been specifically designed to analyze medium-resolution spectra from the Dark Energy Spectroscopic Instrument (DESI) survey, considering scales ranging from $k_{\parallel}$ = 0.1 to 4 Mpc$^{−1}$ and redshifts from $z$ = 2 to $z$ = 4.5. DESI employs a sophisticated instrument equipped with thousands of optical fibers that simultaneously collect spectra from millions of galaxies and quasars. Indeed, DESI started 2 years ago, and it has already doubled the amount of quasar spectra previously obtained.
Our approach involves modeling $P_{1D}$ as a function of the slope and amplitude of the linear matter power spectrum, rather than directly as a function of cosmological parameters. We demonstrate that our emulator achieves sub-percent precision across the entire range of scales. Additionally, the emulator maintains this level of accuracy for three ΛCDM extensions: massive neutrinos, running of the spectral index, and curvature. It also performs at the percent level for thermal histories not present in the training set.
To emulate the probability distribution of $P_{1D}$ at any given $k$ scale, we employ a mixture density network. This allows us to estimate the emulator's uncertainty for each prediction, enabling the rejection of measurements associated with high uncertainty. We have observed that the neural network assigns higher uncertainties to inaccurate emulated $P_{1D}$ values and to training points that lie close to the limits of the convex hull. Furthermore, by emulating the probability distribution of $P_{1D}$, we can estimate the covariance of the emulated values, providing insights into the correlation at different scales. While further investigations are required to enhance our understanding of $P_{1D}$ measurement covariances, we are pleased to note that, to the best of our knowledge, this study represents the first instance in which a complete emulator covariance is provided for $P_{1D}$ emulators, rather than solely focusing on the diagonal elements.
Given the demonstrated sub-percent precision, robustness to ΛCDM extensions, and the ability to estimate uncertainties, we expect that the developed neural network emulator will play a crucial role in the cosmological analysis of the DESI survey.
Rapid strides are currently being made in the field of artificial intelligence using Transformer-based models like Large Language Models (LLMs). The potential of these methods for creating a single, large, versatile model in astronomy has not yet been explored except for some uses of the basic component of Transformer – the attention mechanism. In this talk, we will talk about a framework for data-driven astronomy that uses the same core techniques and architecture as used by LLMs without involving natural language but floating point data directly. Using a variety of observations and labels of stars as an example, we have built a Transformer-based model and trained it in a self-supervised manner with cross-survey data sets to perform a variety of inference tasks. In particular, we have demonstrated that a single model can perform both discriminative and generative tasks even if the model was not trained or fine-tuned to do any specific task. For example, on the discriminative task of deriving stellar parameters from Gaia XP spectra, our model slightly outperforms an expertly trained XGBoost model in the same setting of inputs and outputs combination. But the same model can also generate Gaia XP spectra from stellar parameters, inpaint unobserved spectral regions, extract empirical stellar loci, and even determine the interstellar extinction curve. The framework allows us to train such foundation models on large cross-survey, multidomain astronomical data sets with a huge amount of missing data due to the different footprints of the surveys. This demonstrates that building and training a single foundation model without fine-tuning using data and parameters from multiple surveys to predict unmeasured observations and parameters is well within reach. Such 'Large Astronomy Models' trained on large quantities of observational data will play a large role in the analysis of current and future large surveys.
Field level likelihood-free inference is one of the brand new methods to extract cosmological information, over passing inferences of the usual and time-demanding traditional methods. In this work we train different machine learning models, without any cut on scale, considering a sequence of distinct selections on galaxy catalogs from the CAMELS suite in order to recover the main challenges of real data observations. We consider mask effects, peculiar velocity uncertainties, and galaxy selection effects. Also, we are able to show that we obtain a robust model across different sub-grid physical models such as Astrid, SIMBA, IllustrisTNG, Magneticum, and SWIFT-EAGLE using only galaxy phase-space information ($3$D positions and $1$D velocity).
Moreover, we are able to show that the model can still track the matter content of the simulations keeping only the $2$D positions and $1$D velocity. The main purpose is to provide a proof of concept that graph neural networks, together with moment neural networks, can be used as a useful and powerful machinery to constrain cosmology for the next generation of surveys.
Upcoming cosmological weak lensing surveys are expected to constrain cosmological parameters with unprecedented precision. In preparation for these surveys, large simulations with realistic galaxy populations are required to test and validate analysis pipelines. However, these simulations are computationally very costly -- and at the volumes and resolutions demanded by upcoming cosmological surveys, they are computationally infeasible.
Here, we propose a Deep Generative Modeling approach to address the specific problem of emulating realistic 3D galaxy orientations in synthetic catalogs. For this purpose, we develop a novel Score-Based Diffusion Model specifically for the SO(3) manifold. The model accurately learns and reproduces correlated orientations of galaxies and dark matter halos that are statistically consistent with those of a reference high-resolution hydrodynamical simulation.
Deep generative models parametrize very flexible families of distributions able to fit complicated datasets of images or text. These models provide independent samples from complex high-distributions at negligible costs. On the other hand, sampling exactly a target distribution, such a Bayesian posterior or the Boltzmann distribution of a physical system, is typically challenging: either because of dimensionality, multi-modality, ill-conditioning or a combination of the previous. In this talk, I will discuss opportunities and challenges in enhancing traditional inference and sampling algorithms with learning.
New large-scale astronomical surveys such as the Vera Rubin Observatory's Legacy Survey of Space and Time (LSST) have the potential to revolutionize transient astronomy, providing opportunities to discover entirely new classes of transients while also enabling a deeper understanding of known supernovae. LSST is expected to observe over 10 million transient alerts every night, over an order of magnitude more than any preceding survey. In this talk, I'll discuss the issue that with such large data volumes, the astronomical community will struggle to prioritize which transients - rare, interesting, or young - should be followed up. I address three major challenges: (1) automating real-time classification of transients, (2) automating serendipity by identifying the likelihood of a transient being interesting and anomalous, and (3) identifying the epoch time in order to observe transients early to understand their central engine and progenitor systems. I present machine learning and Bayesian methods of automating real-time classification, anomaly detection, and predicting epoch times of transients. Our ability to classify events and identify anomalies improves over the lifetime of the light curves.
Amidst the era of astronomical surveys that collect massive datasets, neural networks have emerged as powerful tools to address the challenge of exploring and mining these enormous volumes of information from our sky. Among the obstacles in the study of these surveys is the identification of exoplanetary signatures in the photometric light curves. In this presentation, we will discuss how convolutional neural networks can significantly facilitate the detection of exoplanets, focusing on two exoplanetary detection methods: (1) planetary transits and (2) gravitational microlensing. We will elaborate on (1) their proven success in detecting planetary transit signals within the Transiting Exoplanet Survey Satellite data and (2) our ongoing project to identify gravitational microlensing events using the nine-year Microlensing Observations in Astrophysics dataset. Our strategy proposes using only raw photometric light curves as input for our neural network pipeline, which, after training, can detect the desired signal in a light curve in milliseconds. Looking towards future space missions, we will discuss the role of neural networks as an alternative pipeline to accelerate the identification of potential exoplanet candidates in the Nancy Grace Roman Space Telescope data.
Weak gravitational lensing is an excellent quantifier of the growth of structure in our universe, as the distortion of galaxy ellipticities measures the spatial fluctuations in the matter field density along a line of sight. Traditional two-point statistical analyses of weak lensing only capture Gaussian features of the observable field, hence leaking information from smaller scales where non-linear gravitational interactions yield non-Gaussian features in the matter distribution. Higher-order statistics such as peak counts, Minkowski-functionals, three-point correlation functions, and convolutional neural networks, have been introduced to capture this additional non-Gaussian information and improve constraints on key cosmological parameters such as $\Omega_m$ and $\sigma_8$.
We demonstrate the potential of applying a self-attention-based deep learning method, specifically a Vision Transformer, to predict cosmological parameters from weak lensing observables, particularly convergence $\kappa$ maps. Transformers, which were first developed for natural language processing and are now at the core of generative large language models, can be used for computer vision tasks with patches from an input image serving as sequential tokens analogous to words in a sentence. In the context of weak lensing, Vision Transformers are worth exploring for their different approach to capturing long-scale and inter-channel information, improved parallelization, and lack of strong inductive bias and locality of operations.
Using transfer learning, we compare the performance of Vision Transformers to that of benchmark residual convolutional networks (ResNets) on simulated $w$CDM theory predictions for $\kappa$, with noise properties and sky coverage similar to DESY3, LSSTY1, and LSSTY10. We further use neural density estimators to investigate the differences in the cosmological parameters' posteriors recovered by either deep learning method. These results showcase a potential astronomical application derived from the advent of powerful large language models, as well as machine learning tools relevant to the next generation of large-scale surveys.
Most applications of ML in astronomy pertain to classification, regression, or emulation, however, ML has the potential to address whole new categories of problems in astronomical big data. This presentation uses ML in a statistically principled approach to observing strategy selection, which encompasses the frequency and duration of visits to each portion of the sky and impacts the degree to which the resulting data can be employed toward any scientific objective, let alone the net effect on many diverse science goals. Aiming to homogenize the units of observing strategy metrics across different science cases and minimize analysis model-dependence, we introduce TheLastMetric, a variational approximation to the lower bound of mutual information between a physical parameter of interest and anticipated data, a measure of the potentially recoverable information, under a given observing strategy. We demonstrate TheLastMetric in the context of photometric redshifts (photo-zs) from the upcoming Legacy Survey of Space and Time (LSST) on the Vera C. Rubin Observatory, showing qualitative agreement with traditional photo-z metrics and improved discriminatory power without assuming a photo-z estimation model. In combination with evaluations on other physical parameters of interest, TheLastMetric isolates the subjective assessment of relative priority of a science goal from the units-dependent sensitivity of its metric, enhancing the transparency and objectivity of the decisionmaking process. We thus recommend the broad adoption of TheLastMetric as an appropriate and effective paradigm for community-wide observing strategy optimization.
Galaxy mergers are unique in their ability to transform the morphological, kinematic, and intrinsic characteristics of galaxies on short timescales. The redistribution of angular momentum brought on by a merger can revive, enhance, or truncate star formation, trigger or boost the accretion rate of an AGN, and fundamentally alter the evolutionary trajectory of a galaxy.
These effects are well studied in spectroscopically distinct galaxy pairs, but less so in pre- and post-coalescence merger systems on account of their rarity, and complications surrounding their identification by traditional morphological metrics.
To overcome this obstacle, we use bespoke machine learning morphological classifications to search for merging and merged galaxies in two imaging surveys: the latest data release from the deep and high-resolution Canada France Imaging Survey (CFIS/UNIONS), and the Dark Energy Camera Legacy Survey (DECaLS). I will present the details of our machine learning methodology, and offer our work as a case study on the flexibility and utility of machine vision as a bridge between observations and simulations.
Thanks to new large datasets and methodological advantages ushered in by the popularization of machine learning in astronomy, I will present for the first time an updated, abundant, and pure sample of pre- and post-mergers, and show the results of a temporal study following the star formation and multi-wavelength AGN demographics of galaxy mergers all the way through to coalescence.
Deep learning is data-hungry; we typically need thousands to millions of labelled examples to train effective supervised models. Gathering these labels in citizen science projects like Galaxy Zoo can take years, delaying the science return of new surveys. In this talk, I’ll describe how we’re combining simple techniques to build better galaxy morphology models with fewer labels.
First [1], we’re using large-scale pretraining with supervised and self-supervised learning to reduce the number of labelled galaxy images needed to train effective models. For example, using self-supervised learning to pretrain on unlabelled Radio Galaxy Zoo images halves our error rate at distinguishing FRI and FRII radio galaxies in a separate dataset.
Second [2], we’re continually retraining our models to prioritise the most helpful galaxies for volunteers to label. Our probabilistic models filter out galaxies they can confidently classify, leaving volunteers able to focus on challenging and interesting galaxies. We used this to measure the morphology of every bright extended galaxy in HSC-Wide in weeks rather than years.
Third [3], we’re using natural language processing to capture radio astronomy classes (like “FRI” or “NAT”) through plain English words (like “hourglass”) that volunteers use to discuss galaxies. These words reveal which visual features are shared between astronomical classes, and, when presented as classification options, let volunteers classify complex astronomical classes in an intuitive way.
We are now preparing to apply these three techniques - pretraining, active learning, and natural language labels - to provide day-one galaxy morphology measurements for Euclid DR1.
The CDM model is in remarkable agreement with large-scale observations but small-scale evidence remains scarce. Studying substructure through strong gravitational lensing can fill in the gap on small scales. In the upcoming years, we expect the number of observed strong lenses to increase by several orders of magnitude from ongoing and future surveys. Machine learning has the potential to optimally analyze these images, but its application to real observations remains limited. I will present the first application of machine learning to the analysis of subhalo properties in real strong lensing observations. Our work leverages a neural simulation-based inference technique in order to infer the density slopes of subhalos. I will compare our method’s prediction on HST images to the expected CDM measurements and discuss the implication of our work.
In our previews works, e.g., arXiv:2209.10333, deep learning techniques have succeeded in estimating galaxy cluster masses in observations of Sunyaev Zel'dovich maps, e.g. in the Planck PSZ2 catalog and mass radial profiles from SZ mock maps. In the next step, we explore inferring 2D mass density maps from mock observations of SZ, X-ray and stars using THE THREE HUNDRED (The300) cosmological simulation. In order to do that, we investigate state-of-the-art deep learning models that have been proven to be successful for image generation in multiple areas of research including astrophysics and medical imaging. These models are conditioned to observations, e.g. SZ maps, to generate the most likely matter 2D distribution given our dataset, composed of around 140 thousand mock maps from The300. We show that these models can successfully infer the 2D matter distribution with a scatter of around $~14\%$ in their pixel distribution and reproduce the matter power spectrum when comparing the generated maps with the ground-truth from the simulations. One of the main advantages of these generative models, is that they can effectively combine several inputs views and extract the useful features of each of them to infer mass density maps. By combining SZ, X-ray and stars in a multichannel approach, the scatter is reduced by a factor of $~2$ in comparison with the scatter that is computed when considering only the single-view models.
The next natural step of this project is to apply DL models on high resolution SZ observation, such as NIKA2, SPT and ACT. However, mock images needed for training deep learning models must fully take into consideration the observational impact of the telescopes in order to mimic real observations.
The halo mass function describes the abundance of dark matter halos as a function of halo mass and depends sensitively on the cosmological model. Accurately modelling the halo mass function for a range of cosmological models will enable forthcoming surveys such as Vera C. Rubin Observatory’s Legacy Survey of Space and Time (LSST) and Euclid to place tight constraints on cosmological parameters. Due to the highly non-linear nature of halo formation, understanding which quantities determine the halo mass function for different cosmological models is difficult. We present an interpretable deep learning framework that allows us to find, with minimal prior assumptions, a compressed representation of the information required to accurately predict the halo mass function. We use neural network models that consist of an encoder-decoder architecture: the encoder compresses the input linear matter power spectrum and growth function into a low-dimensional representation, and the decoder uses this representation to predict halo abundance given a halo mass. We train the network to predict the halo mass function at redshift z=0 to better than 1% precision for a range of cosmological parameters. We then interpret the representation found by the network via measuring mutual information between the representation and quantities such as the ground truth halo number densities, the power spectrum, and cosmological parameters. This can enable us to gain new insights on what physics is involved in the process of halo formation, and a better understanding of how to accurately model the halo mass function for different cosmological models. The framework can also be extended to model the halo mass function over a range of redshifts.
The data from the new generation of cosmological surveys, such as DESI (DESI Collaboration et al. 2022), have already started taking data, and even more will arrive with Euclid (Laureĳs et al. 2011) and the LSST of Vera Rubin Observatory (Ivezić et al. 2019) starting soon. At the same time, the classical methods of analysing RSD and BAO with 2-point statistics provide less strenuous constraints than for example a full-modelling analysis (Ivanov et al. 2020). Such an analysis does however require much more computational power.
We present an emulator based on the feedforward neural network which allows us to significantly speed up analytical computations of the 2-point statistics in both Fourier and configuration space (Trusov et al. in prep). Our approach is based on emulating the perturbation theory (PT) quantities, which are later combined with bias terms to produce the non-linear prediction of the 2-point statistics for any galaxy sample. We compare the performance of our approach against publicly available PT codes using mocks based on the AbacusSummit simulations (Maksimova et al. 2021, Garrison et al. 2021 ), where our tool performs significantly faster without any noticeable loss of precision.
Conventional cosmic shear analyses, relying on two-point functions, do not have access to the non-Gaussian information present at the full field level, thus limiting our ability to constrain with precision cosmological parameters. Performing Full-Field inference is in contrast an optimal way to extract all available cosmological information, and it can be achieved with two widely different methodologies:
Explicit high-dimensional inference through the use of Bayesian Hierarchical Model (BHM)
Implicit Inference (also known as Simulation-Based Inference or Likelihood-Free Inference)
It is evident that differentiability of the forward model is essential for explicit inference, as this approach requires exploring a very high dimensional space, which is only practical with gradient-based inference techniques (HMC, Variational Inference, etc). In this work, we consider the question of whether implicit inference approaches can similarly benefit from having access to a differentiable simulator in a cosmological full-field inference scenario. Indeed, several methods (including ours) have been developed in recent years to leverage the gradients of the simulator to help constrain the inference problem, but the benefits of these gradients are problem dependent, raising the question of their benefit for cosmological inference. To answer this question, we consider a simplified full-field weak lensing analysis, emulating an LSST Y10 setting, and benchmark state-of-the art implicit inference methods making use or not of gradients.
This setting allows us to ask a first question: “What is the best method to optimally recover cosmological parameters for an LSST full-field weak lensing analysis with the minimum number of forward model evaluations?” There, our results suggest that gradient-free SBI methods are the most effective for this particular problem, and we develop some insights explaining why.
Accurately describing the relation between the dark matter over-density and the observable galaxy field is one of the significant challenges to analyzing cosmic structures with next-generation galaxy surveys. Current galaxy bias models are either inaccurate or computationally too expensive to be used for efficient inference of small-scale information.
In this talk, I will present a hybrid machine learning approach called the Neural Physical Engine (NPE) that addresses this problem. The network architecture, first developed and tested by Charnock et al. (2020), exploits physical information of the galaxy bias problem and is suitable for zero-shot learning within field-level inference approaches.
Furthermore, the model can efficiently generate mock halo catalogues on the scales of wide-field surveys such as Euclid. Finally, I will also show that those generated mocks are consistent with full phase-space halo finders, including the 2-point correlation function.
Whether it's calibrating our analytical predictions on small scales, or devising all new probes beyond standard two-point functions, the road to precision cosmology is paved with numerical simulations. The breadth of the parameter space we must simulate, and the associated computational cost, however, present a serious challenge. Fortunately, emulators based on Gaussian processes and neural networks provide a way forward, allowing for numerical models to be constructed through training machine learning algorithms on a tractable number of mocks. In this talk, I will present cosmological constraints derived from new statistics made possible by a simulation-based emulator model, and argue that this new approach presents a practical and environmentally-conscious path towards accurate cosmological inference.
We created an ML pipeline able to efficiently detect craters in a large dataset of georeferenced images. We used it to create a detailed database of craters on rocky bodies in the solar system including Mars. The Mars crater database was of sufficient detail to enable us to determine the likely origin of a number of meteorites that we have collected on Earth. As a consequence, it is possible to get a better picture of the early formation processes of Mars using a sample from Mars, before the first sample-return mission has been organized. In this presentation, we will see how we have structured our pipeline and the technologies used to produce that data product.
Photometric redshifts and strong lensing are both integral for stellar physics and cosmological studies with the Rubin Observatory Legacy Survey of Space and Time (LSST), which will provide billions of galaxy images in six filters, including on the order of 100,000 galaxy-scale lenses. To efficiently exploit this huge amount of data, machine learning is a promising technique that leads to an extreme reduction of the computational time per object.
Since accurate redshifts are a necessity for nearly any astrophysical study, precise and efficient techniques to predict photometric redshifts are crucial to allow for the full exploitation of the LSST data. To this end, I will highlight in the first part of my talk the novel ability of using convolutional neural networks (CNNs) to estimate the photometric redshifts of galaxies. Since the image quality from LSST is expected to be very similar to that of the Hyper Suprime-Cam (HSC), and training a network on realistic data is crucial to achieve a good performance on real data, the network is trained on real HSC cutouts in five different filters. The good performance will be highlighted with a detailed comparison to the Direct Empirical Photometric (DEmP) method, a hybrid technique with one of the best performances on HSC images.
To address further challenges in efficiently analyzing the huge amount of data provided by LSST, I will present in the second part of my talk some recent machine learning techniques developed within the HOLISMOKES collaboration, which focus on the exploitation of strongly lensed supernovae (SNe). These very rare events offer promising avenues to probe stellar physics and cosmology. For instance, the time-delays between the multiple images of a lensed SN allow for a direct measurement of the Hubble constant (H0) independently from other probes. This allows one to assess the current tension on the H0 value, and the possible need for new physics. Furthermore, these lensed SNe also help constrain the SN progenitor scenarios by facilitating follow-up observations in the first hours after the explosion. In particular, I will summarize our deep learning methods to search for lensed SNe in current and future wide-field time-domain surveys, and focus on our new achievements in the automation of strong-lens modeling with a residual neural network. To train, validate, and test these networks, we mock up images based on real observed galaxies from HSC and the Hubble Ultra Deep Field. These networks are further tested on known real systems to estimate the true performance on real data.
For all the networks, the main advantage is the opportunity to apply these easily and fully automated to millions of galaxies with a huge gain in speed. Both regression networks are able to estimate the parameter values in fractions of a second on a single CPU while the lens modeling with traditional techniques typically takes weeks. With these networks, we will be able to efficiently process the huge amount of expected detections in the near future by LSST.
In this review talk, I will show how artificial intelligence can bring tangible benefits to cosmological analysis of large-scale structure.
I will focus on how the use of AI in the framework of Simulations-Based Inference to achieve scientific objectives that would not be attainable with classical 2-pt function analyses. I will show three avenues where, in my opinion, AI can bring the most benefits: reaching the information floor of limited survey data via SBI analysis, accelerating simulations for SBI, and breaking degeneracies between cosmological probes and astrophysical nuissance fields. I will discuss new challenges that come with AI-based analyses. Finally, I will present outlook for exciting future applications for AI analysis of LSS.
A fundamental task of data analysis in many scientific fields is to determine the underlying causal relations between physical properties as well as the quantitative nature of these relations/laws. These laws are the fundamental building blocks of scientific models describing observable phenomena. Historically, causal methods were applied in the field of social sciences and economics (Pearl, 2000), where causal relations were investigated by means of interventions (manipulating and varying features of systems to see how systems react). However, since we can observe one single world and one single Universe we cannot use interventions for recovering causal models describing our data in disciplines such as astrophysics or climate sciences. It is therefore necessary to discover causal relations by analyzing statistical properties of purely observational data, a task known as causal discovery or causal structure learning.
In S. Di Gioia et al, 2023 (in preparation), in collaboration with R. Trotta, V. Acquaviva, F. Bucinca and A. Maller, we perform causal model discovery on simulated galaxy data, to better understand which galaxy and halo properties are the drivers of galaxy size, initially at redshift z = 0. In particular, we used a constraint-based structure learning algorithm, called kernel-PC, based on a Python parallel code developed by the author, and, as input data, the simulated galaxy catalog generated with the Santa Cruz semi-analytic model (SC-SAM). The SC-SAM was built on the merger trees extracted from the dark matter-only version of the TNG-100 hydro-dynamical simulation, which showed to describe successfully the full spectra of observed galaxy properties, from z=0 to z=3-4 (Gabrielpillai et al., 2022).
In my talk I will present the main results of this work, together with an overview of the most common algorithms to perform causal discovery, in the framework of Causal Graphical Models, focusing on their potential applicability to upcoming astronomical surveys. Future applications of this method include dimensionality reduction and Bayesian model discovery.
The interstellar medium (ISM) is an important actor in the evolution of galaxies and provides key diagnostics of their activity, masses and evolutionary state. However, surveys of the atomic and molecular gas, both in the Milky Way and in external galaxies, produce huge position-position-velocity data cubes over wide fields of view with varying signal-to-noise ratios. Besides, inferring the physicals conditions of the ISM from these data requires complex and often slow astrophysical codes.
The overall challenge is to reduce the amount of human supervision required to analyze and interpret these data. I will describe two applications of deep learning to tackle this challenge.
1/ I will first introduce a self-supervised denoising method adapted to molecular line data cubes (Einig et al. 2023). The proposed autoencoder architecture compensates for the lack of redundancy between channels in line data cubes compared to hyperspectral Earth remote sensing data. When applied to a typical data cube of about 10^7 voxels, this method allows to recover the low SNR emission without affecting the signals with high SNR. The proposed method surpasses current state of the art denoising tools, such as ROHSA and GaussPY+, which are based on multiple Gaussian fitting of line profiles.
2/ Numerical simulations are usually too slow to be used in Bayesian inference framework, as it requires numerous model evaluations. Here, I will present a supervised method to derive fast and light neural-network based emulations of a model from a grid of precomputed outputs (Palud et al. 2023). This emulator is compared with four standard classes of interpolation methods used to emulate the Meudon PDR code, a characteristic ISM numerical model. The proposed strategies yield networks that outperform all interpolation methods in terms of accuracy on outputs that have not been used during training. Moreover, these networks are 1,000 times faster than accurate interpolation methods, and require at most 10 times less memory. This paves the way to efficient inferences using wide-field multi-line observations of the ISM. The proposed strategies can easily be adapted to other astrophysical models.
References:
Einig et al. 2023, A&A, in press
Palud et al. 2023, subm. to A&A
Strong gravitational lensing has become one of the most important tools for investigating the nature of dark matter (DM). This is because it can be used to detect dark matter subhaloes in the environments of galaxies. The existence of a large number of these subhaloes is a key prediction of the most popular DM model, cold dark matter (CDM). With a technique called gravitational imaging, the number and mass of these subhaloes can be measured in strong lenses, constraining the underlying DM model.
Gravitational imaging however is an expensive method. This is mostly due to the final stage of the analysis: so-called sensitivity mapping. Here, the observation is analysed to find the smallest detectable subhalo in each pixel. This information can be used to turn a set of subhalo detections and non-detections into an inference on the dark matter model. We have previously introduced a machine learning technique that uses a set of large convolutional neural networks (CNNs) to replace the expensive sensitivity mapping stage [1]. We exploited this new technique to test the sensitivity of Euclid strong lenses to dark matter subhaloes. Analysing 16,000 simulated Euclid strong lens observations we found that subhaloes with mass larger than $M>10^{8.8\pm0.2}M_\odot$ could be detected at $3\sigma$ in that data, and that the entire survey should yield $\sim2500$ new detections.
In the current work, we take our method much further to understand a crucial systematic uncertainty in subhalo detection: the angular structure of the lens mass model. We train an ensemble of CNNs to detect subhaloes in highly realistic HST images. The models use an increasing amount of angular complexity in the lensing galaxy mass model, parametrised as an elliptical power-law plus multipole perturbations up to order 4, and external shear. Multipole perturbations allow for boxy/discy structure in the lens galaxy. This is commonly found in the light profiles of elliptical galaxies but is almost always missing from the mass profile in strong lensing studies.
We find that multipole perturbations up to 1 per cent are large enough to cause false positive subhalo detections at a rate of 20 per cent, with order 3 perturbations having the strongest effect. We find that the area in an observation where a subhalo can be detected drops by a factor of 10 when multipoles up to an amplitude of 3 per cent are allowed in the mass model. However, the mass of the smallest subhalo that can be detected does not change, with a detection limit of $M>10^{8.2}M_\odot$ found at $5\sigma$ regardless of model choice. Assuming CDM, we find that HST observations modelled without multipoles should yield a detectable subhalo in 4.8 per cent of cases. This drops to 0.47 per cent when the lenses are modelled with multipoles up to 3 per cent amplitude. The loss of expected detections is due to the effect of the previously detectable objects being consistent with multipoles of that strength. To remain reliable, strong lensing analyses for dark matter subhaloes must therefore include angular complexity beyond the elliptical power-law.
[1] O'Riordan C. M., Despali, G., Vegetti, S., Moliné, Á., Lovell, M., MNRAS 521, 2342 (2023)
Convolutional neural networks (CNNs) are now the standard tool for finding strong gravitational lenses in imaging surveys. Upcoming surveys like Euclid will rely completely on CNNs for strong lens finding but the selection function of these CNNs has not yet been studied. This is representative of the large gap in the literature in the field of machine learning applied to astronomy. Biases in CNN lens finders have the potential to influence the next generation of strong lens science unless properly accounted for. In our work we have quantified, for the first time, this selection function. We also explore the implications of these biases for various strong lens science goals.
We find that CNNs with similar architecture and training data as is commonly found in the lens finding literature are biased classifiers. We use three training datasets, representative of those used to train galaxy-galaxy and galaxy-quasar lens finding neural networks. The networks preferentially select systems with larger Einstein radii, as in this case the source and lens light is most easily disentangled. Similarly, the CNNs prefer large sources with more concentrated source-light distributions, as they are more distinct from the extended lens light.
The model trained to find lensed quasars shows a stronger preference for higher lens ellipticities than those trained to find lensed galaxies. The selection function is independent of the slope of the power-law of the mass profiles, hence measurements of this quantity will be unaffected. We find that the lens finder selection function reinforces the lensing cross-section. In general, we expect our findings to be a universal result for all galaxy-galaxy and galaxy-quasar lens finding neural networks.
Based on work in Herle A., O’Riordan C. M., Vegetti, S., arXiv:2307.10355, submitted to MNRAS. arXiv submission
The Gaia Collaboration's 3rd data release (DR3) provides comprehensive information including photometry and kinematics on more than a billion stars across the entire sky up to $G\approx21$, encompassing approximately 220 million stars with supplementary low-resolution spectra ($G<17.6$). These spectra offer derived valuable stellar properties like [Fe/H], $\log g$, and $T_{eff}$, serving as proxies to identify and characterize significant stellar structures, such as stellar streams formed from past minor galaxy mergers with the Milky Way.
In pursuit of constraining the chemo-dynamical history of the Galaxy with data-driven algorithms, we propose a novel self-supervised approach implementing masked stellar modelling (MSM) exploiting multiple spectroscopic and photometric surveys to extend beyond the limitations of DR3’s low-resolution spectra. We incorporate diverse imaging surveys that span ultraviolet to near-infrared wavelengths across the celestial sphere. The MSM employs a powerful encoder to generate informative embeddings, containing crucial information for downstream tasks, facilitated by an extensive training sample. By leveraging these embeddings, similarity searches on the complete database of embeddings can be conducted instantly. Moreover, spectroscopic surveys often exhibit inconsistencies due to varying assumptions in their respective derivations of stellar characteristics. The MSM method offers the ability to fine-tune any survey on specific stellar astrophysics tasks with much fewer labels, and thanks to its extensive training set, is more robust to misrepresentativity. The stellar embeddings result in a self-consistent dataset, effectively establishing a comprehensive stellar model.
Overall, this research showcases an innovative data-driven approach to utilize various surveys and spectral products, empowering researchers to make significant strides in understanding the Milky Way's history and dynamics. The methodology's effectiveness in regression tasks and its scalability will be highlighted, shedding light on its broader applicability.
We present a novel methodology for hybrid simulation-based inference (HySBI) for large scale structure analysis in cosmology. Our approach combines perturbative analysis on the large scales which can be modeled analytically from first principles, with simulation based implicit inference (SBI) on small, non-linear scales that cannot be modeled analytically. As a proof-of-principle, we apply our method to dark matter density fields to constrain cosmology parameters using power spectrum on the large scales, and power spectrum and wavelet coefficients on small scales. We highlight how this hybrid approach can mitigate the computational challenges in applying SBI to the future cosmological surveys, and discuss the roadmap to extend this approach for analyzing survey data.
The main goal of cosmology is to perform parameter inference and model selection, from astronomical observations. But, uniquely, it is a field that has to do this limited to a single experiment, the Universe we live in. With compelling existing and upcoming cosmological surveys, we need to leverage state-of-the-art inference techniques to extract as much information as possible from our data.
In this talk, I will begin present Machine Learning based methods to perform inference in cosmology, such as simulation-based inference, and stochastic control sampling approaches. I will show how we can use Machine Learning to perform parameter inference of multimodal posterior distributions on high dimensional spaces. I will finish by showing how these methods are being used to improve our knowledge of the Universe, by presenting the results from the SimBIG analysis on simulation-based inference from large-scale structure data.
In this talk I will present the first cosmological constraints from only the observed photometry of galaxies. Villaescusa-Navarro et al. (2022) recently demonstrated that the internal physical properties of a single galaxy contain a significant amount of cosmological information. These physical properties, however, cannot be directly measured from observations. I will present how we can go beyond theoretical demonstrations to infer cosmological constraints from actual galaxy observables (e.g. optical photometry) using neural density estimation and the CAMELS suite of hydrodynamical simulations. We find that the cosmological information in the photometry of a single galaxy is limited. However, we can combine the constraining power of photometry from many galaxies using hierarchical population inference and place significant cosmological constraints. With the observed photometry of $\sim$15,000 NASA-Sloan Atlas galaxies, we constrain $\Omega_m = 0.310^{+0.080}_{-0.098}$ and $\sigma_8 = 0.792^{+0.099}_{-0.090}$.
We present cosmological constraints from the Subaru Hyper Suprime-Cam (HSC) first-year weak lensing shear catalogue using convolutional neural networks (CNNs) and conventional summary statistics. We crop 19 $3\times3$deg$^2$ sub-fields from the first-year area, divide the galaxies with redshift $0.3< z< 1.5$ into four equally-spaced redshift bins, and perform tomographic analyses. We develop a pipeline to generate simulated convergence maps from cosmological $N$-body simulations, where we account for effects such as intrinsic alignments (IAs), baryons, photometric redshift errors, and point spread function errors, to match characteristics of the real catalogue. We train CNNs that can predict the underlying parameters from the simulated maps, and we use them to construct likelihood functions for Bayesian analyses. In the $\Lambda$ cold dark matter model with two free cosmological parameters $\Omega$ and $\sigma_8$, we find $\Omega=0.278_{-0.035}^{+0.037}$, $S_8\equiv(\Omega/0.3)^{0.5}\sigma_8=0.793_{-0.018}^{+0.017}$, and the IA amplitude $A_\mathrm{IA}=0.20_{-0.58}^{+0.55}$. In a model with four additional free baryonic parameters, we find $\Omega=0.268_{-0.036}^{+0.040}$, $S_8=0.819_{-0.024}^{+0.034}$, and $A_\mathrm{IA}=-0.16_{-0.58}^{+0.59}$, with the baryonic parameters not being well-constrained. We also find that statistical uncertainties of the parameters by the CNNs are smaller than those from the power spectrum (5-24 percent smaller for $S_8$ and a factor of 2.5-3.0 smaller for $\Omega$), showing the effectiveness of CNNs for uncovering additional cosmological information from the HSC data. With baryons, the $S_8$ discrepancy between HSC first-year data and Planck 2018 is reduced from $\sim2.2\,\sigma$ to $0.3-0.5\,\sigma$.
In this era of large and complex astronomical survey data, interpreting, validating, and comparing inference techniques becomes increasingly difficult. This is particularly critical for emerging inference methods like Simulation-Based Inference (SBI), which offer significant speedup potential and posterior modeling flexibility, especially when deep learning is incorporated. We present a study to assess and compare the performance and uncertainty prediction capability of Bayesian inference algorithms – from traditional MCMC sampling of analytic functions to deep learning-enabled SBI. We focus on testing the capacity of hierarchical inference modeling in those scenarios. Before we extend this study to cosmology, we first use astrophysical simulation data to ensure interpretability. We demonstrate a probabilistic programming implementation of hierarchical and non-hierarchical Bayesian inference using simulations derived from the DeepBench software library, a benchmarking tool developed by our group that generates simple and controllable astrophysical objects from first principles. This study will enable astronomers and physicists to harness the inference potential of these methods with confidence.
In this review talk, I will show how artificial intelligence can bring tangible benefits to cosmological analysis of large-scale structure.
I will focus on how the use of AI in the framework of Simulations-Based Inference to achieve scientific objectives that would not be attainable with classical 2-pt function analyses. I will show three avenues where, in my opinion, AI can bring the most benefits: reaching the information floor of limited survey data via SBI analysis, accelerating simulations for SBI, and breaking degeneracies between cosmological probes and astrophysical nuissance fields. I will discuss new challenges that come with AI-based analyses. Finally, I will present outlook for exciting future applications for AI analysis of LSS.
Galaxies exhibit a wide variety of morphologies which are strongly related to their star formation histories and formation channels. Having large samples of morphologically classified galaxies is fundamental to understand their evolution. In this talk, I will review my research related to the application of deep learning algorithms for morphological classification of galaxies. This technique is extremely successful and has resulted in the release of morphological catalogues for important surveys such as SDSS, MaNGA or Dark Energy Survey. I will describe the methodology, based on supervised learning and convolutional neural networks (CNN). The main disadvantage of such approach is the need of large labelled training samples, which we overcome by applying transfer learning or by ‘emulating’ the faint galaxy population. I will also show current challenges for the classification of galaxy images with CNNs, such as the detection, classification and segmentation of low surface brightness features, which will be of great relevance for surveys such as HSC-SSP or ARRAKIHS, and our current plans for addressing them properly.
Visual inspections of the first optical rest-frame images from JWST have indicated a surprisingly high fraction of disk galaxies at high redshifts. Here, we alternatively apply self-supervised machine learning to explore the morphological diversity at $z \geq 3$.
Our proposed data-driven representation scheme of galaxy morphologies, calibrated on mock images from the TNG50 simulation, is shown to be robust to noise and to correlate well with physical properties of the simulated galaxies, including their 3D structure. We apply the method simultaneously to F200W and F356W galaxy images of a mass-complete sample ($M_*/M_\odot>10^9$) at $ 3 \leq z \leq 6$ from the first JWST/NIRCam CEERS data release. We find that the simulated and observed galaxies do not exactly populate the same manifold in the representation space from contrastive learning. We also find that half the galaxies classified as disks (either CNN-based or visually) populate a similar region of the representation space as TNG50 galaxies with low stellar specific angular momentum and non-oblate structure.
Although our data-driven study does not allow us to firmly conclude on the true nature of these galaxies, it suggests that the disk fraction at $z \geq 3$ remains uncertain and possibly overestimated by traditional supervised classifications.
Likelihood-free inference provides a rigorous way to preform Bayesian analysis using forward simulations only. It allows us to account for complex physical processes and observational effects in forward simulations. In this work, we use Density-Estimation Likelihood-Free Inference (DELFI) to perform a likelihood-free forward modelling for Bayesian cosmological inference, which uses the redshift evolution of the cluster abundance together with weak-lensing mass calibration. The analysis framework developed in this study will be powerful for cosmological inference in relation to ongoing cluster cosmology programs, such as the XMM-XXL survey and the eROSITA all-sky survey, combined with wide-field weak-lensing surveys.
In this talk, I will first present the convergent solutions for the posterior distribution which employ the synthetic cluster catalogue generated from our forward model, and then I will show some preliminary results by applying this method to the HSC data.
Weak Lensing Galaxy Cluster Masses are an important observable to test the cosmological standard model and modified gravity models. However cluster cosmology in optical surveys is challenged by sources of systematics like photometric redshift error. We use combinatorial optimization schemes and fast Machine Learning assisted model evaluation to select galaxy source samples that minimize the expected systematic error budget while maintaining sufficient signal-to-noise in the measurement to meet the stringent science requirements of surveys like LSST.
The ΛCDM cosmological model has been very successful, but cosmological data indicate that extensions are still highly motivated. Past explorations of extensions have largely been restricted to adding a small number of parameters to models of fixed mathematical form. Neural networks can account for more flexible model extensions and can capture unknown physics at the level of differential equation models. I will present evidence that it is possible to learn missing physics in this way at the level of linear cosmological perturbation theory as well as quantify uncertainty on these neural network predictions. This is accomplished through Bolt, the first differentiable Boltzmann solver code - the gradients provided by Bolt allow for efficient inference of neural network and cosmological parameters. Time permitting, I will also present other aspects of Bolt, such as the use of iterative methods of solution, choice of automatic differentiation algorithm, and stiff ODE solver performance.
Simulations have revealed correlations between the properties of dark matter halos and their environment, made visible by the galaxies which inherit these connections through their host halos. We define a measure of the environment based on the location and observable properties of a galaxy’s nearest neighbors in order to capture the broad information content available in the environment. We then use a neural network to learn the connection between the multi-dimensional space defined by the observable properties of galaxies and the properties of their host halos using mock galaxy-catalogs from UNIVERSEMACHINE. The trained networks will: 1) reveal new connections between galaxy, halo, and environment; 2) serve as a powerful tool for placing galaxies into halos in future cosmological simulations; and 3) be a framework for inferring the properties of real halos from next-generation survey data, allowing for direct comparison between observational statistics and theory. We will first show the results of estimating the masses of halos and sub-halos. This will be followed by preliminary results on halo properties beyond mass, including satellite membership and concentration.
Machine learning is becoming an essential component of the science operations processing pipelines of modern astronomical surveys. Space missions such as NASA’s Transiting Exoplanet Survey Satellite (TESS) are observing millions of stars each month. In order to select the relevant targets for our science cases from these large numbers of observations, we need highly automated and efficient classification methods. Only afterwards, more detailed astrophysical studies can be done to derive the physical parameters of the selected stars. Given the increasing data volumes, machine learning techniques, and in particular physically interpretable machine learning models, prove to be the ideal instruments to achieve this. In this talk, I will draw from our experiences in developing the TESS Data for Asteroseismology (T’DA) machine learning classification pipeline to (i) discuss the challenges and opportunities associated to the development of such pipelines, (ii) share our insights with regard to the used machine learning techniques and identify where they could be improved, and (iii) give an outlook of how machine learning could be incorporated into the science processing pipelines of ground- and space-based surveys.
Joint observations in electromagnetic and gravitational waves shed light on the physics of objects and surrounding environments with extreme gravity that are otherwise unreachable via siloed observations in each messenger. However, such detections remain challenging due to the rapid and faint nature of counterparts. Protocols for discovery and inference still rely on human experts manually inspecting survey alert streams and intuiting optimal usage of limited follow-up resources. Strategizing an optimal follow-up program requires adaptive sequential decision-making given evolving light curve data that (i) maximizes a global objective despite incomplete information and (ii) is robust to stochasticity introduced by detectors/observing conditions. Reinforcement learning (RL) approaches allow agents to implicitly learn the physics/detector dynamics and the behavior policy that maximize a designated objective through experience.
To demonstrate the utility of such an approach for the kilonova follow-up problem, we train a toy RL agent for the goal of maximizing follow-up photometry for the true kilonova among several contaminant transient light curves. In a simulated environment where the agent learns online, it achieves 3x higher accuracy compared to a random strategy. However, it is surpassed by human agents by up to a factor of 2. This is likely because our hypothesis function (Q that is linear in state-action features) is an insufficient representation of the optimal behavior policy. More complex agents could perform at par or surpass human experts. Agents like these could pave the way for machine-directed software infrastructure to efficiently respond to next generation detectors, for conducting science inference and optimally planning expensive follow-up observations, scalably and with demonstrable performance guarantees.
When measuring photon counts from incoming sky fluxes, observatories imprint nuisance effects on the data that must be accurately removed. Some detector effects can be easily inverted, while others are not trivially invertible such as the point spread function and shot noise. Using information field theory and Bayes' theorem, we infer the posterior mean and uncertainty for the sky flux. This involves the use of prior knowledge encoded in a generative model and a precise and differentiable model of the instrument.
The spatial variability of the point spread functions as part of the instrument description degrades the resolution of the data as the off-axis angle increases. The approximation of the true instrument point spread function by an interpolated and patched convolution provides a fast and accurate representation as part of a numerical instrument model. By incorporating the spatial variability of the point spread function, far off-axis events can be reliably accounted for, thereby increasing the signal-to-noise ratio.
The developed reconstruction method is demonstrated on a series of Chandra X-ray observations of the Perseus galaxy cluster.
Keywords: spatially variant point spread functions; deconvolution; deblurring; X-ray imaging; information field theory; Perseus galaxy cluster; Bayesian imaging
In this talk, we explore the use of Generative Topographic Mapping (GTM) as an alternative to self-organizing maps (SOM) for deriving accurate mean redshift estimates for cosmic shear surveys. We delve into the advantages of the GTM probabilistic modeling of the complex relationships within the data, enabling robust estimation of redshifts. Through comparative analysis, we showcase the effectiveness of GTM in producing tomographic redshift estimates, thereby contributing to the advancement of cosmological studies. Our findings underscore GTM’s potential as a powerful tool for redshift inference in large-scale astronomical surveys.
During the Epoch of reionisation, the intergalactic medium is reionised by the UV radiation from the first generation of stars and galaxies. One tracer of the process is the 21 cm line of hydrogen that will be observed by the Square Kilometre Array (SKA) at low frequencies, thus imaging the distribution of ionised and neutral regions and their evolution.
To prepare for these upcoming observations, we investigate a deep learning method to predict from 21 cm maps the reionisation time field treion, i.e. the time at which each location has been reionised. treion encodes the propagation of ionisation fronts in a single field, gives access to times of local reionisation or to the extent of the radiative reach of early sources. Moreover it gives access to the time evolution of ionisation on the plane of sky, when such evolution is usually probed along the line-of-sight direction.
We trained a convolutional neural network (CNN) using simulated 21 cm maps and reionisation times fields produced by the simulation code 21cmFAST . We also investigate the performance of the CNN when adding instrumental effects.
Globally, we find that without instrumental effects the 21 cm maps can be used to reconstruct the associated reionisation times field in a satisfying manner: the quality of the reconstruction is dependent on the redshift at which the 21 cm observation is being made and in general it is found that small scale features are smoothed in the reconstructed field, while larger
scale features are well recovered. When instrumental effects are included, the scale dependance of reconstruction is even further pronounced, with significant smoothing on small and intermediate scales.
The reionisation time field can be reconstructed, at least partially, from 21 cm maps of IGM during the Epoch of reionisation. This quantity can thus be derived in principle from observations and should then provide a mean to investigate the effect of local histories of reionisation on the first structures that appear in a given region.
We present an innovative clustering method, Significance Mode Analysis (SigMA), to extract co-spatial and co-moving stellar populations from large-scale surveys such as ESA Gaia. The method studies the topological properties of the density field in the multidimensional phase space. The set of critical points in the density field gives rise to the cluster tree, a hierarchical structure in which leaves correspond to modes of the density function. Typically, however, non-parametric density estimation methods lead to an over-clustering of the input data. We propose an interpretable cluster tree pruning strategy by determining minimum energy paths between pairs of neighboring modes directly in the input space. We present a statistical hypothesis test that examines deviations from unimodality along these paths, which provides a measure of significance for each pair of clusters.
We apply SigMA to Gaia EDR3 data of the closest OB association to Earth, Scorpio-Centaurus (Sco-Cen), and find 37 co-moving clusters in Sco-Cen. These clusters are independently validated using astrophysical knowledge and, and to a certain extent, by their association with massive stars too bright for Gaia, both unknown to SigMA. Our findings suggest that the OB association is more actively star-forming and dynamically richer than previously thought. This application demonstrates that SigMA allows for an accurate census of young populations, quantify their dynamics, and reconstruct the recent star formation history of the local Milky Way.
As upcoming SKA-scale surveys open new regimes of observation, it is expected that some of the objects they detect will be "unknown unknowns": entirely novel classes of sources which are outside of our current understanding of astrophysics. The discovery of these sources has the potential to introduce new fields of study, as it did for e.g. pulsars, but relies upon us being able to identify them within peta- or exascale data volumes. Automated anomaly detection using machine learning offers a promising method to ensure that atypical sources are not lost within the data, but these methods are typically incapable of simultaneously classifying non-anomalous sources and identifying anomalies, resulting in separate models being needed to complete both necessary tasks. In this talk, we discuss the possibility of using uncertainty metrics derived from an image classification model to provide anomaly detection within a classification pipeline.
Fanaroff-Riley (FR) galaxies, a type of radio-loud AGN, are among the sources that are expected to see a drastic increase in known population with upcoming large-scale radio surveys, and provide a useful test population for outlier detection because in addition to two "standard" morphologies (FRI and FRII) there are numerous rare morphological subclasses which are particularly useful for the study of AGN environments and engines. Using the MiraBest dataset of Fanaroff-Riley galaxies, we trained supervised deep learning model on binary Fanaroff-Riley sources, reserving hybrid FR galaxies to serve as a sample of "anomalous" objects that might be mistaken for in-distribution sources. Our model architecture used dropout at test time to approximate a Bayesian posterior on predictions, allowing for uncertainty in a class label to be expressed by calculating predictive entropy.
Highly anomalous out-of-distribution sources were found to be located in sparse regions of latent space and hence were easily identifiable, but hybrid sources could not easily be isolated from binary FR galaxies in either latent space or by entropy value alone. Instead, we created a measure of typical local entropy by calculating the average entropy of the nearest ten training set sources to any given point in latent space; this allowed for objects with atypically high or low entropy relative to nearby sources to be identified regardless of the absolute value of their entropy.
Using a test set of both in-distribution binary FRs and "anomalous" hybrid sources, we find that the in-distribution sources show no significant departure from the training set entropy, but hybrid sources have significantly higher entropy than their surroundings in all regions of latent space except where the local entropy is itself maximal. All sources more than 3$\sigma$ from the local entropy were found to be hybrids, and flagging using this method alone detected 30% of the hybrid sample; the majority of the remaining hybrid sources were found to have near-maximal entropy, meaning that additionally flagging high-entropy sources would allow for both these and the most uncertainly-labelled in-distribution FR galaxies to be inspected while avoiding unnecessary flagging of low-uncertainty sources.
How can we gain physical intuition in real-world datasets using `black-box' machine learning? In this talk, I will discuss how ordered component analyses can be used to seperate, identify, and understand physical signals in astronomical datasets. We introduce Information Ordered Bottlenecks (IOBs), a neural layer designed to adaptively compress data into latent variables organized by likelihood maximization. As an nonlinear extension of Principal Component Analysis, IOB autoencoders are designed to be truncated at any bottleneck width, controlling information flow through only the most crucial latent variables. With this architecture, we show how classical neural networks can be easily extended to dynamically order latent information, revealing learned structure in multi-signal datasets. We demonstrate how this methodology can be extended to structure and classify physical phenomena, discover low-dimensional symbolic expressions in high-dimensional data, and regularize implicit inference. Along the way, we present several astronomical applications including emulation of CMB power spectrum, analysis of binary black hole systems, and dimensionality reduction of galaxy properties in large cosmological simulations.