Get trending papers in your email inbox once a day!
Get trending papers in your email inbox!
SubscribeAST: Audio Spectrogram Transformer
In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To better capture long-range global context, a recent trend is to add a self-attention mechanism on top of the CNN, forming a CNN-attention hybrid model. However, it is unclear whether the reliance on a CNN is necessary, and if neural networks purely based on attention are sufficient to obtain good performance in audio classification. In this paper, we answer the question by introducing the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification. We evaluate AST on various audio classification benchmarks, where it achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2.
AST-Probe: Recovering abstract syntax trees from hidden representations of pre-trained language models
The objective of pre-trained language models is to learn contextual representations of textual data. Pre-trained language models have become mainstream in natural language processing and code modeling. Using probes, a technique to study the linguistic properties of hidden vector spaces, previous works have shown that these pre-trained language models encode simple linguistic properties in their hidden representations. However, none of the previous work assessed whether these models encode the whole grammatical structure of a programming language. In this paper, we prove the existence of a syntactic subspace, lying in the hidden representations of pre-trained language models, which contain the syntactic information of the programming language. We show that this subspace can be extracted from the models' representations and define a novel probing method, the AST-Probe, that enables recovering the whole abstract syntax tree (AST) of an input code snippet. In our experimentations, we show that this syntactic subspace exists in five state-of-the-art pre-trained language models. In addition, we highlight that the middle layers of the models are the ones that encode most of the AST information. Finally, we estimate the optimal size of this syntactic subspace and show that its dimension is substantially lower than those of the models' representation spaces. This suggests that pre-trained language models use a small part of their representation spaces to encode syntactic information of the programming languages.
Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models
The high cost of full-parameter fine-tuning (FFT) of Large Language Models (LLMs) has led to a series of parameter-efficient fine-tuning (PEFT) methods. However, it remains unclear which methods provide the best cost-performance trade-off at different model scales. We introduce Astraios, a suite of 28 instruction-tuned OctoCoder models using 7 tuning methods and 4 model sizes up to 16 billion parameters. Through investigations across 5 tasks and 8 different datasets encompassing both code comprehension and code generation tasks, we find that FFT generally leads to the best downstream performance across all scales, and PEFT methods differ significantly in their efficacy based on the model scale. LoRA usually offers the most favorable trade-off between cost and performance. Further investigation into the effects of these methods on both model robustness and code security reveals that larger models tend to demonstrate reduced robustness and less security. At last, we explore the relationships among updated parameters, cross-entropy loss, and task performance. We find that the tuning effectiveness observed in small models generalizes well to larger models, and the validation loss in instruction tuning can be a reliable indicator of overall downstream performance.
AstroLLaMA: Towards Specialized Foundation Models in Astronomy
Large language models excel in many human-language tasks but often falter in highly specialized domains like scholarly astronomy. To bridge this gap, we introduce AstroLLaMA, a 7-billion-parameter model fine-tuned from LLaMA-2 using over 300,000 astronomy abstracts from arXiv. Optimized for traditional causal language modeling, AstroLLaMA achieves a 30% lower perplexity than Llama-2, showing marked domain adaptation. Our model generates more insightful and scientifically relevant text completions and embedding extraction than state-of-the-arts foundation models despite having significantly fewer parameters. AstroLLaMA serves as a robust, domain-specific model with broad fine-tuning potential. Its public release aims to spur astronomy-focused research, including automatic paper summarization and conversational agent development.
ASTRAL: Automated Safety Testing of Large Language Models
Large Language Models (LLMs) have recently gained attention due to their ability to understand and generate sophisticated human-like content. However, ensuring their safety is paramount as they might provide harmful and unsafe responses. Existing LLM testing frameworks address various safety-related concerns (e.g., drugs, terrorism, animal abuse) but often face challenges due to unbalanced and obsolete datasets. In this paper, we present ASTRAL, a tool that automates the generation and execution of test cases (i.e., prompts) for testing the safety of LLMs. First, we introduce a novel black-box coverage criterion to generate balanced and diverse unsafe test inputs across a diverse set of safety categories as well as linguistic writing characteristics (i.e., different style and persuasive writing techniques). Second, we propose an LLM-based approach that leverages Retrieval Augmented Generation (RAG), few-shot prompting strategies and web browsing to generate up-to-date test inputs. Lastly, similar to current LLM test automation techniques, we leverage LLMs as test oracles to distinguish between safe and unsafe test outputs, allowing a fully automated testing approach. We conduct an extensive evaluation on well-known LLMs, revealing the following key findings: i) GPT3.5 outperforms other LLMs when acting as the test oracle, accurately detecting unsafe responses, and even surpassing more recent LLMs (e.g., GPT-4), as well as LLMs that are specifically tailored to detect unsafe LLM outputs (e.g., LlamaGuard); ii) the results confirm that our approach can uncover nearly twice as many unsafe LLM behaviors with the same number of test inputs compared to currently used static datasets; and iii) our black-box coverage criterion combined with web browsing can effectively guide the LLM on generating up-to-date unsafe test inputs, significantly increasing the number of unsafe LLM behaviors.
Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models
Retrieval-Augmented Generation (RAG), while effective in integrating external knowledge to address the limitations of large language models (LLMs), can be undermined by imperfect retrieval, which may introduce irrelevant, misleading, or even malicious information. Despite its importance, previous studies have rarely explored the behavior of RAG through joint analysis on how errors from imperfect retrieval attribute and propagate, and how potential conflicts arise between the LLMs' internal knowledge and external sources. We find that imperfect retrieval augmentation might be inevitable and quite harmful, through controlled analysis under realistic conditions. We identify the knowledge conflicts between LLM-internal and external knowledge from retrieval as a bottleneck to overcome in the post-retrieval stage of RAG. To render LLMs resilient to imperfect retrieval, we propose Astute RAG, a novel RAG approach that adaptively elicits essential information from LLMs' internal knowledge, iteratively consolidates internal and external knowledge with source-awareness, and finalizes the answer according to information reliability. Our experiments using Gemini and Claude demonstrate that Astute RAG significantly outperforms previous robustness-enhanced RAG methods. Notably, Astute RAG is the only approach that matches or exceeds the performance of LLMs without RAG under worst-case scenarios. Further analysis reveals that Astute RAG effectively resolves knowledge conflicts, improving the reliability and trustworthiness of RAG systems.
AstroMLab 3: Achieving GPT-4o Level Performance in Astronomy with a Specialized 8B-Parameter Large Language Model
AstroSage-Llama-3.1-8B is a domain-specialized natural-language AI assistant tailored for research in astronomy, astrophysics, and cosmology. Trained on the complete collection of astronomy-related arXiv papers from 2007-2024 along with millions of synthetically-generated question-answer pairs and other astronomical literature, AstroSage-Llama-3.1-8B demonstrates remarkable proficiency on a wide range of questions. AstroSage-Llama-3.1-8B scores 80.9% on the AstroMLab-1 benchmark, greatly outperforming all models -- proprietary and open-weight -- in the 8-billion parameter class, and performing on par with GPT-4o. This achievement demonstrates the potential of domain specialization in AI, suggesting that focused training can yield capabilities exceeding those of much larger, general-purpose models. AstroSage-Llama-3.1-8B is freely available, enabling widespread access to advanced AI capabilities for astronomical education and research.
AstroLLaMA-Chat: Scaling AstroLLaMA with Conversational and Diverse Datasets
We explore the potential of enhancing LLM performance in astronomy-focused question-answering through targeted, continual pre-training. By employing a compact 7B-parameter LLaMA-2 model and focusing exclusively on a curated set of astronomy corpora -- comprising abstracts, introductions, and conclusions -- we achieve notable improvements in specialized topic comprehension. While general LLMs like GPT-4 excel in broader question-answering scenarios due to superior reasoning capabilities, our findings suggest that continual pre-training with limited resources can still enhance model performance on specialized topics. Additionally, we present an extension of AstroLLaMA: the fine-tuning of the 7B LLaMA model on a domain-specific conversational dataset, culminating in the release of the chat-enabled AstroLLaMA for community use. Comprehensive quantitative benchmarking is currently in progress and will be detailed in an upcoming full paper. The model, AstroLLaMA-Chat, is now available at https://huggingface.co/universeTBD, providing the first open-source conversational AI tool tailored for the astronomy community.
Astroformer: More Data Might not be all you need for Classification
Recent advancements in areas such as natural language processing and computer vision rely on intricate and massive models that have been trained using vast amounts of unlabelled or partly labeled data and training or deploying these state-of-the-art methods to resource constraint environments has been a challenge. Galaxy morphologies are crucial to understanding the processes by which galaxies form and evolve. Efficient methods to classify galaxy morphologies are required to extract physical information from modern-day astronomy surveys. In this paper, we introduce Astroformer, a method to learn from less amount of data. We propose using a hybrid transformer-convolutional architecture drawing much inspiration from the success of CoAtNet and MaxViT. Concretely, we use the transformer-convolutional hybrid with a new stack design for the network, a different way of creating a relative self-attention layer, and pair it with a careful selection of data augmentation and regularization techniques. Our approach sets a new state-of-the-art on predicting galaxy morphologies from images on the Galaxy10 DECals dataset, a science objective, which consists of 17736 labeled images achieving 94.86% top-1 accuracy, beating the current state-of-the-art for this task by 4.62%. Furthermore, this approach also sets a new state-of-the-art on CIFAR-100 and Tiny ImageNet. We also find that models and training methods used for larger datasets would often not work very well in the low-data regime.
AstroVision: Towards Autonomous Feature Detection and Description for Missions to Small Bodies Using Deep Learning
Missions to small celestial bodies rely heavily on optical feature tracking for characterization of and relative navigation around the target body. While deep learning has led to great advancements in feature detection and description, training and validating data-driven models for space applications is challenging due to the limited availability of large-scale, annotated datasets. This paper introduces AstroVision, a large-scale dataset comprised of 115,970 densely annotated, real images of 16 different small bodies captured during past and ongoing missions. We leverage AstroVision to develop a set of standardized benchmarks and conduct an exhaustive evaluation of both handcrafted and data-driven feature detection and description methods. Next, we employ AstroVision for end-to-end training of a state-of-the-art, deep feature detection and description network and demonstrate improved performance on multiple benchmarks. The full benchmarking pipeline and the dataset will be made publicly available to facilitate the advancement of computer vision algorithms for space applications.
CNN Features off-the-shelf: an Astounding Baseline for Recognition
Recent results indicate that the generic descriptors extracted from the convolutional neural networks are very powerful. This paper adds to the mounting evidence that this is indeed the case. We report on a series of experiments conducted for different recognition tasks using the publicly available code and model of the \overfeat network which was trained to perform object classification on ILSVRC13. We use features extracted from the \overfeat network as a generic image representation to tackle the diverse range of recognition tasks of object image classification, scene recognition, fine grained recognition, attribute detection and image retrieval applied to a diverse set of datasets. We selected these tasks and datasets as they gradually move further away from the original task and data the \overfeat network was trained to solve. Astonishingly, we report consistent superior results compared to the highly tuned state-of-the-art systems in all the visual classification tasks on various datasets. For instance retrieval it consistently outperforms low memory footprint methods except for sculptures dataset. The results are achieved using a linear SVM classifier (or L2 distance in case of retrieval) applied to a feature representation of size 4096 extracted from a layer in the net. The representations are further modified using simple augmentation techniques e.g. jittering. The results strongly suggest that features obtained from deep learning with convolutional nets should be the primary candidate in most visual recognition tasks.
Cl+ and HCl+ in Reaction with H2 and Isotopologues: A Glance into H Abstraction and Indirect Exchange at Astrophysical Conditions
Astrochemical models of interstellar clouds, the sites of stars, and planet formation require information about spin-state chemistry to allow quantitative comparison with spectroscopic observations. In particular, it is important to know if full scrambling or H abstraction (also known as proton hopping) takes place in ion-neutral reactions. The reaction of Cl+ and HCl+ with H2 and isotopologues has been studied at cryogenic temperatures between 20 and 180 K using a 22 pole radio frequency ion trap. Isotopic exchange processes are used to probe the reaction mechanism of the HCl+ + H2 reaction. The results are compared with previous measurements and theoretical predictions. The rate coefficients for the Cl+ + H2 and HCl+ + H2 reactions are found to be constant in the range of temperatures studied, except for the DCl+ + D2 reaction, where a weak negative temperature dependence is observed, and reactions with D2 are found to be significantly slower than the Langevin rate. No isotopic exchange reactions are observed to occur for the H2Cl+ ion. The analysis of the products of the HCl+ + H2 isotopic system clearly indicates that the reaction proceeds via simple hydrogen atom abstraction.
AstroLoc: Robust Space to Ground Image Localizer
Astronauts take thousands of photos of Earth per day from the International Space Station, which, once localized on Earth's surface, are used for a multitude of tasks, ranging from climate change research to disaster management. The localization process, which has been performed manually for decades, has recently been approached through image retrieval solutions: given an astronaut photo, find its most similar match among a large database of geo-tagged satellite images, in a task called Astronaut Photography Localization (APL). Yet, existing APL approaches are trained only using satellite images, without taking advantage of the millions open-source astronaut photos. In this work we present the first APL pipeline capable of leveraging astronaut photos for training. We first produce full localization information for 300,000 manually weakly labeled astronaut photos through an automated pipeline, and then use these images to train a model, called AstroLoc. AstroLoc learns a robust representation of Earth's surface features through two losses: astronaut photos paired with their matching satellite counterparts in a pairwise loss, and a second loss on clusters of satellite imagery weighted by their relevance to astronaut photography via unsupervised mining. We find that AstroLoc achieves a staggering 35% average improvement in recall@1 over previous SOTA, pushing the limits of existing datasets with a recall@100 consistently over 99%. Finally, we note that AstroLoc, without any fine-tuning, provides excellent results for related tasks like the lost-in-space satellite problem and historical space imagery localization.
Astro-HEP-BERT: A bidirectional language model for studying the meanings of concepts in astrophysics and high energy physics
I present Astro-HEP-BERT, a transformer-based language model specifically designed for generating contextualized word embeddings (CWEs) to study the meanings of concepts in astrophysics and high-energy physics. Built on a general pretrained BERT model, Astro-HEP-BERT underwent further training over three epochs using the Astro-HEP Corpus, a dataset I curated from 21.84 million paragraphs extracted from more than 600,000 scholarly articles on arXiv, all belonging to at least one of these two scientific domains. The project demonstrates both the effectiveness and feasibility of adapting a bidirectional transformer for applications in the history, philosophy, and sociology of science (HPSS). The entire training process was conducted using freely available code, pretrained weights, and text inputs, completed on a single MacBook Pro Laptop (M2/96GB). Preliminary evaluations indicate that Astro-HEP-BERT's CWEs perform comparably to domain-adapted BERT models trained from scratch on larger datasets for domain-specific word sense disambiguation and induction and related semantic change analyses. This suggests that retraining general language models for specific scientific domains can be a cost-effective and efficient strategy for HPSS researchers, enabling high performance without the need for extensive training from scratch.
AstroM$^3$: A self-supervised multimodal model for astronomy
While machine-learned models are now routinely employed to facilitate astronomical inquiry, model inputs tend to be limited to a primary data source (namely images or time series) and, in the more advanced approaches, some metadata. Yet with the growing use of wide-field, multiplexed observational resources, individual sources of interest often have a broad range of observational modes available. Here we construct an astronomical multimodal dataset and propose AstroM^3, a self-supervised pre-training approach that enables a model to learn from multiple modalities simultaneously. Specifically, we extend the CLIP (Contrastive Language-Image Pretraining) model to a trimodal setting, allowing the integration of time-series photometry data, spectra, and astrophysical metadata. In a fine-tuning supervised setting, our results demonstrate that CLIP pre-training improves classification performance for time-series photometry, where accuracy increases from 84.6% to 91.5%. Furthermore, CLIP boosts classification accuracy by up to 12.6% when the availability of labeled data is limited, showing the effectiveness of leveraging larger corpora of unlabeled data. In addition to fine-tuned classification, we can use the trained model in other downstream tasks that are not explicitly contemplated during the construction of the self-supervised model. In particular we show the efficacy of using the learned embeddings for misclassifications identification, similarity search, and anomaly detection. One surprising highlight is the "rediscovery" of Mira subtypes and two Rotational variable subclasses using manifold learning and dimension reduction algorithm. To our knowledge this is the first construction of an n>2 mode model in astronomy. Extensions to n>3 modes is naturally anticipated with this approach.
AstroMLab 2: AstroLLaMA-2-70B Model and Benchmarking Specialised LLMs for Astronomy
Continual pretraining of large language models on domain-specific data has been proposed to enhance performance on downstream tasks. In astronomy, the previous absence of astronomy-focused benchmarks has hindered objective evaluation of these specialized LLM models. Leveraging a recent initiative to curate high-quality astronomical MCQs, this study aims to quantitatively assess specialized LLMs in astronomy. We find that the previously released AstroLLaMA series, based on LLaMA-2-7B, underperforms compared to the base model. We demonstrate that this performance degradation can be partially mitigated by utilizing high-quality data for continual pretraining, such as summarized text from arXiv. Despite the observed catastrophic forgetting in smaller models, our results indicate that continual pretraining on the 70B model can yield significant improvements. However, the current supervised fine-tuning dataset still constrains the performance of instruct models. In conjunction with this study, we introduce a new set of models, AstroLLaMA-3-8B and AstroLLaMA-2-70B, building upon the previous AstroLLaMA series.
ASTER: Natural and Multi-language Unit Test Generation with LLMs
Implementing automated unit tests is an important but time-consuming activity in software development. To assist developers in this task, many techniques for automating unit test generation have been developed. However, despite this effort, usable tools exist for very few programming languages. Moreover, studies have found that automatically generated tests suffer poor readability and do not resemble developer-written tests. In this work, we present a rigorous investigation of how large language models (LLMs) can help bridge the gap. We describe a generic pipeline that incorporates static analysis to guide LLMs in generating compilable and high-coverage test cases. We illustrate how the pipeline can be applied to different programming languages, specifically Java and Python, and to complex software requiring environment mocking. We conducted an empirical study to assess the quality of the generated tests in terms of code coverage and test naturalness -- evaluating them on standard as well as enterprise Java applications and a large Python benchmark. Our results demonstrate that LLM-based test generation, when guided by static analysis, can be competitive with, and even outperform, state-of-the-art test-generation techniques in coverage achieved while also producing considerably more natural test cases that developers find easy to understand. We also present the results of a user study, conducted with 161 professional developers, that highlights the naturalness characteristics of the tests generated by our approach.
AstroMLab 1: Who Wins Astronomy Jeopardy!?
We present a comprehensive evaluation of proprietary and open-weights large language models using the first astronomy-specific benchmarking dataset. This dataset comprises 4,425 multiple-choice questions curated from the Annual Review of Astronomy and Astrophysics, covering a broad range of astrophysical topics. Our analysis examines model performance across various astronomical subfields and assesses response calibration, crucial for potential deployment in research environments. Claude-3.5-Sonnet outperforms competitors by up to 4.6 percentage points, achieving 85.0% accuracy. For proprietary models, we observed a universal reduction in cost every 3-to-12 months to achieve similar score in this particular astronomy benchmark. Open-source models have rapidly improved, with LLaMA-3-70b (80.6%) and Qwen-2-72b (77.7%) now competing with some of the best proprietary models. We identify performance variations across topics, with non-English-focused models generally struggling more in exoplanet-related fields, stellar astrophysics, and instrumentation related questions. These challenges likely stem from less abundant training data, limited historical context, and rapid recent developments in these areas. This pattern is observed across both open-weights and proprietary models, with regional dependencies evident, highlighting the impact of training data diversity on model performance in specialized scientific domains. Top-performing models demonstrate well-calibrated confidence, with correlations above 0.9 between confidence and correctness, though they tend to be slightly underconfident. The development for fast, low-cost inference of open-weights models presents new opportunities for affordable deployment in astronomy. The rapid progress observed suggests that LLM-driven research in astronomy may become feasible in the near future.
AstroPT: Scaling Large Observation Models for Astronomy
This work presents AstroPT, an autoregressive pretrained transformer developed with astronomical use-cases in mind. The AstroPT models presented here have been pretrained on 8.6 million 512 times 512 pixel grz-band galaxy postage stamp observations from the DESI Legacy Survey DR8. We train a selection of foundation models of increasing size from 1 million to 2.1 billion parameters, and find that AstroPT follows a similar saturating log-log scaling law to textual models. We also find that the models' performances on downstream tasks as measured by linear probing improves with model size up to the model parameter saturation point. We believe that collaborative community development paves the best route towards realising an open source `Large Observation Model' -- a model trained on data taken from the observational sciences at the scale seen in natural language processing. To this end, we release the source code, weights, and dataset for AstroPT under the MIT license, and invite potential collaborators to join us in collectively building and researching these models.
Information divergences to parametrize astrophysical uncertainties in dark matter direct detection
Astrophysical uncertainties in dark matter direct detection experiments are typically addressed by parametrizing the velocity distribution in terms of a few uncertain parameters that vary around some central values. Here we propose a method to optimize over all velocity distributions lying within a given distance measure from a central distribution. We discretize the dark matter velocity distribution as a superposition of streams, and use a variety of information divergences to parametrize its uncertainties. With this, we bracket the limits on the dark matter-nucleon and dark matter-electron scattering cross sections, when the true dark matter velocity distribution deviates from the commonly assumed Maxwell-Boltzmann form. The methodology pursued is general and could be applied to other physics scenarios where a given physical observable depends on a function that is uncertain.
Astrocyte-Enabled Advancements in Spiking Neural Networks for Large Language Modeling
Within the complex neuroarchitecture of the brain, astrocytes play crucial roles in development, structure, and metabolism. These cells regulate neural activity through tripartite synapses, directly impacting cognitive processes such as learning and memory. Despite the growing recognition of astrocytes' significance, traditional Spiking Neural Network (SNN) models remain predominantly neuron-centric, overlooking the profound influence of astrocytes on neural dynamics. Inspired by these biological insights, we have developed an Astrocyte-Modulated Spiking Unit (AM-SU), an innovative framework that integrates neuron-astrocyte interactions into the computational paradigm, demonstrating wide applicability across various hardware platforms. Our Astrocyte-Modulated Spiking Neural Network (AstroSNN) exhibits exceptional performance in tasks involving memory retention and natural language generation, particularly in handling long-term dependencies and complex linguistic structures. The design of AstroSNN not only enhances its biological authenticity but also introduces novel computational dynamics, enabling more effective processing of complex temporal dependencies. Furthermore, AstroSNN shows low latency, high throughput, and reduced memory usage in practical applications, making it highly suitable for resource-constrained environments. By successfully integrating astrocytic dynamics into intelligent neural networks, our work narrows the gap between biological plausibility and neural modeling, laying the groundwork for future biologically-inspired neural computing research that includes both neurons and astrocytes.
AstroCLIP: Cross-Modal Pre-Training for Astronomical Foundation Models
We present AstroCLIP, a strategy to facilitate the construction of astronomical foundation models that bridge the gap between diverse observational modalities. We demonstrate that a cross-modal contrastive learning approach between images and optical spectra of galaxies yields highly informative embeddings of both modalities. In particular, we apply our method on multi-band images and optical spectra from the Dark Energy Spectroscopic Instrument (DESI), and show that: (1) these embeddings are well-aligned between modalities and can be used for accurate cross-modal searches, and (2) these embeddings encode valuable physical information about the galaxies -- in particular redshift and stellar mass -- that can be used to achieve competitive zero- and few- shot predictions without further finetuning. Additionally, in the process of developing our approach, we also construct a novel, transformer-based model and pretraining approach for processing galaxy spectra.
Astronomaly at scale: searching for anomalies amongst 4 million galaxies
Modern astronomical surveys are producing datasets of unprecedented size and richness, increasing the potential for high-impact scientific discovery. This possibility, coupled with the challenge of exploring a large number of sources, has led to the development of novel machine-learning-based anomaly detection approaches, such as Astronomaly. For the first time, we test the scalability of Astronomaly by applying it to almost 4 million images of galaxies from the Dark Energy Camera Legacy Survey. We use a trained deep learning algorithm to learn useful representations of the images and pass these to the anomaly detection algorithm isolation forest, coupled with Astronomaly's active learning method, to discover interesting sources. We find that data selection criteria have a significant impact on the trade-off between finding rare sources such as strong lenses and introducing artefacts into the dataset. We demonstrate that active learning is required to identify the most interesting sources and reduce artefacts, while anomaly detection methods alone are insufficient. Using Astronomaly, we find 1635 anomalies among the top 2000 sources in the dataset after applying active learning, including eight strong gravitational lens candidates, 1609 galaxy merger candidates, and 18 previously unidentified sources exhibiting highly unusual morphology. Our results show that by leveraging the human-machine interface, Astronomaly is able to rapidly identify sources of scientific interest even in large datasets.
Asteroid: the PyTorch-based audio source separation toolkit for researchers
This paper describes Asteroid, the PyTorch-based audio source separation toolkit for researchers. Inspired by the most successful neural source separation systems, it provides all neural building blocks required to build such a system. To improve reproducibility, Kaldi-style recipes on common audio source separation datasets are also provided. This paper describes the software architecture of Asteroid and its most important features. By showing experimental results obtained with Asteroid's recipes, we show that our implementations are at least on par with most results reported in reference papers. The toolkit is publicly available at https://github.com/mpariente/asteroid .
CoSTA$\ast$: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing
Text-to-image models like stable diffusion and DALLE-3 still struggle with multi-turn image editing. We decompose such a task as an agentic workflow (path) of tool use that addresses a sequence of subtasks by AI tools of varying costs. Conventional search algorithms require expensive exploration to find tool paths. While large language models (LLMs) possess prior knowledge of subtask planning, they may lack accurate estimations of capabilities and costs of tools to determine which to apply in each subtask. Can we combine the strengths of both LLMs and graph search to find cost-efficient tool paths? We propose a three-stage approach "CoSTA*" that leverages LLMs to create a subtask tree, which helps prune a graph of AI tools for the given task, and then conducts A* search on the small subgraph to find a tool path. To better balance the total cost and quality, CoSTA* combines both metrics of each tool on every subtask to guide the A* search. Each subtask's output is then evaluated by a vision-language model (VLM), where a failure will trigger an update of the tool's cost and quality on the subtask. Hence, the A* search can recover from failures quickly to explore other paths. Moreover, CoSTA* can automatically switch between modalities across subtasks for a better cost-quality trade-off. We build a novel benchmark of challenging multi-turn image editing, on which CoSTA* outperforms state-of-the-art image-editing models or agents in terms of both cost and quality, and performs versatile trade-offs upon user preference.
Delving into the Utilisation of ChatGPT in Scientific Publications in Astronomy
Rapid progress in the capabilities of machine learning approaches in natural language processing has culminated in the rise of large language models over the last two years. Recent works have shown unprecedented adoption of these for academic writing, especially in some fields, but their pervasiveness in astronomy has not been studied sufficiently. To remedy this, we extract words that ChatGPT uses more often than humans when generating academic text and search a total of 1 million articles for them. This way, we assess the frequency of word occurrence in published works in astronomy tracked by the NASA Astrophysics Data System since 2000. We then perform a statistical analysis of the occurrences. We identify a list of words favoured by ChatGPT and find a statistically significant increase for these words against a control group in 2024, which matches the trend in other disciplines. These results suggest a widespread adoption of these models in the writing of astronomy papers. We encourage organisations, publishers, and researchers to work together to identify ethical and pragmatic guidelines to maximise the benefits of these systems while maintaining scientific rigour.
Building astroBERT, a language model for Astronomy & Astrophysics
The existing search tools for exploring the NASA Astrophysics Data System (ADS) can be quite rich and empowering (e.g., similar and trending operators), but researchers are not yet allowed to fully leverage semantic search. For example, a query for "results from the Planck mission" should be able to distinguish between all the various meanings of Planck (person, mission, constant, institutions and more) without further clarification from the user. At ADS, we are applying modern machine learning and natural language processing techniques to our dataset of recent astronomy publications to train astroBERT, a deeply contextual language model based on research at Google. Using astroBERT, we aim to enrich the ADS dataset and improve its discoverability, and in particular we are developing our own named entity recognition tool. We present here our preliminary results and lessons learned.
HackerRank-ASTRA: Evaluating Correctness & Consistency of Large Language Models on cross-domain multi-file project problems
Evaluating the real-world applicability of large language models (LLMs) provides valuable insights for their development and use in software development tasks. Existing benchmarks often focus on standalone coding problems or specific libraries, overlooking multi-file, project-based scenarios and lacking a rigorous evaluation of consistency. The HackerRank-ASTRA Benchmark introduces project-based coding problems that mirror real-world scenarios. It evaluates model consistency through 32 runs (k = 32) and median standard deviation while incorporating taxonomy-level analysis to assess sub-skill capabilities. Initial evaluations on 65 problems show that the top three models -- o1, o1-preview, and Claude-3.5-Sonnet-1022 -- achieved comparable average scores of 75%, with no statistically significant differences in performance. Notably, Claude-3.5-Sonnet-1022 demonstrated the highest consistency across problems, with low variability (SD = 0.0497), which was statistically significant compared to other models, highlighting its reliability for real-world software development tasks.
Historical Astronomical Diagrams Decomposition in Geometric Primitives
Automatically extracting the geometric content from the hundreds of thousands of diagrams drawn in historical manuscripts would enable historians to study the diffusion of astronomical knowledge on a global scale. However, state-of-the-art vectorization methods, often designed to tackle modern data, are not adapted to the complexity and diversity of historical astronomical diagrams. Our contribution is thus twofold. First, we introduce a unique dataset of 303 astronomical diagrams from diverse traditions, ranging from the XIIth to the XVIIIth century, annotated with more than 3000 line segments, circles and arcs. Second, we develop a model that builds on DINO-DETR to enable the prediction of multiple geometric primitives. We show that it can be trained solely on synthetic data and accurately predict primitives on our challenging dataset. Our approach widely improves over the LETR baseline, which is restricted to lines, by introducing a meaningful parametrization for multiple primitives, jointly training for detection and parameter refinement, using deformable attention and training on rich synthetic data. Our dataset and code are available on our webpage.
The Tiny Time-series Transformer: Low-latency High-throughput Classification of Astronomical Transients using Deep Model Compression
A new golden age in astronomy is upon us, dominated by data. Large astronomical surveys are broadcasting unprecedented rates of information, demanding machine learning as a critical component in modern scientific pipelines to handle the deluge of data. The upcoming Legacy Survey of Space and Time (LSST) of the Vera C. Rubin Observatory will raise the big-data bar for time-domain astronomy, with an expected 10 million alerts per-night, and generating many petabytes of data over the lifetime of the survey. Fast and efficient classification algorithms that can operate in real-time, yet robustly and accurately, are needed for time-critical events where additional resources can be sought for follow-up analyses. In order to handle such data, state-of-the-art deep learning architectures coupled with tools that leverage modern hardware accelerators are essential. We showcase how the use of modern deep compression methods can achieve a 18times reduction in model size, whilst preserving classification performance. We also show that in addition to the deep compression techniques, careful choice of file formats can improve inference latency, and thereby throughput of alerts, on the order of 8times for local processing, and 5times in a live production setting. To test this in a live setting, we deploy this optimised version of the original time-series transformer, t2, into the community alert broking system of FINK on real Zwicky Transient Facility (ZTF) alert data, and compare throughput performance with other science modules that exist in FINK. The results shown herein emphasise the time-series transformer's suitability for real-time classification at LSST scale, and beyond, and introduce deep model compression as a fundamental tool for improving deploy-ability and scalable inference of deep learning models for transient classification.
pathfinder: A Semantic Framework for Literature Review and Knowledge Discovery in Astronomy
The exponential growth of astronomical literature poses significant challenges for researchers navigating and synthesizing general insights or even domain-specific knowledge. We present Pathfinder, a machine learning framework designed to enable literature review and knowledge discovery in astronomy, focusing on semantic searching with natural language instead of syntactic searches with keywords. Utilizing state-of-the-art large language models (LLMs) and a corpus of 350,000 peer-reviewed papers from the Astrophysics Data System (ADS), Pathfinder offers an innovative approach to scientific inquiry and literature exploration. Our framework couples advanced retrieval techniques with LLM-based synthesis to search astronomical literature by semantic context as a complement to currently existing methods that use keywords or citation graphs. It addresses complexities of jargon, named entities, and temporal aspects through time-based and citation-based weighting schemes. We demonstrate the tool's versatility through case studies, showcasing its application in various research scenarios. The system's performance is evaluated using custom benchmarks, including single-paper and multi-paper tasks. Beyond literature review, Pathfinder offers unique capabilities for reformatting answers in ways that are accessible to various audiences (e.g. in a different language or as simplified text), visualizing research landscapes, and tracking the impact of observatories and methodologies. This tool represents a significant advancement in applying AI to astronomical research, aiding researchers at all career stages in navigating modern astronomy literature.
A Dataset for Exploring Stellar Activity in Astrometric Measurements from SDO Images of the Sun
We present a dataset for investigating the impact of stellar activity on astrometric measurements using NASA's Solar Dynamics Observatory (SDO) images of the Sun. The sensitivity of astrometry for detecting exoplanets is limited by stellar activity (e.g. starspots), which causes the measured "center of flux" of the star to deviate from the true, geometric, center, producing false positive detections. We analyze Helioseismic and Magnetic Imager continuum image data obtained from SDO between July 2015 and December 2022 to examine this "astrometric jitter" phenomenon for the Sun. We employ data processing procedures to clean the images and compute the time series of the sunspot-induced shift between the center of flux and the geometric center. The resulting time series show quasiperiodic variations up to 0.05% of the Sun's radius at its rotation period.
Testing the Cosmological Principle: Astrometric Limits on Systemic Motion of Quasars at Different Cosmological Epochs
A sample of 60,410 bona fide optical quasars with astrometric proper motions in Gaia EDR3 and spectroscopic redshifts above 0.5 in an oval 8400 square degree area of the sky is constructed. Using orthogonal Zernike functions of polar coordinates, the proper motion fields are fitted in a weighted least-squares adjustment of the entire sample and of six equal bins of sorted redshifts. The overall fit with 37 Zernike functions reveals a statistically significant pattern, which is likely to be of instrumental origin. The main feature of this pattern is a chain of peaks and dips mostly in the R.A. component with an amplitude of 25~muas yr^{-1}. This field is subtracted from each of the six analogous fits for quasars grouped by redshifts covering the range 0.5 through 7.03, with median values 0.72, 1.00, 1.25, 1.52, 1.83, 2.34. The resulting residual patterns are noisier, with formal uncertainties up to 8~muas yr^{-1} in the central part of the area. We detect a single high-confidence Zernike term for R.A. proper motion components of quasars with redshifts around 1.52 representing a general gradient of 30 muas yr^{-1} over 150degr on the sky. We do not find any small- or medium-scale systemic variations of the residual proper motion field as functions of redshift above the 2.5,sigma significance level.
Neural Posterior Estimation for Cataloging Astronomical Images with Spatially Varying Backgrounds and Point Spread Functions
Neural posterior estimation (NPE), a type of amortized variational inference, is a computationally efficient means of constructing probabilistic catalogs of light sources from astronomical images. To date, NPE has not been used to perform inference in models with spatially varying covariates. However, ground-based astronomical images have spatially varying sky backgrounds and point spread functions (PSFs), and accounting for this variation is essential for constructing accurate catalogs of imaged light sources. In this work, we introduce a method of performing NPE with spatially varying backgrounds and PSFs. In this method, we generate synthetic catalogs and semi-synthetic images for these catalogs using randomly sampled PSF and background estimates from existing surveys. Using this data, we train a neural network, which takes an astronomical image and representations of its background and PSF as input, to output a probabilistic catalog. Our experiments with Sloan Digital Sky Survey data demonstrate the effectiveness of NPE in the presence of spatially varying backgrounds and PSFs for light source detection, star/galaxy separation, and flux measurement.
Stochastic acceleration in arbitrary astrophysical environments
Turbulent magnetic fields are to some extent a universal feature in astrophysical phenomena. Charged particles that encounter these turbulence get on average accelerated according to the so-called second-order Fermi process. However, in most astrophysical environments there are additional competing processes, such as different kinds of first-order energy changes and particle escape, that effect the resulting momentum distribution of the particles. In this work we provide to our knowledge the first semi-analytical solution of the isotropic steady-state momentum diffusion equation including continuous and catastrophic momentum changes that can be applied to any arbitrary astrophysical system of interest. Here, we adopt that the assigned magnetic turbulence is constrained on a finite range and the particle flux vanishes beyond these boundaries. Consequently, we show that the so-called pile-up bump -- that has for some special cases long been established -- is a universal feature of stochastic acceleration that emerges around the momentum chi_{rm eq} where acceleration and continuous loss are in equilibrium if the particle's residence time in the system is sufficient at chi_{rm eq}. In general, the impact of continuous and catastrophic momentum changes plays a crucial role in the shape of the steady-state momentum distribution of the accelerated particles, where simplified unbroken power-law approximations are often not adequate.
Knowledge Graph in Astronomical Research with Large Language Models: Quantifying Driving Forces in Interdisciplinary Scientific Discovery
Identifying and predicting the factors that contribute to the success of interdisciplinary research is crucial for advancing scientific discovery. However, there is a lack of methods to quantify the integration of new ideas and technological advancements in astronomical research and how these new technologies drive further scientific breakthroughs. Large language models, with their ability to extract key concepts from vast literature beyond keyword searches, provide a new tool to quantify such processes. In this study, we extracted concepts in astronomical research from 297,807 publications between 1993 and 2024 using large language models, resulting in a set of 24,939 concepts. These concepts were then used to form a knowledge graph, where the link strength between any two concepts was determined by their relevance through the citation-reference relationships. By calculating this relevance across different time periods, we quantified the impact of numerical simulations and machine learning on astronomical research. The knowledge graph demonstrates two phases of development: a phase where the technology was integrated and another where the technology was explored in scientific discovery. The knowledge graph reveals that despite machine learning has made much inroad in astronomy, there is currently a lack of new concept development at the intersection of AI and Astronomy, which may be the current bottleneck preventing machine learning from further transforming the field of astronomy.
Paying Attention to Astronomical Transients: Introducing the Time-series Transformer for Photometric Classification
Future surveys such as the Legacy Survey of Space and Time (LSST) of the Vera C. Rubin Observatory will observe an order of magnitude more astrophysical transient events than any previous survey before. With this deluge of photometric data, it will be impossible for all such events to be classified by humans alone. Recent efforts have sought to leverage machine learning methods to tackle the challenge of astronomical transient classification, with ever improving success. Transformers are a recently developed deep learning architecture, first proposed for natural language processing, that have shown a great deal of recent success. In this work we develop a new transformer architecture, which uses multi-head self attention at its core, for general multi-variate time-series data. Furthermore, the proposed time-series transformer architecture supports the inclusion of an arbitrary number of additional features, while also offering interpretability. We apply the time-series transformer to the task of photometric classification, minimising the reliance of expert domain knowledge for feature selection, while achieving results comparable to state-of-the-art photometric classification methods. We achieve a logarithmic-loss of 0.507 on imbalanced data in a representative setting using data from the Photometric LSST Astronomical Time-Series Classification Challenge (PLAsTiCC). Moreover, we achieve a micro-averaged receiver operating characteristic area under curve of 0.98 and micro-averaged precision-recall area under curve of 0.87.
The Photometric LSST Astronomical Time-series Classification Challenge (PLAsTiCC): Data set
The Photometric LSST Astronomical Time Series Classification Challenge (PLAsTiCC) is an open data challenge to classify simulated astronomical time-series data in preparation for observations from the Large Synoptic Survey Telescope (LSST), which will achieve first light in 2019 and commence its 10-year main survey in 2022. LSST will revolutionize our understanding of the changing sky, discovering and measuring millions of time-varying objects. In this challenge, we pose the question: how well can we classify objects in the sky that vary in brightness from simulated LSST time-series data, with all its challenges of non-representativity? In this note we explain the need for a data challenge to help classify such astronomical sources and describe the PLAsTiCC data set and Kaggle data challenge, noting that while the references are provided for context, they are not needed to participate in the challenge.
CosmoCLIP: Generalizing Large Vision-Language Models for Astronomical Imaging
Existing vision-text contrastive learning models enhance representation transferability and support zero-shot prediction by matching paired image and caption embeddings while pushing unrelated pairs apart. However, astronomical image-label datasets are significantly smaller compared to general image and label datasets available from the internet. We introduce CosmoCLIP, an astronomical image-text contrastive learning framework precisely fine-tuned on the pre-trained CLIP model using SpaceNet and BLIP-based captions. SpaceNet, attained via FLARE, constitutes ~13k optimally distributed images, while BLIP acts as a rich knowledge extractor. The rich semantics derived from this SpaceNet and BLIP descriptions, when learned contrastively, enable CosmoCLIP to achieve superior generalization across various in-domain and out-of-domain tasks. Our results demonstrate that CosmoCLIP is a straightforward yet powerful framework, significantly outperforming CLIP in zero-shot classification and image-text retrieval tasks.
Increasing Liquid State Machine Performance with Edge-of-Chaos Dynamics Organized by Astrocyte-modulated Plasticity
The liquid state machine (LSM) combines low training complexity and biological plausibility, which has made it an attractive machine learning framework for edge and neuromorphic computing paradigms. Originally proposed as a model of brain computation, the LSM tunes its internal weights without backpropagation of gradients, which results in lower performance compared to multi-layer neural networks. Recent findings in neuroscience suggest that astrocytes, a long-neglected non-neuronal brain cell, modulate synaptic plasticity and brain dynamics, tuning brain networks to the vicinity of the computationally optimal critical phase transition between order and chaos. Inspired by this disruptive understanding of how brain networks self-tune, we propose the neuron-astrocyte liquid state machine (NALSM) that addresses under-performance through self-organized near-critical dynamics. Similar to its biological counterpart, the astrocyte model integrates neuronal activity and provides global feedback to spike-timing-dependent plasticity (STDP), which self-organizes NALSM dynamics around a critical branching factor that is associated with the edge-of-chaos. We demonstrate that NALSM achieves state-of-the-art accuracy versus comparable LSM methods, without the need for data-specific hand-tuning. With a top accuracy of 97.61% on MNIST, 97.51% on N-MNIST, and 85.84% on Fashion-MNIST, NALSM achieved comparable performance to current fully-connected multi-layer spiking neural networks trained via backpropagation. Our findings suggest that the further development of brain-inspired machine learning methods has the potential to reach the performance of deep learning, with the added benefits of supporting robust and energy-efficient neuromorphic computing on the edge.
Models and Simulations for the Photometric LSST Astronomical Time Series Classification Challenge (PLAsTiCC)
We describe the simulated data sample for the "Photometric LSST Astronomical Time Series Classification Challenge" (PLAsTiCC), a publicly available challenge to classify transient and variable events that will be observed by the Large Synoptic Survey Telescope (LSST), a new facility expected to start in the early 2020s. The challenge was hosted by Kaggle, ran from 2018 September 28 to 2018 December 17, and included 1,094 teams competing for prizes. Here we provide details of the 18 transient and variable source models, which were not revealed until after the challenge, and release the model libraries at https://doi.org/10.5281/zenodo.2612896. We describe the LSST Operations Simulator used to predict realistic observing conditions, and we describe the publicly available SNANA simulation code used to transform the models into observed fluxes and uncertainties in the LSST passbands (ugrizy). Although PLAsTiCC has finished, the publicly available models and simulation tools are being used within the astronomy community to further improve classification, and to study contamination in photometrically identified samples of type Ia supernova used to measure properties of dark energy. Our simulation framework will continue serving as a platform to improve the PLAsTiCC models, and to develop new models.
The Multimodal Universe: Enabling Large-Scale Machine Learning with 100TB of Astronomical Scientific Data
We present the MULTIMODAL UNIVERSE, a large-scale multimodal dataset of scientific astronomical data, compiled specifically to facilitate machine learning research. Overall, the MULTIMODAL UNIVERSE contains hundreds of millions of astronomical observations, constituting 100\,TB of multi-channel and hyper-spectral images, spectra, multivariate time series, as well as a wide variety of associated scientific measurements and "metadata". In addition, we include a range of benchmark tasks representative of standard practices for machine learning methods in astrophysics. This massive dataset will enable the development of large multi-modal models specifically targeted towards scientific applications. All codes used to compile the MULTIMODAL UNIVERSE and a description of how to access the data is available at https://github.com/MultimodalUniverse/MultimodalUniverse
Combined CNN and ViT features off-the-shelf: Another astounding baseline for recognition
We apply pre-trained architectures, originally developed for the ImageNet Large Scale Visual Recognition Challenge, for periocular recognition. These architectures have demonstrated significant success in various computer vision tasks beyond the ones for which they were designed. This work builds on our previous study using off-the-shelf Convolutional Neural Network (CNN) and extends it to include the more recently proposed Vision Transformers (ViT). Despite being trained for generic object classification, middle-layer features from CNNs and ViTs are a suitable way to recognize individuals based on periocular images. We also demonstrate that CNNs and ViTs are highly complementary since their combination results in boosted accuracy. In addition, we show that a small portion of these pre-trained models can achieve good accuracy, resulting in thinner models with fewer parameters, suitable for resource-limited environments such as mobiles. This efficiency improves if traditional handcrafted features are added as well.
Widen the Resonance: Probing a New Regime of Neutrino Self-Interactions with Astrophysical Neutrinos
Neutrino self-interactions beyond the standard model have profound implications in astrophysics and cosmology. In this work, we study an uncharted scenario in which one of the three neutrino species has a mass much smaller than the temperature of the cosmic neutrino background. This results in a relativistic component that significantly broadens the absorption feature on the astrophysical neutrino spectra, in contrast to the sharply peaked absorption expected in the extensively studied scenarios assuming a fully nonrelativistic cosmic neutrino background. By solving the Boltzmann equations for neutrino absorption and regeneration, we demonstrate that this mechanism provides novel sensitivity to sub-keV mediator masses, well below the traditional sim 1--100 MeV range. Future observations of the diffuse supernova neutrino background with Hyper-Kamiokande could probe coupling strengths down to g sim 10^{-8}, surpassing existing constraints by orders of magnitude. These findings open new directions for discoveries and offer crucial insights into the interplay between neutrinos and the dark sector.
KIC 4150611: A quadruply eclipsing heptuple star system with a g-mode period-spacing pattern Asteroseismic modelling of the g-mode period-spacing pattern
In this work, we aim to estimate the stellar parameters of the primary (Aa) by performing asteroseismic analysis on its period-spacing pattern. We use the C-3PO neural network to perform asteroseismic modelling of the g-mode period-spacing pattern of Aa, discussing the interplay of this information with external constraints from spectroscopy (T_{rm eff} and log(g)) and eclipse modelling (R). To estimate the level of uncertainty due to different frequency extraction and pattern identification processes, we consider four different variations on the period-spacing patterns. To better understand the correlations between and the uncertainty structure of our parameter estimates, we also employed a classical, parameter-based MCMC grid search on four different stellar grids. The best-fitting, externally constrained model to the period-spacing pattern arrives at estimates of the stellar properties for Aa of: M=1.51 pm 0.05 M_odot, X_c =0.43 pm 0.04, R=1.66 pm 0.1 R_odot, f_{rm ov}=0.010, Omega_c=1.58 pm 0.01 d^{-1} with rigid rotation to within the measurement errors, log(T_{rm eff})=3.856 pm 0.008 dex, log(g)=4.18 pm 0.04 dex, and log(L)=0.809 pm 0.005 dex, which agree well with previous measurements from eclipse modelling, spectroscopy, and the Gaia DR3 luminosity. We find that the near-core properties of the best-fitting asteroseismic models are consistent with external constraints from eclipse modelling and spectroscopy. Aa appears to be a typical example of a gamma Dor star, fitting well within existing populations. We find that Aa is quasi-rigidly rotating to within the uncertainties, and note that the asteroseismic age estimate for Aa (1100 pm 100 Myr) is considerably older than the young (35 Myr) age implied by previous isochrone fits to the B binary in the literature. Our MCMC parameter-based grid-search agrees well with our pattern-modelling approach.
Uncovering delayed patterns in noisy and irregularly sampled time series: an astronomy application
We study the problem of estimating the time delay between two signals representing delayed, irregularly sampled and noisy versions of the same underlying pattern. We propose and demonstrate an evolutionary algorithm for the (hyper)parameter estimation of a kernel-based technique in the context of an astronomical problem, namely estimating the time delay between two gravitationally lensed signals from a distant quasar. Mixed types (integer and real) are used to represent variables within the evolutionary algorithm. We test the algorithm on several artificial data sets, and also on real astronomical observations of quasar Q0957+561. By carrying out a statistical analysis of the results we present a detailed comparison of our method with the most popular methods for time delay estimation in astrophysics. Our method yields more accurate and more stable time delay estimates: for Q0957+561, we obtain 419.6 days for the time delay between images A and B. Our methodology can be readily applied to current state-of-the-art optical monitoring data in astronomy, but can also be applied in other disciplines involving similar time series data.
Pattern and Origin for the Extreme $γ$-ray Flares of 3C 454.3 and 3C 279: An Astrophysical Critical Damper?
We apply a Gaussian process method to the extreme gamma-ray flares of 3C 454.3 and 3C 279 to discover the variable patterns and then to investigate the physical origins of the giant flares. The kernels of stochastically driven damped simple harmonic oscillator (SHO), the damped random-walk (DRW), and Matrm ern-3/2 are respectively used to describe the adaptive-binning gamma-ray light curves of the two flares. Our findings show that both the extreme gamma-ray flares of 3C 454.3 and 3C 279 clearly prefer the SHO kernel in the over-damped mode and the Matrm ern-3/2 kernel over the DRW kernel. The resulted SHO and Matrm ern-3/2 power spectral densities (PSDs) are the same for each object, with the index changing from -4 at high frequencies to 0 at low frequencies. The patterns of the two flares are both approaching the critical damping mode with the quality factor Q approx 0.4 (i.e., the damping ratio eta approx 1.25), but with slightly different damping timescales. The characteristic timescale (corresponding to the broken frequency in the PSD) for 3C 454.3 is 2-3 days and 3-5 days for 3C 279. The variable patterns found here suggest that once the system responds to the energy injection disturbance, the release of the energy in the system is finished abruptly. The obtained timescale provides a constraint on the size of energy dissipation region for each source.
WordRobe: Text-Guided Generation of Textured 3D Garments
In this paper, we tackle a new and challenging problem of text-driven generation of 3D garments with high-quality textures. We propose "WordRobe", a novel framework for the generation of unposed & textured 3D garment meshes from user-friendly text prompts. We achieve this by first learning a latent representation of 3D garments using a novel coarse-to-fine training strategy and a loss for latent disentanglement, promoting better latent interpolation. Subsequently, we align the garment latent space to the CLIP embedding space in a weakly supervised manner, enabling text-driven 3D garment generation and editing. For appearance modeling, we leverage the zero-shot generation capability of ControlNet to synthesize view-consistent texture maps in a single feed-forward inference step, thereby drastically decreasing the generation time as compared to existing methods. We demonstrate superior performance over current SOTAs for learning 3D garment latent space, garment interpolation, and text-driven texture synthesis, supported by quantitative evaluation and qualitative user study. The unposed 3D garment meshes generated using WordRobe can be directly fed to standard cloth simulation & animation pipelines without any post-processing.
Beyond Fully-Connected Layers with Quaternions: Parameterization of Hypercomplex Multiplications with $1/n$ Parameters
Recent works have demonstrated reasonable success of representation learning in hypercomplex space. Specifically, "fully-connected layers with Quaternions" (4D hypercomplex numbers), which replace real-valued matrix multiplications in fully-connected layers with Hamilton products of Quaternions, both enjoy parameter savings with only 1/4 learnable parameters and achieve comparable performance in various applications. However, one key caveat is that hypercomplex space only exists at very few predefined dimensions (4D, 8D, and 16D). This restricts the flexibility of models that leverage hypercomplex multiplications. To this end, we propose parameterizing hypercomplex multiplications, allowing models to learn multiplication rules from data regardless of whether such rules are predefined. As a result, our method not only subsumes the Hamilton product, but also learns to operate on any arbitrary nD hypercomplex space, providing more architectural flexibility using arbitrarily 1/n learnable parameters compared with the fully-connected layer counterpart. Experiments of applications to the LSTM and Transformer models on natural language inference, machine translation, text style transfer, and subject verb agreement demonstrate architectural flexibility and effectiveness of the proposed approach.
AnySat: An Earth Observation Model for Any Resolutions, Scales, and Modalities
Geospatial models must adapt to the diversity of Earth observation data in terms of resolutions, scales, and modalities. However, existing approaches expect fixed input configurations, which limits their practical applicability. We propose AnySat, a multimodal model based on joint embedding predictive architecture (JEPA) and resolution-adaptive spatial encoders, allowing us to train a single model on highly heterogeneous data in a self-supervised manner. To demonstrate the advantages of this unified approach, we compile GeoPlex, a collection of 5 multimodal datasets with varying characteristics and 11 distinct sensors. We then train a single powerful model on these diverse datasets simultaneously. Once fine-tuned, we achieve better or near state-of-the-art results on the datasets of GeoPlex and 4 additional ones for 5 environment monitoring tasks: land cover mapping, tree species identification, crop type classification, change detection, and flood segmentation. The code and models are available at https://github.com/gastruc/AnySat.
Understanding the World's Museums through Vision-Language Reasoning
Museums serve as vital repositories of cultural heritage and historical artifacts spanning diverse epochs, civilizations, and regions, preserving well-documented collections. Data reveal key attributes such as age, origin, material, and cultural significance. Understanding museum exhibits from their images requires reasoning beyond visual features. In this work, we facilitate such reasoning by (a) collecting and curating a large-scale dataset of 65M images and 200M question-answer pairs in the standard museum catalog format for exhibits from all around the world; (b) training large vision-language models on the collected dataset; (c) benchmarking their ability on five visual question answering tasks. The complete dataset is labeled by museum experts, ensuring the quality as well as the practical significance of the labels. We train two VLMs from different categories: the BLIP model, with vision-language aligned embeddings, but lacking the expressive power of large language models, and the LLaVA model, a powerful instruction-tuned LLM enriched with vision-language reasoning capabilities. Through exhaustive experiments, we provide several insights on the complex and fine-grained understanding of museum exhibits. In particular, we show that some questions whose answers can often be derived directly from visual features are well answered by both types of models. On the other hand, questions that require the grounding of the visual features in repositories of human knowledge are better answered by the large vision-language models, thus demonstrating their superior capacity to perform the desired reasoning. Find our dataset, benchmarks, and source code at: https://github.com/insait-institute/Museum-65
Consistency-diversity-realism Pareto fronts of conditional image generative models
Building world models that accurately and comprehensively represent the real world is the utmost aspiration for conditional image generative models as it would enable their use as world simulators. For these models to be successful world models, they should not only excel at image quality and prompt-image consistency but also ensure high representation diversity. However, current research in generative models mostly focuses on creative applications that are predominantly concerned with human preferences of image quality and aesthetics. We note that generative models have inference time mechanisms - or knobs - that allow the control of generation consistency, quality, and diversity. In this paper, we use state-of-the-art text-to-image and image-and-text-to-image models and their knobs to draw consistency-diversity-realism Pareto fronts that provide a holistic view on consistency-diversity-realism multi-objective. Our experiments suggest that realism and consistency can both be improved simultaneously; however there exists a clear tradeoff between realism/consistency and diversity. By looking at Pareto optimal points, we note that earlier models are better at representation diversity and worse in consistency/realism, and more recent models excel in consistency/realism while decreasing significantly the representation diversity. By computing Pareto fronts on a geodiverse dataset, we find that the first version of latent diffusion models tends to perform better than more recent models in all axes of evaluation, and there exist pronounced consistency-diversity-realism disparities between geographical regions. Overall, our analysis clearly shows that there is no best model and the choice of model should be determined by the downstream application. With this analysis, we invite the research community to consider Pareto fronts as an analytical tool to measure progress towards world models.
OpenStreetView-5M: The Many Roads to Global Visual Geolocation
Determining the location of an image anywhere on Earth is a complex visual task, which makes it particularly relevant for evaluating computer vision algorithms. Yet, the absence of standard, large-scale, open-access datasets with reliably localizable images has limited its potential. To address this issue, we introduce OpenStreetView-5M, a large-scale, open-access dataset comprising over 5.1 million geo-referenced street view images, covering 225 countries and territories. In contrast to existing benchmarks, we enforce a strict train/test separation, allowing us to evaluate the relevance of learned geographical features beyond mere memorization. To demonstrate the utility of our dataset, we conduct an extensive benchmark of various state-of-the-art image encoders, spatial representations, and training strategies. All associated codes and models can be found at https://github.com/gastruc/osv5m.
OmniSat: Self-Supervised Modality Fusion for Earth Observation
The field of Earth Observations (EO) offers a wealth of data from diverse sensors, presenting a great opportunity for advancing self-supervised multimodal learning. However, current multimodal EO datasets and models focus on a single data type, either mono-date images or time series, which limits their expressivity. We introduce OmniSat, a novel architecture that exploits the spatial alignment between multiple EO modalities to learn expressive multimodal representations without labels. To demonstrate the advantages of combining modalities of different natures, we augment two existing datasets with new modalities. As demonstrated on three downstream tasks: forestry, land cover classification, and crop mapping. OmniSat can learn rich representations in an unsupervised manner, leading to improved performance in the semi- and fully-supervised settings, even when only one modality is available for inference. The code and dataset are available at github.com/gastruc/OmniSat.
AutoPaint: A Self-Inpainting Method for Unsupervised Anomaly Detection
Robust and accurate detection and segmentation of heterogenous tumors appearing in different anatomical organs with supervised methods require large-scale labeled datasets covering all possible types of diseases. Due to the unavailability of such rich datasets and the high cost of annotations, unsupervised anomaly detection (UAD) methods have been developed aiming to detect the pathologies as deviation from the normality by utilizing the unlabeled healthy image data. However, developed UAD models are often trained with an incomplete distribution of healthy anatomies and have difficulties in preserving anatomical constraints. This work intends to, first, propose a robust inpainting model to learn the details of healthy anatomies and reconstruct high-resolution images by preserving anatomical constraints. Second, we propose an autoinpainting pipeline to automatically detect tumors, replace their appearance with the learned healthy anatomies, and based on that segment the tumoral volumes in a purely unsupervised fashion. Three imaging datasets, including PET, CT, and PET-CT scans of lung tumors and head and neck tumors, are studied as benchmarks for evaluation. Experimental results demonstrate the significant superiority of the proposed method over a wide range of state-of-the-art UAD methods. Moreover, the unsupervised method we propose produces comparable results to a robust supervised segmentation method when applied to multimodal images.
Law of the Weakest Link: Cross Capabilities of Large Language Models
The development and evaluation of Large Language Models (LLMs) have largely focused on individual capabilities. However, this overlooks the intersection of multiple abilities across different types of expertise that are often required for real-world tasks, which we term cross capabilities. To systematically explore this concept, we first define seven core individual capabilities and then pair them to form seven common cross capabilities, each supported by a manually constructed taxonomy. Building on these definitions, we introduce CrossEval, a benchmark comprising 1,400 human-annotated prompts, with 100 prompts for each individual and cross capability. To ensure reliable evaluation, we involve expert annotators to assess 4,200 model responses, gathering 8,400 human ratings with detailed explanations to serve as reference examples. Our findings reveal that, in both static evaluations and attempts to enhance specific abilities, current LLMs consistently exhibit the "Law of the Weakest Link," where cross-capability performance is significantly constrained by the weakest component. Specifically, across 58 cross-capability scores from 17 models, 38 scores are lower than all individual capabilities, while 20 fall between strong and weak, but closer to the weaker ability. These results highlight the under-performance of LLMs in cross-capability tasks, making the identification and improvement of the weakest capabilities a critical priority for future research to optimize performance in complex, multi-dimensional scenarios.
Improving Text-to-Image Consistency via Automatic Prompt Optimization
Impressive advances in text-to-image (T2I) generative models have yielded a plethora of high performing models which are able to generate aesthetically appealing, photorealistic images. Despite the progress, these models still struggle to produce images that are consistent with the input prompt, oftentimes failing to capture object quantities, relations and attributes properly. Existing solutions to improve prompt-image consistency suffer from the following challenges: (1) they oftentimes require model fine-tuning, (2) they only focus on nearby prompt samples, and (3) they are affected by unfavorable trade-offs among image quality, representation diversity, and prompt-image consistency. In this paper, we address these challenges and introduce a T2I optimization-by-prompting framework, OPT2I, which leverages a large language model (LLM) to improve prompt-image consistency in T2I models. Our framework starts from a user prompt and iteratively generates revised prompts with the goal of maximizing a consistency score. Our extensive validation on two datasets, MSCOCO and PartiPrompts, shows that OPT2I can boost the initial consistency score by up to 24.9% in terms of DSG score while preserving the FID and increasing the recall between generated and real data. Our work paves the way toward building more reliable and robust T2I systems by harnessing the power of LLMs.
Improved baselines for vision-language pre-training
Contrastive learning has emerged as an efficient framework to learn multimodal representations. CLIP, a seminal work in this area, achieved impressive results by training on paired image-text data using the contrastive loss. Recent work claims improvements over CLIP using additional non-contrastive losses inspired from self-supervised learning. However, it is sometimes hard to disentangle the contribution of these additional losses from other implementation details, e.g., data augmentation or regularization techniques, used to train the model. To shed light on this matter, in this paper, we first propose, implement and evaluate several baselines obtained by combining contrastive learning with recent advances in self-supervised learning. In particular, we use the loss functions that were proven successful for visual self-supervised learning to align image and text modalities. We find that these baselines outperform a basic implementation of CLIP. However, when a stronger training recipe is employed, the advantage disappears. Indeed, we find that a simple CLIP baseline can also be improved substantially, up to a 25% relative improvement on downstream zero-shot tasks, by using well-known training techniques that are popular in other subfields. Moreover, we discover that it is enough to apply image and text augmentations to make up for most of the improvement attained by prior works. With our improved training recipe for CLIP, we obtain state-of-the-art performance on four standard datasets, and consistently outperform prior work (up to +4% on the largest dataset), while being substantially simpler.
Automatic Chain of Thought Prompting in Large Language Models
Large language models (LLMs) can perform complex reasoning by generating intermediate reasoning steps. Providing these steps for prompting demonstrations is called chain-of-thought (CoT) prompting. CoT prompting has two major paradigms. One leverages a simple prompt like "Let's think step by step" to facilitate step-by-step thinking before answering a question. The other uses a few manual demonstrations one by one, each composed of a question and a reasoning chain that leads to an answer. The superior performance of the second paradigm hinges on the hand-crafting of task-specific demonstrations one by one. We show that such manual efforts may be eliminated by leveraging LLMs with the "Let's think step by step" prompt to generate reasoning chains for demonstrations one by one, i.e., let's think not just step by step, but also one by one. However, these generated chains often come with mistakes. To mitigate the effect of such mistakes, we find that diversity matters for automatically constructing demonstrations. We propose an automatic CoT prompting method: Auto-CoT. It samples questions with diversity and generates reasoning chains to construct demonstrations. On ten public benchmark reasoning tasks with GPT-3, Auto-CoT consistently matches or exceeds the performance of the CoT paradigm that requires manual designs of demonstrations. Code is available at https://github.com/amazon-research/auto-cot
You Only Look at Screens: Multimodal Chain-of-Action Agents
Autonomous user interface (UI) agents aim to facilitate task automation by interacting with the user interface without manual intervention. Recent studies have investigated eliciting the capabilities of large language models (LLMs) for effective engagement in diverse environments. To align with the input-output requirement of LLMs, existing approaches are developed under a sandbox setting where they rely on external tools and application-specific APIs to parse the environment into textual elements and interpret the predicted actions. Consequently, those approaches often grapple with inference inefficiency and error propagation risks. To mitigate the challenges, we introduce Auto-UI, a multimodal solution that directly interacts with the interface, bypassing the need for environment parsing or reliance on application-dependent APIs. Moreover, we propose a chain-of-action technique -- leveraging a series of intermediate previous action histories and future action plans -- to help the agent decide what action to execute. We evaluate our approach on a new device-control benchmark AITW with 30K unique instructions, spanning multi-step tasks such as application operation, web searching, and web shopping. Experimental results show that Auto-UI achieves state-of-the-art performance with an action type prediction accuracy of 90% and an overall action success rate of 74%. Code is publicly available at https://github.com/cooelf/Auto-UI.
Learning Prescriptive ReLU Networks
We study the problem of learning optimal policy from a set of discrete treatment options using observational data. We propose a piecewise linear neural network model that can balance strong prescriptive performance and interpretability, which we refer to as the prescriptive ReLU network, or P-ReLU. We show analytically that this model (i) partitions the input space into disjoint polyhedra, where all instances that belong to the same partition receive the same treatment, and (ii) can be converted into an equivalent prescriptive tree with hyperplane splits for interpretability. We demonstrate the flexibility of the P-ReLU network as constraints can be easily incorporated with minor modifications to the architecture. Through experiments, we validate the superior prescriptive accuracy of P-ReLU against competing benchmarks. Lastly, we present examples of interpretable prescriptive trees extracted from trained P-ReLUs using a real-world dataset, for both the unconstrained and constrained scenarios.
PHNNs: Lightweight Neural Networks via Parameterized Hypercomplex Convolutions
Hypercomplex neural networks have proven to reduce the overall number of parameters while ensuring valuable performance by leveraging the properties of Clifford algebras. Recently, hypercomplex linear layers have been further improved by involving efficient parameterized Kronecker products. In this paper, we define the parameterization of hypercomplex convolutional layers and introduce the family of parameterized hypercomplex neural networks (PHNNs) that are lightweight and efficient large-scale models. Our method grasps the convolution rules and the filter organization directly from data without requiring a rigidly predefined domain structure to follow. PHNNs are flexible to operate in any user-defined or tuned domain, from 1D to nD regardless of whether the algebra rules are preset. Such a malleability allows processing multidimensional inputs in their natural domain without annexing further dimensions, as done, instead, in quaternion neural networks for 3D inputs like color images. As a result, the proposed family of PHNNs operates with 1/n free parameters as regards its analog in the real domain. We demonstrate the versatility of this approach to multiple domains of application by performing experiments on various image datasets as well as audio datasets in which our method outperforms real and quaternion-valued counterparts. Full code is available at: https://github.com/eleGAN23/HyperNets.
IRepair: An Intent-Aware Approach to Repair Data-Driven Errors in Large Language Models
Not a day goes by without hearing about the impressive feats of large language models (LLMs), and equally, not a day passes without hearing about their challenges. LLMs are notoriously vulnerable to biases in their dataset, leading to issues such as toxicity. While domain-adaptive training has been employed to mitigate these issues, these techniques often address all model parameters indiscriminately during the repair process, resulting in poor repair quality and reduced model versatility. In this paper, we introduce a novel dynamic slicing-based intent-aware LLM repair strategy, IRepair. This approach selectively targets the most error-prone sections of the model for repair. Specifically, we propose dynamically slicing the model's most sensitive layers that require immediate attention, concentrating repair efforts on those areas. This method enables more effective repairs with potentially less impact on the model's overall performance by altering a smaller portion of the model. We evaluated our technique on three models from the GPT2 and GPT-Neo families, with parameters ranging from 800M to 1.6B, in a toxicity mitigation setup. Our results show that IRepair repairs errors 43.6% more effectively while causing 46% less disruption to general performance compared to the closest baseline, direct preference optimization. Our empirical analysis also reveals that errors are more concentrated in a smaller section of the model, with the top 20% of layers exhibiting 773% more error density than the remaining 80\%. This highlights the need for selective repair. Additionally, we demonstrate that a dynamic selection approach is essential for addressing errors dispersed throughout the model, ensuring a robust and efficient repair.
Boosting Latent Diffusion with Perceptual Objectives
Latent diffusion models (LDMs) power state-of-the-art high-resolution generative image models. LDMs learn the data distribution in the latent space of an autoencoder (AE) and produce images by mapping the generated latents into RGB image space using the AE decoder. While this approach allows for efficient model training and sampling, it induces a disconnect between the training of the diffusion model and the decoder, resulting in a loss of detail in the generated images. To remediate this disconnect, we propose to leverage the internal features of the decoder to define a latent perceptual loss (LPL). This loss encourages the models to create sharper and more realistic images. Our loss can be seamlessly integrated with common autoencoders used in latent diffusion models, and can be applied to different generative modeling paradigms such as DDPM with epsilon and velocity prediction, as well as flow matching. Extensive experiments with models trained on three datasets at 256 and 512 resolution show improved quantitative -- with boosts between 6% and 20% in FID -- and qualitative results when using our perceptual loss.
The effect of dynamical states on galaxy clusters populations. I. Classification of dynamical states
While the influence of galaxy clusters on galaxy evolution is relatively well-understood, the impact of the dynamical states of these clusters is less clear. This paper series explores how the dynamical state of galaxy clusters affects their galaxy populations' physical and morphological properties. The primary aim of this first paper is to evaluate the dynamical state of 87 massive (M_{500} geq 1.5 times 10^{14} M_{odot}) galaxy clusters at low redshifts (0.10 leq z leq 0.35). This will allow us to have a well-characterized sample for analyzing physical and morphological properties in our next work. We employ six dynamical state proxies utilizing optical and X-ray imaging data. Principal Component Analysis (PCA) is applied to integrate these proxies effectively, allowing for robust classification of galaxy clusters into relaxed, intermediate, and disturbed states based on their dynamical characteristics. The methodology successfully segregates the galaxy clusters into the three dynamical states. Examination of the galaxy distributions in optical wavelengths and gas distributions in X-ray further confirms the consistency of these classifications. The clusters' dynamical states are statistically distinguishable, providing a clear categorization for further analysis.
Weighted Grouped Query Attention in Transformers
The attention mechanism forms the foundational blocks for transformer language models. Recent approaches show that scaling the model achieves human-level performance. However, with increasing demands for scaling and constraints on hardware memory, the inference costs of these models remain high. To reduce the inference time, Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) were proposed in (Shazeer, 2019) and (Ainslieet al., 2023) respectively. In this paper, we propose a variation of Grouped-Query Attention, termed Weighted Grouped-Query Attention (WGQA). We introduced new learnable parameters for each key and value head in the T5 decoder attention blocks, enabling the model to take a weighted average during finetuning. Our model achieves an average of 0.53% improvement over GQA, and the performance converges to traditional Multi-head attention (MHA) with no additional overhead during inference. We evaluated the introduction of these parameters and subsequent finetuning informs the model about the grouping mechanism during training, thereby enhancing performance. Additionally, we demonstrate the scaling laws in our analysis by comparing the results between T5-small and T5-base architecture.
Large Language Models to Enhance Bayesian Optimization
Bayesian optimization (BO) is a powerful approach for optimizing complex and expensive-to-evaluate black-box functions. Its importance is underscored in many applications, notably including hyperparameter tuning, but its efficacy depends on efficiently balancing exploration and exploitation. While there has been substantial progress in BO methods, striking this balance remains a delicate process. In this light, we present LLAMBO, a novel approach that integrates the capabilities of Large Language Models (LLM) within BO. At a high level, we frame the BO problem in natural language, enabling LLMs to iteratively propose and evaluate promising solutions conditioned on historical evaluations. More specifically, we explore how combining contextual understanding, few-shot learning proficiency, and domain knowledge of LLMs can improve model-based BO. Our findings illustrate that LLAMBO is effective at zero-shot warmstarting, and enhances surrogate modeling and candidate sampling, especially in the early stages of search when observations are sparse. Our approach is performed in context and does not require LLM finetuning. Additionally, it is modular by design, allowing individual components to be integrated into existing BO frameworks, or function cohesively as an end-to-end method. We empirically validate LLAMBO's efficacy on the problem of hyperparameter tuning, highlighting strong empirical performance across a range of diverse benchmarks, proprietary, and synthetic tasks.
LLamol: A Dynamic Multi-Conditional Generative Transformer for De Novo Molecular Design
Generative models have demonstrated substantial promise in Natural Language Processing (NLP) and have found application in designing molecules, as seen in General Pretrained Transformer (GPT) models. In our efforts to develop such a tool for exploring the organic chemical space in search of potentially electro-active compounds, we present "LLamol", a single novel generative transformer model based on the LLama 2 architecture, which was trained on a 13M superset of organic compounds drawn from diverse public sources. To allow for a maximum flexibility in usage and robustness in view of potentially incomplete data, we introduce "Stochastic Context Learning" as a new training procedure. We demonstrate that the resulting model adeptly handles single- and multi-conditional organic molecule generation with up to four conditions, yet more are possible. The model generates valid molecular structures in SMILES notation while flexibly incorporating three numerical and/or one token sequence into the generative process, just as requested. The generated compounds are very satisfactory in all scenarios tested. In detail, we showcase the model's capability to utilize token sequences for conditioning, either individually or in combination with numerical properties, making LLamol a potent tool for de novo molecule design, easily expandable with new properties.
A Cheaper and Better Diffusion Language Model with Soft-Masked Noise
Diffusion models that are based on iterative denoising have been recently proposed and leveraged in various generation tasks like image generation. Whereas, as a way inherently built for continuous data, existing diffusion models still have some limitations in modeling discrete data, e.g., languages. For example, the generally used Gaussian noise can not handle the discrete corruption well, and the objectives in continuous spaces fail to be stable for textual data in the diffusion process especially when the dimension is high. To alleviate these issues, we introduce a novel diffusion model for language modeling, Masked-Diffuse LM, with lower training cost and better performances, inspired by linguistic features in languages. Specifically, we design a linguistic-informed forward process which adds corruptions to the text through strategically soft-masking to better noise the textual data. Also, we directly predict the categorical distribution with cross-entropy loss function in every diffusion step to connect the continuous space and discrete space in a more efficient and straightforward way. Through experiments on 5 controlled generation tasks, we demonstrate that our Masked-Diffuse LM can achieve better generation quality than the state-of-the-art diffusion models with better efficiency.
Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition
This work proposes POMP, a prompt pre-training method for vision-language models. Being memory and computation efficient, POMP enables the learned prompt to condense semantic information for a rich set of visual concepts with over twenty-thousand classes. Once pre-trained, the prompt with a strong transferable ability can be directly plugged into a variety of visual recognition tasks including image classification, semantic segmentation, and object detection, to boost recognition performances in a zero-shot manner. Empirical evaluation shows that POMP achieves state-of-the-art performances on 21 downstream datasets, e.g., 67.0% average accuracy on 10 classification dataset (+3.1% compared to CoOp) and 84.4 hIoU on open-vocabulary Pascal VOC segmentation (+6.9 compared to ZSSeg).
Is ChatGPT a General-Purpose Natural Language Processing Task Solver?
Spurred by advancements in scale, large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot -- i.e., without adaptation on downstream data. Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community due to the fact that it can generate high-quality responses to human input and self-correct previous mistakes based on subsequent conversations. However, it is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot. In this work, we empirically analyze the zero-shot learning ability of ChatGPT by evaluating it on 20 popular NLP datasets covering 7 representative task categories. With extensive empirical studies, we demonstrate both the effectiveness and limitations of the current version of ChatGPT. We find that ChatGPT performs well on many tasks favoring reasoning capabilities (e.g., arithmetic reasoning) while it still faces challenges when solving specific tasks such as sequence tagging. We additionally provide in-depth analysis through qualitative case studies.
Multimodal Chain-of-Thought Reasoning in Language Models
Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies have focused on the language modality. We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference. In this way, answer inference can leverage better generated rationales that are based on multimodal information. With Multimodal-CoT, our model under 1 billion parameters outperforms the previous state-of-the-art LLM (GPT-3.5) by 16 percentage points (75.17%->91.68% accuracy) on the ScienceQA benchmark and even surpasses human performance. Code is publicly available available at https://github.com/amazon-science/mm-cot.
SPT: Semi-Parametric Prompt Tuning for Multitask Prompted Learning
Pre-trained large language models can efficiently interpolate human-written prompts in a natural way. Multitask prompted learning can help generalization through a diverse set of tasks at once, thus enhancing the potential for more effective downstream fine-tuning. To perform efficient multitask-inference in the same batch, parameter-efficient fine-tuning methods such as prompt tuning have been proposed. However, the existing prompt tuning methods may lack generalization. We propose SPT, a semi-parametric prompt tuning method for multitask prompted learning. The novel component of SPT is a memory bank from where memory prompts are retrieved based on discrete prompts. Extensive experiments, such as (i) fine-tuning a full language model with SPT on 31 different tasks from 8 different domains and evaluating zero-shot generalization on 9 heldout datasets under 5 NLP task categories and (ii) pretraining SPT on the GLUE datasets and evaluating fine-tuning on the SuperGLUE datasets, demonstrate effectiveness of SPT.
Named Entity Recognition in Indian court judgments
Identification of named entities from legal texts is an essential building block for developing other legal Artificial Intelligence applications. Named Entities in legal texts are slightly different and more fine-grained than commonly used named entities like Person, Organization, Location etc. In this paper, we introduce a new corpus of 46545 annotated legal named entities mapped to 14 legal entity types. The Baseline model for extracting legal named entities from judgment text is also developed.
A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions
Curation methods for massive vision-language datasets trade off between dataset size and quality. However, even the highest quality of available curated captions are far too short to capture the rich visual detail in an image. To show the value of dense and highly-aligned image-text pairs, we collect the Densely Captioned Images (DCI) dataset, containing 8012 natural images human-annotated with mask-aligned descriptions averaging above 1000 words each. With precise and reliable captions associated with specific parts of an image, we can evaluate vision-language models' (VLMs) understanding of image content with a novel task that matches each caption with its corresponding subcrop. As current models are often limited to 77 text tokens, we also introduce a summarized version (sDCI) in which each caption length is limited. We show that modern techniques that make progress on standard benchmarks do not correspond with significant improvement on our sDCI based benchmark. Lastly, we finetune CLIP using sDCI and show significant improvements over the baseline despite a small training set. By releasing the first human annotated dense image captioning dataset, we hope to enable the development of new benchmarks or fine-tuning recipes for the next generation of VLMs to come.
Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents
Large language models (LLMs) have dramatically enhanced the field of language intelligence, as demonstrably evidenced by their formidable empirical performance across a spectrum of complex reasoning tasks. Additionally, theoretical proofs have illuminated their emergent reasoning capabilities, providing a compelling showcase of their advanced cognitive abilities in linguistic contexts. Critical to their remarkable efficacy in handling complex reasoning tasks, LLMs leverage the intriguing chain-of-thought (CoT) reasoning techniques, obliging them to formulate intermediate steps en route to deriving an answer. The CoT reasoning approach has not only exhibited proficiency in amplifying reasoning performance but also in enhancing interpretability, controllability, and flexibility. In light of these merits, recent research endeavors have extended CoT reasoning methodologies to nurture the development of autonomous language agents, which adeptly adhere to language instructions and execute actions within varied environments. This survey paper orchestrates a thorough discourse, penetrating vital research dimensions, encompassing: (i) the foundational mechanics of CoT techniques, with a focus on elucidating the circumstances and justification behind its efficacy; (ii) the paradigm shift in CoT; and (iii) the burgeoning of language agents fortified by CoT approaches. Prospective research avenues envelop explorations into generalization, efficiency, customization, scaling, and safety. This paper caters to a wide audience, including beginners seeking comprehensive knowledge of CoT reasoning and language agents, as well as experienced researchers interested in foundational mechanics and engaging in cutting-edge discussions on these topics. A repository for the related papers is available at https://github.com/Zoeyyao27/CoT-Igniting-Agent.
SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing
The mixture of Expert (MoE) parallelism is a recent advancement that scales up the model size with constant computational cost. MoE selects different sets of parameters (i.e., experts) for each incoming token, resulting in a sparsely-activated model. Despite several successful applications of MoE, its training efficiency degrades significantly as the number of experts increases. The routing stage in MoE relies on the efficiency of the All2All communication collective, which suffers from network congestion and has poor scalability. To mitigate these issues, we introduce SMILE, which exploits heterogeneous network bandwidth and splits a single-step routing into bi-level routing. Our experimental results show that the proposed method obtains a 2.5x speedup over Switch Transformer in terms of pretraining throughput on the Colossal Clean Crawled Corpus without losing any convergence speed.
Bootstrapping Multilingual AMR with Contextual Word Alignments
We develop high performance multilingualAbstract Meaning Representation (AMR) sys-tems by projecting English AMR annotationsto other languages with weak supervision. Weachieve this goal by bootstrapping transformer-based multilingual word embeddings, in partic-ular those from cross-lingual RoBERTa (XLM-R large). We develop a novel technique forforeign-text-to-English AMR alignment, usingthe contextual word alignment between En-glish and foreign language tokens. This wordalignment is weakly supervised and relies onthe contextualized XLM-R word embeddings.We achieve a highly competitive performancethat surpasses the best published results forGerman, Italian, Spanish and Chinese.