Session Day

Click on each to see more information.

Day 01

Session 1.1: Practice + Animal & Plant Breeding

The descriptions of experimental designs are often idiosyncratic and verbose, interwoven with details that are geared towards domain experts (e.g. the preparation of the experimental materials). Extracting the statistical elements of the experimental design from these descriptions can be tedious at best, challenging at worst. The emergence of Large Language Models (LLMs) has revolutionized various applications, notably in natural language processing. This talk explores the use of LLMs to streamline the extraction of statistical elements from the descriptions of experiments. This can expedite the distilling of complex experimental design descriptions and aid in formulating an appropriate analysis of experimental data.

Phytosanitary, or quarantine treatments for fresh horticultural produce require a level of treatment efficacy to be determined. This is to provide confidence that no target pest species are moved with the product. The treatment efficacy is often based on the mortality of target pests exposed to the treatment. In naturally infested products, the samples are exposed to mature pests which deposit an unknown number of eggs into each sample. In such cases, the number of pests exposed to the treatment is often unknown and needs to be estimated. The International Plant Protection Convention (IPPC) provides formulae for estimating the infestation rate when naturally infesting. We had concerns over the statistical validity of the formulae which have been an international standard since at least 2014. This talk discusses the deficiencies with the IPPC formulas, our proposed improved method, and the road to acceptance of our concerns and proposed improvements.

Statistics competes not only for funding but also relevance with other quantitative fields. Interpretable models drive knowledge creation in science, plus an increasingly diverse range of domains. Unfortunately, there is little understanding that AI or machine learning are rarely interpretable, while statisticians excel in building interpretable models.

We explore the common claim of accounting for confounders by asking “What do researchers mean by this?”, “Are they doing it correctly, or is it just dodgy bookkeeping?”, “How can it be done?”, and “Why is it needed for interpretable models, and why are they so important?”

Then use this exposition to discuss our strengths and where we need to improve so researchers want us as trusted advisors on the complex issues they face when building interpretable models. Achieving this means a greater share of scarce research dollars and involvement in ground-breaking impactful research. Boosting our profile and ensuring our relevance in a fast-moving quantitative research world.

The advances in “omics” technologies have enabled unprecedented progress in agricultural and biological sciences. The synergy of high performance computing, high throughput omics approaches, and high dimensional phenotyping has demonstrated the capacity to enhance our understanding of biological mechanisms and provide powerful insights into dissecting the genetic basis of complex traits. Genome-wide association study (GWAS) has become a useful approach to identify mutations that underlie diseases and complex traits. However, it is less suitable for quantitative traits influenced by a large number of genes with small effects. Genomic selection holds the promises to overcome the limitations by using whole-genome information to predict genetic merits of phenotypes. We present the analytical methods and results of GWAS and genomic prediction using multi-environment and multi-trait barley data to identify genomic regions associated with agronomically important traits in barley grown under heat-stressed environments.

Genomic selection can be a useful tool in modern plant breeding programs as it allows for the genomic prediction of unphenotyped (unobserved) varieties via the association of a phenotyped trait with high-density genetic marker scores. However, the implementation of models to form this association can be complicated by several issues including, the presence of variety by environment interaction (VEI), complex trial designs and linear dependencies in the matrix of marker scores. The latter can arise in several situations including when genetic clones are present, the matrix of marker scores is centred or there are more varieties than markers. This talk will aim to address several of these issues motivated by an Australian chickpea multi-environment trial (MET) dataset comprising both field trials and controlled environments. The analysis was conducted using DWReml via fitting a single-step factor analytic linear mixed (FALMM) to assess disease resistance.

Candidate two-dimensional spatial models, whose fitting has been implemented in asremlPlus (Brien, 2024) for fitting with ASReml-R, are (i) separable variance models (e.g. assuming ar1 by ar1), (ii) tensor-product smoothing splines, and (iii) tensor-product P-splines for which the degree and the order of differencing the penalty can be specified and for which Piepho et al.’s (2022) P-spline modifications have been incorporated. The models that are fitted and the three asremlPlus functions for fitting and comparing these models will be described. Their application will be illustrated using a two-greenhouse, 1100-pot, high-throughput phenotyping experiment that involved 215 barley lines.

Session 1.2: Health + Design

To assess causality in behavioral addictions, a longitudinal modelling framework is generally required. Utilizing an Ecological Momentary Assessment (EMA) design, we measured the affective dynamics of mental wellbeing variables before, during and after pornography use among individuals recruited from an online forum. Participants completed a four-week EMA, capturing data on sexual activities and mental wellbeing variables. Bayesian hierarchical mixed-effects modelling was employed to analyze affective dynamics. We found that participants experienced a complex interplay between pornography use and other negative emotional states, which were exacerbated by feelings of guilt and shame. Using temporal markers provided by EMA data, we were able to propose causal relationships between pornography use and its effects on mental wellbeing. The EMA approach produced several unique findings and may be the ideal method for examining the effects of abstinence from pornography use.

Maternal mortality presents a significant global public health challenge in low- and middle-income countries (LMICs). The effective utilization of maternal healthcare services (MHSs) including antenatal care (ANC), skilled birth attendance (SBA), institutional delivery (ID), and postnatal care (PNC), is crucial for achieving improved maternal health outcomes. We investigated the utilization of MHS among women in 33 LMICs around the globe using Demographic and Health Surveys. We fitted complex survey-adjusted logistic regression models for each outcome separately in a combined data set of all 33 surveys. We observe high heterogeneity in the utilization of MHS across countries; for example, Indonesia was 10.12, 8.14, 3.65 and 4.19 times more likely to utilize ANC (95% CI= 9.00,11.39), SBA (7.01, 9.46), ID (3.23, 4.13) and PNC (3.65, 4.81) compared to Bangladesh. Given the heterogeneity in MHS uptakes, country-specific adaptation of successful interventions might be a way forward to achieve relevant Sustainable Development Goals by 2030.

Urban-rural inequality in the utilization of quality antenatal care (ANC) is a well-documented challenge in low- and middle-income countries, such as Bangladesh and Pakistan. This study investigates urban-rural inequality in the utilization of quality ANC in Bangladesh and Pakistan. We decomposed inequalities in the utilization of quality ANC among urban and rural women in Bangladesh and Pakistan using the Oaxaca, the Blinder, and related decompositions for nonlinear models. To quantify covariate contributions to the urban-rural inequality, we employed the Blinder-Oaxaca multivariate decomposition analysis for nonlinear response models. Using data from the latest Demographic Health Surveys (2017-2018), the study reveals significant urban-rural inequality in Bangladesh and Pakistan, which is more pronounced in Pakistan. Wealth difference has the largest contribution percentage among the common significant predictors for both countries. In Pakistan, women's education is the second largest contributor to inequality, while in Bangladesh, it is media exposure. Tailored strategies are required to mitigate these inequalities in ANC.

Multiplicative noise masking is a well-known method to perturb data for privacy protection purpose. Twin uniform distribution has been introduced in the literature as a distribution for multiplicative noise, given the simplicity in its mathematical form and the ability to provide good value protection without sacrificing statistical utility. We explore the impact of various distribution parameters on the performances of privacy protection and utility loss when multiplying twin uniform noise for data masking. We proposed an approach to optimise the multiplicative noise scheme with twin unform noise distribution. We applied the optimisation algorithm to a real accounts payable dataset, and conclude that it yields good results for both privacy and utility.

Animal models are essential in pre-clinical development of vaccines. This work is motivated by preliminary studies to determine that a candidate vaccine induces a protective immune response suitable for further evaluation. Such a preliminary study needs adequate statistical power to detect a large “all-or-nothing” difference for the probability of disease (e.g. P1=0.975 vs P2=0.3) with small sample sizes (less than 10 per group) to minimise the economic and welfare costs of using animals in research. Most “off-the-shelf” power calculators for two proportion comparisons use approximate methods (e.g. power.prop.test function in R) despite being inappropriate for small sample sizes. Alternatively, since the sample space of all possible outcomes is small, the true power can readily be calculated for a more “exact” test such as Fisher or Barnard. We propose that this power calculation method be used for preliminary studies in the interests of improving animal welfare.

Bayesian experimental design is a well-established methodology for planning data collections. With such an approach, designs are typically found by maximising the expectation of a utility function with respect to the joint distribution of the parameters and the response, conditional on an assumed statistical model. In practice, specifying such a model can be difficult due to incomplete knowledge of the data generating process. This can be rather problematic as a misspecified model can lead to inefficiencies in data collection and/or conclusions that are misleading. To address this, we present an approach to find Bayesian designs that are robust to the assumed model. To do so, we propose to determine designs based on flexible modelling structures such as those based on spline models. This approach is motivated by real-world sampling problems in environmental monitoring and agriculture where we assess the performance of our methodology against more standard practices.

Session 1.3: Statistics in Practice

Outlier detection is one of the most critical procedures in statistical modelling and should be considered before any formal testing is performed (Cook and Weisberg, 1982). The alternative outlier model (AOM) approach of Cook et al. (1982) was developed in the context of ordinary linear models. The most general approach for outlier detection in a linear mixed model (LMM) framework is presented in Haslett and Hayes (1998) and Haslett and Haslett (2007). Their focus was on the definition and roles of residuals with a general covariance structure.

Gumedze et al. (2010) and Gumedze (2018) considered simple variance component models and proposed likelihood ratio and score test statistics to determine whether individual observations have inflated variance. Under the assumption of independent random effects and residuals, this approach appeared to provide reasonable Type I error rate for multiple testing. The full parametric bootstrap procedure was reported to be computationally demanding

In this talk, we extend the use of AOM in a LMM framework and derive residual maximum likelihood (REML) score tests applicable for residuals in the model. Our approach also accommodates correlated effects and uses an efficient resampling scheme that does not require re-fitting the null model in each iteration. A simulation study shows that our approach provides accurate Type I error rates.

This paper introduces a semi-supervised learning technique for model-based clustering. Our research focus is on applying it to matrices of ordered categorical response data, such as those obtained from the surveys with Likert scale responses. We use the proportional odds model, which is popular and widely used for analysing such data, as the model structure. Our proposed technique is designed for analysing datasets that contain both labeled and unlabeled observations from multiple clusters. To evaluate the performance of our proposed model, we conducted a simulation study in which we tested the model from six different scenarios, each with varying combinations and proportions of known and unknown cluster memberships. The fitted models accurately estimate the parameters in most of the designed scenarios, indicating that our technique is effective in clustering partially-labeled data with ordered categorical response variables. To illustrate our approach, we use a real-world dataset from aquaculture area.

Spatially misaligned data, where the response and covariates are observed at different spatial locations, commonly arise in many environmental studies. Motivated by spatially misaligned data collected on air pollution and weather in China, we propose a cokrig-and-regress (CNR) method to estimate spatial regression models involving multiple covariates and potentially non-linear associations. The CNR estimator is constructed by replacing the unobserved covariates (at the response locations) by their cokriging predictor derived from the observed but misaligned covariates under a multivariate Gaussian assumption, where a generalized Kronecker product covariance is used to account for spatial correlations within and between covariates. Simulation studies demonstrate that CNR outperforms several existing methods for handling spatially misaligned data, such as nearest-neighbour interpolation. Applying CNR to the spatially misaligned air pollution and weather data in China reveals a number of non-linear relationships between PM2.5 concentration and several meteorological covariates.

Pooled testing (or group testing) arises when units are pooled together and tested as a group for the presence of an attribute, such as a disease. It originated in blood testing, but has been applied in many fields, including prevalence estimation of mosquito-borne viruses and plant disease assessment – the two fields in which we have encountered the technique.

Confidence intervals for proportions estimated by pooled testing have involved both exact and asymptotic methods. Hepworth and Biggerstaff (2017, 2021) showed Firth’s correction (Firth 1993) to maximum likelihood estimation to reduce bias effectively for pooled testing. Considering the Firth-corrected score as the first derivative function of a penalised likelihood, we develop confidence intervals wholly within this framework, evaluate their performance, and compare them to the existing, recommended asymptotic method.

The methods are illustrated using data on yellow fever virus and West Nile virus infection in mosquitoes.

The big data era demands new statistical analysis paradigms, specially, structured big data such as multicentre data, there are challenges to analysing complex data including privacy protection, large-scale computation resource requirements, heterogeneity, and correlated observations. Random effects models (REMs) are one of the most popular statistical methods for complexity of the data such as nested structure data, but these methods are often infeasible for big and complicated data due to memory and storage limitations of standard computers. To address this, we use a method “Divide and Combine (D&C or DAC)”, where results are combined over sub-analyses performed in separate data subsets using two approaches, summary statistics DAC (federated DAC) and horizontal DAC (centralised DAC). To assess their efficacy, we apply both approaches to real data and compare them with the gold-standard method of directly fitting REMs to the pooled dataset.

The integrated analysis of multi-omics datasets is a challenging problem, owing to their complexity and high dimensionality. Numerous integration tools have been developed, but they are heterogeneous in terms of their underlying statistical methodology, input data requirements, and visual representation of the results generated. I present the moiraine R package, which provides a framework to consistently integrate and visualise multi-omics datasets. In particular, moiraine enables the construction of insightful and context-rich visualisations that facilitate the interpretation of integration results by domain experts. With moiraine, it is also possible to compare the results obtained with different integration tools, providing confidence into the biological relevance of the results obtained.

Day 02

Session 2.1: Bayesian + Animal & Plant Breeding

A key problem in astronomy is the identification of novel celestial objects in the sky. Of particular interest is studying so-called "transient" sources whose changing brightness over time reveals interesting properties about our universe. However, these novel sources must be found among the thousands of other sources that a wide field-of-view astronomical survey might observe. The time trace of a source's changing brightness over time is known as its light curve, and my work entails the analysis of light curves as observed by MeerKAT, a Square Kilometre Array precursor telescope in South Africa. I have applied Gaussian process (GP) regression to these light curves and found that the distribution of the fitted GP hyperparameters revealed patterns useful for distinguishing between different types of celestial objects. I have compared my results with the variability metrics more commonly used in radio astronomy and found that my approach grants improved discriminatory power and interpretability.

Understanding crustacean moulting is vital in fisheries, as this phase increases vulnerability, resulting in higher mortality, product loss, and reduced value for certain species. Fisheries often implement closures during moulting, but timing is challenging due to geographic variations or sensitivity to environmental conditions. We introduce a Bayesian model for estimating moult timing using routinely collected datasets in crustacean fisheries, applied to Southern Rock Lobster around Tasmania. Our method, utilizing growth increments, appendage damage, and pleopod regeneration data, offers a robust analysis of spatial variation. It highlights a minor misalignment between the commercial fishing season closure and the moulting period in Tasmania. This flexible approach can be widely applied, integrating diverse data types available in different fisheries, providing a comprehensive understanding of moulting characteristics for effective management.

The increase in computational power, coupled with the availability of location data and advanced statistical methods, has fuelled interest in disease mapping and spatial epidemiology. However, the prevalence of geocoding errors and misallocated data introduces significant challenges to the accuracy of these analyses. Yet these anomalies are present in most Australian administrative health datasets, including cancer registry and hospital datasets. These misallocated cases can have a substantial impact on analyses conducted, including misidentifying relationships between covariates and outcomes [2, 3]. This project aims to develop practical and efficient methods to detect spatial anomalies in health datasets. Starting from applying Bayesian spatial models to identify anomalies, followed by utilisation of non-Bayesian smoothing methods such as Gaussian processes, and kriging, and exploring the use of variogram. In the presentation, I will share the findings from the ongoing project followed by future directions in spatial anomaly detection with geospatial datasets in Australia.

Plant and animal breeders face challenges like productivity fluctuations due to climate change and global competition. To sustain long-term genetic gains, we propose optimal design strategies using multi-objective Bayesian optimization. This involves meticulously selecting optimal parental combinations each generation or crop cycle, guiding breeding endeavors.

We propose a novel statistical method using dynamic programming to maximize parental selection in breeding programs. Our approach focuses on maximizing Genomic Estimated Breeding Values (GEBV) and genetic diversity. This involves selecting optimal parental combinations from a population pool, considering both additive and non-additive genomic random effects, and pedigree information to maintain breeding values.

We compare two acquisition functions—probability of improvement and expected improvement— as design criteria in the optimizing. GEBV are predicted using Genomic Best Linear Unbiased Prediction (GBLUP). Our results show promising long-term genetic gains compared to traditional methods. Additionally, the method is adaptable for plant breeding, incorporating genomic and environmental interactions.

Genomic prediction (GP) is an emerging data driven technology for plant breeding, which provides an opportunity to predict phenotypes or breeding values of crops based on both genomic and environmental information. From the statistical computation perspective, adding the interactions between environmental covariates and genome-wide markers may significantly increase the model dimension, and create a computational burden. We developed a dimensional reduction method based on the collinearity among the DNA markers and among the environmental covariates, which can represent the large amount of genotype and environmental interactions using a small number of summary statistics. Then we proposed to use Bayesian slab and spike regression, as well as deep Convolutional Neural Networks as the predictive models. The predictive performance of these methods were evaluated on a large scale cotton GP data sets collected from the CSIRO cotton breeding program with more than 3000 genotypes, 50 environmental covariates and over 100 00 phenotype records, and showed promising prediction accuracies for cotton economical traits such as lint yield and fibre qualities.

On-farm experiments (OFE) are gaining attention among farmers and agronomists for testing various research questions on real farms. The Analytics for the Australian Grains Industry (AAGI) has developed several techniques for analysing OFE data. Geographically Weighted Regression (GWR) and the multi-environment trial (MET) technique, which partitions paddocks into pseudo-environments (PEs), have proven effective. Additionally, we have explored the potential of the Generalised Additive Model (GAM) for handling temporal and spatial variability, given its flexibility in accommodating non-linear variables. In this presentation, we will demonstrate case studies using these techniques to analyse OFE data and compare the outcomes of different approaches.

Day 03

Session 3.1: Geology / Indigenous rock art

The CIELAB colour space index converts colour spectrum into perceptual colour coordinates (L*, a*, b*) defined by CIE international standards. L* represents lightness, a* the red- green balance, and b* the blue-yellow balance. Unlike RGB, CIELAB is perceptually uniform, meaning a given numerical change in coordinates reflects the same perceptible change in colour. This index converts colour spectrum data for wavelengths 380 to 780 nm (n = 950) into 3 numbers (L*, a*, b*). Traditionally, the Euclidean distance (ΔE) between L*, a*, b* coordinates was used to detect perceptual colour differences, with ΔE > 2 indicating a difference. The Murujuga Rock Art Monitoring Program examines the impact of industrial air emissions on rock art near Karratha in Western Australia. One component uses spectral colour measurements to monitor long-term trends, but extreme weather has caused instrument faults. These faults are visible in the spectrum plots but disappear when converted to CIELAB, appearing as valid measurements.

Spatial kernel smoothing is a technique to visualise spatial point patterns. However, extreme outliers in the data can distort the smoothed surface and mislead the interpretation. These outliers often arise in data collected rocks using portable X-ray fluorescence (pXRF). Geoscientists often classify rocks based on their proportions of elements. For example, the division of igneous rocks into ‘mafic’ and ‘felsic’ rocks is based on the proportion of silica and titanium measured by pXRF, silica typically ranges between 45000 and 300000 ppm, but there may be few extreme values around 700000 ppm or higher. To address these extreme outliers, we present winsorization technique in spatial smoothing, which reduces the impact of spurious outliers while preserving extreme spatial information. This proposed technique is applied to silica and titanium data collected on rocks in Murujuga, Western Australia. The performance is assessed in terms of root mean square error (RMSE) before and after winsorization.

Methods to establish porosity of homogeneous materials are well established and are used in geology to determine the size of aquifers and hydrocarbon reservoirs. However, these methods are not suited to determining porosity over a changing gradient. To study rock weathering and surface durability, an understanding of how porosity behaves over a vertical section of changing structure is needed. The technique of spatial smoothing can be used to determine a porosity curve that changes with distance from the surface. The technique is based on Scanning Electron Microscope (SEM) images. It uses morphological openings and connected components to define the rock sample surface, and then applies a Nadaraya-Watson smoothing with small bandwidths to estimate of porosity at different surface depths. The technique was applied to multiple rock samples. Differing bandwidth and opening morphologies were compared.

Session 3.2: Applied Statistics

Field crop agronomy projects involve multiple experiments exploring varied agronomic treatment factors, often with inconsistent levels of these treatments investigated across environments. Addressing key questions is challenging because it requires analysing data across a selection of these experiments with consistent treatments and treatment levels. This complexity is exacerbated when needing to address key questions for differing audiences; for example, growers seeking insights into the benefits of changing agronomic practices, and researchers aiming to understand the underlying drivers.

Motivated by a weed suppression project, this presentation explores how experiments and treatments were selected based on the key questions from diverse audiences. The proposed approach maximises the available data, and thus information, by using a tailored definition of “environment” to explore environment x treatment interactions for consistent treatments and/or treatment levels. The approach is performed using linear mixed models, implemented via the Restricted Maximum Likelihood (REML) procedure in Genstat.

Following an outbreak of pasture dieback in Queensland in 2015, considerable research effort has been expended in seeking the cause of dieback and developing management strategies. Pasture dieback is a condition that causes death of high yielding tropical and sub-tropical grasses in eastern areas of Queensland and north-eastern New South Wales. Symptoms include leaf discolouration (reddening and/or yellowing) then unthrifty growth, which can be difficult to differentiate from water and temperature stress, nutrient deficiency, fungal infections, and herbicide damage. Symptoms progress until the whole plant dies, becoming grey and brittle. Affected areas spread from roughly circular patches to paddock scale areas of dead pasture. The vague definition of dieback, difficulty in correct diagnosis and the fact that it always results in plant death posed considerable challenges relating to trial design. Come with us on our journey of discovery as we attempted to conquer these obstacles.

An African Swine Fever outbreak was simulated among a wild boar (WB) and of pig farm (PF) population on a fictional territory (Picault et al. 2022). Three successive periods were considered after the detection of the initial outbreak, with alternative control measures. Participating teams had to predict epidemiological trajectories. We tailored a stochastic compartmental model to predict spatial densities of newly infected WB and estimate the risk for PF to become infected. It accounted for age, sex and movement behaviour in WB, as well as the biosecurity level of PF. We estimated parameters by evaluating the prediction adequacy to provided epidemiological trajectories at the start of the challenge scenarios. Our final predictions were satisfactory: our model performed similarly to other participating teams and provided a complementary point of view. Decision-makers can assess the gain in implementing such-and-such measures to contain and eliminate the outbreak. We also discuss our modelling limitations.

Session 3.3: Applied Statistics

Mammal-resistant fences have enabled the eradication of exotic mammals from ecosanctuaries in New Zealand. However, preventing the re-invasion of mice has proven problematical! Indeed, in many fenced ecosanctuaries mice remain present and they can reach high numbers.

Scientists at Manaaki Whenua – Landcare Research have been studying the impacts of mice on biodiversity at Sanctuary Mountain Maungatautari. Two independently fenced sites within the sanctuary were managed to achieve high mouse numbers at the first site and undetectable mouse numbers at the second site. After 2 years, management protocols were switched, with mice eradicated from the first site and their numbers allowed to increase at the second. Data on abundance of invertebrates, seedlings and fungi were collected throughout the duration of the study.

Linear mixed models with smoothing splines were used to model the temporal trends in abundance, leading to the conclusion that mice may be catastrophic in ecosanctuaries that focus on the recovery of invertebrates.

Annual variations in weather patterns drive many decisions in agriculture. Readily available seasonal summaries, such as those from the Bureau of Meteorology (BOM) are often not focused on the specific details that are of interest to growers. For example, 'how wet was it today' and 'how hot was it' are relatively easy to determine from weather reports and maps on the BOM website, but these point measures of climate are less important to growers than seasonal summaries, and comparisons with previous seasons.

The Climate and Weather team at DPIRD provide maps and graphical summaries of weather relevant to the agricultural industry. These include break of season, available soil water, thermal time and frost durations. Maps are also produced quickly in response to notable events, such as heat stress, erosion strength winds, and storms. Maps from the 2024 season will be presented.

Neutron stars (NSs) are the densest form matter in the universe, formed from the death of massive stars. A significant fraction of NSs are found in binary systems with stars like our sun, where they “accrete” matter from this companion star. Characterizing the population of NSs is an important component of understanding evolution and death of stars in our Galaxy. From an observational perspective, astronomers only detect these NSs when an episode of accretion “outburst” occurs (lasting days) between long episodes of quiescence (months to decades). These periods of quiescence combined with the comparatively short baseline of observations (few decades), imposes heavy left censoring (absence of data when a quiescence period began), which impacts our ability to infer about the population of NSs in the Galaxy. In this talk, we present the unique complexities of astronomical observations in this context and possible statistical strategies to mitigate these issues.