Research
Research areas
- PhDs
- Publications
- Previous work
- Old but still accessible datasets
Research areas
I am a senior lecturer in the Bioinformatics and Health Informatics Group of the Department of Computer Science, Aberystwyth University. My interests include machine learning and data science, text analysis, genome and metagenome analysis and microbial bioinformatics. I am interested in sequence analysis, for genomics, time series and text processing and in particular, in what we can do with computers (algorithms, data structures and artificial intelligence) to help with this analysis. DNA/RNA sequencing allows us to inspect the genetic composition of microbes, animals, plants and viruses, but after we have obtained the sequences, what can we learn? How are communities changing over time? How are enzymes within a community specialised for different roles? How can we detect genes in communities of organisms that have never been cultured? I'm also interested in the ethical and moral implications of genomic analysis.
PhD students
PhD students I've supervised or co-supervised, past and present:
Current: Emma Liu, Lily Major
Completed: James Ravenscroft, Nick Dimonaco, Sam Nicholls, Emmanuel Isibor, Elizabeth Donkin, Michael Riley
Publications
- Major, L., Clare, A., Daykin, J. W., Mora, B., Zarges, C. (2025)
Heuristics for the Run-length Encoded Burrows-Wheeler Transform Alphabet Ordering Problem. Journal of Heuristics 31 article 11 and arXiv preprint
- Clare, A., Aubrey, W., Surette, M.G., and Dimonaco, N (2024) Predicting coding regions on unassembled reads, how hard can it be? Poster at Genome Informatics 2024
- Clare, A., Aubrey, W., Surette, M. and Dimonaco, N. (2024) Partial gene predictions on unassembled reads: evaluating the Good, the Bad and the slightly ORF. Poster at RECOMB-Seq
- Major, L., Clare, A., Daykin, J. W., Mora, B., Zarges, C. (2024) A visualization tool to explore alphabet orderings for the Burrows-Wheeler Transform. arXiv preprint
- Dimonaco, N. J., Clare, A., Kenobi, K., Aubrey, W. and Creevey, C. (2023) StORF-Reporter: Finding genes between genes. Nucleic Acids Research 51(21) p11504-11517, or bioRxiv preprint
- Dimonaco, N. J., Aubrey, W., Kenobi, K., Clare, A. and Creevey, C. (2022) No one tool to rule them all: Prokaryotic gene prediction tool performance is highly dependent on the organism of study. Bioinformatics 38(5):1198-1207, or bioRxiv preprint
- Ravenscroft, J., Cattan, A., Clare, A., Dagan, I., Liakata, M. (2021) CD2CR: Co-reference Resolution Across Documents and Domains EACL 2021. (also see pre-submission arXiv preprint)
- Nicholls, S. M., Aubrey, W., de Grave, K., Schietgat, L., Creevey, C. and Clare, A. (2021) On the complexity of haplotyping a microbial community. Bioinformatics 37(10):1360-1366, preprint
- Ravenscroft, J., Clare, A. and Liakata, M. (preprint, 2020) Measuring prominence of scientific work in online news as a proxy for impact. arXiv preprint
- Major, L., Clare, A. Daykin, J. W., Pena Gamboa, L., Mora, B., Zarges, C. (2020) Evaluation of a Permutation-Based Evolutionary Framework for Lyndon Factorizations. PPSN 2020. Accepted version of the paper.
- Jozwik, A., Aubrey, W., Clare, A. Smallbone, W. Martin, K. (2019) Intelligent Decision Making Support for Water Quality Monitoring. Presentation OR61A156 at OR 61.
- Nicholls, S. M., Aubrey, W., Edwards, A., de Grave, K., Schietgat, L., Huws, S., Soares, A., Creevey, C. and Clare, A. (preprint, 2019) Recovery of gene haplotypes from a metagenome bioRxiv preprint.
- Clare, A., Daykin, J. W., Mills, T. and Zarges, C. (2019) Evolutionary search techniques for the Lyndon factorization of biosequences. Workshop on Evolutionary Computation for Permutation Problems, GECCO 2019. Accepted version of the paper.
- Clare, A. and Daykin, J. W. (2019) Enhanced string factoring from alphabet orderings. Information Processing Letters 143:4-7. Also arXiv preprint. See also the poster for Genome Science 2018.
- Ravenscroft, J., Clare, A. and Liakata, M. (2018) HarriGT: A Tool for Linking News to Science. Proceedings of ACL 2018, System Demonstrations P18-4004, p19-24. Try out HarriGT.
- Garland, O., Clare, A. and Aubrey, W. (preprint, 2018) GiraFFe Browse: A lightweight web based tool for inspecting GFF and FASTA data bioRxiv preprint.
- Nicholls, S. M., Aubrey, W., Edwards, A., de Grave, K., Schietgat, L., Huws, S., Soares, A., Creevey, C. and Clare, A. (preprint, 2018) Computational haplotype recovery and long-read validation identifies novel isoforms of industrially relevant enzymes from natural microbial communities bioRxiv preprint.
- Ravenscroft, J., Liakata, M., Clare, A. and Duma, D. (2017) Measuring Scientific Impact Beyond Academia: An assessment of existing impact metrics and proposed improvements. PLoS One doi:10.1371/journal.pone.0173152, blog post about measuring scientific impact, download the data
- Donkin, E., Dennis, P., Ustalkov, A., Warren, J. and Clare, A. (2017) Replicating complex agent based models, a formidable task. Environmental Modelling and Software, 92:142-151
- Veneman, J.B., Saetnan, E., Clare, A., Newbold, C. (2016). MitiGate; an online meta-analysis database for quantification of mitigation strategies for enteric methane emissions. Science of the Total Environment 572 pp. 1166-1174 doi: 10.1016/j.scitotenv.2016.08.029
- Nicholls, S. M., Clare, A. and Randall J. C. (2016) Goldilocks: a tool for identifying genomic regions that are 'just right'. Bioinformatics 32 (13): 2047-2049, doi: 10.1093/bioinformatics/btw116, blog post about Goldilocks.
- Duma, D., Liakata, M., Clare, A., Ravenscroft, J., Klein, E. (2016) Rhetorical Classification of Anchor Text for Citation Recommendation. WOSP Workshop (5th Intl Workshop on Mining Scientific Publications). Full text of article in D-Lib Magazine.
- Aubrey, W., Riley, M. C., Young, M., King, R. D., Oliver, S. G. and Clare, A. (2015) A Tool for Multiple Targeted Genome Deletions that Is Precise, Scar-Free, and Suitable for Automation. PLOS One 10(12): e0142494 doi: 10.1371/journal.pone.0142494, blog post about seamless gene deletion.
- Sapstead, S., Daniel, I. and Clare, A. (2015) Automatically Geotagging Articles in the Welsh Newspapers Online Collection. In proceedings of AI 2015. doi: 10.1007/978-3-319-25032-8_28
- Runciman, C., Clare, A. and Harkness, R. (2014) Laboratory automation in a functional programming language. Journal of Laboratory Automation 2014 Dec; 19(6):569-76. doi: 10.1177/2211068214543373. Blog post describing this article, github code and preprint pdf.
- Riley, M. C., Aubrey, W., Young, M. and Clare, A. (2013) PD5: a general purpose library for primer design software. PLoS One, DOI: 10.1371/journal.pone.0080156. Get the code at the PD5 web site.
- Ravenscroft, J., Liakata, M. and Clare, A. (2013) Partridge: An effective system for the automatic classification of the types of academic papers. In proceedings of AI 2013, Dec 2013. Try out the Partridge system!
- Sparkes, A. and Clare, A. (2012) AutoLabDB: a substantial open source database schema to support a high-throughput automated laboratory. Bioinformatics 28(10) 1390-1397. doi: 10.1093/bioinformatics/bts140 (abstract, pdf).
- Clare, A., Croset, A., Grabmueller, C., Kafkas, S., Liakata, M., Oellrich, A., Rebholz-Schuhmann, D. (2011) Exploring the Generation and Integration of Publishable Scientic Facts Using the Concept of Nano-publications. 1st International Workshop on Semantic Publication (SePublica 2011). pdf.
- Alsberg, B. and Clare, A. (2010) Wiki based management of chemometric research projects. Journal of Chemometrics 24(7-8) p408-417
- Sparkes, A., Aubrey, W., Byrne, E., Clare, A., Khan, M. N., Liakata, M., Markham, M., Rowland, J., Soldatova, L. N., Whelan, K. E., Young, M. and King, R. D. (2010) Towards Robot Scientists for autonomous scientific discovery. Automated Experimentation 2010, 2:1 doi:10.1186/1759-4499-2-1
- Sparkes, A., King, R. D., Aubrey, W., Benway, M., Byrne, E., Clare, A., Liakata, M., Markham, M., Whelan, K. E., Young, M., Rowland, J. (2010) An Integrated Laboratory Robotic System for Autonomous Discovery of Gene Function JALA 15(1) pages 33-40.
- King, R. D., Rowland, J., Aubrey, W., Liakata, M., Markham, M., Soldatova, L. N., Whelan, K. E., Clare, A., Young, M., Sparkes, A., Oliver, S. G., Pir, P. (2009) The Robot Scientist Adam, IEEE Computer, vol. 42, no. 8, pp. 46-54, August, doi:10.1109/MC.2009.270
- King, R. D., Rowland, J., Oliver, S. G., Young, M.,
Aubrey, W., Byrne, E., Liakata, M., Markham, M., Pir, P.,
Soldatova, L. N., Sparkes, A., Whelan, K. E., Clare, A. (2009) The Automation of Science. Science 324(5923):85-89, 3rd April 2009. (preprint pdf, before final corrections)
- Soldatova, L., Aubrey, W., King, R. D. and Clare, A. (2008) The EXACT description of biomedical protocols. Bioinformatics 2008 24: i295-i303. Special issue for ISMB 2008. See also EXACT webpage.
- Riley, M.C., Clare, A. and King, R. D. (2007)
Locational distribution of gene functional classes in Arabidopsis thaliana BMC Bioinformatics 8:112
- Blockeel, H., Schietgat, L., Struyf, J., Dzeroski, S., Clare, A. (2006) Decision Trees for Hierarchical Multilabel Classification: A Case Study in Functional Genomics. In proceedings of PKDD 2006.
- Soldatova, L., Clare, A., Sparkes, A. and King, R. D. (2006) An ontology for a robot scientist.
Bioinformatics 2006 22: 464-471.
Also in ISMB 2006. Archived in CADAIR here.
- Clare, A., Karwath, A., Ougham, H. and King, R. D. (2006) Functional Bioinformatics for Arabidopsis thaliana. Bioinformatics 2006 22: 1130-1136
- Struyf, J., Dzeroski, S. Blockeel, H. and Clare, A. (2005)
Hierarchical Multi-classification with Predictive Clustering Trees in
Functional Genomics. In proceedings of the EPIA 2005 CMB Workshop. Springer link
- Clare, A. (2005) Integration of genomic and phenotypic data. Data Analysis and Visualization in
Genomics and Proteomics, Eds. Francisco Azuaje and Joaquin Dopazo, Wiley, London. ISBN: 0-470-09439-7
- Clare, A., Williams, H. E. and Lester, N. (2004) Scalable multi-relational association mining. In proceedings of the 4th IEEE International Conference on Data Mining (ICDM '04). p355-358. abstract, software
- King, R. D., Wise, P. H. and Clare, A. (2004) Confirmation of Data Mining Based Predictions of Protein Function. Bioinformatics 20(7) 1110-1118, abstract, genepredictions.org
- Clare, A. and King, R. D. (2003) Predicting gene function in Saccharomyces cerevisiae. ECCB 2003 (published as a journal supplement in Bioinformatics 19: ii42-ii49, abstract
- Clare, A. (2003)
Machine learning and data mining for yeast functional genomics. PhD thesis. University of Wales Aberystwyth. pdf (1Mb) This was a runner-up in the 2004 BCS Distinguished Dissertations Award.
- Clare, A. and King R.D. (2003)
Data mining the yeast genome in a lazy functional language. In Practical Aspects of Declarative Languages (PADL'03) (won Best/Most Practical Paper award), abstract, pdf
- Clare, A. and King R.D. (2002)
How well do we understand the clusters found in microarray data? In Silico Biol. 2, 0046, abstract, html, further data
- Clare, A. and King R.D. (2002)
Machine learning of functional class from phenotype data. Bioinformatics 18(1) 160-166. abstract, pdf, further data
- Clare, A. and King R.D. (2001)
Knowledge Discovery in Multi-Label Phenotype Data. In proceedings of ECML/PKDD 2001. abstract, pdf, further data, code
- King, R.D., Karwath, A., Clare, A., & Dehaspe, L. (2001)
The Utility of Different Representations of Protein Sequence for
Predicting Functional Class. Bioinformatics 17(5) 445-454. abstract, pdf, further data
- King, R.D., Karwath, A., Clare, A., & Dehapse, L. (2000)
Accurate prediction of protein functional class in the M. tuberculosis and
E. coli genomes using data mining. Comparative and
Functional Genomics 17 283-293 (nb: volume 1 of CFG was volume 17 of Yeast). actual article, preprint postscript, further data
- King, R.D., Karwath, A., Clare, A., & Dehapse, L. (2000)
Genome scale
prediction of protein functional class from sequence using data
mining. In: The Sixth International Conference on Knowledge Discovery and Data Mining (KDD 2000). pdf, further data
- Rose, T., Elworthy, D., Kotcheff, A., Clare, A., Tsonis, P. (2000) ANVIL: a system for the retrieval of captioned images using NLP techniques. In Challenge of Image Retrieval, Brighton, 2000. gzipped doc
Previous work
I held an RAEng/EPSRC Industrial Fellowship with Dŵr Cymru Welsh Water (2021-2022) on Advanced statistical process control for water treatment. This investigated automated detection of anomalous sensor readings in drinking water treatment for Dŵr Cymru Welsh Water, for early detection and risk managment.
I held an RAEng/EPSRC Research
Fellowship to "Engineer the Intelligent Scientific
Laboratory" (2006-2011). This project involved work on the Robot
Scientist project, where intelligent software
created scientific hypotheses, designed experiments to distinguish
between these hypotheses, controlled a lab robot to conduct these
experiments, and then uses the results to design the next round of
experiments. There were many aspects to the work on this project,
including data formalism, experimental protocols, data collection,
inference and querying, planning and scheduling, and the
practicalities of working in a real lab with real automation
equipment.
Before this I held an 1851 Research
Fellowship to investigate Grid-enabling lab robots for the Robot
Scientist. This was a two year project, Oct 2004 to Sep 2006.
Previously, as a post doc on a BBSRC funded grant, and as a PhD student, I've used
machine learning (including ILP) and data mining (particularly
multi-relational associations) for functional genomics - elucidating the
biological functions of the parts of a genome. When a genome is
sequenced, and we have the predicted locations of the genes within the
genome, the next stage is to work out the possible functions of these
genes. We've been looking at genes in Saccharomyces cerevisiae and Arabidopsis thaliana, the first
plant genome to be sequenced.
Detailed results
for yeast and Arabidopsis are available.
This has involved looking at ways to make use of different kinds of data, from
microarray data, sequence statistics, homology data, predicted
secondary structure, QTLs, and phenotypic data. Also ways
to make use of background information, hierarchical information, and
also to take into account that proteins have more than one function, a
classification problem where each item fits into more than one class.
I've also spent 3 months working with RMIT's Search Engine Group making a
multi-relational data mining tool (Radar) based on inverted indexing.
Old but still accessible data sets
Back to Amanda Clare