Final year project suggestions - 2014/15

This page describes some of the final year project ideas I'm interested in supervising. Please contact me (Amanda Clare, afc@aber.ac.uk) if interested in one of these areas (or if you have similar ideas).

1. Analysis of the digitised newspapers held at the National Library of Wales (multiple projects)

The National Library recently digitised its collection of newspapers. A hacathon in January 2013 had a go at making use of the data for various projects. Browse examples of the articles and see what's available. The API document is available.

Natural language processing libraries: http://www.nltk.org/ (Python), http://nlp.stanford.edu/software/ (Java), https://opennlp.apache.org/ (Java).

Examples of what could be done:

Biological species browser: in May a year ago we held a Bioblitz and counted the number of species that members of the public could find on campus. How many of these are also found in the historical newspapers? How could their stories be automatically summarised and presented in an accessible manner? Which species are not found?
Scientists in the news: who were the scientists in these newspapers from 1850-1910? Can we build a database and website about them, based on the information that can be automatically extracted from these newspapers? What subjects did they practise and where did they work? Why were they in the news? What words were most frequently used to describe them? "distinguished"? "Welsh"? "chemist"? "summoned"? "clever"? "saved"? "fined"?
The news articles are not labelled as to which kind of news (politics, sport, local topics, business, etc). Could you automatically do this?

Problems include dealing with (suggesting) spelling variations, coping with OCR-misrecognition issues,

Keywords: text processing, machine learning, statistics, algorithms

Degree scheme: AI, CS, SE, IC

2. Extending Abermol (supervised by Edel or Amanda)

The project would be to extend some code that has been written by Edel in Haskell ("Abermol"). This code can be used to analyse the chemical structure of small molecules involved in reactions. The aim of this software is to determine which new molecules can be formed when bonds are broken. In order to do this, the code represents the chemical properties of elements in the periodic table (such as valencies), models the molecules as graphs containing nodes (atoms) and edges (bonds), calculates the new graphs that can be formed, and ensures that these new molecules are unique. This project would extend and develop this code so that it can be applied to a new line of research. Specifically, we would like the student to work on two areas.

1) The first is to integrate the results of the code with the ability to automatically query online chemical databases, such as ChEBI and ChemSpider, in order to determine if the chemicals produced by the software actually exist or can be purchased or manufactured, and hence would be available for experiments. This part of the project would introduce you to the existing code-base and to the SMILES notation for the structural description of chemicals.

2) The second part of this project is to add functionality to the software in order that, given a molecule, and a selection of alternative atoms and small groups, we could calculate all the "similar" molecules that could be produced by exchanging one or more atoms/groups for alternatives. In this way, we can start with an existing molecule and ask the software to enumerate all close variants of this molecule. This question is particularly important to us in our study of the use of markers for genome engineering. We wish to know about all the alternative molecules that a cell can be deceived into taking up, which would then disrupt a vital pathway.

Keywords: Haskell, bioinformatics

Degree scheme: AI, CS, SE

3. Version control of DNA sequences

This project would explore the use of various version control systems (or create a new one) for recording small changes to long DNA sequences. Labs need to record what changes they make to DNA, and this needs to be archived for a long time, while postdocs and technicians come and go. Software version control likes to use lines of code as a default unit. What is the equivalent for DNA sequences? At which point does the system break? How can we incorporate compression of the long sequences and still ask what changed, in a reasonable amount of time and memory?

Some reading:

Theory of patches: http://darcs.net/Theory, http://www.staff.science.uu.nl/~swier004/Publications/versionControl.pdf

A paper about storing genomes as self-indexed structures (don't worry about the details, just get a feel for what they're saying in the introduction) http://www.dcc.uchile.cl/~gnavarro/ps/recomb09.pdf. A paper about reference-based compression of genomes http://genome.cshlp.org/content/21/5/734.long. This Masters thesis talks about using the type system of Haskell to look at how to apply patches. It's not what I want to do, but chapter 3 is useful background about how patches could work: https://ir.library.oregonstate.edu/xmlui/bitstream/handle/1957/11180/thesis.pdf

Yeast genome data, just as an example: http://www.yeastgenome.org/download-data/sequence, http://wiki.yeastgenome.org/index.php/Commonly_used_strains

Commonly used E. coli strains (just described by how they are different to each other). http://www.addgene.org/mol-bio-reference/strain-information/. Also see plasmids, which are a bit like the biological equivalent of a patch or diff. You can apply them to your organism to create a difference. https://www.addgene.org/mol-bio-reference/plasmid-background/. Addgene is a site that contains loads of plasmids (and often the plasmids themselves are just variants of each other).

Keywords: version control, sequence, compression

Degree scheme: AI, CS, SE

4. Audio and visual representation of DNA sequence differences

This project would create a tool that can take in two DNA sequences that are similar but have differences, and it will automatically demonstrate those differences in sound, images and/or animation in a way that entertains the public. It should give the public some idea of how different the sequences are (very different, hardly different) and represent the differences at various scales (DNA, protein, gene, genome, I'll explain this). The input sequences may be very short (a few hundred characters) or very long (megabytes).

Keywords: sound, animation, sequences

Degree scheme: AI, CS, SE, IC

5. Simulation of Babbage's Difference Engine or Analytical Engine

The Difference Engine and Analytical Engine were mechanical computers designed by Charles Babbage between 1822 and 1837. I'd like an HTML5 simulation of what Babbage's Difference Engine can do. A much harder task would be to simulate what the Analytical Engine could do. I'd like to be able to use the simulations to describe the capabilities of these machines, and what kinds of programming concepts they would have been considering for these machines. https://www.fourmilab.ch/babbage/applet.html, http://www.fourmilab.ch/babbage/cmdline.html http://jfinkels.github.io/analyticalengine/. The description of the Analytical Engine.

Keywords: HTML5, compiler, programming concepts, history

Degree scheme: AI, CS, SE, IC

6. Construct and analyse a database of Miscanthus genetic data

We have a dataset that describes just over 100 Miscanthus plants by their properties (height, diameter, flowering time etc). It also contains genetic information about the differences in the plants' genomes. We want to find out how the genomes correspond to the plant properties that have been measured. In order to do this, we have to collect a lot more data, which tells us about the corresponding genes in other plant species (rice, wheat, sorghum, Arabidopsis). This project would be to find and download this information from a variety of websites, to extract the information from the variety of file formats, to construct a local database to structure this information, and then to begin to look for correlations and do data analysis.

Keywords: database, data analysis, data extraction and validation

Degree scheme: CS, SE, IC, AI

Back to Amanda Clare