AWS genomics event: Distributed Bio, Cycle Computing talks: practical scaling

The afternoon of the AWS Genomics Event features longer detailed tutorials on using Amazon resources for biological tasks.

Next Generation Sequencing Data Management and Analysis: Chris Smith & Giles Day, Distributed Bio

Chris and Giles work at Distributed Bio, which provides consulting services on building, well, distributed biology pipelines. Chris starts by talking about some recommended tools:

  • Data transfer: rsync/scp; Aspera, bbcp, Tsunami, iRODS. UDP transport, like in iRODS, provides a substantial speed improvement over standard TCP. Simple demo: 3 minutes versus 18 seconds for ~1Gb file (with iput/iget).

  • iRODs: catalog on top of filesystem, which is ideal for massive data collections. Lets you easily manage organizing, sharing, storing between machines. It's a database of files with metadata: definitely worth investigating. Projects to interact with S3 and HDFS in work.

  • GlusterFS: easy to setup and use; performance equivalent or better to NFS

  • Queuing and scheduling: Openlava is basically open-source LSF, with EC2 friendly scheduler

Giles continues with 3 different bioinformatics problems: antibody/epitope docking, sequence annotation and NGS library analysis. For docking, architected a S3 EC2 solution with 1000 cores over 3.5 hours for $350. 10 times faster than local cluster work, enabling new science.

Sequence annotation work was to annotate large set of genes in batch. Was changing so fast cluster could not keep up with updates. Moved to hybrid AWS architecture that processed on AWS and submitted results back to private databases. Costs were $100k in house versus $20k at AWS.

NGS antibody library analysis looks at variability of specific regions in heavy-chain V region. Uses Blast and HMM models to find unique H3 clones. VDJFasta software available; paper. On the practical side, iRODs used to synchronize local and AWS file systems.

Building elastic High Performance clusters on EC2: Deepak Singh and Matt Wood, AWS

Deepak starts off with a plug for AWS Education grants, which are a great way to get grants for compute time. Matt follows with a demo to build an 8 node, 64 core cluster with monitoring. Create a cc1.4xlarge cluster instance and 100Gb EBS store attached. Allow SSH and HPC connections between nodes in the security group. Put the instance within a placement group so you can replicate 7 more of them later by creating an AMI from the first.

CycleCloud demo: Andrew Kaczorek

Andrew from Cycle Computing wraps up the day by talking about their 30,000 node cluster created using CycleCloud. Monitoring done using Grill, a CycleServer plugin that monitors Chef jobs.

The live demo shows utilizing their interface to spin up and down nodes. Simple way to start clusters and scale up and down; nice.

Also demos a SMRT interface to PacBio data and analyses. Provides the back end elastic compute to a PacBio instrument on Amazon.

AWS genomics event: Deepak Singh, Allen Day; practical pipeline scaling

High Performance Computing in the Cloud: Deepak Singh, AWS

After the coffee break, Deepak starts off the late morning talks at the AWS genomics event. He begins by discussing the types of jobs in biology: batch jobs and data intensive computing. 4 things go into it:

  • infrastructure -- instances on AWS that you access using an API. On EC2, this infrastructure is elastic and programmable. Amazon has cluster compute instances that make it easier to connect this infrastructure into something you can do distributed work on. Scaling on Amazon cluster good for MPI jobs in common jobs in computational chemistry and physics. Another choice are GPU instances if your code distributes there.

  • provision and manage -- Lots of choices here: ruby scripts, Amazon CloudFormation, chef, puppet. Also standard cluster options: Condor, SGE, LSF, Torque, Rocks+. MIT Starcluster examples with awesome demos of making clusters less tricky. Cycle Computing leveraging these tools to make massive clusters and monitoring tools.

  • applications: Galaxy CloudMan, CloudBioLinux, Map Reduce for Genomics: Contrail, DNANexus, SeqCentral, Nimbus Bioinformatics

  • people: Most valuable resource that you can maximize by removing constraints. Big advantage to have access to unlimited instances to leverage when needed.

Elastic Analysis Pipelines: Allen Day, Ion Flux

Allen from IonFlux talks about a system for processing Ion Torrent data; production oriented system to move from initial data to final results behind a well-packaged front end. Decided to work on the Cloud to well serve smaller labs without existing infrastructure, plus all the benefits of scale. End to end solution from torrent machine to results.

The pipeline pulls data into S3, aligns data, does realignments, produces variant calls. Workflow involves using Cascading describe Hadoop jobs; this talks to Hbase to store results. On the LIMS side, uses messaging queues to pass analysis needs to workflow side. Jenkins used for continuous integration.

What does the Hadoop part do? Distributed sorting of BAM files, co-grouping, and detection of outliers. Idea is to distribute filtering work to prioritize variants or regions of interest. Data is self-describing, which allows restarting or recovering at arbitrary points. Cascading allows serialization of these by defining schemes for BAM, VCF and fastq files.

How did they work on scaling problems? Index server that pre-computes results and feeds them into the MapReduce cluster. Index server became a bottleneck. Can improve this by moving index server into a bittorrent swarm that serves them to the MapReduce nodes. The continuous integration systems does the work of creating index files as EBS snapshots; bittorrent swarm uses these snapshots.

AWS Genomics Event: Matt Wood; Chris Dagdigian on cloud biology automation

I'm in Seattle at the AWS Genomics Event, excited for a fun day of talking about genomics in the cloud.

Introduction to Research in the Cloud: Matt Wood, AWS

Matt Wood starts off the day with an introduction to Amazon Web Services and details about Amazon's interest in Genomics. Idea is to move from data to materials, and from compute to methods; focusing better on the science. Areas where Amazon interacts with science:

  • Reproducibility: 1000 genomes a great example. Improves the impact of science by easing reuse. Can package the environment as machine images, which is awesome since you can give collaborators exactly what you did. Allows us to work in new ways since you can share complex environments. CloudFormation allows you to define in JSON all of the items in a cluster. Tools like Puppet and Chef provision software and configuration. Taverna can model the actual science workflow. Amazon provides SimpleDB as a key/attribute store to help model and store metadata associated with experiments or data. Galaxy fully invested in reproducibility and community involvement within their infrastructure.

  • Constraint removal: avoid constraints that limit innovation and research. Expand your problem space by introducing an easy approach to scaling.

  • Algorithm development: Infrastructure enables algorithms. Nice examples are:
  • GPU instances; b. Crossbow utilizing Hadoop.

  • Collaboration and sharing: data, data uses and multiple users over lots of locations. General idea: moving the compute to the data. Amazon has free inbound transfer; if that's too slow, also have Import/Export via FedExed hard drives. Can do parallel upload to S3.

  • Funding options: On-demand is the easiest approach, but most costly. Can use reserved capacity to reduce the hourly rates. The spot market lets you bid on capacity and save money; need to architect for interruption.

  • Compliance: shared responsibility -- Amazon secures the infrastructure; users secure the instances and data. ISO 27001 and HIPPA compliant. Data mirrored across availability zones, but local data stays local. GovCloud: US only usage.

Some exciting things that are coming soon in genomics. Getting closer to health and patient data: going to require security and data availability, scaling to large numbers of users with elastic pipelines. Important to put patients in charge of their own data.

Practical Cloud & Workflow Orchestration: Chris Dagdigian, The BioTeam

Chris Dagdigian discusses working on the hardware geek side of science with AWS. Three topics: time, laziness and beauty. Getting to the point where automated provisioning changes lag time between wanting to do science and getting the hardware ready to do it. Research infrastructure is 100% scriptable and automatable; be lazy and automate what you do. The beautiful bits are what you can build on top of Amazon infrastructure.

Demo time:

  • CloudInit gives you a hook into freshly booted systems. Don't need to maintain tons of AMIs; easy way to configure a new system with a YAML configuration file.

  • Amazon CloudFormation allows you to turn on/off a large number of instances. Create an elastic database cluster, webserver cluster and monitoring: all in a JSON input file. The example JSON template is a good place to get started.

  • Opscode Chef enables infrastructure as code. Important that everything is idemopotent so you can run multiple times. Demo with knife, Chef's commandline tool. Can run ssh code on each node in a cluster, but also do searches with this. With the searches can find certain nodes with properties of interest and run those.

  • MIT StarCluster builds ready to use cluster compute farm on AWS. Especially useful for handling legacy use cases. Slideshare example of running this.

Bioinformatics Open Source Conference 2011 -- Day 2 afternoon talks

Ben Vandervalk -- SADI for GMOD: Bringing Model Organism Data onto the Semantic Web

SADI is a Semantic Web framework developed to get RDF documents, described with OWL. Ben wrote several SADI services to access sequence feature data from GMOD. Takes in input RDF, does a query, and returns back feature descriptions in RDF. SADI service works off GFF files and is a CGI script. Can get from SADI GMOD google website.

Stian Soiland-Reyes -- Scufl2: because a workflow is more than its definition

Scufl2 is a workflow format for Taverna. Wanted to handle workflows and be compatible with Semantic Web technologies. Capture information to re-run and re-use any part of workflow. Idea is to get fully reproducible workflow from a paper.

Workflow distributed as a bundled zip file. It contains a workflow bundle RDF file along with workflow RDF, profile RDF, annotation RDF, input and output RDFs. Main development page is at workflow4ever and code on Github. Currently in alpha stage, aiming for releases later in the year.

Tomasz Adamusiak -- OntoCAT - an integrated programming toolkit for common ontology application tasks

Tomasz starts off by calling everyone wolves. Then mentions that the problem with the 3 little pigs is that they did not provide a consistent API. Relevance of RDF is that we have multiple ontology repositories: EBI, BioPortal, plus local ontologies. OntoCAT is a database and browser, REST service and GoogleApp. It provides an extraction layer over the different resources so provides a set of common methods to access ontologies. Has a web tool and R package.

Steffen Moeller -- Debian Med: individuals' expertize and their sharing of package build instructions

Bioinformatics has an enormous number of bioinformatics tools that are specialized and can be challenging to install. Debian Med is a community package repository: publication of packages, collaboration and public engagement. Packaging is all volunteer work and shared with Bio-Linux and other communities. Current work is incorporating additional Java packages, establishing complete workflows, data management with BioMaj.

Andreas Hildebrandt -- The Biochemical Algorithms Library for Rapid Application Development in Structural Bioinformatics

Motivation: difficult to design new drugs, despite improvements in data generation, knowledge and techniques. Want to improve drug design. Approach is to generate structures: start with PDB, parse them, add missing atoms, infer bonds and optimize atom positions. Biochemical Algorithims Library helps handle these steps. It has Python language bindings so you can work with it directly from scripting languages. The viewer has some beautiful pictures of 3D structures.

Simon Mercer -- A Framework for Bioinformatics on the Microsoft Platform

Simon talks about the work Microsoft is doing to build a reusable Bioinformatics toolkit built on the .NET framework. The Microsoft Bioinformatics Framework is at version 2.0; cross platform and works on Mono. The name of the project is going to change as they are moving to the OuterCurve foundation and will not be guided or owned by Microsoft.

Bioinformatics Open Source Conference 2011 -- Day 2 morning talks

The second day of the Bioinformatics Open Source Conference (BOSC) started off with a session on Cloud Computing.

Matt Wood -- Into the Wonderful

Matt worked at Ensembl/Sanger on sequencing pipelines, now technology evangelist at Amazon dealing with Cloud Computing.

Start off by talking about data, well, lots of data. Challenges are distributing and making data available with lots of constraints: throughput, data management, software, availability, reproducibility, cost. So far, we've managed to move from Gb-scale to Tb-scale work. Open source software has played a role in making software easily availability, and active development communities.

Work so far is foundational blocks for next steps; how can we optimize with existing tools and infrastructure? Optimize for developer productivity; also wider development community because lower barriers to entry. Goals are to abstract difficult and tedious parts, maximizing time you can spend on the fun parts.

What are the building blocks that are available to do this? Cloud provides a collection of foundations: compute, storage, databases, automation; couple with workflows, analytics, warehouses and visualization. Move data/compute to materials/methods. Usability is the most important metric for tools: available, flexible, and reliable. Cloud is awesome example of this: quick to get new image and API to flexibly access them.

Machine images are really key to sharing code, data, configuration and services with others. Also a reproducible representation of work that was done. You have a ton of moving parts and want to be able to capture these: tools like Puppet and Chef allow you to reproduce this as well.

Amazon data replicated across multiple availability zones for redundancy. Each availability zone separate. However, data stays local so if cannot move from US to Europe does not. There are a ton of options for building different infrastructure and managing costs: standard, reserved and spot instances. Spot instances great way to get access to cheap compute; need to architect for interruption.

Matt talks about some of his favorite projects: Galaxy, Cloud BioLinux, Taverna, StarCluster, CloudCrowd. Also companies doing interesting stuff here: Cycle Computing, ionflux, DNANexus.

For Hadoop, Amazon's ElasticMapReduce takes away a lot of the pain of setting up a Hadoop server. Another example is Amazon's Relational Database Service for MySQL/Oracle. General idea is lowering the barrier to utilization.

The free tier and research grants are some no-cost ways to get started.

Richard Holland -- Securing and sharing bioinformatics in the cloud

Talking about commercial deployment with open source software: PlasMapper and Ensembl. Proof of concept cloud architecture with Ensembl and custom databases and open source applications on top, Ensembl, PlasMapper and GeneAtlas.

For security, used OpenAM to authenticate, then encrypted data on disk, SSL encryption of communication, hide Apache information, and firewalled.

Some potential issues. With PlasMapper writes to tmp directory, where your original data is available; if you don't secure this directory in apache others can grab your data. With Ensembl, can do html injection by appending nasty javascript tricks to GET parameters. Also has global identifiers for things like BLAST results; security by obscurity since if someone had or guessed the id could look at your results. Need to tie these identifiers to login.

Recommendations: firewall externally and internally, validate file uploads, don't store uploaded files in accessible location, avoid GET parameters where possible.

Chunlei Wu -- Mygene.info: Gene Annotation as a Service - GAaaS

Chunlei starts off with a migration story for BioGPS, a Gene-centric annotation data representation. They started with a relational database solution, then switched into a document based solution: json style objects with CouchDB. Infrastructure uses Tornado on top of that, then nginx. Web based API for query, alongside web application: mygene.info.

Ntino Krampis -- Cloud BioLinux: open source, fully-customizable bioinformatics computing on the cloud for the genomics community and beyond

Paradigm shift associated with next gen sequencing data and small sequencing machines. Now small labs can handle their own sequencing; second step is how do you analyze it. CloudBioLinux is community project with JCVI, NEBC Bio-Linux associated. Ntino demos using CloudBioLinux, connecting with graphical client and making data available with collaborators through sharing AMIs.

Olivier Sallou -- OBIWEE : an open source bioinformatics cloud environment

OBIWEE is bioinformatics framework based on Torque job scheduler. It combines 3 software: a workflow authoring tool, a virtual Torque cluster and a set of deployment scripts for private or public cloud. SLICEE is the workflow authoring tool with front end to submit jobs. Has API, commandline and GUI interfaces for running.

Brian O'Connor -- SeqWare: Analyzing Whole Human Genome Sequence Data on Amazon's Cloud

SeqWare is an open source toolset for large-scale sequence analysis. Project ported it to EC2 for scaling out. Uses the Pegasus workflow engine to define workflows and run on clusters. SeqWare has multiple levels to interact with: workflow description language, java class interface. Can also provision and bundle dependencies.

To port to EC2, used StarCluster with custom AMI containing dependencies. 9 human genomes analyzed, cost $1000 per genome, and $100 per exome.

Lars Jorgensen -- Sequencescape - a cloud enabled Laboratory Information Management Systems (LIMS) for second and third generation sequencing

Sequencescape is the LIMS at Sanger institute, so can definitely scale. Supports all sequencing technologies. Development is open on GitHub and what's there matches what is running at Sanger currently. Sanger data needs to be publicly release 60 days after quality controlled. Really impressed.

LIMS handles pretty much everything: from freezer tracking, study management to automation, workflows to data release and reporting. Live demo is sweet and covers every use case you could imagine; runs on a laptop.

Enis Afgan -- Enabling NGS Analysis with(out) the Infrastructure

Enis talks about CloudMan, Galaxy on the cloud with reusable backend for scaling analyses. Lets you do NGS analyses on Amazon without needing any computational resources. Has even more tools and reference datasets than the Galaxy main site. It offers a wizard-guided setup directly in the browser, is customizable and can be shared with other users. Contains tons of NGS tools built on top of CloudBioLinux.

Aleksi Kallio -- Hadoop-BAM: A Library for Genomic Data Processing

Chipster is the main project and Hadoop-BAM was abstracted from that. Designed for dealing with large numbers of BAM files coming out of NGS analyses. Detect BAM record chunks based on compression and data for splitting up. Has a Picard compatible API. Data import/export is slow but otherwise scales based on parallelization well; used for batch pre-processing.

Bioinformatics Open Source Conference 2011 -- Day 1 afternoon talks

The afternoon session at the 2011 Bioinformatics Open Source Conference is focused on 2 areas: Visualization and next-gen sequencing.

Michael Smoot -- Cytoscape 3.0: Architecture for Extension

Cytoscape is a visualization framework for complex network analyses. It has a plugin architecture which allows customization by users; developed a strong community of contributions. Some issues are that Cytoscape architecture is very complicated, which makes it difficult to change. Changes often break plugins which aren't updated regularly.

Challenge to Cytoscape is to improve this architecture: hence the new 3.0 version of Cytoscape. The new technologies used: OSGi defines boundaries of modules and Spring-DM to help manage OSGi based on XML configuration. Semantic Versioning (major.minor.patch) used to make version numbers meaningful; this allows you to specify ranges of working packages. Maven used for dependency management.

Jeremy Goecks -- Applying Visual Analytics to Extend the Genome Browser from Visualization Tool to Analysis Tool

Trackster is a Genome Browser integrated with Galaxy. 3 unique features:

  • Dynamic visualization of NGS data -- Jeremy bravely does live demo and everything works. Whew.
  • Support visual analytics -- use interactive visualization to reason about and solve problems. Sliders in trackster allow you to visually explore parameters, and then apply filters to entire dataset. Awesome part is that you can work only in a small region and re-run results in Galaxy. Running on a whole dataset would take a long time, but can quickly re-run on a small region. Demo shows doing this with Cufflinks.
  • Sharing working visualization -- Can make a visualization link that others can pull up directly. Allow you to show exactly what you want to share but can also be dynamically manipulated.

Can integrate tools with trackster by specifying how it can be run on a local region, or how you can re-use a global model on a local region.

Nomi Harris -- WebApollo: A web-based sequence annotation editor for community annotation

WebApollo succeeds the "old" Apollo which was designed for community annotation, but very difficult to do collaborative annotation. WebApollo is, well, web-based instead of Java and does common real-time annotation updating for collaborative work.

Use JBrowse for genome browsing with extensions for annotation work. Accesses public data at UCSC and uses custom DAS servers as well. Demo server is impressive and changes to annotated transcripts are pushed immediately to other users working on different servers.

Florian Breitwieser -- The isobar R package: Analysis of quantitative proteomics data

isobar works with mass spectrometry data to visualize protein expression changes; generates PDF and LaTeX reports. Overview of techniques: fragment peptides to get spectrum and use isobaric peptide tags for quantitation; multiplex up to 8 samples. isobar extracts identification from databases and quantitative details from mass spec.

Handles normalization issues and correcting for technical variability, plots for sample variability and visualization. Analysis can be automated with Sweave to produce PDF reports with fully reproducible approach along with outputs.

Julian Catchen -- Stacks: building and genotyping loci de novo from short-read sequences

Stacks motivated by work on Zebrafish, which have a duplicated genome relatively recently in their evolutionary history. Use an outgroup fish (spotted Gar) that did not undergo a duplication. Use RAD-seq (restriction-site associated DNA) technique to sample the genome at SbfI common cut sites. Use stacks to do comparative analyses between non-duplicated and duplicated fish.

Algorithm in Stacks: reads are combined into regions called stacks, then broken down in kmers that are loaded into a dictionary. The kmers between stacks are used to establish similar regions in duplications. Can look at SNP variation within similar blocks.

Morris Swertz -- Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands

MOLGENIS provides a XML interface to define tools, and wanted to use this to analyze genome of the Netherlands. Sequencing done at BGI and 75% aligned and analyzed so far. Used technology approaches from 1000 genomes, alignment with BWA and GATK SNP calling. Big challenges were in tracking samples and results. Built a custom system to handle this. Can send MOLGENIS results to Galaxy.

Raoul Bonnal -- Bio-NGS: BioRuby plugin to conduct programmable workflows for Next Generation Sequencing data

Based on Bio-Gem, which is a general framework for extending BioRuby. Bio-NGS uses this modular framework to combine together multiple tools for next-gen sequencing. Currently runs locally but next steps are distributed jobs over muliple machines. Tasks and programs defined with Ruby classes. Approaches to distributed tasks including messaging:Bio-Hub. Percolate is a related Ruby project which might be worth looking at for parallelization.

Kevin Dorff -- Goby framework: native support in GSNAP, BWA and IGV 2.0

Goby provides file formats for next-gen sequencing that are more compact than BAM. They provide several algorithms and bridges to GSNAP, BWA and IGV.

Frank Drews -- A Scalable Multicore Implementation of the TEIRESIAS Algorithm

TERESIAS is a motif discovery algorithm from IBM. It's available and has binaries but they are not useful for large datasets. Used to discover common patterns within the human genome. Algorithm has two phases: initial scan step and then alignment/convolve step that resolves patterns.

To parallelize scan, split up word space by initial letters: with 4 cores use 1 letter, with 16 cores use 2, and so on. For parallel convolve, need to combine initial seeds into similar groups and then apply separately on each. 4-10X speedup with 16 core machines depending on kmer size.

Future work to include regular expression patterns and distributed computing.

Jean-Frédéric Berthelot -- Biomanycores, open-source parallel code for many-core bioinformatics

Biomanycores is a repository of parallel code that works on GPUs and multicore CPUs. OpenCL is used to generalize CUDA code over multiple machines. Implementations available of Smith-Waterman, TFM-Cuda, Unafold. Want to bridge gap to biologists but building a pool of applications. Interfaces available for Biopython, BioJava and BioPerl. Impressive speed ups.

Kerensa McElroy -- GemSIM: General, Error-Model Based Simulator of next-generation sequencing

Interested in trying to find low frequency variations within bacterial populations. GemSIM handles error models, populations, makes reads, and then produces stats. Want to generate an error model specific to the data you have; Illumina error rates can be quite variable between runs. Shows graphs of 454 versus Illumina results and importance of quality cutoffs.

Bioinformatics Open Source Conference 2011 -- Day 1 morning talks

I'm in beautiful Vienna, Austria at the Bioinformatics Open Source Conference on Friday and Saturday, July 15-16th. The conference emphasizes freely available biological software and the communities that contribute to them. These are my notes from the morning talks.

Larry Hunter -- The role of openness in knowledge-based systems for biomedicine

Difficulty in Artificial Intelligence is capturing common sense information that you expect to an intelligent agent to know. There is a ton of this information; but thankfully in molecular biology this common sense is less of a barrier -- you can capture everything known about molecular biology from textbooks, papers, databases. Can we write programs that get all this information?

Difficulty is that the interesting questions we want to answer are complex. The one gene, one disease model is extremely rare; in general we are looking at perturbations of complex models that change over time. Now that we are so good at sequencing, the hard problems in bioinformatics are understanding the data. This is not only about facts, but rather putting facts together to answer "why" questions. Judging if an explanation is plausible in AI requires a knowledge base and a way to score results.

Some knowledge based computational biology solutions are: BioCyc, AskHermes, Watson Medicine, GO over-representation analysis, HyQue; anything that uses ontologies like GO. 3 reasons that openness matters in these areas:

  • Productivity: very hard problem so need to build off results; can't do it alone
  • Equity: allow anyone to contribute by lowering barriers and costs
  • Ethics: AI is a social concern; need to earn the trust of society

How do we get the process going?

  • Build on current open ontologies: OBO, Semantic Web, Open Access Publishing, Linked Life Data (wow)
  • Social infrastructure to work together to solve hard problems: cooperation and competition combined
  • Conform to shared infrastructure and avoid fears of losing ideas and credit to the community

Idea is to organize competitions that require open source code and using shared infrastructure, specifically with goals of combining existing communities (BioCreative, BioNLP). For this you need software to work off of, computational power, training data to work with, and significant prizes.

Larry's group has made CRAFT available, an open source set of semantic annotations that uses existing community ontologies. This can be the basis of these competitions. Key is leveraging these existing standards to serve as a basis for future work so we are actually building off each other's work.

Remaining challenges are ensuring openness of papers in a way that they can be bulk downloaded for AI text mining, improving onotology and connections to existing text. The technical aspects are things that are excellent targets for competitions.

Konstantin Okonechnikov -- Unipro UGENE: an open source toolkit for complex genome analysis

UGENE integrates bioinformatics tools. Written in C++/Qt with a plugin system. It provides a large library of bioinformatics algorithms: Smith-Waterman, Muscle, Blast, HMM, Bowtie and more. It contains a visualization toolkit for sequence viewing, alignment. Algorithms are parallelized for multi-core CPUs, GPUs and support launching on clusters.

Contains a visual environment for constructing workflows. The workflow can be turned into a shell command to run from the commandline. Future plans are to develop a web environment, and support next-gen sequencing analysis.

Thomas Down -- Exploring the genome with Dalliance

Existing Genome Browsers fall into two classes: heavy-weight clients that require installation like IGV, or light-weight browser-based clients like UCSC. Can we have a browser client that acts more like heavy-weight clients but without installation? Now a lot of web technologies to drive this: javascript, SVG/Canvas views, browsers focused on performance for games, HTML5.

Interactive demo of the Dalliance Genome Browser shows nice scrolling and interaction fully within the web. For getting data, uses DAS: XML based annotations from the web. Used to be limited here by javascript same-origin policies but now can set the server to allow cross origin requests in the headers.

Some alternatives to DAS include dense binary formats like BAM, BigBed and BigWig with indexes for random access. Dalliance can support these directly. Nice interactive demo: quick, easy, and can drill all the wall down to reads with BAM display.

Alex Kalderimis -- InterMine - Using RESTful Webservices for Interoperability

InterMine is a data warehouse framework for biological experiments and raw data: FlyMine, ModMine and more. The database is heavily de-normalised so is loaded and served as a read-only database: very performant. This is coupled with a read-write User database that references items in the data repository. Provide a web application interface to the repository, with custom query templates for biologists.

With the increase in number of InterMine instances, need ways to communicate between them. Use a REST API with clients for Java, Perl, JavaScript, Python and, soon, Ruby. This has a low threshold to usage, and can return data formats people are used to like tab delimited, but also structured formats for programs. Can use this to build automated workflows that query one mine, grab identifiers, then get data from another one. API for clients is improving to reduce boilerplate code required.

Some Lessons learned: JSON is awesome; use GET/PUT to make it more browser friendly; fail loudly with http or JSON error codes so you actually know if you have a problem.

Bernat Gel -- easyDAS: Automatic creation of DAS servers

easyDAS is a small web server to make it easy to create a DAS server. DAS servers are meant to be easy, with smart clients using the simple servers. Most DAS servers provided by larger institutes; how to handle it if you are a small place without lots of resources?

easyDAS removes all the server configuration details, and you only need to upload a data file; it is hosted at EBI with a web interface. The maximum size is a million rows; not suitable for full genome base level information but for lots of other information.

Kostas Karasavvas -- Enacting Taverna Workflows through Galaxy

Taverna is a workflow management system; goal is to integrate with the Galaxy web framework. Taverna has a graphical interface to connect tools into a larger workflow. It provides a server that can run these workflows.

Implemented as ruby gem that makes a Galaxy tool from a workflow in MyExperiment along with connection to Taverna server. Install the tool XML into Galaxy and then run.

Hervé Ménager -- Mobyle 1.0: new features, new types of services

[Mobyle] is a web user interface for running commandline tools. Also allows chaining of jobs into workflows. XML definitions are used for both. Also implemented viewers that visualize RNA structure, multiple alignments, phylogenetic trees. Can edit alignments directly in the web interface. For workflows, can run on LSF clusters to parallelize.

Junjun Zhang -- BioMart 0.8 offers new tools, more interfaces, and increased flexibility through plug-ins

BioMart is an open source federated data management system meant to make in-house data available online. It is built as a Java system with lots of good software engineering. The new version provides additional ways to query the data backend. Used in several large scale collaborations.

Bioinformatics Open Source Conference (BOSC) 2010: Day 2 afternoon

BOSC 2010 sadly wrapped up on Saturday afternoon after a great two days of talks, discussion and planning. Here are my notes from the afternoon sessions.

Simon Mercer -- Microsoft Biology Framework

Simon will be presenting information about the Microsoft Biology Foundation and their new 1.0 release. Microsoft External Research brokers relationships between academic communities and Microsoft researchers. This collaboration process involves the development of reusable software that is often made available. Examples include the Ontology Add-in for Word, NodeXL that visualizes networks, 3D molecular viewer for PDB, Trident scientific workflow workbench that provides an interactive and commandline environment for developing workflows.

Goal was to develop together these and other collaborative tools within Microsoft into a framework: Microsoft Biology Foundation. This is reusable tools designed for the .NET platform. Looks like lots of useful stuff: standard representations, file parsing IO, algorithms and web services.

Clickframes -- Clickframes: rapid, validated development for clinical informatics

William is from Children's Hospital in Boston and Beacon 16 software. Clickframes provides a robust software modeling schema for MVC display, database access, user authentication: all of the nasty bits. Written in Java. Idea is to avoid large product requirement documents and take care of both modeling data, and generate code for some of the nasty details. XML based language that folks can write their actual specifications in. Specs turn into interactive web based previews. XML also generates a flow diagram of the application. Tests are automatically generated in Selenium. Really saves a lot of the have to do development things to help focus on the interesting parts.

Morris Swertz -- molgenis: database at the push of a button

Molgenis provides models of the biology and tries to autogenerate the background bits. Models are specified in a domain specific language that produces code and magic. It's Java based and has a XML language to specify what you want and are doing. Plugins can be used to add in java code to handle specific tasks. Generates java classes, tests, SQL and everything for web development on Tomcat. It has a nice interface to R which allows to retrieve data directly from the web form, uses a REST interface. Provides an RDF SPARQL query interface. Reuses models and tools from Galaxy under the covers for sharing.

Alexandros Kanterakis -- MOLGENESIS and MAGE-TAB for microarrays

Idea is to use MOLGENESIS to build a database for microarray and GWAS analaysis: want to combine genotypic and phenotypic information for eQTL analysis. Data is stored in MAGE-TAB which provides a tab oriented form of microarray information. MAGE was translated into the MOLGENESIS XML data model. Used MOLGENESIS to produce a web based system for managing the database. Lots of endorsements for using MAGE-ML to model complicated experiment metadata.

Sebastian Schultheiss -- Persistence of bioinformatics web services

Looked at 927 web services to see how many are still available. 17% of the original published services are no longer active. Problematic since your scripts are no longer reproducible and comparable. Over time the publishing policies have become stricter and things do seem to be improving. On average 45% of original services are available and still seem to work with test data. 58% of the services are developed on students who are graduating and moving on, 24% of the folks admitted that are not planning to maintain the service.

Lincoln Stein -- Gbrowse2

GMOD provides the infrastructure and tools for model organism databases. Contains standard ontologies, schema, file formats, browsers and editors.

Gbrowse is the web-based genome browser part of GMOD. Image glyphs are configurable in the display which allows user to provide organism specific things like pictures of worms, haplotype displays, time course RNA data.

Version 2.0 contains a lot of AJAX and javascript: dragging, zooming, support for SAM/BAM, BED, GFF, WIG BigWig. Subtracks allow items to be organized into groups of tracks related to interesting top level items.

Behind the scenes, you can render tracks independently. JBrowse is the next generation Gbrowse.

Gary Bader -- Cytoscape web

Web based component which provides a scaled down version of Cytoscape. Made up of Flash + Javascript and is client-side only. Full customization is possible, generally it looks like an awesome version of cytoscape functionality on the web. It is more suitable for medium sized networks (less than 2000 elements).

Being used for several different clients: GeneMania, iRefWeb, Pathguide. Webiste features online demos. Uses jQuery for interaction.

Nobuaki Kono -- Pathway projector

A genome browser for pathway data in the style of google maps. Lots of google features: browsing, marking points, drawing graphs. This allows manual annotation with the Quikmaps javascript library. Info windows pop up while browsing with links to external resources.

James Morris -- Evoker: a visualization tool for genotype intensity data

Genome wide association studies: associated SNP or other data with specific phenotypes, build up p-values based on allele differences hopefully identifying signals that are significantly different. Need good quality control in GWAS to avoid false positives from poor quality DNA, population structure or hidden confounding artifacts.

Evoker provides the visualization components to assess these issues, integrating with large data stores. It's written in Java with perl helper scripts. Fully interactive for zooming in and out and what not. Provides statistical plots to confirm good genotype calls and identify false positives.

Pavel Tomancak -- Fiji is just ImageJ

Fiji provides visualization of biological images and is a distribution of ImageJ. Two reasons for the project: first is that it's needed in the community and has had big uptake, second is that it's build around biological projects and provides community aspects. Fiji is targetted at Biologists, Bioinformaticians, Software developers and vision researchers. It's batteries included to target it at Biologists, and includes documentation and tutorials. Includes an API accessible from any JVM language. Code is developed under Git and put an emphasis on communication between developers and users. Developed an image library that allows researchers to write algorithms in DSL and autogenerate into Fiji code. An auto push updater was developed last summer during GSoC.

Iddo Friedberg -- IPRStats for visualization of InterProScan results

Use case for IPRScan: deal with the diversity of microorganisms and their health effects. Microbes live in complex communities which is what metagenomics studies. DNA isolated directly from environmental samples and annotating the samples is a problem. One approach is to use InterProScan, and then IPRStats provides visualization of InterProScan results.

Bioinformatics Open Source Conference (BOSC) 2010: Day 2 morning talks

Day 2 of the bioinformatics open source conference (BOSC) kicked off bright and early on Saturday with a very nice discussion from Ross Gardler about building open development communities.

Ross Gardler -- Community Development at the Apache Software Foundation

Ross is going to discuss how the community development system at Apache could be useful to open source communities in biology. Apache has 70 active projects and 30+ in development; there are 2500 regular contributors with commit access. The Apache foundation started in 1995 to fix up the UIUC server and became an official organization in 1999 to provide legal protection for members. The mission statement was broad and general: more about a way of doing things in an open manner than about specific projects. Foundation exists to get the legal nastiness and what not out of the way so folks can write code and documentation with minimal resistance. Apache provides indirect financial support. They don't pay for code but many developers are paid by third-parties to do work on Apache projects.

Apache is a meritocracy, and everyone has a voice and vote. Contribution of value produces merit within projects. If you earn merit in multiple projects then you can earn membership at the foundation level. Consensus is made via debate and code, although occasionally a vote is required via the mailing list. The rule of lazy consensus is that trusted folks can just code away: once you have code you can evaluate it more easily and move forward if everyone agrees.

Growing the foundation from original Apache server to the 70+ projects has been a challenge. Jakarta become developed as a sub-project head underneath Apache which had some failures; modification to the organization was to keep a flat structure without any umbrella projects. This allowed projects to be reviewed by the Apache folks who have lots of experience evaluating development communities. The Apache foundation doesn't consider technical issues, but rather things like stagnating communities, undue commercial influence and other potential problems.

What are the characteristics of a good Apache project? Diversity -- at least 3 committers unrelated to each other outside the project. Full audited code for IP issues which makes the work more palatable to companies who are contributing. Projects should be generic and reusable, so the component parts are available. Idea is that the components can be used outside of your field so you can build a wider community.

How do you scale the community? More projects brings in additional volunteers and doesn't stress the overhead too much, but creates the potential for dilution of the Apache foundation values and brand. The flat structure gives power to new members since there are low barriers to entry, but this can result in the blind leading the blind. However, hierarchy is inefficient. Peer review is one of the answers to helping the community self-regulate.

Mentoring helps bring new folks into Apache. In the incubator, mentors guide new project teams and teach them the apache way. Google summer of code brings in some community members, and the Apache mentoring project goes beyond this to provide mentoring on a year round basis.

Summary of lessons: the foundation should handle the brand, infrastructure, and legal aspects of projects. This also allows for cross project community discussions. The project handles technical issues and handling contributors. Lazy consensus is used to avoid management by committee and keep the power in the hands of the people who do things. Need to think how to generalise your project components and get outside of you niche. Excellent things to think about for the biology community where we are used to trying to specialize.

Chris Fields -- BioPerl

Chris will talk about current things happening in BioPerl, and then focus on some changes that are happening in the community: making things easier for new users, using modern perl features and dealing with BioPerl being monolithic. BioPerl has been around since 1996 and has impressive number of current and past contributors. Lincoln Stein next-gen tools: Bio-SamTools, Bio-BigFile which are separate CPAN distributions. Gbrowse talk later.

Summer of code happening for the 3rd year. The alignment subsystem is being cleaned up to include the capability to deal with large datasets via indexing and reduced in memory representations.

Moving forward, how can the current code be improved and modularized. To lower the barrier to entry, the BioPerl repository was migrated to GitHub. The monolithic nature of BioPerl makes things very hard to maintain and release. One idea is to make BioPerl a front end installer that adds specific individual packages based on interests and needs. Have an initial prototype using Moose for BioPerl objects, and for BioPerl on Perl 6.

Raoul Bonnal -- BioRuby

Overview of Ruby itself: a nice language with object orientation, functional aspects and reflection. BioRuby works with both standard C Ruby and JRuby. Last BioRuby update presentation was 2008 BOSC, and have tons of development including 3 Hackathons and 1 Codefest. New features include support for BioSQL which allows interoperable storage of sequences, PhyloXML support from a GSOC project, Fastq parsing support, NCBI REST access, and TogoWS support.

BioRuby has frequent meetings via mail, skype and IRC. Very strict requirement for tests as they continue to move to an agile programming style. BioRuby has a plugin system with standard naming scheme: bioruby-plugin-NAME. Provide a script interface to download and install plugins.

Peter Rice -- EMBOSS

EMBOSS received continued funding last year which allowed new development as opposed to bug fix and maintenance releases over the previous two years. EMBOSS aims at both developers and end users, and is targeted at the commandline. There are over 100 interfaces including Galaxy. New release supports BAM and other new next gen features. 3 open source books are coming out soon, which will lock down much of the library functionality.

Fastq and other parsing was improved by thinking about truncated failure cases and building up a standard set of problem cases. New EMBOSS accesses BioMart and ENSEMBL. New planned are DAS, GMOD and BioSQL. Provide a standard definition format for defining databases; awesome way to avoid re-doing all of the specific process. Other new planned features include improved Ontology support.

Tiago Antao -- population genetics in Python

HapMap project develops a haplotypemap of the human genome: 11 populations, 90-180 individuals in each. It contains SNPs, CNVs, genotypes, pedigree info. UCSC known genes are most useful for overlapping with data from HapMap. Python library accesses both HapMap and UCSC with Biopython, matplotlib, GenePop and Entrez data. Ensembl Variation API covers a similar are in Perl.

Structure is SQLite based: remote data is downloaded once and stored and indexed. Interface examples look straightforward to retrieval and querying. Very nice demonstration plots of data with matplotlib.

Laurent Gautier -- Bioconductor and Python

Provides a way to natively access libraries implemented in R. Bioconductor is one really useful targets for biologists: tons of open source packages in R. Laurent shows an awesome diagram of the biological data landscape: what Python handles well and what R/Bioconductor handles well. R is heavily statistical while Python is more focused on data processing.

Idea of rpy2 is to bridge the Python and R communities. Community wise, this lets interpreters develop that can share the usefulness of each separate community. Nice example of using edgeR from python to look at differential expression of RNA-seq data.

Eric Talevich -- Bio.Phylo package in python

Eric developed a phylogenetics library for Biopython, that makes it easy to explore tree data. There are bunch of phylogenetics formats: Newick, Nexus, PhyloXML and NeXML.

Eric provides a demo of using PhyloXML to parse a Newick tree, visualize it in multiple ways: text tree, networkx style graphical trees. With PhyloXML you can specify attributes of a tree and annotate it, and then store all this in the XML format. Easy to promote standard Newick to the more representative PhyloXML.

Bioinformatics Open Source Conference (BOSC) 2010: Day 1 afternoon talks

These are my notes on the afternoon talks from BOSC 2010.

Stefen Moeller -- Community-driven computational biology with Debian and Taverna

Stefan describes the DebMed initiative to provide Debian/Ubuntu packages for biological programs. How can this be generalized to cloud instances? Taverna provides the ability to general tools as web services and avoid some of the burden of installing packages.

Final idea is shared public data which can be made available on cloud images that would work on Eucalyptus. Really good idea to have generalized data but not sure about technical aspects of providing images across providers.

Darin London -- Dealing with the data deluge: what can the robotics community teach us?

Dealing with 50+ cell lines sequenced with multiple ChIP-seq anitbodies. How to best manage this? Next gen data is very heterogenous across time and types of data.

Can we think up any good ideas for dealing with this type of data by looking at things the robotic community has done? Behavior-based robots act via independent modules modeled after biological activity. Systems are fault tolerant since different modules can pick up when others fail to act. Can parallelize this since individual modules act autonomously instead of needing to be serialized.

One useful idea is to predict when problems might happen with running out of disk space or memory based on the system parameters.

Developed a pipelin to generate data for ENCODE. Three times of agents: runner agents, processing agents launched by runner agents, and human agents. The task list is developed in Google spreadsheet. By adding tasks to the spreadsheet, can control the agents. Available as Perl module on CPAN.

Nyasha Chambwe -- Goby framework

One issue with scaling and dealing with data is the proliferation of biological file formats. What are the desirable characteristics of file formats: well specified, easy to parse, compression and streaming. Developed new file formats for next gen data with a file format to analyze them.

Goby uses protocol buffers to provide a flexible and efficient mechanism for serializing. The data is defined as a message in a proto file. File is chunked and each region can be gzipped for random access to each region.

Demonstrate a full pipeline for RNA-seq analysis using Goby file formats.

Dana Robinson -- BioHDF

Goals are to create a data model to describe data, a store to allow for efficient retrieval, and a toolkit for development.

BioHDF is a database schema in HDF for storing biological data, and a library and C API which are coming, and commandline tools similar to samtools. Reads are stored in a hierarchical manner by reads and alignments. Information stored is: reads, alignments, annotations, clusters of aligned reads, reference sequences and indexes. Additional user specific data can be stored.

One exciting development that is being discussed on the samtools mailing list is switching the underlying representation of BAM to HDF and abstracting it out with a higher C API.

Jens Lichtenberg -- Concurrent bioinformatics software for discovering genome-wide patterns

WordSeeker -- a tool that does motif discovery: enumerate the word space using suffix and radix trees, score the motifs, cluster them based on word sizes, evaluate conservation analysis using phastCons scores from UCSC, look for biased distribution of motif locations.

Scalable approach is necessary to parallelize the enumeration of all words. Similarly for scoring need to do frequent lookups. Can be scaled via MPI for distributed memory processing or OpenMP for shared memory machines. Presented timing data for analysis on Arabidopsis genome.

Chris Hemmerich -- Automated Annotation of NGS Transcriptome Data using ISGA and Ergatis

Ergatis is a workflow management tool for running pipelines. Integrative Services for Genomic Analysis (ISGA) is a biologist's tool for running and customizing Ergatis pipelines. It provides a graphical interface for setting up a pipeline and customizing input parameters. A specific transcriptome pipeline example is presented.

Mark Wilkinson -- SADI

Mark discusses his semantic web solution for pulling together web services to make it easy to ask complex questions. Idea is to support scientific method and discussion where we have opinions and debate: not necessarily 100% about what something means. General notion is to create OWL ontologies that help define expressed hypotheses.

Aravind Venkatesan -- Bio-Ontologies in Galaxy

ONTO-Toolkit is a collection of tools to manage ontologies represented in the OBO file format. Wraps ONTO-PERL which provides a high level API for querying ontologies. Two use cases:

  • Investigate the similarities between two different molecular functions. Look upstream of both and see how many of their ancestor terms are shared. Most specific common term can be used to assess this.

  • Identify overlapping annotations for a given pair of distinct biological process terms. Look for overlap between two distinct biological processes.

Christian Zmasek -- Connecting TOPSAN to computational analysis

The Open Protein Structure Annotation Network (TOPSAN). Structures are available in PDB but very little annotation about them beyond the PDB titles. So TOPSAN provides a database for community annotation of proteins.

Most annotations entered by humans, but can also provide structured data in a simple format TOPSAN Protein Syntax (TPS). This is a RDF triple of protein, predicte (homologous, encodedbj, citation, memberof) and the value

Jianjiong Gao -- Musite: Global Prediction of General and Kinase-Specific Phosphorylation Sites

Musite is an open source tool for protein phosphorylation prediction. Disordered regions typically have phosphorylation regions, so may also be useful for evaluating protein disorder.