<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Datablend &#187; mutation data</title>
	<atom:link href="http://datablend.be/?cat=19&#038;feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://datablend.be</link>
	<description>Big Data Simplified</description>
	<lastBuildDate>Mon, 07 Sep 2015 09:04:17 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.6.1</generator>
		<item>
		<title>Big Data Genomics &#8211; How to efficiently store and retrieve mutation data</title>
		<link>http://datablend.be/?p=246</link>
		<comments>http://datablend.be/?p=246#comments</comments>
		<pubDate>Fri, 24 Jun 2011 09:31:19 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[cassandra]]></category>
		<category><![CDATA[genomics]]></category>
		<category><![CDATA[mutation data]]></category>
		<category><![CDATA[NoSQL]]></category>

		<guid isPermaLink="false">http://datablend22.lin3.nucleus.be/?p=246</guid>
		<description><![CDATA[[information] This blog post is the first one in a series of articles that describe the use of NoSQL databases to efficiently store and retrieve mutation data. 1. Part one introduces the notion of mutation data and describes the conceptual use of the Cassandra NoSQL datastore. [/information] The only way to learn a new technology<p><a href="http://datablend.be/?p=246">Continue Reading →</a></p>]]></description>
				<content:encoded><![CDATA[[information]
<p style="text-align: justify;">This blog post is the first one in a series of articles that describe the use of NoSQL databases to efficiently store and retrieve mutation data. </p>
<ul>
<ol>1. <a target='_blank' href="http://datablend.be/?p=202">Part one</a> introduces the notion of mutation data and describes the conceptual use of the Cassandra NoSQL datastore.</ol>
</ul>
[/information]
<p style="text-align: justify;">The only way to learn a new technology is by putting it into practice. Just try to find a suitable use case in your immediate working environment and give it go. In my case, it was trying to efficiently store and retrieve mutation data through a variety of NoSQL data stores, including <a target='_blank' href="http://cassandra.apache.org">Cassandra</a>, <a target='_blank' href="http://www.mongodb.org">MongoDB</a> and <a target='_blank' href="http://www.neo4j.org">Neo4J</a>.</p>
<p>&nbsp;</p>
<h3>1. What is mutation data?</h3>
<p style="text-align: justify;"><span class="highlight">DNA</span>, or <i>deoxyribonucleic acid</i>, is the hereditary material that defines an organism.  DNA information is stored as a code made up of four chemical bases: <span class="highlight">adenine</span> (A), <span class="highlight">guanine</span> (G), <span class="highlight">cytosine</span> (C), and <span class="highlight">thymine</span> (T). The human DNA for instance, contains 3 billion bases. The order, or sequence, of these bases defines the information available for building and maintaining an organism, similar to the way in which letters of the alphabet appear in a certain order to form words and sentences.</p>
<p style="text-align: justify;">A <span class="highlight">mutation</span> is a change in the DNA sequence of an organism. Mutations can happen for various reasons, the most common one being an error when DNA material is being copied. If the <span class="highlight">reference</span> DNA sequence is available for a particular organism, one can try to identify the mutations between this reference and the DNA material that is extracted from a similar organism. Different types of mutations are possible. A <span class="highlight">point mutation</span> is a mutation where one base mutated into another base. In the example below, two <span class="highlight">point mutations</span> can be identified: the reference T base at position 2 mutated into a G base and the reference C base at position 10 mutated into a T base. At position 5, a <span class="highlight">deletion</span> is found: the reference sequence has a T base which can not be found in the sample sequence. The inverse is also possible: the sample sequence <span class="highlight">inserts</span> a G base at position 6.</p>
<p>&nbsp;</p>
<p align="center"><img src="http://datablend.be/wp-content/uploads/mutations.jpg" alt="Mutation example" /></p>
<p>&nbsp;</p>
<p style="text-align: justify;">Scientific insights can be gained by observing how organisms mutate over time. <span class="highlight">Antiviral drug research</span> for instance, tries to identify which mutations can be candidates for applying fake DNA building blocks. If a virus can be tricked into mistakenly incorporating these fake building blocks, the effects of a virus can be reduced. An interesting question antiviral drug researchers like to see answered is the notion of <span class="highlight">mutation frequency</span>. The mutation frequency refers to the number of times a given mutation occurs in a large population over a certain period of time. As antiviral drug researchers are typically dealing with millions of mutations, it is the ideal use case for playing around with Big Data and NoSQL.</p>
<p>&nbsp;</p>
<h3>2. Cassandra as a mutation datastore?</h3>
<p style="text-align: justify;">When working with a <span class="highlight">relational database</span>, the first thing you do is modeling your data. A well defined database model allows you to query its data through SQL queries. Unfortunately, a fully <i>normalized</i> model degrades your performance when joins need to be executed on tables that contain millions of rows. To improve performance, Cassandra advocates a <span class="highlight">query-first</span> approach, where first you identify your queries and then model your data accordingly. In the next couple of paragraphs, we will gradually explore the Cassandra data structures by developing the mutation data model. Remember, what we are trying to achieve is to be able to quickly calculate <span class="highlight">mutation frequencies</span>!</p>
<h5>2.1 Columns</h5>
<p style="text-align: justify;">A <span class="highlight">column</span> is Cassandra&#8217;s smallest data container. In essence, it is just a <span class="highlight">key-value</span> pair tagged with a <span class="highlight">timestamp</span>. (Don&#8217;t worry about the timestamp. It&#8217;s not really relevant for this discussion.)</p>
<p align="center"><img src="http://datablend.be/wp-content/uploads/column1.jpg" alt="column"/></p>
<p style="text-align: justify;">Our mutation data contains thousands of sample sequences. For each individual sequence, we would like to save several properties, including the DNA sequence, the origin and the sequencing method.</p>
<p align="center"><img src="http://datablend.be/wp-content/uploads/sequence1.jpg" alt="sequence" /></p>
<p style="text-align: justify;">Creating columns is only the first step. Probably, you also want to group the set of columns which specify the properties of a single sequence. This grouping of columns is provided through the use of <span class="highlight">column families</span>.</p>
<h5>2.2 Column families</h5>
<p style="text-align: justify;">A <span class="highlight">column family</span> is a container holding a number of <span class="highlight">rows</span>, each row referring to a <span class="highlight">collection of columns</span>.</p>
<p align="center"><img src="http://datablend.be/wp-content/uploads/columnfamily.jpg" alt="columnfamily" /></p>
<p style="text-align: justify;">Contrary to relational database tables, column families are <span class="highlight">schema-less</span>. When a new row is created (or updated), you can add as many columns (i.e key-value pairs) as you see fit. This allows applications to work with data in a very flexible way and enables your schema to evolve organically as new requirements pop up. Think of a column family as a <span class="highlight">HashMap of HashMaps</span> where a unique row key refers to a set of columns, each defined by a specific column key. For each column family, you need to define how columns, contained within a row, need to be <span class="highlight">sorted</span>. Out of the box, Cassandra provides support for <i>BytesType</i>, <i>UTF8Type</i>, <i>LexicalUUIDType</i>, <i>TimeUUIDType</i>, <i>AsciiType</i>, and <i>LongType</i>. Hence, whenever you retrieve the columns associated with a particular row, you can expect your results to be sorted as specified. If the build-in sorting types are not sufficient, you can always <span class="highlight">plug-in</span> your own implementation.</p>
<p style="text-align: justify;">Our mutation datastore features several column families. Let&#8217;s first have a look at our <i>sequence column family</i>.</p>
<p align="center"><img src="http://datablend.be/wp-content/uploads/squences.jpg" alt="sequences" /></p>
<p style="text-align: justify;">Each row needs to have an unique key. In case of the sequence column family, we use <i>TimeUUID</i>s as unique keys. For each sequence, this unique <i>TimeUUID</i> is generated on basis of the sequence date. (Why we do this will become more clear when we talk about our <i>mutation column family</i>). Sorting-wise, we don&#8217;t really care how the associated columns are ordered. Hence, we just use <i>UTF8Type</i> sorting. As can be observed from the picture, not all properties are available for a each sequence. As no schema needs to be defined, Cassandra allows us to deal with this concept very easily.</p>
<p style="text-align: justify;">The <i>mutation column family</i> features 2 Cassandra design patterns, namely <span class="highlight">Aggregate Key</span> and <span class="highlight">Valueless Column</span>. Remember that we want to be able to quickly retrieve all sequences that contain a particular mutation. Hence, let&#8217;s make things easy: we define the key of the <i>mutation column family</i> as the aggregation of all relevant mutation details (being <i>position</i>, <i>type and</i> <i>base</i>). If we want to retrieve all <span class="highlight">point mutations to base G at position 2</span>, we just need to fetch the row with aggregate key <span class="highlight">2-MUT-G</span>. Each row refers to the <i>list</i> of sequences (i.e. <i>TimeUUID</i>s) that contain the specific mutation. As we are basically using the row as a list, no <span class="highlight">meaningful column values</span> are associated.</p>
<p align="center"><img src="http://datablend.be/wp-content/uploads/mutationsfamily.jpg" alt="mutationcolumns" /></p>
<p style="text-align: justify;">We still need to specify an ordering for the <i>mutation column family</i>. As the column keys are <i>TimeUUID</i>s, <i>TimeUUIDType</i> sorting is applied. Consequently, all sequences containing the particular mutation are sorted time-wise. This design rationale allows us to fully leverage the Cassandra platform. In order to calculate the frequency of a particular mutation during a certain time period, we just need to fetch the relevant mutation row and perform a <span class="highlight">range slice query</span>. This range slice query takes as input a lower and upper bound value (in our case the UUID representations of the start and end date of the time period) and is able to quickly retrieve the sequences within that range.</p>
<h5>2.3 Key spaces</h5>
<p style="text-align: justify;">A <span class="highlight">key space</span> is Cassandra&#8217;s outermost data container. It basically combines several <i>column families</i> in one logical space. This is similar to a relational database schema containing multiple tables.</p>
<p>&nbsp;</p>
<h3>3. Technical implementation</h3>
<p style="text-align: justify;">Using the concepts explained in this article, a <span class="highlight">mutation exploration tool</span> was developed that allows scientists to query the frequency of specific mutations and compare individual results. Although the database contains millions of mutations, queries are executed blazingly fast.</p>
<p><a target='_blank' href="http://datablend.be/wp-content/uploads/muttool_large.jpg">
<p align="center"><img width="550" src="http://datablend.be/wp-content/uploads/muttool.jpg" alt="mutation tool" /></p>
<p></a></p>
<p>&nbsp;</p>
<h3>4. Conclusion</h3>
<p style="text-align: justify;">This concludes the first article on saving mutation data using NoSQL datastores. The next article will provide deeper insights in how the Cassandra data model explained above is technically implemented. Looking forward to your remarks and comments!</p>
<p></p>]]></content:encoded>
			<wfw:commentRss>http://datablend.be/?feed=rss2&#038;p=246</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
