Tutorial with SQL

The sql file is essentially an excel document with the following tables, and those tables with it's own tables

Table Name Description
users Contains and id, sample, height, weight, blood, gender, and race.
ancestry_supop Contains and id, sample and the super population codes from 1000 genomes project.
ancestry_pop Contains and id, sample, and populaiton codes from 1000 genomes project
ancestry_jkp Contains and id, sample and ancestry codes from joe pickrells ancestry program.

users Table

Field Description
id The id of the record.
sample The sample of the user
height The height of the user
weight The weight of the user
blood The blood type of the user
gender The gender of the user
race The reported race of the user

ancestry_supop Table

Field Description
id The id of the record.
sample The sample of the file
amr The Americas population from the 1000 genomes.
afr The Africa population from the 1000 genomes.
eur The European population from the 1000 genomes.
eas The East Asia population from the 1000 genomes.
sas The South Asia population from the 1000 genomes.

ancestry_pop Table

These are the population categories from 1000 genomes project.

Field Description
id The id of the record.
sample The sample of the file
CHB The Han Chinese in Bejing, China population
JPT The Japanese in Tokyo, Japan population
CHS Southern Han Chinese
CDX Chinese Dai in Xishuangbanna, China
KHV Kinh in Ho Chi Minh City, Vietnam
CEU Utah Residents (CEPH) with Northern and Western Ancestry
TSI Toscani in Italia
FIN Finnish in Finland
GBR British in England and Scotland
IBS Iberian Population in Spain
YRI Yoruba in Ibadan, Nigeria
LWK Luhya in Webuye, Kenya
GWD Gambian in Western Divisions in the Gambia
MSL Mende in Sierra Leone
ESN Esan in Nigeria
ASW Americans of African Ancestry in SW USA
ACB African Caribbeans in Barbados
MXL Mexican Ancestry from Los Angeles USA
PUR Puerto Ricans from Puerto Rico
CLM Colombians from Medellin, Colombia
PEl Peruvians from Lima, Peru
GIH Gujarati Indian from Houston, Texas
BEB Bengali from Bangladesh
STU Sri Lankan Tamil from the UK
ITU Indian Telugu from the UK

ancestry_jkp Table

Field Description
id The id of the record.
sample The sample of the file

About the sample ID:

Some users from the personal genome project had multiple files so the file associated was written in the following format:

alt text

The human_id can be use to view the profile at the personal genome project.
By using the following link:

Downloading the sqlite database

Here is the command to download the database:

$ wget

Here it is using curl:

$ curl -O

Viewing the database

To use the the sqlite database, sqlite3 must be installed.

$ sqlite3 repgp-data.sqlite3
SQLite version 3.8.2 2013-12-06 14:53:30 Enter ".help" for instructions Enter SQL statements terminated with a ";" sqlite>

The following query will show the samples in the users table:

sqlite> SELECT sample FROM users;

To quit from the program enter:

sqlite> .quit

Use the following command to change the mode by columns

sqlite> .mode column

Executing the sql query will then display as this:

sqlite> SELECT sample,height FROM users;
pgp_hu43860C_41  195
pgp_hu2FEC01_29  175
pgp_hu34D5B9_67  180
.                .
.                .
.                .

Use the following command to display the header:

sqlite> .header on

Executing the sql query will then display as this:

sqlite> SELECT sample,height,blood FROM users;
sample           height      blood
---------------  ----------  ----------
pgp_hu43860C_41  195         O+
pgp_hu2FEC01_29  175         A+
pgp_hu34D5B9_67  180         *
pgp_hu34D5B9_67  180         *
pgp_hu34D5B9_73  180         *
.                .           .
.                .           .

The aseterisk (*) means that that the data is unavailable for that record

Alternative ways to execute an sql query

Using a shell script:

$ sqlite3 repgp-data.sqlite3 'SELECT * from ancestry_supop'
1|pgp_hu45AE46_1123|0.0|0.0142651|0.0|0.98573539|0.0 2|pgp_hu67B84E_278|0.0|0.0|0.0|0.9918381|0.0 3|pgp_hu89910D_1019|0.0362677131983|0.0|0.0|0.963732286802|0.0 4|pgp_huDDEC1D_92|0.054813|0.0552869|0.01620862|0.8642559|0.0 5|pgp_hu1EE386_594|0.0|0.0|0.0|0.9974648|0.0 . . .

Using pipe and a shell variable.

$ sql="SELECT sample,MSL,ACB,CEU from ancestry_pop;"
$ echo $sql | sqlite3 repgp-data.sqlite3 -header -column sample MSL ACB CEU ----------------- ---------- ---------- ---------- pgp_hu45AE46_1123 0.0 0.0 0.0 pgp_hu67B84E_278 0.0 0.0 0.0 pgp_huDDEC1D_92 0.0154875 0.0 0.0581599 . . . . . . . .

-header option displays with the header
-column option displays it in a column

Joining Tables

SELECT users.sample,ancestry_pop.CEU,users.gender FROM users JOIN ancestry_pop WHERE ancestry_pop.sample=users.sample;
This will output:
sample           CEU         gender
---------------  ----------  ----------
pgp_hu43860C_41  0.100628    Male
pgp_hu43860C_41  0.100628    Male
pgp_huD37D14_85  0.0         Female
pgp_huD37D14_85  0.0         Female
pgp_huAEADC0_11  0.0         Male
.                .           .
.                .           .
.                .           .

This joins the users table and ancestry_pop table when the samples match and displays the those specific columns.

Notice the sample width gets cut off. To change the width use the following code:

sqlite3>.width 17

17 being the number of characters for the sample record.

Other SQL keywords:

Using other SQL keywords can generate other results. For instance, the following code will display all samples who are male.

sqlite3> SELECT sample,gender FROM users WHERE gender LIKE 'Male' LIMIT 10;
sample             gender
-----------------  ----------
pgp_hu43860C_41    Male
pgp_hu2FEC01_291   Male
pgp_hu2FEC01_653   Male
pgp_hu2FEC01_650   Male
pgp_hu2FEC01_407   Male
pgp_hu3CAB43_304   Male
pgp_huAEADC0_1121  Male
.                  .
.                  .
.                  .

The LIMIT keyword limits the output to the first 10 records.