pysam - An interface for reading and writing SAM files

Pysam is a python module that makes it easy to read and manipulate mapped short read sequence data stored in SAM/BAM files. It is a lightweight wrapper of the samtools C-API.

This page provides a quick introduction in using pysam followed by the API. See Working with BAM/SAM-formatted files for more detailed usage instructions.

To use the module to read a file in BAM format, create a Samfile object:

import pysam
samfile = pysam.Samfile( "ex1.bam", "rb" )

Once a file is opened you can iterate over all of the read mapping to a specified region using fetch(). Each iteration returns a AlignedRead object which represents a single read along with its fields and optional tags:

for alignedread in samfile.fetch('chr1', 100, 120):
     print alignedread

samfile.close()

To give:

EAS56_57:6:190:289:82       0       99      <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<;     69      CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA     0       192     1
EAS56_57:6:190:289:82       0       99      <<<<<<;<<<<<<<<<<;<<;<<<<;8<6;9;;2;     137     AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC     73      64      1
EAS51_64:3:190:727:308      0       102     <<<<<<<<<<<<<<<<<<<<<<<<<<<::<<<844     99      GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG     99      18      1
...

You can also write to a Samfile:

import pysam
samfile = pysam.Samfile("ex1.bam", "rb")
pairedreads = pysam.Samfile("allpaired.bam", "wb", template=samfile)
for read in samfile.fetch():
     if read.is_paired:
             pairedreads.write(read)

pairedreads.close()
samfile.close()

An alternative way of accessing the data in a SAM file is by iterating over each base of a specified region using the pileup() method. Each iteration returns a PileupColumn which represents all the reads in the SAM file that map to a single base in the reference sequence. The list of reads are represented as PileupRead objects in the PileupColumn.pileups property:

import pysam
samfile = pysam.Samfile("ex1.bam", "rb" )
for pileupcolumn in samfile.pileup( 'chr1', 100, 120):
    print
    print 'coverage at base %s = %s' % (pileupcolumn.pos , pileupcolumn.n)
    for pileupread in pileupcolumn.pileups:
        print '\tbase in read %s = %s' % (pileupread.alignment.qname, pileupread.alignment.seq[pileupread.qpos])

samfile.close()

The above code outputs:

coverage at base 99 = 1
    base in read EAS56_57:6:190:289:82 = A

coverage at base 100 = 1
    base in read EAS56_57:6:190:289:82 = G

coverage at base 101 = 1
    base in read EAS56_57:6:190:289:82 = G

coverage at base 102 = 2
    base in read EAS56_57:6:190:289:82 = G
    base in read EAS51_64:3:190:727:308 = G
...

Commands available in csamtools are available as simple function calls. For example:

pysam.sort( "ex1.bam", "output" )

corresponds to the command line:

samtools sort ex1.bam output

Analogous to Samfile, a Tabixfile allows fast random access to compressed and tabix indexed tab-separated file formats with genomic data:

import pysam
tabixfile = pysam.Tabixfile( "example.gtf.gz" )

for gtf in tabixfile.fetch('chr1', 1000, 2000):
    print gtf.contig, gtf.start, gtf.end, gtf.gene_id

Tabixfile implements lazy parsing in order to iterate over large tables efficiently.

More detailed usage instructions is at Working with BAM/SAM-formatted files.

Note

Coordinates in pysam are always 0-based (following the python convention). SAM text files use 1-based coordinates.

Note

The above examples can be run in the tests directory of the installation directory. Type ‘make’ before running them.

See also

http://code.google.com/p/pysam/
The pysam Google code page, containing source code and download instructions
http://wwwfgu.anat.ox.ac.uk/~andreas/documentation/samtools/contents.html
The pysam website containing documentation

API

class pysam.Samfile
*(filename, mode=None, template = None, referencenames = None, referencelengths = None, text = NULL, header = None,
add_sq_text = False, check_header = True, check_sq = True )*

A SAM/BAM formatted file. The file is automatically opened.

mode should be r for reading or w for writing. The default is text mode (SAM). For binary (BAM) I/O you should append b for compressed or u for uncompressed BAM output. Use h to output header information in text (TAM) mode.

If b is present, it must immediately follow r or w. Valid modes are r, w, wh, rb, wb and wbu. For instance, to open a BAM formatted file for reading, type:

f = pysam.Samfile('ex1.bam','rb')

If mode is not specified, we will try to auto-detect in the order ‘rb’, ‘r’, thus both the following should work:

f1 = pysam.Samfile('ex1.bam' )
f2 = pysam.Samfile('ex1.sam' )

If an index for a BAM file exists (.bai), it will be opened automatically. Without an index random access to reads via fetch() and pileup() is disabled.

For writing, the header of a SAM file/BAM file can be constituted from several sources (see also the samtools format specification):

  1. If template is given, the header is copied from a another Samfile (template must be of type Samfile).
  2. If header is given, the header is built from a multi-level dictionary. The first level are the four types (‘HD’, ‘SQ’, ...). The second level are a list of lines, with each line being a list of tag-value pairs. The header is constructed first from all the defined fields, followed by user tags in alphabetical order.
  3. If text is given, new header text is copied from raw text.
  4. The names (referencenames) and lengths (referencelengths) are supplied directly as lists. By default, ‘SQ’ and ‘LN’ tags will be added to the header text. This option can be changed by unsetting the flag add_sq_text.

By default, if file a file is opened in mode ‘r’, it is checked for a valid header (check_header = True) and a definition of chromosome names (check_sq = True).

close

Samfile.close(self)

closes the pysam.Samfile.

count

Samfile.count(self, reference=None, start=None, end=None, region=None, until_eof=False) (reference = None, start = None, end = None, region = None, callback = None, until_eof = False)

count reads region using 0-based indexing. The region is specified by reference, start and end. Alternatively, a samtools region string can be supplied.

Note that a TAM file does not allow random access. If region or reference are given, an exception is raised.

fetch

Samfile.fetch(self, reference=None, start=None, end=None, region=None, callback=None, until_eof=False)

fetch aligned reads in a region using 0-based indexing. The region is specified by reference, start and end. Alternatively, a samtools region string can be supplied.

Without reference or region all mapped reads will be fetched. The reads will be returned ordered by reference sequence, which will not necessarily be the order within the file.

If until_eof is given, all reads from the current file position will be returned in order as they are within the file. Using this option will also fetch unmapped reads.

If only reference is set, all reads aligned to reference will be fetched.

The method returns an iterator of type pysam.IteratorRow unless a callback is provided. If *callback is given, the callback will be executed for each position within the region. Note that callbacks currently work only, if region or reference is given.

Note that a SAM file does not allow random access. If region or reference are given, an exception is raised.

filename

number of filename associated with this object.

getrname

Samfile.getrname(self, tid)

convert numerical tid into reference name.

gettid

Samfile.gettid(self, reference)

convert reference name into numerical tid

returns -1 if reference is not known.

header

header information within the sam file. The records and fields are returned as a two-level dictionary.

lengths

tuple of the lengths of the reference sequences. The lengths are in the same order as pysam.Samfile.references

mapped

total number of mapped reads in file.

mate

Samfile.mate(self, AlignedRead read) return the mate of AlignedRead read.

Throws a ValueError if read is unpaired or the mate is unmapped.

Note

Calling this method will change the file position. This might interfere with any iterators that have not re-opened the file.

next

x.next() -> the next value, or raise StopIteration

nreferences

number of reference sequences in the file.

pileup

Samfile.pileup(self, reference=None, start=None, end=None, region=None, callback=None, **kwargs)

perform a pileup within a region. The region is specified by reference, start and end (using 0-based indexing). Alternatively, a samtools region string can be supplied.

Without reference or region all reads will be used for the pileup. The reads will be returned ordered by reference sequence, which will not necessarily be the order within the file.

The method returns an iterator of type pysam.IteratorColumn unless a callback is provided. If a *callback is given, the callback will be executed for each column within the region.

Note that SAM formatted files do not allow random access. In these files, if a region or reference are given an exception is raised.

Optional kwargs to the iterator:

stepper

The stepper controlls how the iterator advances. Possible options for the stepper are

all
use all reads for pileup.
samtools
same filter and read processing as in csamtools pileup
fastafile
A FastaFile object
mask
Skip all reads with bits set in mask if mask=True.
max_depth
Maximum read depth permitted. The default limit is 8000.
truncate
By default, the samtools pileup engine outputs all reads overlapping a region (see note below). If truncate is True and a region is given, only output columns in the exact region specificied.

Note

all reads which overlap the region are returned. The first base returned will be the first base of the first read not necessarily the first base of the region used in the query.

references

tuple with the names of reference sequences.

reset

Samfile.reset(self) reset file position to beginning of read section.

seek

Samfile.seek(self, uint64_t offset, int where=0)

move file pointer to position offset, see pysam.Samfile.tell().

tell

Samfile.tell(self)

return current file position

text

full contents of the sam file header as a string.

unmapped

total number of unmapped reads in file.

write

Samfile.write(self, AlignedRead read) -> int

write a single pysam.AlignedRead to disk.

returns the number of bytes written.

class pysam.Fastafile

(filename)

A FASTA file. The file is automatically opened.

The file expects an indexed fasta file.

TODO:
add automatic indexing. add function to get sequence names.
close

Fastafile.close(self)

fetch

Fastafile.fetch(self, reference=None, start=None, end=None, region=None) (reference = None, start = None, end = None, region = None)

fetch AlignedRead() objects in a region using 0-based indexing.

The region is specified by reference, start and end.

fetch returns an empty string if the region is out of range or addresses an unknown reference.

If reference is given and start is None, the sequence from the first base is returned. Similarly, if end is None, the sequence until the last base is returned.

Alternatively, a samtools region string can be supplied.

filename

number of filename associated with this object.

class pysam.IteratorRow

abstract base class for iterators over mapped reads.

Various iterators implement different behaviours for wrapping around contig boundaries. Examples include:

pysam.IteratorRowRegion
iterate within a single contig and a defined region.
pysam.IteratorRowAll
iterate until EOF. This iterator will also include unmapped reads.
pysam.IteratorRowAllRefs
iterate over all reads in all reference sequences.

The method Samfile.fetch() returns an IteratorRow.

class pysam.IteratorColumn

abstract base class for iterators over columns.

IteratorColumn objects wrap the pileup functionality of samtools.

For reasons of efficiency, the iterator points to the current pileup buffer. The pileup buffer is updated at every iteration. This might cause some unexpected behavious. For example, consider the conversion to a list:

f = Samfile("file.bam", "rb")
result = list( f.pileup() )

Here, result will contain n objects of type PileupProxy for n columns, but each object in result will contain the same information.

The desired behaviour can be achieved by list comprehension:

result = [ x.pileups() for x in f.pileup() ]

result will be a list of n lists of objects of type PileupRead.

If the iterator is associated with a Fastafile using the addReference() method, then the iterator will export the current sequence via the methods getSequence() and seq_len().

Optional kwargs to the iterator

stepper

The stepper controls how the iterator advances. Possible options for the stepper are

all
use all reads for pileup.
samtools
same filter and read processing as in csamtools pileup

The default is to use “all” if no stepper is given.

fastafile
A FastaFile object
mask
Skip all reads with bits set in mask.
max_depth
maximum read depth. The default is 8000.
addReference

IteratorColumn.addReference(self, Fastafile fastafile)

add reference sequences in fastafile to iterator.

hasReference

IteratorColumn.hasReference(self)

return true if iterator is associated with a reference

seq_len

current sequence length.

class pysam.AlignedRead

AlignedRead()

Class representing an aligned read. see SAM format specification for the meaning of fields (http://samtools.sourceforge.net/).

This class stores a handle to the samtools C-structure representing an aligned read. Member read access is forwarded to the C-structure and converted into python objects. This implementation should be fast, as only the data needed is converted.

For write access, the C-structure is updated in-place. This is not the most efficient way to build BAM entries, as the variable length data is concatenated and thus needs to resized if a field is updated. Furthermore, the BAM entry might be in an inconsistent state. The validate() method can be used to check if an entry is consistent.

One issue to look out for is that the sequence should always be set before the quality scores. Setting the sequence will also erase any quality scores that were set previously.

In Python 3, the fields containing sequence and quality (seq, query, qual and qqual) data are of type bytes. Other string data, such as the qname field and strings in the tags tuple, is represented as unicode strings. On assignment, both bytes and unicode objects are allowed, but unicode strings must contain only ASCII characters.

aend

aligned reference position of the read on the reference genome.

aend points to one past the last aligned residue. Returns None if not available.

alen

aligned length of the read on the reference genome. Returns None if not available.

aligned_pairs

a list of aligned read and reference positions.

Unaligned position are marked by None.

bin

properties bin

cigar

the cigar alignment (None if not present). The alignment is returned as a list of operations. The operations are:

M BAM_CMATCH 0
I BAM_CINS 1
D BAM_CDEL 2
N BAM_CREF_SKIP 3
S BAM_CSOFT_CLIP 4
H BAM_CHARD_CLIP 5
P BAM_CPAD 6
= BAM_CEQUAL 7
X BAM_CDIFF 8
cigarstring

the cigar alignment as a string.

Returns the empty string if not present.

compare

AlignedRead.compare(self, AlignedRead other) return -1,0,1, if contents in this are binary <,=,> to other

fancy_str

AlignedRead.fancy_str(self) returns list of fieldnames/values in pretty format for debugging

flag

properties flag

is_duplicate

true if optical or PCR duplicate

is_paired

true if read is paired in sequencing

is_proper_pair

true if read is mapped in a proper pair

is_qcfail

true if QC failure

is_read1

true if this is read1

is_read2

true if this is read2

is_reverse

true if read is mapped to reverse strand

is_secondary

true if not primary alignment

is_unmapped

true if read itself is unmapped

isize

the insert size deprecated: use tlen instead

mapq

mapping quality

mate_is_reverse

true is read is mapped to reverse strand

mate_is_unmapped

true if the mate is unmapped

mpos

the position of the mate deprecated, use PNEXT instead.

mrnm

the reference id of the mate deprecated, use RNEXT instead.

opt

AlignedRead.opt(self, tag) retrieves optional data given a two-letter tag

overlap

AlignedRead.overlap(self, uint32_t start, uint32_t end) return number of aligned bases of read overlapping the interval start and end

on the reference sequence.
pnext

the position of the mate

pos

0-based leftmost coordinate

positions

a list of reference positions that this read aligns to.

qend

end index of the aligned query portion of the sequence (0-based, exclusive)

qlen

Length of the aligned query sequence

qname

the query name (None if not present)

qqual

aligned query sequence quality values (None if not present). This property is read-only.

In Python 3, this property is of type bytes.

qstart

start index of the aligned query portion of the sequence (0-based, inclusive)

qual

read sequence base qualities, including soft clipped bases (None if not present).

In Python 3, this property is of type bytes and assigning a unicode string to it consisting of ASCII characters only will work, but is inefficient.

query

aligned portion of the read and excludes any flanking bases that were soft clipped (None if not present).

In Python 3, this property is of type bytes. Assigning a unicode string to it consisting of ASCII characters only will work, but is inefficient.

SAM/BAM files may included extra flanking bases sequences that were not part of the alignment. These bases may be the result of the Smith-Waterman or other algorithms, which may not require alignments that begin at the first residue or end at the last. In addition, extra sequencing adapters, multiplex identifiers, and low-quality bases that were not considered for alignment may have been retained.

rlen

length of the read (read only). Returns 0 if not given.

rname

target ID

DEPRECATED from pysam-0.4 - use tid in the future. The rname field caused a lot of confusion as it returns the target ID instead of the reference sequence name.

Note

This field contains the index of the reference sequence in the sequence dictionary. To obtain the name of the reference sequence, use pysam.Samfile.getrname()

rnext

the reference id of the mate

seq

read sequence bases, including soft clipped bases (None if not present).

In Python 3, this property is of type bytes and assigning a unicode string to it consisting of ASCII characters only will work, but is inefficient.

tags

the tags in the AUX field.

This property permits convenience access to the tags. Changes it the returned list will not update the tags automatically. Instead, the following is required for adding a new tag:

read.tags = read.tags + [("RG",0)]

This method will happily write the same tag multiple times.

tid

target ID

Note

This field contains the index of the reference sequence in the sequence dictionary. To obtain the name of the reference sequence, use pysam.Samfile.getrname()

tlen

the insert size

class pysam.PileupColumn

A pileup column. A pileup column contains all the reads that map to a certain target base.

tid
chromosome ID as is defined in the header
pos
the target base coordinate (0-based)
n
number of reads mapping to this column
pileups
list of reads (pysam.PileupRead) aligned to this column
class pysam.PileupProxy

PileupProxy() A pileup column. A pileup column contains

all the reads that map to a certain target base.

tid
chromosome ID as is defined in the header
pos
the target base coordinate (0-based)
n
number of reads mapping to this column
pileups
list of reads (pysam.PileupRead) aligned to this column

This class is a proxy for results returned by the samtools pileup engine. If the underlying engine iterator advances, the results of this column will change.

n

number of reads mapping to this column.

pileups

list of reads (pysam.PileupRead) aligned to this column

pos
tid

the chromosome ID as is defined in the header

class pysam.PileupRead

PileupRead() A read aligned to a column.

alignment

a pysam.AlignedRead object of the aligned read

indel

indel length; 0 for no indel, positive for ins and negative for del

is_del

1 iff the base on the padded read is a deletion

is_head
is_tail
level
qpos

position of the read base at the pileup site, 0-based

class pysam.IndexedReads

IndexedReads(Samfile samfile, int reopen=True) index a bamfile by read.

The index is kept in memory.

By default, the file is re-openend to avoid conflicts if multiple operators work on the same file. Set reopen = False to not re-open samfile.

build

IndexedReads.build(self) build index.

find

IndexedReads.find(self, qname)

pysam.tabix_index()

tabix_index(filename, force=False, seq_col=None, start_col=None, end_col=None, preset=None, meta_char=’#’, zerobased=False)

index tab-separated filename using tabix.

An existing index will not be overwritten unless force is set.

The index will be built from coordinates in columns seq_col, start_col and end_col.

The contents of filename have to be sorted by contig and position - the method does not check if the file is sorted.

Column indices are 0-based. Coordinates in the file are assumed to be 1-based.

If preset is provided, the column coordinates are taken from a preset. Valid values for preset are “gff”, “bed”, “sam”, “vcf”, psltbl”, “pileup”.

Lines beginning with meta_char and the first line_skip lines will be skipped.

If filename does not end in ”.gz”, it will be automatically compressed. The original file will be removed and only the compressed file will be retained.

If filename ends in gz, the file is assumed to be already compressed with bgzf.

returns the filename of the compressed data

pysam.tabix_compress()

tabix_compress(filename_in, filename_out, force=False)

compress filename_in writing the output to filename_out.

Raise an IOError if filename_out already exists, unless force is set.

class pysam.Tabixfile

(filename, mode=’r’)

opens a tabix file for reading. A missing index (filename + ”.tbi”) will raise an exception.

close

Tabixfile.close(self)

closes the pysam.Tabixfile.

contigs

chromosome names

fetch

Tabixfile.fetch(self, reference=None, start=None, end=None, region=None, parser=None)

fetch one or more rows in a region using 0-based indexing. The region is specified by reference, start and end. Alternatively, a samtools region string can be supplied.

Without reference or region all entries will be fetched.

If only reference is set, all reads matching on reference will be fetched.

If parser is None, the results are returned as an unparsed string. Otherwise, parser is assumed to be a functor that will return parsed data (see for example asTuple() and asGTF()).

filename

filename associated with this object.

header

the file header.

Note

The header is returned as an iterator over lines without the newline character.

class pysam.asTuple

converts a tabix row into a python tuple.

Access is by numeric index.

class pysam.asGTF

converts a tabix row into a GTF record with the following fields:

contig
contig
feature
feature
source
source
start
genomic start coordinate (0-based)
end
genomic end coordinate plus one (0-based)
score
feature score
strand
strand
frame
frame
attributes
attribute string.

GTF formatted entries also defined the attributes:

gene_id
the gene identifier
transcript_ind
the transcript identifier
class pysam.asVCF

converts a tabix row into a VCF record with the following fields:

contig
contig
pos
chromosomal position, zero-based
id
id
ref
reference
alt
alt
qual
qual
filter
filter
info
info
format
format specifier.

Access to genotypes is via index:

contig = vcf.contig
first_sample_genotype = vcf[0]
second_sample_genotype = vcf[1]
class pysam.asBed

converts a tabix row into a bed record with the following fields:

contig
contig
start
genomic start coordinate (zero-based)
end
genomic end coordinate plus one (zero-based)
name
name of feature.
score
score of feature
strand
strand of feature
thickStart
thickStart
thickEnd
thickEnd
itemRGB
itemRGB
blockCount
number of bocks
blockSizes
‘,’ separated string of block sizes
blockStarts
‘,’ separated string of block genomic start positions

Only the first three fields are required. Additional fields are optional, but if one is defined, all the preceeding need to be defined as well.

pysam.tabix_iterator()

tabix_iterator(infile, parser) return an iterator over all entries in a file.

class pysam.tabix_inplace_iterator

iterate over infile.

This iterator is not safe. If the __next__() method is called after infile is closed, the result is undefined (see fclose()).

The iterator might either raise a StopIteration or segfault.

next

x.next() -> the next value, or raise StopIteration

class pysam.VCF
compare_calls()

Utility function: compares two calls for equality

connect()

connect to tabix file.

convertGT()
convertGTback()
enter_default_format()
error()
fetch()

Parse a stream of VCF-formatted lines. Initializes class instance and return generator

format_format()
format_formatdata()
get_expected()
getfilter()

Dictionary of ##FILTER tags, as VCF.FORMAT values

getformat()

Dictionary of ##FORMAT tags, as VCF.FORMAT values

getheader()

List of header key-value pairs (strings)

getinfo()

Dictionary of ##INFO tags, as VCF.FORMAT values

getsamples()

List of samples in VCF file

ignoreerror()
inregion()
parse()

Parse a stream of VCF-formatted lines. Initializes class instance and return generator

parse_data()
parse_format()
parse_formatdata()
parse_header()
parse_heading()
setfilter()

Dictionary of ##FILTER tags, as VCF.FORMAT values

setformat()

Dictionary of ##FORMAT tags, as VCF.FORMAT values

setheader()

List of header key-value pairs (strings)

setinfo()

Dictionary of ##INFO tags, as VCF.FORMAT values

setreference()

Provide a reference sequence; a Python class supporting a fetch(chromosome, start, end) method, e.g. PySam.FastaFile

setregions()
setsamples()

List of samples in VCF file

setversion()
validate()

validate vcf record.

returns a validated record.

warnerror()
write()

Writes a VCF file to a stream, using a data generator (or list)

write_data()
write_header()
write_heading()
writeheader()

Writes a VCF header

class pysam.VCFRecord

vcf record.

initialized from data and vcf meta

alt
contig
error

raise error.

filter
format
id
info
pos
qual
ref
samples
exception pysam.SamtoolsError(value)

exception raised in case of an error incurred in the samtools library.

class pysam.SamtoolsDispatcher(dispatch, parsers)

samtools dispatcher.

Emulates the samtools command line as module calls.

Captures stdout and stderr.

Raises a pysam.SamtoolsError exception in case samtools exits with an error code other than 0.

Some command line options are associated with parsers. For example, the samtools command “pileup -c” creates a tab-separated table on standard output. In order to associate parsers with options, an optional list of parsers can be supplied. The list will be processed in order checking for the presence of each option.

If no parser is given or no appropriate parser is found, the stdout output of samtools commands will be returned.

getMessages()
usage()

return the samtools usage information for this command

Table Of Contents

Previous topic

pysam: samtools interface for python

Next topic

Working with BAM/SAM-formatted files

This Page