coalispr.bedgraph_analyze.store

Module for dealing with file storage and retrieval to and from BNY.

to do: Change saving format to TSV using command parameter.

Attributes

Functions

config_from_newdirs(exp, path)

Create storage folders during the initialization step coalispr init.

get_unselected_folderpath()

Return path to folder with written bam files for unselected reads that

check_done(listofkeys, name, tag, notag)

Assess whether keys are present in merged data.

store_chromosome_data(name, plusdata, minusdata, tag)

Binarize binned bedgraph dataframes for easy access.

store_specified_indexes(args[, suffix])

Keep specified indexes for easy access.

store_segments_table(name, df, tag, suffix, folders[, ...])

Save table with given keywords in folders/filename.

save_average_table(df, name, kind, samples[, suffix])

Save averaged count tables with given keywords in the folder/filename.

retrieve_all_specified_segments(chrnam, tag)

Read all specified regions for a chromosome from tsv files.

retrieve_specified_segments(chrnam, tag, kind[, usecols])

Read specified regions for a chromosome using configured thresholds.

retrieve_merged(chrnam, tag)

Retrieve the merged experimental data.

retrieve_merged_reference(chrnam)

Return merged reference data, organized per binned chromosome.

retrieve_merged_unselected([chrnam])

Returns merged unselected data, organized per binned chromosome.

retrieve_processed_files(name, chrnam, tag[, notag, ...])

Retrieve merged experimental data from binary files.

retrieve_index_frames(kind, chrnam, tag)

Retrieve frames with column 'index' for extracted index from merged

retrieve_indexes(kind, chrnam, tag)

Retrieve indexes that have been extracted from merged frames for

has_been_run(options[, backup])

Check folders for possibility to contine with options.

print_memory_usage_merged()

Show pandas memory usage of merged data frames. Memory usage is

Module Contents

coalispr.bedgraph_analyze.store.NOT_YET = ('Not', 'Yet')
coalispr.bedgraph_analyze.store.logger
coalispr.bedgraph_analyze.store.config_from_newdirs(exp, path)

Create storage folders during the initialization step coalispr init.

Returns:

Paths to storage folders linked to new experiment EXP.

Return type:

Path

coalispr.bedgraph_analyze.store.get_unselected_folderpath()

Return path to folder with written bam files for unselected reads that have been retrieved during counting of UNSPECIFIC reads.

coalispr.bedgraph_analyze.store.check_done(listofkeys, name, tag, notag)

Assess whether keys are present in merged data.

Parameters:
  • name (str) – Name for file to be stored.

  • plusdata (dict) – Data for plus strand as chromosome: list of pd.DataFrames.

  • minusdata (dict) – Data for minus strand as chromosome: list of pd.DataFrames.

  • tag (str) – Flag to indicate kind of aligned-reads, TAGUNCOLL or TAGCOLL.

  • notag (bool) – Flag to indicate whether ‘tag’ needs an argument.

Returns:

List of keys not in merged datframe if available.

Return type:

List

coalispr.bedgraph_analyze.store.store_chromosome_data(name, plusdata, minusdata, tag, notag=False, suffix=None, folder=None)

Binarize binned bedgraph dataframes for easy access.

Parameters:
  • name (str) – Name for file to be stored.

  • plusdata (dict) – Data for plus strand as chromosome: list of pd.DataFrames.

  • minusdata (dict) – Data for minus strand as chromosome: list of pd.DataFrames.

  • tag (str) – Flag to indicate kind of aligned-reads, TAGUNCOLL or TAGCOLL.

  • notag (bool) – Flag to indicate whether ‘tag’ needs an argument.

  • suffix (str) – Suffix indicating file format and used to select function name for storing dataframes by pandas.

  • folder (str) – Folder within get_suffix_store_path(), in which data gets stored.

Returns:

Prints message upon completion of function

Return type:

None

coalispr.bedgraph_analyze.store.store_specified_indexes(args, suffix=BNY)

Keep specified indexes for easy access.

Parameters:
  • [name (args =)

  • chrnam (str)

  • plusdata

  • minusdata

  • tag]

  • name (str) – Name for file to be stored.

  • chrnam – Name of chromosome for which data are stored.

  • plus_idx (object) – Data for plus strand of chromosome chrnam.

  • minus_idx (object) – Data for minus strand of chromosome chrnam.

  • tag (str) – Flag to indicate kind of aligned-reads, TAGUNCOLL or TAGCOLL.

Returns:

Prints message upon completion of function.

Return type:

None

coalispr.bedgraph_analyze.store.store_segments_table(name, df, tag, suffix, folders, backup=False)

Save table with given keywords in folders/filename.

Parameters:
  • name (str) – Name of filename for output table to save; is equal to figure name.

  • df (pandas.DataFrame) – Table to write out..

  • tag (str) – Flag to indicate kind of aligned-reads, TAGUNCOLL or TAGCOLL

  • suffix (str) – Suffix indicating file format and used to select function writing dataframe by pandas.

  • folders (list) – Folders to store file to.

  • backup (bool) – Create backup if file exists.

Notes

The ‘tag’ if needed should be in one of the folder names. tag : str

Flag to indicate kind of aligned-reads, TAGUNCOLL or TAGCOLL

coalispr.bedgraph_analyze.store.save_average_table(df, name, kind, samples, suffix=BNY)

Save averaged count tables with given keywords in the folder/filename.

Parameters:
  • name (str) – Name of filename for output table to save; is equal to figure name.

  • df (pandas.DataFrame) – Table to write out.

  • kind (str (default: SPECIFIC)) – What type of counted reads to use, i.e. SPECIFIC or UNSPECIFIC.

  • samples (list) – List of library samples used for averaging dataframe.

  • suffix (str) – Suffix indicating file format and used to select function name for storing dataframes by pandas.

coalispr.bedgraph_analyze.store.retrieve_all_specified_segments(chrnam, tag)

Read all specified regions for a chromosome from tsv files.

Get stored information on ‘upper’ and ‘lower’ boundaries for all regions of contiguous reads (obtained by assessing bedgraphs and saved by gather*regions of coalispr.bedgraph_analyze.compare).

Parameters:
  • chrnam (str) – Name of the chromosome for which the information has been stored..

  • tag (str) – Tag to indicate kind of mapped reads analysed: collapsed or uncollapsed

Returns:

A tuple of pandas dataframes, one for each strand of chromosome chrnam.

Return type:

pandas.DataFrame, pandas.DataFrame

coalispr.bedgraph_analyze.store.retrieve_specified_segments(chrnam, tag, kind, usecols=[LOWR, UPPR, SPAN])

Read specified regions for a chromosome using configured thresholds.

Get stored information on upper and ‘lower’ boundaries for regions of contiguous reads. These have previously been obtained by assessing bedgraph data and saved as BNY files by gather_regions functions in module coalispr.bedgraph_analyze.compare.

Parameters:
  • chrnam (str) – Name of the chromosome for which the information has been stored.

  • tag (str) – Tag to indicate kind of mapped reads analysed: TAGCOLL or TAGUNCOLL

  • kind (str (default: SPECIFIC)) – The kind of specified reads stored: specific, unspecific or both.

  • usecols (list) – Columns to keep in returned dataframes.

Returns:

A tuple of pandas dataframes, one for each strand of chromosome chrnam.

Return type:

pandas.DataFrame, pandas.DataFrame

coalispr.bedgraph_analyze.store.retrieve_merged(chrnam, tag)

Retrieve the merged experimental data.

Parameters:
  • tag (str) – Flag to indicate kind of aligned-reads, TAGUNCOLL or TAGCOLL.

  • chrnam (str) – Chromosome for which merged file/data to return.

Returns:

A tuple of dicts, one for each strand, with pandas dataframes, one for each chromosome, with columns of bedgraph values summed per BINSET for each sample.

Return type:

pandas.DataFrame, pandas.DataFrame

coalispr.bedgraph_analyze.store.retrieve_merged_reference(chrnam)

Return merged reference data, organized per binned chromosome.

coalispr.bedgraph_analyze.store.retrieve_merged_unselected(chrnam=None)

Returns merged unselected data, organized per binned chromosome.

coalispr.bedgraph_analyze.store.retrieve_processed_files(name, chrnam, tag, notag=False, suffix=BNY, folder=None)

Retrieve merged experimental data from binary files.

Defines internal class FileTooShortWarning(Exception)

Parameters:
  • name (str) – Name for file

  • chrnam (str) – Chromosome for which merged file/data to return.

  • tag (str) – Flag to indicate kind of aligned-reads, TAGUNCOLL or TAGCOLL.

  • notag (bool) – Flag to indicate whether ‘tag’ needs an argument.

  • suffix (str) – Suffix indicating file format and used to select function name for storing dataframes by pandas. Default binary BNY.

  • folder (str) – Folder in which data gets stored.

Raises:

PickleWarning – Raised when stored file has ‘.pkl’ extension; deprecated.

Returns:

A tuple of dataframes, one for each strand, for requested chromosome.

Return type:

dataframe, dataframe

coalispr.bedgraph_analyze.store.retrieve_index_frames(kind, chrnam, tag)

Retrieve frames with column ‘index’ for extracted index from merged frames for the indicated kind of reads.

Parameters:
  • kind (str) – Kind of specified reads, SPECIFIC or UNSPECIIFC, to retrieve the index for.

  • chrnam (str) – Chromosome for which dataframe to return.

  • tag (str) – Flag to indicate kind of aligned-reads, TAGUNCOLL or TAGCOLL.

Returns:

A tuple of dataframes, one for each strand, for requested chromosome.

Return type:

dataframe, dataframe

coalispr.bedgraph_analyze.store.retrieve_indexes(kind, chrnam, tag)

Retrieve indexes that have been extracted from merged frames for the indicated kind of reads.

Parameters:
  • kind (str) – Kind of specified reads, SPECIFIC or UNSPECIIFC, to retrieve the index for.

  • chrnam (str) – Chromosome for which index to return.

  • tag (str) – Flag to indicate kind of aligned-reads, TAGUNCOLL or TAGCOLL.

Returns:

A tuple of dataframes, one for each strand, for requested chromosome.

Return type:

dataframe, dataframe

coalispr.bedgraph_analyze.store.has_been_run(options, backup=False)

Check folders for possibility to contine with options.

Parameters:

options (list) – List of keys to find folders to check

coalispr.bedgraph_analyze.store.print_memory_usage_merged()

Show pandas memory usage of merged data frames. Memory usage is comparable to disk space taken up when using python.shelve (pickle), but with compressed parquet much less (~10-15 fold) disk space is used for storage.

Returns:

Floats describing (in MBs) memory usage of reference, TAGCOLL or TAGUNCOLL datasets in Pandas.

Return type:

floats