Skip to content
6 changes: 6 additions & 0 deletions prjcfg.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
metadata:
sample_annotation: samples_peppy.tsv
sample_subannotation: units_peppy.tsv

snake_config: "config.yaml"

36 changes: 32 additions & 4 deletions rules/common.smk
Original file line number Diff line number Diff line change
@@ -1,17 +1,45 @@
import pandas as pd
from peppy import Project, SAMPLE_NAME_COLNAME as PEP_SAMPLE_COL
from snakemake.utils import validate

report: "../report/workflow.rst"

def peppy_rename(df):
""" Rename peppy's column for sample name identification to snakemake's. """
if df is None:
return None
if "sample" in df.columns and PEP_SAMPLE_COL in df.columns:
raise Exception("Multiple sample identifier columns present: {}".format(", ".join(["sample", PEP_SAMPLE_COL])))
return df.rename({PEP_SAMPLE_COL: "sample"}, axis=1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the name that peppy uses here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it sample_name or may it be something else? If it is sample_name, I'd rather change the access to this df in the rules to use sample_name.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, sample_name

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is exactly the same issue as the units vs subsample_name and can be solved in the same way.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johanneskoester so is this something you will change?



def peppy_units(df):
""" Add unit/subsample indices to peppy a data frame. """
if "unit" in df.columns:
return df
def count_names(names):
def go(rem, n, curr, acc):
if rem == []:
return acc + [n]
h, t = rem[0], rem[1:]
return go(t, n + 1, curr, acc) if h == curr else go(t, 1, h, acc + [n])
return go(names[1:], 1, names[0], []) if names else []
df.insert(1, "unit", [i for n in count_names(list(df[PEP_SAMPLE_COL])) for i in range(1, n + 1)])
return df
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the peppy dataframe does contain a column subsample_name, right? So, we can change from unit to subsample_name in the workflow.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does; OK, sounds good

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we were trying to make it so that existing workflows could work without any updates, but yeah, it would be easier to change the name. my suggestion below of a snakePEP class could also hide this away.



###### Config file and sample sheets #####
configfile: "config.yaml"
p = Project("prjcfg.yaml")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I get a dictionary with the config contents from the Project object? Then, I would introduce a new Snakemake directive

peppyconfig: "project.yaml"

that loads the project and makes it available as a global object peppy. From this, I would like to access samples and subannotation directly (getting rid of the boilerplate here).
However, I have doubts regarding the naming of the attributes. Why is the sample dataframe called sheet and the subannotation called sample_subannotation. Seems to be asymmetric. Why not calling the first samples and the second sample_subannotation?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you do; in fact that is basically what we're already doing... p is an attmap to be precise... so p.attribute (or p["attribute"] should give you whatever configuration options are in your project.yaml file, already... you just say p = Project('project.yaml') as we've done here.

as far as getting rid of the boilerplate... the only thing the boilerplate is doing is converting the name "unit" to "sample_subannotation" and "sample" to "sample_name". one idea I had is to create a new class, say, snakePEP, that extends peppy.Project and includes this boilerplate. You would just import this object and use it as a PEP but it would handle those name conversions.

I've raised the naming issue here: pepkit/peppy#280 -- we can probably accommodate.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or one could add keyword args to Project for renaming the two? Like Project("project.yaml", sample_col="sample", sample_subannotation_col="unit")?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we just adapt the naming in this workflow. Although it feels weird to change the wildcards to sample_subannotation instead of unit. It somehow does not sound like the right name in this context.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is just because of what you're used to... or can you elaborate?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing. Shouldn't the project.yaml read like this for consistency:

metadata:
    sample_table: path/to/samples.tsv
    subsample_table: path/to/subsamples.tsv

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your example here, I cannot find subsample_name in the subannotation table at all.

True -- it's optional. if you don't provide it they will be indexed numerically, which is I believe what you were doing... but you can include it and then pull out subsamples by name (with get_subsample), or index them in the table with the subsample_name column.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But subsample_name alone is not necessarily unique, right? In many experimental setups, it will only be unique in combination with sample_name (e.g., when encoding the lane or replicate in subsample_name).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_subsample only exists on a Sample object rather than on an entire Project, and the subsample naming should be unique within local scope/context of a single sample (i.e., units are uniquely identified by name alone when considering just one sample.) In the table, though, the unique identification would definitely be problematic unless combined with sample like you say. Here's an example

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. So, I would suggest to always do set_index(("sample_name", "subsample_name"), drop=False) on the subsample_table.

configfile: p.snake_config
validate(config, schema="../schemas/config.schema.yaml")

samples = pd.read_table(config["samples"]).set_index("sample", drop=False)
samples = p.sheet
samples = peppy_rename(samples).set_index("sample", drop=False)

validate(samples, schema="../schemas/samples.schema.yaml")

units = pd.read_table(config["units"], dtype=str).set_index(["sample", "unit"], drop=False)
units = peppy_units(peppy_rename(p.sample_subannotation)).set_index(["sample", "unit"], drop=False)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if peppy would set the dataframe index like that itself. Is there a reason to not do it?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should already be fine, or if not we can probably accommodate it

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does peppy know about the unit column at all? I thought it just requires the sample_name column in the subannotation.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we added it above to accommodate snakemake... but in our typical use case, the unit column is called subsample_name -- but it is optional in generic peppy Projects...

units.index = units.index.set_levels([i.astype(str) for i in units.index.levels]) # enforce str in index

validate(units, schema="../schemas/units.schema.yaml")

# contigs in reference genome
Expand Down
4 changes: 4 additions & 0 deletions samples_peppy.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
sample_name
A
B

4 changes: 4 additions & 0 deletions units_peppy.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
sample_name unit platform fq1 fq2
A 1 ILLUMINA data/reads/a.chr21.1.fq data/reads/a.chr21.2.fq
B 1 ILLUMINA data/reads/b.chr21.1.fq data/reads/b.chr21.2.fq
B 2 ILLUMINA data/reads/b.chr21.1.fq