-
Notifications
You must be signed in to change notification settings - Fork 149
Using peppy #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using peppy #6
Changes from all commits
d34726f
59eb4d4
f92ae7c
484976b
b3661dc
ba4debb
e1af9a1
d5b6d46
5964ecf
6b54e14
37ded42
ca76544
e85a876
6711da2
6fde15f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| metadata: | ||
| sample_annotation: samples_peppy.tsv | ||
| sample_subannotation: units_peppy.tsv | ||
|
|
||
| snake_config: "config.yaml" | ||
|
|
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,17 +1,45 @@ | ||
| import pandas as pd | ||
| from peppy import Project, SAMPLE_NAME_COLNAME as PEP_SAMPLE_COL | ||
| from snakemake.utils import validate | ||
|
|
||
| report: "../report/workflow.rst" | ||
|
|
||
| def peppy_rename(df): | ||
| """ Rename peppy's column for sample name identification to snakemake's. """ | ||
| if df is None: | ||
| return None | ||
| if "sample" in df.columns and PEP_SAMPLE_COL in df.columns: | ||
| raise Exception("Multiple sample identifier columns present: {}".format(", ".join(["sample", PEP_SAMPLE_COL]))) | ||
| return df.rename({PEP_SAMPLE_COL: "sample"}, axis=1) | ||
|
|
||
|
|
||
| def peppy_units(df): | ||
| """ Add unit/subsample indices to peppy a data frame. """ | ||
| if "unit" in df.columns: | ||
| return df | ||
| def count_names(names): | ||
| def go(rem, n, curr, acc): | ||
| if rem == []: | ||
| return acc + [n] | ||
| h, t = rem[0], rem[1:] | ||
| return go(t, n + 1, curr, acc) if h == curr else go(t, 1, h, acc + [n]) | ||
| return go(names[1:], 1, names[0], []) if names else [] | ||
| df.insert(1, "unit", [i for n in count_names(list(df[PEP_SAMPLE_COL])) for i in range(1, n + 1)]) | ||
| return df | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So, the peppy dataframe does contain a column subsample_name, right? So, we can change from unit to subsample_name in the workflow.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It does; OK, sounds good There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we were trying to make it so that existing workflows could work without any updates, but yeah, it would be easier to change the name. my suggestion below of a snakePEP class could also hide this away. |
||
|
|
||
|
|
||
| ###### Config file and sample sheets ##### | ||
| configfile: "config.yaml" | ||
| p = Project("prjcfg.yaml") | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do I get a dictionary with the config contents from the Project object? Then, I would introduce a new Snakemake directive that loads the project and makes it available as a global object There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes you do; in fact that is basically what we're already doing... as far as getting rid of the boilerplate... the only thing the boilerplate is doing is converting the name "unit" to "sample_subannotation" and "sample" to "sample_name". one idea I had is to create a new class, say, I've raised the naming issue here: pepkit/peppy#280 -- we can probably accommodate.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or one could add keyword args to
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or we just adapt the naming in this workflow. Although it feels weird to change the wildcards to sample_subannotation instead of unit. It somehow does not sound like the right name in this context. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is just because of what you're used to... or can you elaborate?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another thing. Shouldn't the project.yaml read like this for consistency: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
True -- it's optional. if you don't provide it they will be indexed numerically, which is I believe what you were doing... but you can include it and then pull out subsamples by name (with
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But subsample_name alone is not necessarily unique, right? In many experimental setups, it will only be unique in combination with sample_name (e.g., when encoding the lane or replicate in subsample_name).
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, exactly. So, I would suggest to always do |
||
| configfile: p.snake_config | ||
| validate(config, schema="../schemas/config.schema.yaml") | ||
|
|
||
| samples = pd.read_table(config["samples"]).set_index("sample", drop=False) | ||
| samples = p.sheet | ||
| samples = peppy_rename(samples).set_index("sample", drop=False) | ||
|
|
||
| validate(samples, schema="../schemas/samples.schema.yaml") | ||
|
|
||
| units = pd.read_table(config["units"], dtype=str).set_index(["sample", "unit"], drop=False) | ||
| units = peppy_units(peppy_rename(p.sample_subannotation)).set_index(["sample", "unit"], drop=False) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would prefer if peppy would set the dataframe index like that itself. Is there a reason to not do it?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That should already be fine, or if not we can probably accommodate it
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does peppy know about the unit column at all? I thought it just requires the sample_name column in the subannotation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we added it above to accommodate snakemake... but in our typical use case, the |
||
| units.index = units.index.set_levels([i.astype(str) for i in units.index.levels]) # enforce str in index | ||
|
|
||
| validate(units, schema="../schemas/units.schema.yaml") | ||
|
|
||
| # contigs in reference genome | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| sample_name | ||
| A | ||
| B | ||
|
|
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| sample_name unit platform fq1 fq2 | ||
| A 1 ILLUMINA data/reads/a.chr21.1.fq data/reads/a.chr21.2.fq | ||
| B 1 ILLUMINA data/reads/b.chr21.1.fq data/reads/b.chr21.2.fq | ||
| B 2 ILLUMINA data/reads/b.chr21.1.fq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the name that peppy uses here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it sample_name or may it be something else? If it is sample_name, I'd rather change the access to this df in the rules to use sample_name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep,
sample_nameThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is exactly the same issue as the units vs subsample_name and can be solved in the same way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johanneskoester so is this something you will change?