Using peppy#6
Conversation
johanneskoester
left a comment
There was a problem hiding this comment.
Awesome start! Thanks a lot. I have some comments below.
| return None | ||
| if "sample" in df.columns and PEP_SAMPLE_COL in df.columns: | ||
| raise Exception("Multiple sample identifier columns present: {}".format(", ".join(["sample", PEP_SAMPLE_COL]))) | ||
| return df.rename({PEP_SAMPLE_COL: "sample"}, axis=1) |
There was a problem hiding this comment.
What is the name that peppy uses here?
There was a problem hiding this comment.
Is it sample_name or may it be something else? If it is sample_name, I'd rather change the access to this df in the rules to use sample_name.
There was a problem hiding this comment.
this is exactly the same issue as the units vs subsample_name and can be solved in the same way.
There was a problem hiding this comment.
@johanneskoester so is this something you will change?
| return go(t, n + 1, curr, acc) if h == curr else go(t, 1, h, acc + [n]) | ||
| return go(names[1:], 1, names[0], []) if names else [] | ||
| df.insert(1, "unit", [i for n in count_names(list(df[PEP_SAMPLE_COL])) for i in range(1, n + 1)]) | ||
| return df |
There was a problem hiding this comment.
So, the peppy dataframe does contain a column subsample_name, right? So, we can change from unit to subsample_name in the workflow.
There was a problem hiding this comment.
we were trying to make it so that existing workflows could work without any updates, but yeah, it would be easier to change the name. my suggestion below of a snakePEP class could also hide this away.
| validate(samples, schema="../schemas/samples.schema.yaml") | ||
|
|
||
| units = pd.read_table(config["units"], dtype=str).set_index(["sample", "unit"], drop=False) | ||
| units = peppy_units(peppy_rename(p.sample_subannotation)).set_index(["sample", "unit"], drop=False) |
There was a problem hiding this comment.
I would prefer if peppy would set the dataframe index like that itself. Is there a reason to not do it?
There was a problem hiding this comment.
That should already be fine, or if not we can probably accommodate it
There was a problem hiding this comment.
Does peppy know about the unit column at all? I thought it just requires the sample_name column in the subannotation.
There was a problem hiding this comment.
we added it above to accommodate snakemake... but in our typical use case, the unit column is called subsample_name -- but it is optional in generic peppy Projects...
|
|
||
| ###### Config file and sample sheets ##### | ||
| configfile: "config.yaml" | ||
| p = Project("prjcfg.yaml") |
There was a problem hiding this comment.
Do I get a dictionary with the config contents from the Project object? Then, I would introduce a new Snakemake directive
peppyconfig: "project.yaml"
that loads the project and makes it available as a global object peppy. From this, I would like to access samples and subannotation directly (getting rid of the boilerplate here).
However, I have doubts regarding the naming of the attributes. Why is the sample dataframe called sheet and the subannotation called sample_subannotation. Seems to be asymmetric. Why not calling the first samples and the second sample_subannotation?
There was a problem hiding this comment.
Yes you do; in fact that is basically what we're already doing... p is an attmap to be precise... so p.attribute (or p["attribute"] should give you whatever configuration options are in your project.yaml file, already... you just say p = Project('project.yaml') as we've done here.
as far as getting rid of the boilerplate... the only thing the boilerplate is doing is converting the name "unit" to "sample_subannotation" and "sample" to "sample_name". one idea I had is to create a new class, say, snakePEP, that extends peppy.Project and includes this boilerplate. You would just import this object and use it as a PEP but it would handle those name conversions.
I've raised the naming issue here: pepkit/peppy#280 -- we can probably accommodate.
There was a problem hiding this comment.
Or one could add keyword args to Project for renaming the two? Like Project("project.yaml", sample_col="sample", sample_subannotation_col="unit")?
There was a problem hiding this comment.
Or we just adapt the naming in this workflow. Although it feels weird to change the wildcards to sample_subannotation instead of unit. It somehow does not sound like the right name in this context.
There was a problem hiding this comment.
I think this is just because of what you're used to... or can you elaborate?
There was a problem hiding this comment.
Another thing. Shouldn't the project.yaml read like this for consistency:
metadata:
sample_table: path/to/samples.tsv
subsample_table: path/to/subsamples.tsv
There was a problem hiding this comment.
In your example here, I cannot find subsample_name in the subannotation table at all.
True -- it's optional. if you don't provide it they will be indexed numerically, which is I believe what you were doing... but you can include it and then pull out subsamples by name (with get_subsample), or index them in the table with the subsample_name column.
There was a problem hiding this comment.
But subsample_name alone is not necessarily unique, right? In many experimental setups, it will only be unique in combination with sample_name (e.g., when encoding the lane or replicate in subsample_name).
There was a problem hiding this comment.
get_subsample only exists on a Sample object rather than on an entire Project, and the subsample naming should be unique within local scope/context of a single sample (i.e., units are uniquely identified by name alone when considering just one sample.) In the table, though, the unique identification would definitely be problematic unless combined with sample like you say. Here's an example
There was a problem hiding this comment.
Yes, exactly. So, I would suggest to always do set_index(("sample_name", "subsample_name"), drop=False) on the subsample_table.
|
Closing in favor of #8 |
@johanneskoester @nsheff