Using peppy by vreuter · Pull Request #6 · snakemake-workflows/dna-seq-gatk-variant-calling

vreuter · 2019-03-14T22:25:07Z

johanneskoester

Awesome start! Thanks a lot. I have some comments below.

johanneskoester · 2019-03-20T09:14:26Z

+        return None
+    if "sample" in df.columns and PEP_SAMPLE_COL in df.columns:
+        raise Exception("Multiple sample identifier columns present: {}".format(", ".join(["sample", PEP_SAMPLE_COL])))
+    return df.rename({PEP_SAMPLE_COL: "sample"}, axis=1)


What is the name that peppy uses here?

Is it sample_name or may it be something else? If it is sample_name, I'd rather change the access to this df in the rules to use sample_name.

Yep, sample_name

this is exactly the same issue as the units vs subsample_name and can be solved in the same way.

@johanneskoester so is this something you will change?

johanneskoester · 2019-03-20T09:19:28Z

+            return go(t, n + 1, curr, acc) if h == curr else go(t, 1, h, acc + [n])
+        return go(names[1:], 1, names[0], []) if names else []
+    df.insert(1, "unit", [i for n in count_names(list(df[PEP_SAMPLE_COL])) for i in range(1, n + 1)])
+    return df


So, the peppy dataframe does contain a column subsample_name, right? So, we can change from unit to subsample_name in the workflow.

It does; OK, sounds good

we were trying to make it so that existing workflows could work without any updates, but yeah, it would be easier to change the name. my suggestion below of a snakePEP class could also hide this away.

johanneskoester · 2019-03-20T09:20:51Z

 validate(samples, schema="../schemas/samples.schema.yaml")

-units = pd.read_table(config["units"], dtype=str).set_index(["sample", "unit"], drop=False)
+units = peppy_units(peppy_rename(p.sample_subannotation)).set_index(["sample", "unit"], drop=False)


I would prefer if peppy would set the dataframe index like that itself. Is there a reason to not do it?

That should already be fine, or if not we can probably accommodate it

Does peppy know about the unit column at all? I thought it just requires the sample_name column in the subannotation.

we added it above to accommodate snakemake... but in our typical use case, the unit column is called subsample_name -- but it is optional in generic peppy Projects...

johanneskoester · 2019-03-20T09:27:10Z


 ###### Config file and sample sheets #####
-configfile: "config.yaml"
+p = Project("prjcfg.yaml")


Do I get a dictionary with the config contents from the Project object? Then, I would introduce a new Snakemake directive

peppyconfig: "project.yaml"

that loads the project and makes it available as a global object peppy. From this, I would like to access samples and subannotation directly (getting rid of the boilerplate here).
However, I have doubts regarding the naming of the attributes. Why is the sample dataframe called sheet and the subannotation called sample_subannotation. Seems to be asymmetric. Why not calling the first samples and the second sample_subannotation?

Yes you do; in fact that is basically what we're already doing... p is an attmap to be precise... so p.attribute (or p["attribute"] should give you whatever configuration options are in your project.yaml file, already... you just say p = Project('project.yaml') as we've done here.

as far as getting rid of the boilerplate... the only thing the boilerplate is doing is converting the name "unit" to "sample_subannotation" and "sample" to "sample_name". one idea I had is to create a new class, say, snakePEP, that extends peppy.Project and includes this boilerplate. You would just import this object and use it as a PEP but it would handle those name conversions.

I've raised the naming issue here: pepkit/peppy#280 -- we can probably accommodate.

Or one could add keyword args to Project for renaming the two? Like Project("project.yaml", sample_col="sample", sample_subannotation_col="unit")?

Or we just adapt the naming in this workflow. Although it feels weird to change the wildcards to sample_subannotation instead of unit. It somehow does not sound like the right name in this context.

I think this is just because of what you're used to... or can you elaborate?

Another thing. Shouldn't the project.yaml read like this for consistency:

metadata: sample_table: path/to/samples.tsv subsample_table: path/to/subsamples.tsv

In your example here, I cannot find subsample_name in the subannotation table at all.

True -- it's optional. if you don't provide it they will be indexed numerically, which is I believe what you were doing... but you can include it and then pull out subsamples by name (with get_subsample), or index them in the table with the subsample_name column.

But subsample_name alone is not necessarily unique, right? In many experimental setups, it will only be unique in combination with sample_name (e.g., when encoding the lane or replicate in subsample_name).

get_subsample only exists on a Sample object rather than on an entire Project, and the subsample naming should be unique within local scope/context of a single sample (i.e., units are uniquely identified by name alone when considering just one sample.) In the table, though, the unique identification would definitely be problematic unless combined with sample like you say. Here's an example

Yes, exactly. So, I would suggest to always do set_index(("sample_name", "subsample_name"), drop=False) on the subsample_table.

vreuter · 2019-04-29T20:41:52Z

Closing in favor of #8

vreuter added 15 commits March 13, 2019 12:27

initial peppy imports working

d34726f

more peppy interop

59eb4d4

set the index; use master config

f92ae7c

cleanup

484976b

remove additional print

b3661dc

more cleanup

ba4debb

peppy files

e1af9a1

minimize changes, shorten names

d5b6d46

remove unused import

5964ecf

get back validate

6b54e14

need to check files entry

37ded42

guards and cleanup

ca76544

clear unused KV in project config

e85a876

condense and explain

6711da2

peppy-compatible subannotation / units sheet

6fde15f

johanneskoester reviewed Mar 20, 2019

View reviewed changes

This was referenced Mar 21, 2019

Another thing. Shouldn't the project.yaml read like this for consistency: #7

Closed

metadata naming pepkit/peppy#281

Closed

nsheff mentioned this pull request Apr 19, 2019

Snakemake object pepkit/peppy#286

Closed

vreuter closed this Apr 29, 2019

Conversation

vreuter commented Mar 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johanneskoester left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vreuter commented Apr 29, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vreuter commented Mar 14, 2019 •

edited

Loading