Skip to content

Using peppy#6

Closed
vreuter wants to merge 15 commits into
snakemake-workflows:masterfrom
vreuter:peppy_units
Closed

Using peppy#6
vreuter wants to merge 15 commits into
snakemake-workflows:masterfrom
vreuter:peppy_units

Conversation

@vreuter
Copy link
Copy Markdown

@vreuter vreuter commented Mar 14, 2019

Copy link
Copy Markdown
Contributor

@johanneskoester johanneskoester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome start! Thanks a lot. I have some comments below.

Comment thread rules/common.smk
return None
if "sample" in df.columns and PEP_SAMPLE_COL in df.columns:
raise Exception("Multiple sample identifier columns present: {}".format(", ".join(["sample", PEP_SAMPLE_COL])))
return df.rename({PEP_SAMPLE_COL: "sample"}, axis=1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the name that peppy uses here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it sample_name or may it be something else? If it is sample_name, I'd rather change the access to this df in the rules to use sample_name.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, sample_name

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is exactly the same issue as the units vs subsample_name and can be solved in the same way.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johanneskoester so is this something you will change?

Comment thread rules/common.smk
return go(t, n + 1, curr, acc) if h == curr else go(t, 1, h, acc + [n])
return go(names[1:], 1, names[0], []) if names else []
df.insert(1, "unit", [i for n in count_names(list(df[PEP_SAMPLE_COL])) for i in range(1, n + 1)])
return df
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the peppy dataframe does contain a column subsample_name, right? So, we can change from unit to subsample_name in the workflow.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does; OK, sounds good

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we were trying to make it so that existing workflows could work without any updates, but yeah, it would be easier to change the name. my suggestion below of a snakePEP class could also hide this away.

Comment thread rules/common.smk
validate(samples, schema="../schemas/samples.schema.yaml")

units = pd.read_table(config["units"], dtype=str).set_index(["sample", "unit"], drop=False)
units = peppy_units(peppy_rename(p.sample_subannotation)).set_index(["sample", "unit"], drop=False)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if peppy would set the dataframe index like that itself. Is there a reason to not do it?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should already be fine, or if not we can probably accommodate it

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does peppy know about the unit column at all? I thought it just requires the sample_name column in the subannotation.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we added it above to accommodate snakemake... but in our typical use case, the unit column is called subsample_name -- but it is optional in generic peppy Projects...

Comment thread rules/common.smk

###### Config file and sample sheets #####
configfile: "config.yaml"
p = Project("prjcfg.yaml")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I get a dictionary with the config contents from the Project object? Then, I would introduce a new Snakemake directive

peppyconfig: "project.yaml"

that loads the project and makes it available as a global object peppy. From this, I would like to access samples and subannotation directly (getting rid of the boilerplate here).
However, I have doubts regarding the naming of the attributes. Why is the sample dataframe called sheet and the subannotation called sample_subannotation. Seems to be asymmetric. Why not calling the first samples and the second sample_subannotation?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you do; in fact that is basically what we're already doing... p is an attmap to be precise... so p.attribute (or p["attribute"] should give you whatever configuration options are in your project.yaml file, already... you just say p = Project('project.yaml') as we've done here.

as far as getting rid of the boilerplate... the only thing the boilerplate is doing is converting the name "unit" to "sample_subannotation" and "sample" to "sample_name". one idea I had is to create a new class, say, snakePEP, that extends peppy.Project and includes this boilerplate. You would just import this object and use it as a PEP but it would handle those name conversions.

I've raised the naming issue here: pepkit/peppy#280 -- we can probably accommodate.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or one could add keyword args to Project for renaming the two? Like Project("project.yaml", sample_col="sample", sample_subannotation_col="unit")?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we just adapt the naming in this workflow. Although it feels weird to change the wildcards to sample_subannotation instead of unit. It somehow does not sound like the right name in this context.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is just because of what you're used to... or can you elaborate?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing. Shouldn't the project.yaml read like this for consistency:

metadata:
    sample_table: path/to/samples.tsv
    subsample_table: path/to/subsamples.tsv

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your example here, I cannot find subsample_name in the subannotation table at all.

True -- it's optional. if you don't provide it they will be indexed numerically, which is I believe what you were doing... but you can include it and then pull out subsamples by name (with get_subsample), or index them in the table with the subsample_name column.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But subsample_name alone is not necessarily unique, right? In many experimental setups, it will only be unique in combination with sample_name (e.g., when encoding the lane or replicate in subsample_name).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_subsample only exists on a Sample object rather than on an entire Project, and the subsample naming should be unique within local scope/context of a single sample (i.e., units are uniquely identified by name alone when considering just one sample.) In the table, though, the unique identification would definitely be problematic unless combined with sample like you say. Here's an example

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. So, I would suggest to always do set_index(("sample_name", "subsample_name"), drop=False) on the subsample_table.

@vreuter
Copy link
Copy Markdown
Author

vreuter commented Apr 29, 2019

Closing in favor of #8

@vreuter vreuter closed this Apr 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants