Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add predicate to GFF3Codec to give a chance to filter out some unused attributes #1575

Merged
merged 4 commits into from
Dec 8, 2021

Conversation

lindenb
Copy link
Contributor

@lindenb lindenb commented Oct 21, 2021

Description

When using a GFF3Reader with DecodeDepth==DEEP, it may use a large amount of memory with attributes that will never be used ("version" ,"tag", etc...). This PR gives the GFF3Codec a chance to set a Predicate<String> to only keep a defined set of attributes.

the private attribute of ID_ATTRIBUTE_KEY and NAME_ATTRIBUTE_KEY Gff3BaseData was removed to check if the predicate does not remove them.

a new method setFilterOutAttribute was added to GFF3Codec

the static attribute of GFF3Codec.parseLine was removed

I added a test codecFilterOutFieldsTest

Things to think about before submitting:

  • Make sure your changes compile and new tests pass locally.
  • Add new tests or update existing ones:
    • A bug fix should include a test that previously would have failed and passes now.
    • New features should come with new tests that exercise and validate the new functionality.
  • Extended the README / documentation, if necessary
  • Check your code style.
  • Write a clear commit title and message
    • The commit message should describe what changed and is targeted at htsjdk developers
    • Breaking changes should be mentioned in the commit message.

Sorry, something went wrong.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lindenb Thank you, this seems like a good idea. I have a few nitpicky comments, but look good overall.

*/
public Gff3Codec setFilterOutAttribute(final Predicate<String> filterOutAttribute) {
/* check required keys are always kept */
for(final String key : new String[] {Gff3Constants.PARENT_ATTRIBUTE_KEY,Gff3BaseData.ID_ATTRIBUTE_KEY,Gff3BaseData.NAME_ATTRIBUTE_KEY}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicky, but can you add spaces between these?

@@ -200,7 +220,7 @@ private Gff3Feature decode(final LineIterator lineIterator, final DecodeDepth de
return attributes;
}

static private Gff3BaseData parseLine(final String line, final int currentLine) {
private Gff3BaseData parseLine(final String line, final int currentLine) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably keep this static and explicitly pass in the predicate

@@ -217,6 +237,8 @@ static private Gff3BaseData parseLine(final String line, final int currentLine)
final int phase = splitLine.get(GENOMIC_PHASE_INDEX).equals(Gff3Constants.UNDEFINED_FIELD_VALUE) ? -1 : Integer.parseInt(splitLine.get(GENOMIC_PHASE_INDEX));
final Strand strand = Strand.decode(splitLine.get(GENOMIC_STRAND_INDEX));
final Map<String, List<String>> attributes = parseAttributes(splitLine.get(EXTRA_FIELDS_INDEX));
/* remove attibutes if they fail 'acceptExtraFieldKey' */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment no longer matches the variable name.

* @param filterOutAttribute the predicate
* @return this codec
*/
public Gff3Codec setFilterOutAttribute(final Predicate<String> filterOutAttribute) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit torn if I like this as a fluent setter or instead if we should add this as an extra constructor parameter. That way we could make the predicate field final and reduce possible strange state changes, but it's not really that important either way.

@@ -217,6 +237,8 @@ static private Gff3BaseData parseLine(final String line, final int currentLine)
final int phase = splitLine.get(GENOMIC_PHASE_INDEX).equals(Gff3Constants.UNDEFINED_FIELD_VALUE) ? -1 : Integer.parseInt(splitLine.get(GENOMIC_PHASE_INDEX));
final Strand strand = Strand.decode(splitLine.get(GENOMIC_STRAND_INDEX));
final Map<String, List<String>> attributes = parseAttributes(splitLine.get(EXTRA_FIELDS_INDEX));
/* remove attibutes if they fail 'acceptExtraFieldKey' */
attributes.keySet().removeIf(KEY->this.filterOutAttribute.test(KEY));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be the less verbose:

attributes.keySet().removeIf(this.filterOutAttribute);

@@ -9,8 +9,8 @@
import java.util.Map;

public class Gff3BaseData {
private static final String ID_ATTRIBUTE_KEY = "ID";
private static final String NAME_ATTRIBUTE_KEY = "Name";
static final String ID_ATTRIBUTE_KEY = "ID";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these move to the constants class?

lindenb and others added 2 commits November 4, 2021 09:41

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Co-authored-by: Louis Bergelson <louisb@broadinstitute.org>
@codecov-commenter
Copy link

Codecov Report

Merging #1575 (646da19) into master (57c3f03) will decrease coverage by 0.037%.
The diff coverage is 81.818%.

@@               Coverage Diff               @@
##              master     #1575       +/-   ##
===============================================
- Coverage     69.841%   69.804%   -0.037%     
- Complexity      9633      9639        +6     
===============================================
  Files            702       702               
  Lines          37611     37618        +7     
  Branches        6108      6088       -20     
===============================================
- Hits           26268     26259        -9     
- Misses          8897      8907       +10     
- Partials        2446      2452        +6     
Impacted Files Coverage Δ
...rc/main/java/htsjdk/tribble/gff/Gff3Constants.java 0.000% <ø> (ø)
src/main/java/htsjdk/tribble/gff/Gff3Codec.java 77.083% <75.000%> (-0.170%) ⬇️
.../htsjdk/variant/variantcontext/VariantContext.java 78.456% <81.818%> (+0.132%) ⬆️
src/main/java/htsjdk/tribble/gff/Gff3BaseData.java 80.556% <100.000%> (ø)
...sjdk/samtools/util/htsget/HtsgetErrorResponse.java 73.333% <0.000%> (-6.667%) ⬇️
...tools/cram/encoding/core/GammaIntegerEncoding.java 86.667% <0.000%> (-6.667%) ⬇️
...mtools/cram/encoding/core/BetaIntegerEncoding.java 91.304% <0.000%> (-4.348%) ⬇️
...am/encoding/core/CanonicalHuffmanByteEncoding.java 52.941% <0.000%> (-2.941%) ⬇️
...va/htsjdk/beta/codecs/variants/vcf/VCFDecoder.java 62.651% <0.000%> (-2.410%) ⬇️
...va/htsjdk/samtools/util/htsget/HtsgetResponse.java 89.706% <0.000%> (-1.471%) ⬇️
... and 17 more

@lindenb
Copy link
Contributor Author

lindenb commented Nov 4, 2021

@lbergelson thank you for your review. I moved the static final String from Gff3BaseData to Gff3Constants, and replaced setFilterOutAttribute by a new constructor.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lindenb Looks good. Thank you.

@lbergelson lbergelson merged commit e927064 into samtools:master Dec 8, 2021
@lindenb lindenb deleted the pl_filter branch December 8, 2021 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants