I recently had a need to speedily parse through 8GiB+ .fastq text files to calculate a simple statistic of genomic data. My initial “pfastqcount” implementation in Ruby worked fine, but with many files to process took longer than I had hoped in addition to consuming an alarming amount of CPU. I ended up reimplementing the pfastqcount command-line program in C, which takes one or more .fastq files, memory maps them, and creates the statistic. Simply dropping my algorithm down to raw C significantly sped up the process and reduced CPU usage, especially coming from an interpreted language. If any of you bioinformaticians find the need to implement a FASTQ data processing algorithm in C, I encourage you to fork the project and use it as a template. The project is Apache 2.0 licensed for your convenience and publicly available on GitHub.
Categories
2 replies on “Quick FASTQ File Parsing Via Memory Mapping In C/C++”
Hi Preston- I was a student of yours in CST200 fall ’10; was wondering if (this has nothing to do with your post here) you might have any references to OMR Java libraries? I’ve since graduated and work for a SW dev co.- I’m researching a potential project that will involve the reading of a “play slip” and I’d like to assemble the job in Java. So far I’ve found zilch where OMR libraries are concerned. any ideas? (It’s looking like C# will be the viable alternative here)
Hi Allen,
Good to hear you’re doing the development thing! If you’re looking for object-relational mapping (ORM) software, check out this list:
http://en.wikipedia.org/wiki/List_of_object-relational_mapping_software
Preston