The myths of bioinformatics software

roel_v · on July 10, 2015

Much of this is equally applicable to other fields, but I very much disagree with point 2. Every lab should have one or more programmers, people who are professionals at writing software, and who guide researchers in their software development. For both efficiency and accuracy reasons.

But of course a software developer at a university is, at best, a 'lab assistant', but more likely regarded to be on the same level as the janitor (both in respect and in pay). With the result being thousands upon thousands of shitty programs you wouldn't wish work on to your worst enemy.

But hey, cool with me, I carved out a consulting niche in cleaning up such messes in exactly that environment. But man could a lot of money be saved, and a lot of much better work be done, if only researchers (and the hierarchy above them) would recognize that software development is both critical to pretty much any research today, as well as something they cannot just pick up on the side.

collyw · on July 10, 2015

I am a software engineer for a sequencing centre and I completely agree. Scientists don't value proper software development and think that writing one off scripts is the same thing.

wlievens · on July 10, 2015

> But hey, cool with me, I carved out a consulting niche in cleaning up such messes in exactly that environment.

Sounds like an interesting story there. Anything you can share about it?

roel_v · on July 11, 2015

Not so interesting really, just that funding is structured in such a way that there are ways to get funded to convert research prototypes into production-quality software, and as long as you have researchers willing to include it in their proposals, domain knowledge to have credibility and hold your ground in your consortium, and a network within which you're known for this; then you can make a living as an independent researcher/software consultant. I'll add to this that I stumbled into some peculiar, non-replicable circumstances which let me bypass the 'only eat ramen for several years' phase that is the rite of passage in the academic world. The opportunity cost of that would have made it non-rational for me to do this. So I don't have actionable career advice on this line of work, I'm afraid.

_Wintermute · on July 10, 2015

Problem is that scientists get paid pretty terrible wages, though this is expected and pretty much drilled into you since undergraduate 'If you wanted to be paid for your brains, you should have gone into finance' and so on.

On the other hand, programmers and software developers are not likely to put up with being paid peanuts when there are plenty of well paying jobs out there. I can't see much reason why many of them would stick around.

roel_v · on July 11, 2015

"Problem is that scientists get paid pretty terrible wages"

I can't claim to know the situation everywhere, but apart from the postdocs (and I guess the 'adjunct' temp positions endemic in the US), assistant and full professors actually get paid quite well (I know of Western Europe, AU/NZ, Canada, and I think it's the same in most places in the US too - although I don't know first hand there). Funding agencies would fund software people. There is no will to change it though, because hey I wrote 5000 lines of C or Fortran back in the 80's when I was a grad student, so why can't you do the same today? I have a CD-Rom with a very good C++ compiler I've been using since the early 1990's, I'll send you a copy, you should try it out! (I paraphrased that last line, but I was actually told this less than 4 months ago).

jghn · on July 10, 2015

I did the latter for years although it was really because I had no idea how much wages had inflated in that time period, primarily because Americans just don't talk about salary much and I didn't think to poke around. Once I realized the situation it didn't take long to double my salary.

GFK_of_xmaspast · on July 11, 2015

I turned down a job working as a scientific-programmer-at-large for a college recently because it would have been a 40% pay cut. (I might have taken it at 20%, but I have a mortgage and a family).

88e282102ae2e5b · on July 10, 2015

How does one carve out such a niche?

roel_v · on July 11, 2015

I don't know, I more stumbled into it - I'll admit that my wording made it sound like it was premeditated in my case...

meeper16 · on July 10, 2015

And the most long standing myth, which first started after the Human Genome was sequenced by Francis Collins and Craig Ventor (now working on human longevity) over 14 years ago: Bioinformatics software will single-handedly be responsible for discovering billion dollar drug targets. This is in large part why most early bioinformatics companies failed - due to lack of deliver on this front along with jangled software approaches that were being moved into the commercial world from academia. The reality is that most bionformatics software relies on old formal methods and is not geared toward true highly innovative discovery and data interpretation.

I do think however that we are entering a new age of bioinformatics and its associated data mining, interpreation, visualization and discovery tools which hopefully will push the bounderies of being less formal and more experimental. We need to make discoveries faster when it comes to Life Sciences.

I think new approaches in bioinformatics/datamining/data science/visualization will have the greatest impact in the areas of extending human lifespan. This is what Craig Ventor, Google Calico Labs, SENS, GenoPharmix, Buck Institute are all working on now.

nmrm2 · on July 10, 2015

This post comes across as 1) overly dismissive of what's been done so far, and 2) extraordinarily breathless about the capabilities of "data science", whatever the hell that is.

You don't get your massive data sets without all the work done in the 90's. Also, the whole history of Science is a counter-factual to the idea that "all you need is data". More often than not, you get the data. And then you have the data. And then a long time later you finally get the (right) theory. And then a long time later you finally get the product.

Unbridled optimism isn't a virtue. IMO, the sort of attitude you're expressing here (to be clear, not you, but the general attitude, which is very common) has a good shot at precipitating the next AI Winter. The promise of automating Scientific insight is a dangerous (and IMO stupid) promise to make.

meeper16 · on July 10, 2015

I admire, praise and was part of the initial revolution in bioinformatics. It is very true that it's not all just about the data and algorithms. When coupled properly, data and algorithmic processes result in a platform that enables the genomic scientits and researhers to make that leap in discovery.

Platforms that can help researchers and commercial organizations to form new hypothesis are key.

Ideally, it would be nice to see more platforms with the ability to combine pieces of knowledge to hypothisize new potential discoveries validated by researchers.

dalke · on July 10, 2015

I think this myth started long before the human genome was sequenced. It is, after all, little different than the dreams for computer aided drug design in the 1980s and nanotech of the 1990s.

Btw, it's Craig Venter.

Also, can you think of a time in the last century where phrases like "we are entering a new age" and "We need to make discoveries faster", weren't generally believed? I can understand your excitement, but suggest that those phrases are mostly meaningless.

meeper16 · on July 10, 2015

It's all about the data and we did not have it until the genome was sequenced. This is when the bioinformatics revolution started.

Convert the term "entering a new age" to "entering a new phase". Done.

Better to be optimistic when it comes to human health, science and lifespan as opposed to letting cynicism hold us back.

dalke · on July 10, 2015

I recognize that people may consider comments like mine to be cynical.

However, consider the phrases "jangled software approaches" and "old formal methods". Those appear to be used in a negative context. As I think people in the past are equally clever and insightful as people now, with the same sorts of optimism about "human health, science and lifespan", I regard those negative comments in a different sort of light. They have a certain cynicism about the past.

Why weren't the people of the 1960s "geared toward true highly innovative discovery and data interpretation"? Was it simply the lack of data? But this was also the era which coined the term "information explosion", and developed new industries to handle it.

If it is simply the lack of data, then surely in 50 years time people will say that Venter, Google Calico Labs, etc. themselves were hayseeds who used "jangled software approaches" that weren't "geared toward true highly innovative discovery and data interpretation", no? Your baseline requires knowing the human genome, but someone else's baseline might require the genome for 1 million people.

If so, then my belief is that such descriptions carry little intrinsic meaning, and mostly mean "we know more now then they did then."

meeper16 · on July 10, 2015

Well said. In terms of my mention of "old formal methods", these are being used today as academics tend to be risk averse related to employing more experiemental methods due to reputational and funding risks.

dalke · on July 10, 2015

Who developed those methods in the first place? If it was academics, why did they not have the same "reputational and funding risks" in the late 1990s, which is when you said "the bioinformatics revolution started"? What's changed in the last 20 years?

You compare academics in general to a small number of non-academic people and organizations. The latter are not representative of non-academics, because I know people in industry who use the same methods as those in academics. Is Patcher, the author of this piece and an academic, one of the people who uses "old formal methods" due to "reputational and funding risks"?

I'll add that I mostly work in cheminformatics. There is relatively little academic research in this field. Most of it is driven by industry, and more specifically drug discovery. A lot of people in industry also use "old formal methods".

Finally, here's a quote from someone who "served on the advisory council of [a longevity] organization, along with the chairmen of a number of major US corporations":

> Why has the problem of aging been such an intractable one? Up until [recently], the prevalent view of scientists had been that the task of controlling aging was fundamentally impossible. But today, such a consensus no longer exists. Many researchers now believe that their predecessors failed, not because their goals were misguided, but because the tools and the level of sophistication they could bring to the task were inadequate. Moreover, it is argued that progress has been hampered because funding has been scarce, and researchers concerned with aging have been too few and far between. ...

> Growing public and private support for aging research reflects the scientific community’s own increasing commitment. Today, aging research occupies unprecedented numbers of highly talented individuals, not only specialists in gerontology, but researchers from other disciplines as well. These include biochemistry, endocrinology, immunology, neurobiology, genetics, and cell biology, to name only a few.

Before following the link, when do you think that essay was written? http://www.garfield.library.upenn.edu/essays/v6p107y1983.pdf .

meeper16 · on July 10, 2015

[PDF]Origin & History of Bioinformatics - arXiv arxiv.org/pdf/0911.4230 by SM Thampi - ‎2009

http://arxiv.org/ftp/arxiv/papers/0911/0911.4230.pdf

"The gathering, archival, dissemination, modeling, and analysis of biological data falls within a relatively young field of scientific inquiry, currently known as ‘bioinformatics’, ‘Bioinformatics was spurred by wide accessibility of computers with increased compute power and by the advent of genomics. Genomics made it possible to acquire nucleic acid sequence and structural information from a wide range of genomes at an unprecedented pace and made this information accessible to further analysis and experimentation. For example, sequences were matched to those coding for globular proteins of known structure (defined by crystallography) and were used in high-throughput combinatorial approaches (such as DNA microarrays) to study patterns of gene expression. Inferences from equences and biochemical data were used to construct metabolic networks. These activities have generated terabytes of data that are now being analyzed with computer, statistical, and machine learning techniques. The sheer number of sequences and information derived from these endeavors has given the false impression that imagination and hypothesis do not play a role in acquisition of iological knowledge. However, bioinformatics becomes only a science when fueled by hypothesis-driven research and within the context of the complex and everchanging living world. "

kodisha · on July 10, 2015

I also think it has to happen on the web (open or closed), instead of the desktop software.

Also, d3 wont be able to pull it off. If we do a comparison, i think that d3 is MooTools, wee need to get to ES6/angular/react kind of libraries.

collyw · on July 10, 2015

What do you feel that ES6/Angular/React do that d3 won't? They are for building user interfaces. d3 is for charting.

(Most bioinformaticians I know seem to use R).

kodisha · on July 11, 2015

It was a comparison. Once MooTools was the library to use, now its dead.

dalacv · on July 10, 2015

Why do you think it has to be web-based?

kodisha · on July 11, 2015

- collaboration (easier than desktop, but still possible there)

- instant updates and critical bug fixes

- reproducibility

- visibility

also, because of "Atwood's Law" [1]

[1] http://blog.codinghorror.com/the-principle-of-least-power/

coliveira · on July 10, 2015

An important point in this article is the vital distinction between academic software and "general purpose" software. The goal of research software is to prove or exemplify a point explained in one or more papers. It should do so in the most direct and economical way for the researcher(s). It makes no sense to create multi-platform, software-engineering friendly software for this kind of use. In the last few years we have seen an untold number of complains that research software is not robust, user-friendly, etc., and it entirely misses this important fact.

angersock · on July 10, 2015

The problem is that the "direct and economical way" tends to basically enforce bit rot and lack a reproducibility. It also tends to screw whoever inherits the project.

coliveira · on July 10, 2015

Reproducibility is an issue in all scientific fields, it is not something unique to computer science. Can someone easily reproduce experiments performed in a particle accelerator? What about an extremely complex chemical reaction? The answer to these problems is to have more people doing research in that area to make sure that the problem is well understood, instead of requiring that scientists slow down the research because it needs to first achieve some kind of "engineering reproduction" standard.

bbgm · on July 10, 2015

Speaking for the chemical reaction - yes. That is exactly what should be possible. Some reactions are hard to reproduce and labs spend years trying to figure out how to do so. Without that you really don't have good science.

The key is .. can someone read your paper and reproduce your experiment and the results. If not, it's not valid; it's just a paper.

dalke · on July 10, 2015

That's not true. Some types of chemistry cannot (legally) be reproduced, but they are still good and valid science. For a trivial example, consider the many chemistry papers based on measuring fallout effects from nuclear testing.

More prosaically, some things can be verified even when a chemical process cannot be reproduced. Protein crystallization is a tricky process with many irreducible results. Yet if you have a crystal, and can use the crystal to determine the protein structure, then you can verify that the crystal indeed contains that protein - even if you are unable to reproduce the chemical process used to create the crystal in the first place. Others can use the same sample to re-verify the resulting x-ray structure even though they also might be unable to reproduce the crystallization step.

angersock · on July 10, 2015

My critique isn't aimed at computer science research--it applies equally well to other fields using computation to produce results.

There are a lot of projects (and I won't name names, because I've got friends here who work on them) whose results cannot be reproduced, because they ran on some weird-ass combination of a weird libc, using a weird build system, running on weird hardware environments (especially common in HPC), with weird one-off data sets.

They report the results using this setup, and then the setup is gone forever. The code is probably not kept correctly in source control, and generally the community just takes them at their word that the results they get are authentic.

When you run millions of iterations on some crazy quantum solver, it's hard to spot if you're wrong. If others can't at least run the same result, science becomes a matter of anecdote.

coliveira · on July 10, 2015

As hard to believe as it is for software engineers, science is not in the code, it is in the thought process that leads to the code and in the interpretation of the results. The code for an experiment can be lost forever, but if you did your job correctly you have described why, how, and where the experiment works. Future scientists can now "build on the shoulders" of this knowledge and use it to go further and validate the previous result or discredit if it is indeed incorrect.

People complain that published papers are full of incorrect results, but they don't understand that this is just part of the process. Incorrect results will be forgotten because they will never be validated. Correct results will be validated and used (cited) by new scientists as they try discover even more complicated things.

collyw · on July 10, 2015

Things would work so much better if the software was written in a better way.

cossatot · on July 10, 2015

Though you do have a point, I think it's really not binary like that; you've given end-members.

There are a good number of scientists (myself included) who spend at least part of the time making applications for other researchers; it's best when the products are multi-platform, because there are a lot of scientists who only use one platform. These application can be anything from plotting and basic analysis tools to simulators of various sorts. They're more specialized than Evernote or something but are still designed to be installed and used by members of the community, because it's a waste of time for everyone to write their own finite element models (even if everyone knew how).

When done right, not only are these tools enabling but they become very widely cited. For instance, in my field (geoscience) a tool called GMT (Generic Mapping Tools) has over 6000 citations, even though I'd estimate 70% of papers that have used it in some form don't cite it. This is an enormous number of citations in the geosciences; most very famous and influential papers have ~1-2000.

Maybe it's different in CS; I have heard that much of the software is illustrating a new algorithm or something.

stonemetal · on July 10, 2015

That seems rather short sighted. Science is often described as standing on the shoulders of giants, you can't do that if everyone has to start from scratch. Sure it doesn't have to follow NASA's coding guidelines but it also shouldn't be utter garbage that no one besides the authors can figure out.

arca_vorago · on July 10, 2015

The first opportunity to comment on bioinformatics since my non-compete/nda is over!

This seems like a very sloppily put together list of myths, but I'll bite anyway.

1. Not true, but I think that's largely because much of the software used is closed so the FOSS community is largely anemic in the bio world. For the tools that are FOSS or BSD, I saw plenty of contributions, but the other thing to keep in mind is that it's not just about the programming. You have to have a certain level of understanding of the application domain to program a solution for it properly, and there are very few of these people around. I predict a huge uptick in demand and salaries for bioprogrammers.

2. Is true. You need your own people on salary to program for your needs. I was the sysadmin part of a phd, sysadmin, programmer team and we were doing stuff that no-one else was going to do for us. You need to have your own programmer, and a good sysadmin, full stop.

3. Is also true. Picking the right license is important because many labs are pretty tight on cash flow. Sure, they probably have millions going through them a month, but operating costs are super high and margins are lower than you may think. It was during my time in the genetics lab that I fully realized why FOSS was so important, and I think it's the future. (with a few key proprietary exceptions that no FOSS has matched yet, (think Elmer vs Comsol))

4. Using a FOSS license makes this a moot point to address. Use GPLv3 code people, stop using BSD!

5-9: not worth addressing.

Anyway, my overall view of the field is this: with sequencing getting cheaper, the problem is in managing the levels of data being generated (sysadmin issue) and in interpreting the data for meaningful results (programmer/phd issue). Personally, I think that machine learning is going to be the right breakthrough to follow and apply to bio, and once we do that I expect it to take off to crazy levels. I'm talking sequencer in every doctors office, and artificial genetic manipulation becoming much easier and with more accurate predictions.

Also, the other thing everyone underestimates is the microbiome as an entity. You are more the bacteria that lives in you than you are you. Of course, I struggle to understand the science sometimes, I'm just a sysadmin, so take what I say with a grain of salt.

dalke · on July 11, 2015

1. "the FOSS community is largely anemic in the bio world"

I'm shocked by that. I think of the bio world as having a lot of FOSS software. With BOSC it even has its own yearly (satellite) conference. By comparison, I work in chemical informatics, and think that FOSS availability in chemistry is rather less than biology. I've blamed money - chemists have a longer history of making money from their research than biologists.

Which research field do you think has a robust FOSS community and where there are also for-profit companies developing commercial in the same field?

2. "You need to have your own programmer, and a good sysadmin".

You say that myth is true, while Pachter thinks it's false. Your disagreement is really more one of team size. Pachter's myth concerns "a large team." You have a team made of a handful of people. That is, Pachter wrote: "I agree with James Taylor, who ... stated that ”... Scientific software is often developed by one or a handful of people.”", which is what you have.

I can agree that it's not clear from the text that the myth concerns, having ~6 or more people in the team.

3. Could you explain why your explanation makes the myth true?

Pachter gave the example of the UCSC genome browser. From its web site, "The Genome Browser, Blat, and liftOver source are freely downloadable for academic, noncommercial, and personal use. For information on commercial licensing, see the Genome Browser and Blat licensing requirements." Were you working in a for-profit genetics lab when you realized that FOSS was so important? Otherwise, how would a restriction like the UCSC one have affected you?

4. I actually think the FOSS vs. licensing cost are somewhat orthogonal issues. I distribute my software under the MIT license, but only to those people who pay me about US $25,000. This is something the FSF encourages, as a way to bring revenue to a project. However, it only works because I'm not trying to establish some sort of "community" but prefer a vendor/customer relationship.

BTW, I think points #1, #2, #3, and #5 can be recast as "you need to build a community around your project." I think that's a myth in its own right.

infinite8s · on July 11, 2015

"4. I actually think the FOSS vs. licensing cost are somewhat orthogonal issues. I distribute my software under the MIT license, but only to those people who pay me about US $25,000. This is something the FSF encourages, as a way to bring revenue to a project. However, it only works because I'm not trying to establish some sort of "community" but prefer a vendor/customer relationship."

Why the MIT license then? Do you figure that people who paid $25k for the software aren't going to go upload it to github for all to see?

dalke · on July 11, 2015

Correct. My clients are drug development companies. Very few of them distribute any sort of software, in part because often they would need to get legal to sign off on it. Even fewer provide support for the software.

arca_vorago · on July 13, 2015

Apologies for the late reply, so you may not see this.

"Were you working in a for-profit genetics lab when you realized that FOSS was so important?"

Exactly it. I had a very tight budget and therefore I was really pushed towards free licenses for that reason, but also because I think FOSS licenses have not yet fully disproven many eyes theory.

roel_v · on July 10, 2015

"2. Is true. You need your own people on salary to program for your needs."

I think what he was saying is "do it all yourself", the opposite of you interpreted it, it seems. But maybe I got it wrong.

dragonwriter · on July 10, 2015

You may be confusing "FOSS" (Free/Open Source Software) with copy BSD is F/OSS, it is not copyleft. GPLv3 is both F/OSS and copyleft.

arca_vorago · on July 10, 2015

Not confused, its just that when I say FOSS I think FSF aligned licenses, eg, FOSS = Free AND Open Source Software != Free/Open Source Software. I feel like its an important distinction.

That being said, I understand thats now how most view the phrase so I will adjust verbiage accordingly.

dragonwriter · on July 10, 2015

> Not confused, its just that when I say FOSS I think FSF aligned licenses, eg, FOSS = Free AND Open Source Software

BSD License is both Free Software per the FSF and Open Source per OSI. In practice, pretty much all FSF-recognized Free Software licenses are OSI-recognized Open Source licenses and vice versa.

arca_vorago · on July 10, 2015

I stand corrected, thanks for the info.

cjbprime · on July 10, 2015

As https://twitter.com/madprime/status/619503684838387716 points out, the argument that you're cheating the US Government out of public money by releasing without a non-commercial clause is bizarre -- everything the US Government releases is required by law to be released into the public domain.

meeper16 · on July 10, 2015

With the big exception of what is funded by the DOD/DOE, DARPA etc and run through national labs like Berkeley, Livermore, Sandia... this is why they have commercial licensing and tech transfer operations in place.

maaku · on July 10, 2015

Except that's not true? Most things release "by the government" had a contractor involved somehow, and government contractors have a different set of rules.

i000 · on July 10, 2015

Are you suggesting scientists are "goverment contractors"? Never thought of myself as one.

dekhn · on July 11, 2015

What did you think you were doing when you got a grant from the NIH, NSF, or DOE? You entered in a research contract with the government.

angersock · on July 10, 2015

Well, yeah. What's the problem? Money has no smell.

GFK_of_xmaspast · on July 11, 2015

There's a huge difference between 'government employee' and 'accepts federal grants'.

jerven · on July 10, 2015

Your data is more important than your code. Is the often neglected fact in bioinformatics. Whatever you do document your file formats.

danieltillett · on July 11, 2015

As someone how actually makes a living selling bioinformatics software, the problem is mainly due to how scientist view software. The code you write is seen the same way lab books are - basically raw data. Nobody publishes their lab books and all too often software is thought of as just an electronic lab book. It would be great if this changed, but it needs a change in how scientist look at software.

bmir-alum-007 · on July 10, 2015

Disclaimer: I used to work at a Stanford bioinformatics shop.

There's a clear need of AWS-like features for bio/biomedical informatics specifically enabling sharing, security, reuse and anonymization of data (PHI), libraries (like R's bioconductor) and infrastructure (IaaS/PaaS/SaaS).

The issue is that some labs archive still archive their data on actual hard drives (USB and bare drives), making their data much less useful than somewhere readily available and sharable.

I think it's a huge (billion+) opportunity where the right execution would need loads of smart, consultingish customer service reps (huge overhead costs) to help researchers with coding, sysadmining and bio to some degree. Basically, a full-service (with self-service, a-la carte features) hosting company for bio / medical.

This space is only going to grow deeper and wider as more is discovered and confirmed about each gene, protein, pathway and each accompanying expansion in nosology. This sort of research knowledge is vital and unlikely to shrink. The main issues are that it would be a cash-intensive and undefensible business model because it requires paying lots of consultant/scientist brains and anyone can copy the model.

macarthy12 · on July 11, 2015

> There's a clear need of AWS-like features for bio/biomedical > informatics specifically enabling sharing, > security, reuse and anonymization of data (PHI), libraries > (like R's bioconductor) and infrastructure (IaaS/PaaS/SaaS).

I tried to build a startup like this, basically a Heroku for bioinformatics, with a bunch of experienced biologist / genome folks. They just didn't get it and on my part, I guess I couldn't sell it to them. Part of it was snobbery, and institutionalized thinking. It was a big disappointment. Some one will do it, but until then it will crappy bioperl scripts, with no version control etc.

infinite8s · on July 10, 2015

The main issue is this expertise is too expensive to provide at the lab level - probably needs to be provided at the institutional level.

x0x0 · on July 10, 2015

and you're competing with dirt-cheap grad students / student ras / post-docs

most labs are just too cheap

(I wrote custom stats software for image / FRET analysis and my lab was certainly too cheap.)

rch · on July 10, 2015

The article mentions code quality and license issues, and one of my favorites (MEME) seems to suffer a bit from both. I believe the first aspect is simply the result of being developed in a sequential fashion by different contributors (which is reasonable given the environment).

The main problem is that the license rules out using the software for 'commercial purposes' except under unspecified terms that would need to hashed out with the tech transfer office. I completely support the spirit of that construct, but it makes it difficult to advocate for in practice. At least in this case, GPL or LGPL would be a significant improvement.

dalke · on July 10, 2015

The MEME commercial license is at http://techtransfer.universityofcalifornia.edu/NCD/Media/MEM... . How is this "unspecified terms that would need to hashed out with the tech transfer office"? The license is US$2,500 for a license, which as Pachter correctly points out is a small cost for most companies.

Also, while it would be a significant improvement to you, would it be a significant improvement to the science?

For example, I can't speak to MEME but I know of a couple other projects where the software is at low/no cost to academics and has a license fee for commercial use. This money is used to fund future development, which gives a funding source that is independent of grant funding. Pachter also points out this possibility.

It can be frustrating if some people cannot use a package due to license terms or costs. But it can also be frustrating to use a package where no one is available to answer questions or fix bugs - which is something that funding can address.

rch · on July 10, 2015

Thanks for the link - I hadn't seen it. I did know the cost since I suggested a client buy a license a couple of years ago, which they provided, but I didn't know it was such a straightforward process. My comment was based on my experience of getting it approved internally, which took a number of months and more than a little effort. During that time more work was done in other areas.

bbgm · on July 10, 2015

If you are bootstrapped it is a huge barrier. And Meme is not that bad. There are others that are far worse. I once tried using Modeller purely for hobby projects (science was not even my day job) but was denied. That's a problem.

dalke · on July 10, 2015

Not all business models are economically viable, and others are not obligated to make it easy to support your choice of a bootstrap business model.

My experience is that people who get a piece of software, even if for free, often want some support. If you don't give them support, some will complain. A company may decide that it's easier to deal with complaints about the lack of a free version for hobbyists than to deal with complaints about a user not getting sufficient (unpaid) support.

shiggerino · on July 10, 2015

That's news to me, most companies seem to understand full well that if they want support on a free program they have to pay for it.

dalke · on July 10, 2015

bbgm made two comments; one about a company under bootstrap, the other about "using Modeller purely for hobby projects".

My first paragraph was a response to the first comment. My second paragraph was about the second.

shiggerino · on July 11, 2015

Oh, you were talking about the hobbyists?

That's easy. Tell them to RTFM, and if something doesn't work, have them submit a patch. That's not a heavy burden.

dalke · on July 11, 2015

There are very few people who do bioinformatics as a hobby. The term "hobbyist" generally refers to people who are in a related field and want to dabble with a different set of tools, with no firm intention to do serious research in that field.

However, these hobbyists tend to be experts in their (related) field, and talk with actual customers or potential customers of the company. Word of mouth is important, and if the company tell someone to RTFM and submit a patch, then this may lead to a bad vibe. Or it might not. But it's easier to say "only paying customers" and have no burden than to take on even the light burden of responding to non-paying users.

Also, some hobbyists will use their "hobby" status to do research on the cheap. (For a related example, some professors have their own company, doing work related to their research, and use their educational status to get software that ends up being used for work that helps the company.) They might, for example, have an idea that they want to "play around with", which might be competitive with the product. It's not a serious idea, "just a hobby", but they are curious to see if it's something interesting, so get a copy of the software, and use that to judge if the idea should be investigated more seriously. Or start a competitive company. Or publish a paper where they demonstrate that when the tool is used poorly, by someone who doesn't the tool, then it produces poor results.

Is that really a hobby, or is it self-delusion used to avoid committing oneself to a project?

I can't tell, and neither can a company. A policy of "paying customers only" makes it easy to decide. There are certainly other solutions, but is there a concrete advantage to taking on that burden, no matter how light?

shiggerino · on July 10, 2015

Insisting bioinformatics software be non-free is pretty rules out any possibility anyone is going to build on your code. If this is the case, that's regrettable, but why seal the fate?

If they are afraid of companies using and abusing the software, just put it under the GPL and they will at least have repay the favour to the users and the community.

dalke · on July 11, 2015

The essay gave an example of non-free project that others have built upon:

> One of the most widely used software suites in bioinformatics (if not the most widely used) is the UCSC genome browser and its associated tools. The software is not free, in that even though it is free for academic, non-profit and personal use, it is sold commercially. ... As far as development of the software, it has almost certainly been hacked/modified/developed by many academics and companies since its initial release (e.g. even within my own group).

Therefore, by demonstration, using a non-free license does not "seal the fate" and rule out others from building on your code.

Also, GPL does not require anyone to "repay the favor." There's no requirement to distribute modifications upstream or to "the community."

shiggerino · on July 11, 2015

>There's no requirement to distribute modifications upstream or to "the community." No, that would obviously be a onerous requirement. The GPL strikes a reasonable balance between the individual user and the community of users.

I'm just saying the customers should be allowed to get the derived software on the same generous terms as the company received the original on. The customers are obviously not required to redistribute anything, but it encourages good behaviour.

dalke · on July 11, 2015

If it's "obviously [an] onerous requirement" then what does "they will at least have repay the favour to the users and the community" mean?

You clarify that "customers should be allowed to get the derived software on the same generous terms as the company received the original on", but that assumes that the companies have customers. Very few companies that use academically produced bioinformatics software have downstream customers of that software.

For the vast majority of companies that only use the software in-house, which is likely 99+% of all companies, how does using the GPL or any other free software license lead to the company repaying the favor? What's wrong for asking for payment in cash instead of other more nebulous contributions?

shiggerino · on July 11, 2015

>If it's "obviously [an] onerous requirement" then what does "they will at least have repay the favour to the users and the community" mean?

The users will get the sources and the permission to use those sources so they do not have to be powerless and dependent on the company's good will to provide patches. I can't imagine why this is a difficult concept to comprehend.

>You clarify that "customers should be allowed to get the derived software on the same generous terms as the company received the original on", but that assumes that the companies have customers. Very few companies that use academically produced bioinformatics software have downstream customers of that software.

Yes, if they have customers, no matter how few, those customers should have all the permissions associated with free software. I don't see the problem here. It's not an unreasonable assumption that a nonzero number bioinformatics companies have a nonzero number of customers.

>For the vast majority of companies that only use the software in-house, which is likely 99+% of all companies, how does using the GPL or any other free software license lead to the company repaying the favor? What's wrong for asking for payment in cash instead of other more nebulous contributions?

If nothing else, the burden of maintaining a private fork, and integrating upstream changes with your own, can be an incentive to just send your changes upstream.

dalke · on July 11, 2015

I have studied how this happens in my field, which is cheminformatics, not bioinformatics.

One company decided to 'sell' a GPLv2 product, that is, provide the product for free and sell a support contract. They had about 12 paying support customers. They ran into a problem in that some of their customers made local changes in order to add new features. This is one of the well-known advantages of having the source code.

However, it was difficult for the customers to contribute the code upstream. One reason is that some of the changes were considered, or could be considered, proprietary. This means they would have to deal with legal in order to get permission to send upstream, and they didn't want to do that.

As a result, when the vendor distributed new releases, the customer would actually integrate the changes each time. Or rather, not integrate the new changes at all, because it was too much work each time. They ended up with code that was out of date (and buggy) because of the decision to make local changes that were difficult to integrate upstream.

The "right" solution would have been to work with the vendor to change the code and/or provide new APIs for what the customer wants in order to integrate the proprietary methods, but without fully integrating those methods. However, when you have the code Right There it's very easy to just go ahead and do everything. Talking with upstream to get consensus on changes, even from a vendor that you are paying, is more difficult than cranking out code.

As you can tell, this short-term decision to get code done NOW can to long-term problems. Frankly most of the customers don't have the software development experience to handle those problems. They are chemists who learned to program, not software developers.

The vendor polled all of their customers and found that while the customers liked having the source code but had no real need for the source code, and were more willing to use the traditional vendor/customer route to add new features than the free software/contribute upstream/"community" route. Another way to say it is that they had the money to pay for support, but didn't have the time to go through the community process.