Skip to main content

in reply to daniel:// stenberg://

your best year is (number of year's lines of code remaining X years later)/(number of lines written in that year)
in reply to daniel:// stenberg://

I think this one turned out to be the most informative one, or at least it piques my curiosity the most.

I think I'll try following along this graph with curl's version history at hand. For example, I now wonder what kind of refactoring happened around late 2011 - the older code amount drops rather sharply there :)

in reply to daniel:// stenberg://

the only surviving line from 20th century
This entry was edited (2 months ago)
in reply to Urix Turing

@urixturing that is actually 1254 lines. I presume mostly file header comments or something...
in reply to Christoph Petrausch

@hikhvar @dascandy

extract the data using git blame => github.com/curl/stats/blob/mas…

render the graph from the data the script generated using gnuplot => github.com/curl/stats/blob/mas…

in reply to daniel:// stenberg://

Oh this is really nice! You've inspired me to generate this for the Linux kernel. The git blames are running now... I parallelized it, but it's still going to take a while! :)
This entry was edited (2 months ago)
in reply to Kees Cook

@hikhvar @dascandy LOL. 102 Linux kernel tags, averaging 3 minutes per blame run (so far). 5 hours to generate the data. O_O
in reply to daniel:// stenberg://

@kees @dascandy Very nice visualization! I see, that the Linux Kernel has less churn than curl, so once code is commited it is very likely to stay.
in reply to Christoph Petrausch

@hikhvar @dascandy Yeah, it's not as steep as with curl, but I'm starting to see it getting deeper with each segment. The 2016-2018 segment seems to eat into prior areas much more than the other year segments.

I'm so impatient! Blame, git, blame! ;)

in reply to Kees Cook

@kees @dascandy That is true. But also the kernel code is placed on a very solid bedrock from the pre 2006 era.
in reply to Christoph Petrausch

@hikhvar @dascandy I'm curios to see how much of what's left from "start of git history" in Linux is blank lines and comments. :) I'll need a whole new scanner for that. :P
in reply to Kees Cook

yeah, that's basically what I found in the curl code left from < 2000. Mostly comments and a few #ifdef/#defines.
This entry was edited (2 months ago)
in reply to kurtseifried (he/him)

@kurtseifried @hikhvar @dascandy Yeah, once this finishes I'm going to rework the caching and also store paths. Then I can re-run it with arbitrary path filters. The counting phase is fast. The blame phase is sloooow. 😅
in reply to Kees Cook

and yet it runs the simplest form of blame. It could be argued that blame -CCC would give the more "right" info, but that's just so slow it's unbearable to use for this
This entry was edited (2 months ago)
in reply to daniel:// stenberg://

@hikhvar @dascandy Not sure if you want this too; I ended up tweaking the plot's display of lines slightly with this format:

set format y2 "%.0s%c"

So instead of, e.g., 200000, it'll show 200k

in reply to daniel:// stenberg://

How did you gather the data to generate this graph?

This would be very helpful for some respositories 🎉

in reply to Alex Rock

@pierstoval git blame is our friend. This is my (fairly small) perl script that extracts all the data:

github.com/curl/stats/blob/mas…

in reply to daniel:// stenberg://

That's very nice, I'm gonna try it out on the project I'm working on (which is probably about the same age as Curl)
in reply to Alex Rock

@pierstoval cool, just ask if there's anything I can help you with. You might spot that I have a way to list all tags and I get the age of the project at those moments in time. If you have tags like that, you can do it the same way otherwise you need to figure out a different way to identify snapshot moments.
in reply to daniel:// stenberg://

I already tweaked the perl code fetching the tags, but I'm not getting any data yet, I'm trying to figure out the code :)
This entry was edited (2 months ago)
in reply to Alex Rock

@pierstoval note that the git blame command uses a specific path that you want to update/remove
in reply to daniel:// stenberg://

@pierstoval oh, and there's a version check in the loop at the end that you of course need to cut out
in reply to daniel:// stenberg://

Thanks, did that, also removed the "print cache" statement.

I'll make a fork in order to simplify reviewing it 👍

in reply to Alex Rock

@pierstoval once you've gathered a set of data, you want to make the cache work again as running a full run may take hours, depending on your repo
in reply to daniel:// stenberg://

Yep, it's 20 years old and has like thirty thousand commits, might take a while indeed :)

Here's the current diff: github.com/Pierstoval/stats/pu…

It's not gathering data yet, I'm on it :)

in reply to daniel:// stenberg://

I got caught up in many things so I didn't have time to continue, but I'll certainly work on it during the next days! I'll keep you in touch, in case you're interested :)
in reply to Alex Rock

I'm interested! If it helps, @kees made a port of the script over to python to make it perform better on larger code bases like the Linux kernel: github.com/kees/kernel-tools/t…
This entry was edited (2 months ago)
in reply to daniel:// stenberg://

@kees It's interesting, I might try this one too, though it also needs tweaking to be adapted to what I need :)
in reply to daniel:// stenberg://

I had some time this evening to check it out, turns out the very little things I did allow me to have an output, but it looks like this:

❯ perl stats/codeage.pl
2015-09-15;0;0;0;0;0;0;0;0;0;0;0;0;0;0
2015-11-27;0;0;0;0;0;0;0;0;0;0;0;0;0;0
2016-04-01;0;0;0;0;0;0;0;0;0;0;0;0;0;0
2016-04-11;0;0;0;0;0;0;0;0;0;0;0;0;0;0
2016-07-19;0;0;0;0;0;0;0;0;0;0;0;0;0;0
2016-07-27;0;0;0;0;0;0;0;0;0;0;0;0;0;0
2016-11-15;0;0;0;0;0;0;0;0;0;0;0;0;0;0

I'm trying to look where the 0s come from

in reply to Alex Rock

If I remove the "if" statement in the "sub show" function, apparently it gives me an output, though very slowly as you mentioned before:

❯ perl stats/codeage.pl
2015-09-15;0;0;0;12287;29598;54171;113862;150511;178495;178495;178495;178495;178495;178495
2015-11-27;0;0;0;12287;29337;53754;113326;149811;187962;187962;187962;187962;187962;187962

I don't know if these kind of data are relevant, but it's another output.

I pushed it to my fork, on the PR in an earlier post :)

in reply to Alex Rock

@pierstoval each line is a date and then line number counters for all the different time "slots" separated with semicolons, so that looks like perfectly fine output
in reply to daniel:// stenberg://

@pierstoval note also that you can run the gnuplot on a partial data set too, so once you have a 4-5 lines in there you can test that it seems correct
in reply to daniel:// stenberg://

It seems to be okay when using @kees's scripts! The automatic cache definitely helps a lot 🎉

I will let it run through all day and wait for more details 👌

This entry was edited (2 months ago)
in reply to Alex Rock

@pierstoval @kees I also experimented with improving the color palette to reduce the duplicates
in reply to daniel:// stenberg://

my current look

See the fc instructions per plot in github.com/curl/stats/blob/mas…

This entry was edited (2 months ago)
Unknown parent

in reply to daniel:// stenberg://

is that how bedrocks are made? looks like it! that would make this geological time.
in reply to daniel:// stenberg://

Watch out for diamonds or other gemstones in the older layers near the bottom, from the digital mesoproterozoic age.
in reply to daniel:// stenberg://

Really the best visualization of this dataset so far!

I find it confusing that only even years like 2000, 2002, etc. are listed. Did you skip every 2nd year? If data for each two years is accumulated please write "2000-2001" in the key.

in reply to Daniel Böhmer

@dboehmer as said in the top, they are two-year segments. It's just a limit I decided on to keep the number of fields reasonable.
in reply to daniel:// stenberg://

Oh, I didn’t see/read this bit 🙈 Maybe that’s an indicator that this might be too subtle …
in reply to Daniel Böhmer

@dboehmer I wanted to keep the labels simple to reduce the amount of text, as it quickly becomes "heavy" otherwise. But yeah, I'll think of how to improve it.
in reply to daniel:// stenberg://

May a make two (edit: three) suggestions:

a) write "2000 f." for 2000–2001 like common for giving page numbers in citations.
(I just learned that "f." is for giving someone’s birthdate in Swedish 😁 )
en.wiktionary.org/wiki/f.#Adje…

b) Use "≤" or "≥" mathematical operators. As the key is most probably read from the top to the bottom maybe give the lower number year instead like
- ≥ 2023
- ≥ 2021
- ≥ 2019
- …
- < 2000

c) short form 2000/01 to 2023/24

This entry was edited (2 months ago)
in reply to daniel:// stenberg://

You’re so quick! I find this better than take 4, for sure.

If you want to minimize text space I’d consider this the optimal solution.

But to be honest I think it’s a bit too technical even—for software people. it takes a moment to understand this means each color represents two years …

More than ½ h after posting my suggestions I tend to think option C (that I added to the post) might be the most common notation: just "2023/24". Don’t you think? At least Germans use that a lot.

This entry was edited (2 months ago)
in reply to Daniel Böhmer

@dboehmer unfortunately I think that version gets too messy, probably because too many numbers. Without being crystal clear what it means. I think I'll stick with the ≥ for now.
in reply to daniel:// stenberg://

@dboehmer for me, reading the graph part makes everything very clear. Like, the year number is just a point in time, at the transition between two years (e. g. black covers 2010-2012).

It would also be possible to work with dashes, like saying "up to 2002", though that needs a different numbering then:

- 2000
- 2002
- 2004
...

in reply to daniel:// stenberg://

It might be interesting to see this with log scale on y axis and, if those lines seem to decrease roughly linearly, to compare how halflifes change over time.