mattkerlogue.bsky.social
Data/numbers/pretty charts, typically in #rstats. Ex-civil servant still doing government-y stuff, currently @blavatnikschool.bsky.social.
He/him, alphabet mafia type 🏳️🌈, SE London.
https://github.com/mattkerlogue
https://matt.kerlogue.co.uk
48 posts
60 followers
99 following
Getting Started
Active Commenter
comment in response to
post
@mattdray.bsky.social is here
comment in response to
post
When I was dealing with datasets of 200,000+ observations, there’s a threshold at which you accept some stuff can’t be resolved and that’s somewhat driven whether the level of “dirt” actually has an impact on the final output.
comment in response to
post
On the impact point. I’ve recently been working on a project on country level data, so 100-200 observations. I’ve invested quite a lot of time in cleaning the data because an error here is highly influential in the output.
comment in response to
post
Smaller datasets get proportionally more time but it’s also easier to identify each individual cleaning action that’s needed so I spend more time on customised cleans on top of the initial generalised routines.
comment in response to
post
I think yes. There’s a simple volume piece to it, the impact of a small number of “dirty” data points is higher. But also I think it’s partly about time. I don’t think my time cleaning data scales linearly with volume.
comment in response to
post
Yes, I didn't think {targets} was likely to be the right thing as it's very tightly concerned with pipelines but thought I'd raise it since the inspector/visualiser functions might help, but also might not.
comment in response to
post
Whereas the first two packages haven't been updated in years, {targets} is still regularly maintained (and is on CRAN).
comment in response to
post
There's also the {targets} package which is more about analytical-pipelines but has functions for inspecting and visualising dependencies docs.ropensci.org/targets/refe...
comment in response to
post
Yes, I wondered if the second might be more useful. I also mis-read the very end of your post as "and to manage chaos", which depending on the nature of the package might also be true (at least in my own experience) 😂
comment in response to
post
As a social scientist that had worked largely in SPSS and Stata, my biggest challenge when first engaging with R was that you didn't get different types of missing-ness. The {haven} package's tagged_na() is very handy. haven.tidyverse.org/reference/ta...
comment in response to
post
If you can install from GitHub then there's two fairly old packages not on CRAN that I know of: github.com/crsh/depgraph and datastorm-open.github.io/Dependencies...
comment in response to
post
Humans know to use different tools for different things, ergo a generalised LLM chatbot shouldn’t be trying to do maths… it should be asking a thing that can do maths to do the calculation and then returning the result.
comment in response to
post
It’s perfect example of why I’m not sceptical about AI in general but I am sceptical about the current state of generalised AI/LLMs and what folk are using them for… they shouldn’t be/are unlikely to be an everything tool until they get past their own ego.
comment in response to
post
I have now! This was a delight thanks! 😊
comment in response to
post
[37]: the language written in England: some people living in the USA appropriate this name for their language.
[38]: with Americanisms.
comment in response to
post
"For example, R has catalogues for ‘en_GB’ that translate the Americanisms (e.g., ‘gray’) in the standard messages into English.[37] ... If no suitable translation catalogue is found or a particular message is not translated in any suitable catalogue, ‘English’[38] is used."
comment in response to
post
But today's find is this wonderful footnote deep in the #rstats install documentation cran.r-project.org/doc/manuals/...
comment in response to
post
You can't mention #rstats documentation without mentioning Clippy-gate in the {writexl} package github.com/ropensci/wri...
comment in response to
post
My first experience was years ago when I was working out why medians in Stata and #rstats weren't the same, which is due to different quantile calculations. But in the ?quantile documentation you learn that the 'type 5' quantile "is popular amongst hydrologists".
comment in response to
post
If my function is complex and/or has multiple possible outputs then I like to use return to be explicit to others (especially future me!).
But if the function is a simple helper where it’s just encapsulating a single repeated pipe chain or only a few lines of code then I’ll not both with a return.
comment in response to
post
Updated because the blog link got borked via a copy paste :/
Lesson here... re-check the links if you copy a draft post and then paste it somewhere else.
comment in response to
post
A little bot that cruises the canals and posts a random location from the CRT's network, and if it exists a nearby photo published on Flickr (@flickrfdn.bsky.social ).
The bot is written in #rstats and powered by Github Actions. You can read more here: lapsedgeographer.london/2020-10/virt...
comment in response to
post
Oops... thanks for flagging, working link here: lapsedgeographer.london/2020-10/virt...
Current code: github.com/mattkerlogue...
comment in response to
post
This covers when it ran on Twitter, that's since stopped but it's pretty much the same for Mastodon and Bsky, the code is here: github.com/mattkerlogue...
comment in response to
post
Oops, half the link got curtailed into "..." lapsedgeographer.london/2020-10/virt...
comment in response to
post
Re the missing data point you make at the end. Thinking of the facet chart you show I've sometimes displayed missing data as an outline/unfilled circle, or used a different shape (e.g an x).
comment in response to
post
This was a fab post, thanks. Far too few think about these sorts of issues so it's great to have a resource to point to on it.
comment in response to
post
Shout out to @mattdray.bsky.social whose, now defunct, londonmapbot was the inspiration to build the @narrowbotr.bsky.social in the first place.
The bot also exists on Mastodon too (mastodon.social/@narrowbotr@...)
comment in response to
post
Currently I'm working on the Blavatnik Index of Public Administration for @blavatnikschool.bsky.social index.bsg.ox.ac.uk
comment in response to
post
I spent far too long at the Cabinet Office, where amongst many other things I was the lead analyst involved in helping set up the Civil Service People Survey (www.gov.uk/government/c...).
comment in response to
post
You can read more about the Blavatnik Index and explore the results here: index.bsg.ox.ac.uk
comment in response to
post
Something like this (from education.economist.com/insights/int...)
comment in response to
post
Can I suggest using points for the male/female score and a bar between them (geom_linerange in ggplot) rather than using separate bars. I think this would make it a lot easier to visualise both the salience patterns of the issues for each sex as well as any differences.
comment in response to
post
Well that makes me feel a lot less bad about also not realising it was you either!!
comment in response to
post
@gavinfreeguard.bsky.social has just told me he was chatting to you when I was next to him FML 🤦♂️🤦♂️🤦♂️🤦♂️🤦♂️🤦♂️🤦♂️
comment in response to
post
Very nearly went to the local/regional chat.
comment in response to
post
Aaah. Wonderful. Shame not to have spotted you, I know I ought to have posted about being there but still only just starting to re-engage with the whole social media whatnot.
comment in response to
post
Oh. Were you at the Stats Assembly?
comment in response to
post
I have had a complete essay in my head about this ~10 years. If only I could actually force myself to write it.
comment in response to
post
Long-story short it’s largely due to emoji that have multiple code points messing up the calculation of the start/end bytes.
comment in response to
post
@paul.wales I think I’ve worked out the error causing this problem. I’ll apply a temporary fix in the next day or so while I work on a PR for the underlying package.