Site Map - skip to main content

Hacker Public Radio

Your ideas, projects, opinions - podcasted.

New episodes every weekday Monday through Friday.
This page was generated by The HPR Robot at


hpr2091 :: Everyday Unix/Linux Tools for data processing

In this episode, I give some examples of common and uncommon tools for processing data files

<< First, < Previous, , Latest >>

Thumbnail of Mr. Young
Hosted by Mr. Young on 2016-08-08 is flagged as Clean and is released under a CC-BY-SA license.
linux, unix, data, command-line. 4.
The show is available on the Internet Archive at: https://archive.org/details/hpr2091

Listen in ogg, spx, or mp3 format. Play now:

Duration: 00:30:15

general.

Here are some of the tools I use to process and clean data from all manner of customers:

detox

The detox utility renames files to make them easier to work with. It removes spaces and other such annoyances. It’ll also translate or cleanup Latin-1 (ISO 8859-1) characters encoded in 8-bit ASCII, Unicode characters encoded in UTF-8, and CGI escaped characters.

See other episodes for great sed information. I like to remove DOS end of line and end of file characters:

sed -i 's/
//g' *.txt

or

sed -i 's/\r//g' *.txt

Command-line tools

  • ack
  • awk
  • detox
  • grep
  • pandoc
  • pdftotext -layout
  • sed
  • unix2dos and dos2unix
  • wget
  • curl

R libraries

  • RCurl
  • XML
  • rvest
  • tm
  • xlsx

Python libraries

Vim tricks

  • buffer searches (:vim /pattern/ ##)
  • Ack plugin
  • bufdo (:bufdo %s/pattern/replace/ge | update)

Other tools


Comments

Subscribe to the comments RSS feed.

Comment #1 posted on 2016-08-09 00:46:44 by Jonathan Kulp

Ack!

Thanks this is a genius tool. Never heard of it before.

Comment #2 posted on 2016-08-17 16:55:35 by Ken Fallon

I love detox

detox -vr *

wow what an excellent tool.

Comment #3 posted on 2016-08-19 16:30:03 by Dave Morriss

Thanks for mentioning 'ack'

Wow! I had never encountered 'ack' before. It's amazing.

I have written a bunch of Bash scripts to work with a PostgreSQL database (yes, I know, it's a bit like wearing a hair shirt; self mortification), and I found I could do things like:

ack --shell --pager=more psql .

There's no other easy way to do this that I know of.

Thanks very much for pointing this one out.

Comment #4 posted on 2016-08-21 14:53:50 by ivor

Interesting

I always love vim tips. So I got pulled in looking at the buffer search. Then I noticed the other tools mentioned. Most of them I know about and use all that are relevant to me very frequently. So now I'm going to subscribe...

Leave Comment

Note to Verbose Commenters
If you can't fit everything you want to say in the comment below then you really should record a response show instead.

Note to Spammers
All comments are moderated. All links are checked by humans. We strip out all html. Feel free to record a show about yourself, or your industry, or any other topic we may find interesting. We also check shows for spam :).

Provide feedback
Your Name/Handle:
Title:
Comment:
Anti Spam Question: What does the letter P in HPR stand for?
Are you a spammer?
What is the HOST_ID for the host of this show?
What does HPR mean to you?