/— title: “How to Read a .data file with Python Pandas” date: 2020-05-25 draft: false tags: [“pandas”,“ETL”] # please only lowercase and separate with - category: “Data Analysis” weight: "” type: "” # doc, page, post diagrams: false # to add js diagrams Ex: https://mermaid-js.github.io/mermaid/
A .data
file usually contain plain text fixed-width formatted lines.
18.0 8 307.0 130.0 3504. 12.0 70 1 "chevrolet chevelle malibu"
15.0 8 350.0 165.0 3693. 11.5 70 1 "buick skylark 320"
18.0 8 318.0 150.0 3436. 11.0 70 1 "plymouth satellite"
16.0 8 304.0 150.0 3433. 12.0 70 1 "amc rebel sst"
17.0 8 302.0 140.0 3449. 10.5 70 1 "ford torino"
15.0 8 429.0 198.0 4341. 10.0 70 1 "ford galaxie 500"
14.0 8 454.0 220.0 4354. 9.0 70 1 "chevrolet impala"
14.0 8 440.0 215.0 4312. 8.5 70 1 "plymouth fury iii"
To read a table of fixed-width formatted lines into DataFrame, we can use the .read_fwf
method:
In this example, we will use the Auto MPG Data Set from the Center for Machine Learning and Intelligent Systems:
https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/
import pandas as pd
column_names = [
"mpg",
"cylinders",
"displacement",
"horsepower",
"weight",
"acceleration",
"year",
"origin",
"name",
]
df = pd.read_csv("../Data/auto-mpg.data", delim_whitespace=True, names=column_names)
df
mpg | cylinders | displacement | horsepower | weight | acceleration | year | origin | name | |
---|---|---|---|---|---|---|---|---|---|
0 | 18.0 | 8 | 307.0 | 130.0 | 3504.0 | 12.0 | 70 | 1 | chevrolet chevelle malibu |
1 | 15.0 | 8 | 350.0 | 165.0 | 3693.0 | 11.5 | 70 | 1 | buick skylark 320 |
2 | 18.0 | 8 | 318.0 | 150.0 | 3436.0 | 11.0 | 70 | 1 | plymouth satellite |
3 | 16.0 | 8 | 304.0 | 150.0 | 3433.0 | 12.0 | 70 | 1 | amc rebel sst |
4 | 17.0 | 8 | 302.0 | 140.0 | 3449.0 | 10.5 | 70 | 1 | ford torino |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
393 | 27.0 | 4 | 140.0 | 86.00 | 2790.0 | 15.6 | 82 | 1 | ford mustang gl |
394 | 44.0 | 4 | 97.0 | 52.00 | 2130.0 | 24.6 | 82 | 2 | vw pickup |
395 | 32.0 | 4 | 135.0 | 84.00 | 2295.0 | 11.6 | 82 | 1 | dodge rampage |
396 | 28.0 | 4 | 120.0 | 79.00 | 2625.0 | 18.6 | 82 | 1 | ford ranger |
397 | 31.0 | 4 | 119.0 | 82.00 | 2720.0 | 19.4 | 82 | 1 | chevy s-10 |
398 rows × 9 columns
delim_whitespace
specifies whether or not whitespace (e.g. ' '
or ' '
) will be used as the separator.