/— title: “How to Read a .data file with Python Pandas” date: 2020-05-25 draft: false tags: [“pandas”,“ETL”] # please only lowercase and separate with - category: “Data Analysis” weight: "” type: "” # doc, page, post diagrams: false # to add js diagrams Ex: https://mermaid-js.github.io/mermaid/

A .data file usually contain plain text fixed-width formatted lines.

18.0   8   307.0      130.0      3504.      12.0   70  1	"chevrolet chevelle malibu"
15.0   8   350.0      165.0      3693.      11.5   70  1	"buick skylark 320"
18.0   8   318.0      150.0      3436.      11.0   70  1	"plymouth satellite"
16.0   8   304.0      150.0      3433.      12.0   70  1	"amc rebel sst"
17.0   8   302.0      140.0      3449.      10.5   70  1	"ford torino"
15.0   8   429.0      198.0      4341.      10.0   70  1	"ford galaxie 500"
14.0   8   454.0      220.0      4354.       9.0   70  1	"chevrolet impala"
14.0   8   440.0      215.0      4312.       8.5   70  1	"plymouth fury iii"

To read a table of fixed-width formatted lines into DataFrame, we can use the .read_fwf method:

In this example, we will use the Auto MPG Data Set from the Center for Machine Learning and Intelligent Systems:

https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/

import pandas as pd

column_names = [
    "mpg",
    "cylinders",
    "displacement",
    "horsepower",
    "weight",
    "acceleration",
    "year",
    "origin",
    "name",
]
df = pd.read_csv("../Data/auto-mpg.data", delim_whitespace=True, names=column_names)
df

mpg cylinders displacement horsepower weight acceleration year origin name
0 18.0 8 307.0 130.0 3504.0 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165.0 3693.0 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150.0 3436.0 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150.0 3433.0 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140.0 3449.0 10.5 70 1 ford torino
... ... ... ... ... ... ... ... ... ...
393 27.0 4 140.0 86.00 2790.0 15.6 82 1 ford mustang gl
394 44.0 4 97.0 52.00 2130.0 24.6 82 2 vw pickup
395 32.0 4 135.0 84.00 2295.0 11.6 82 1 dodge rampage
396 28.0 4 120.0 79.00 2625.0 18.6 82 1 ford ranger
397 31.0 4 119.0 82.00 2720.0 19.4 82 1 chevy s-10

398 rows × 9 columns

delim_whitespace specifies whether or not whitespace (e.g. ' ' or ' ') will be used as the separator.